Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,50 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- dyngnosis/function_names_v2
|
5 |
---
|
6 |
+
|
7 |
+
A simple Phi-2 model fine-tuned on a function identification task of disassembled binary functions. It will output function names as a JSON object. You can use the following code to identify a function name:
|
8 |
+
|
9 |
+
```python
|
10 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
11 |
+
import torch
|
12 |
+
|
13 |
+
model = AutoModelForCausalLM.from_pretrained(
|
14 |
+
"seanmor5/phi-2-function-identification",
|
15 |
+
attn_implementation="flash_attention_2",
|
16 |
+
torch_dtype=torch.bfloat16,
|
17 |
+
)
|
18 |
+
model.to(torch.device("cuda"))
|
19 |
+
tokenizer = AutoTokenizer.from_pretrained("seanmor5/phi-2-function-identification")
|
20 |
+
|
21 |
+
def prompt(code):
|
22 |
+
return (
|
23 |
+
"Input: Given the following disassembled code, provide a descriptive"
|
24 |
+
+ " function name for the code. Your function name should"
|
25 |
+
+ " accurately describe the purpose of the code. It should"
|
26 |
+
+ " be formatted in C style with lowercase and snakecase."
|
27 |
+
+ f" Only output the name as valid JSON, e.g. {json.dumps({'name': 'function_name'})}"
|
28 |
+
+ f"\nCode: {code}\nOutput:"
|
29 |
+
)
|
30 |
+
|
31 |
+
def identify_function(code):
|
32 |
+
eos_tokens = tokenizer.convert_tokens_to_ids(['"}', "<|endoftext|>"])
|
33 |
+
inputs = tokenizer(prompt(func), return_tensors="pt")
|
34 |
+
inputs.to(torch.device("cuda"))
|
35 |
+
|
36 |
+
outputs = model.generate(**inputs, max_new_tokens=64, eos_token_id=eos_tokens)
|
37 |
+
text = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[1] :])[0]
|
38 |
+
return text
|
39 |
+
|
40 |
+
func = """
|
41 |
+
void fcn.140030b80(ulong param_1, ulong param_2, ulong param_3) {
|
42 |
+
ulong uVar1; uVar1 = fcn.140030ae0(param_3);
|
43 |
+
fcn.14002efc0(param_1, param_2, uVar1); return;
|
44 |
+
}
|
45 |
+
"""
|
46 |
+
|
47 |
+
print(identify_function(func))
|
48 |
+
```
|
49 |
+
|
50 |
+
The model tends to repeat itself excessively, so you should set the EOS token to `"}` when generating.
|