Text Generation
Transformers
PyTorch
code
gpt2
custom_code
Eval Results
text-generation-inference
santafixer / README.md
codelion's picture
Update README.md
36fa1bd
---
license: apache-2.0
datasets:
- lambdasec/cve-single-line-fixes
- lambdasec/gh-top-1000-projects-vulns
language:
- code
tags:
- code
programming_language:
- Java
- JavaScript
- Python
inference: false
model-index:
- name: SantaFixer
results:
- task:
type: text-generation
dataset:
type: openai/human-eval-infilling
name: HumanEval
metrics:
- name: single-line infilling pass@1
type: pass@1
value: 0.47
verified: false
- name: single-line infilling pass@10
type: pass@10
value: 0.74
verified: false
- task:
type: text-generation
dataset:
type: lambdasec/gh-top-1000-projects-vulns
name: GH Top 1000 Projects Vulnerabilities
metrics:
- name: pass@1 (Java)
type: pass@1
value: 0.26
verified: false
- name: pass@10 (Java)
type: pass@10
value: 0.48
verified: false
- name: pass@1 (Python)
type: pass@1
value: 0.31
verified: false
- name: pass@10 (Python)
type: pass@10
value: 0.56
verified: false
- name: pass@1 (JavaScript)
type: pass@1
value: 0.36
verified: false
- name: pass@10 (JavaScript)
type: pass@10
value: 0.62
verified: false
---
# Model Card for SantaFixer
<!-- Provide a quick summary of what the model is/does. -->
This is a LLM for code that is focussed on generating bug fixes using infilling.
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
- **Developed by:** [codelion](https://huggingface.co/codelion)
- **Model type:** GPT-2
- **Finetuned from model:** [bigcode/santacoder](https://huggingface.co/bigcode/santacoder)
## How to Get Started with the Model
Use the code below to get started with the model.
```python
# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "lambdasec/santafixer"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint,
trust_remote_code=True).to(device)
input_text = "<fim-prefix>def print_hello_world():\n
<fim-suffix>\n print('Hello world!')
<fim-middle>"
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
```
## Training Details
- **GPU:** Tesla P100
- **Time:** ~5 hrs
### Training Data
The model was fine-tuned on the [CVE single line fixes dataset](https://huggingface.co/datasets/lambdasec/cve-single-line-fixes)
### Training Procedure
Supervised Fine Tuning (SFT)
#### Training Hyperparameters
- **optim:** adafactor
- **gradient_accumulation_steps:** 4
- **gradient_checkpointing:** true
- **fp16:** false
## Evaluation
The model was tested with the [GitHub top 1000 projects vulnerabilities dataset](https://huggingface.co/datasets/lambdasec/gh-top-1000-projects-vulns)