File size: 4,682 Bytes
b4f9e57
 
19ccc2c
 
8c0f52b
b4f9e57
 
19ccc2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
066de41
 
 
 
 
 
 
 
 
 
 
 
 
 
81f7331
066de41
 
 
 
f2b7f5a
 
066de41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19ccc2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: llama2
base_model:
- codellama/CodeLlama-13b-Instruct-hf
pipeline_tag: text-generation
---

# CodeLlama-13b-MORepair

CodeLlama-13b-MORepair is a program repair model fine-tuned from CodeLlama-13b-instruct using a novel multi-objective fine-tuning framework called MOREPAIR. This model is specifically designed to improve automated program repair capabilities by learning both code transformations and repair logic reasoning.

[Paper](https://arxiv.org/abs/2404.12636) | [Code](https://github.com/buaabarty/morepair)

## Model Description

- **Base Model**: CodeLlama-13b-instruct
- **Training Technique**: Multi-objective fine-tuning with MOREPAIR framework
- **Supported Languages**: Primarily tested on C++ and Java, but likely generalizes to other languages
- **Primary Use**: Automated program repair
- **License**: Llama 2 Community License

## Training Details

### Training Data
- **Dataset**: TUTORLLMCODE
- **Size**: 1,600 pairs of buggy and repaired code
- **Nature**: Programming task corrections with LLM-generated repair guidance

### Training Approach
The model was trained using MOREPAIR, which employs:
- Multi-objective learning with two objectives:
  1. Generating repaired code
  2. Producing repaired code with explanatory guidance
- QLoRA fine-tuning (only 1.84% of parameters modified)
- NEFTune for improved generalization
- LLM-generated guidance for understanding repair logic

## Usage

Here's how to use the model with the Hugging Face Transformers library:

### Installation
```bash
pip install transformers torch
```

### Basic Usage
````python
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Load model and tokenizer
model_name = "barty/CodeLlama-13B-MORepair"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_8bit=True,
    torch_dtype=torch.float16
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

def repair_code(buggy_code, filename="example.java"):
    # Construct prompt in the format model expects
    prompt = f"""[INST] This is an incorrect code({filename}):
```java
{buggy_code}
```
You are a software engineer. Can you repair the incorrect code?
[/INST]
```java
"""
    
    # Calculate token count for length control
    prompt_tokens = len(tokenizer.tokenize(prompt))
    max_new_tokens = 500 - prompt_tokens
    
    # Generate repair
    output = pipe(
        prompt,
        min_length=prompt_tokens + 64,
        max_length=prompt_tokens + max_new_tokens,
        temperature=1.0,
        do_sample=True
    )
    
    # Extract the generated code
    full_text = output[0]['generated_text']
    fixed_code = full_text.split('[/INST]')[1].strip()
    
    return full_text, fixed_code

# Example usage
buggy_code = """
public static int findMinRotated(int[] arr) {
    int left = 0;
    int right = arr.length - 1;
    
    while (left < right) {
        int mid = (left + right) / 2;
        if (arr[mid] > arr[right])
            left = mid;  // Bug: should be mid + 1
        else
            right = mid;
    }
    return arr[left];
}
"""

full_response, fixed_code = repair_code(buggy_code)
print("Fixed code:")
print(fixed_code)
````

### Important Parameters
- `load_in_8bit=True`: Enables 8-bit quantization for efficient inference
- `temperature=1.0`: Controls randomness in generation
- `do_sample=True`: Enables sampling-based generation
- `min_length`: Minimum length of generated text
- `max_length`: Maximum length of generated text

## Limitations

- Performance varies across different programming languages
- May require multiple attempts to generate correct fixes
- Should be used with appropriate test cases to validate repairs
- May not handle very complex or multi-file program repairs

## Technical Specifications

- **Architecture**: Based on CodeLlama-13b-instruct
- **Parameters**: Same as base model (13B)
- **Fine-tuning Method**: QLoRA + NEFTune
- **Context Window**: Same as CodeLlama-13b-instruct
- **Input Format**: Code snippets with optional repair guidance

## Citation

If you use this model in your research, please cite:
```bibtex
@article{yang2024multi,
  title={Multi-Objective Fine-Tuning for Enhanced Program Repair with LLMs},
  author={Yang, Boyang and Tian, Haoye and Ren, Jiadong and Zhang, Hongyu and Klein, Jacques and Bissyandé, Tegawendé F. and Le Goues, Claire and Jin, Shunfu},
  journal={arXiv preprint arXiv:2404.12636},
  year={2024}
}
```

## Acknowledgments

This model builds upon the CodeLlama model family developed by Meta AI and incorporates the MOREPAIR framework for enhanced program repair capabilities.