--- license: apache-2.0 datasets: - KAKA22/CodeRM-UnitTest language: - en base_model: - meta-llama/Llama-3.1-8B-Instruct pipeline_tag: text-generation tags: - code - llama library_name: transformers --- # Introduction CodeRM-8B is a small yet powerful model designed to enable efficient and high-quality unit test generation. It is trained on a dataset of 60k high-quality synthetic Python unit tests using Llama3.1-70B-Instruct. These unit tests are synthesized based on two well-regarded code instruction tuning datasets: [CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction) and the training set of [TACO](https://huggingface.co/datasets/BAAI/TACO). The training dataset used for unit test generation is openly available under [CodeRM-UnitTest](https://huggingface.co/datasets/KAKA22/CodeRM-UnitTest). For further information and details of training, refer to our paper: "Dynamic Scaling of Unit Tests for Code Reward Modeling" available on [arXiv](https://arxiv.org/abs/2501.01054). You can also visit the [homepage](https://code-reward-model.github.io/) and the github [repo](https://github.com/RUCKBReasoning/CodeRM) of the paper. # Model Information The model is trained based on [Llama3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct). # Prompt Format ``` Below is a question and it's corresponding code answer. Please write test cases to check the correctness of the code answer. You need to use the unittest library in Python and create a test class for testing. ### question {question} ### code solution {code in function format} Please add detailed comments to the test cases you write. You do not need to test the function's ability to throw exceptions. ``` # Performance ## Best-of-N First, we evaluate the performance of CodeRM-8B using a best-of-N setting. In this setup, an LLM (policy model) generates 100 candidate code solutions for a given programming problem, while another LLM (reward model) generates 100 unit tests. The optimal code solution is then selected based on majority voting derived from the execution results of these unit tests. Under this framework, our trained unit test generator demonstrates performance comparable to Llama3.1-70B-Instruct, despite having an 8x smaller parameter size. The detailed evaluation results across three well-known benchmarks are as follows: | Model | Policy: Llama3-8B | Policy: Llama3-70B | Policy: GPT-3.5 | Policy: GPT-4o-mini | | :------ | :------ | :------ | :------ | :------ | | **Benchmark: HumanEval Plus** ||||| | Vanilla | 53.58 | 73.74 | 67.83 | 82.96 | | Reward: Llama3.1-8B | 66.84 (+13.26) | 77.14 (+3.40) | 76.32 (+8.49) | 83.11 (+0.15) | | Reward: Llama3.1-70B | **72.04 (+18.46)** | 78.54 (+4.80) | **79.76 (+11.93)** | 85.45 (+2.49) | | Reward: CodeRM-8B | 72.01 (+18.43) | **78.69 (+4.95)** | 78.01 (+10.18) | **86.38 (+3.42)** | | **Benchmark: MBPP Plus** ||||| | Vanilla | 49.20 | 69.33 | 70.53 | 71.59 | | Reward: Llama3.1-8B | 64.31 (+15.11) | 71.64 (+2.31) | 74.18 (+3.65) | 74.48 (+2.89) | | Reward: Llama3.1-70B | 65.26 (+16.06) | 71.85 (+2.52) | 75.72 (+5.19) | 74.96 (+3.37) | | Reward: CodeRM-8B | **66.71 (+17.51)** | **72.44 (+3.11)** | **75.96 (+5.43)** | **75.20 (+3.61)** | | **Benchmark: LiveCodeBench** ||||| | Vanilla | 11.98 | 25.30 | 20.55 | 34.83 | | Reward: Llama3.1-70B | 13.28 (+1.30) | **28.46 (+3.16)** | **22.80 (+2.25)** | 38.60 (+3.77) | | Reward: CodeRM-8B | **15.21 (+3.23)** | 27.73 (+2.43)| 21.76 (+1.21) | **39.20 (+4.37)** | ## Quality of Unit Test We evaluate the quality of the unit test generated by CodeRM-8B. As each unit test functions as a classifier to determine correct or incorrect solutions, we first utilize accuracy and F1 score as metrics to assess the classification performance of the unit test. We further propose two new metrics to detailed evaluate the possibility of the unit test making incorrect judgments. False Acceptance Rate (FAR) measures the probability of wrong solutions being accepted by unit tests. False Rejection Rate (FRR) measures the probability of correct solutions being rejected by unit tests. The calculation formulas for these four metrics are introduced in Appendix D of the paper. Below is the quality of individual unit tests and the combination of multiple unit tests on HumanEval Plus, utilizing Llama3.1-8B as the policy model. The top two performances are marked in **bold** and _underlined_. | **Model** | **Acc (↑)** | **F1 (↑)** | **FAR (↓)** | **FRR (↓)** | |----------------------|---------------|---------------|---------------|---------------| | **Quality of Individual Unit Tests** | | | | | | Llama3.1-8B | 60.02 | 44.97 | 13.66 | 46.13 | | Llama3.1-70B | **73.65** | **70.15** | **11.10** | **34.51** | | *CodeRM-8B (Ours)* | 69.64 | 63.63 | 11.17 | 38.55 | | **Quality of Multiple Unit Tests** | | | | | | Llama3.1-8B | 74.21 | 74.35 | 20.44 | 30.55 | | Llama3.1-70B | 78.30 | 78.76 | 17.19 | 25.97 | | *CodeRM-8B (Ours)* | **80.46** | **81.27** | **16.48** | **22.71** | # Citation If you find our model helpful, please cite the original paper: ``` @misc{ma2025coderm, title={Dynamic Scaling of Unit Tests for Code Reward Modeling}, author={Zeyao Ma and Xiaokang Zhang and Jing Zhang and Jifan Yu and Sijia Luo and Jie Tang}, year={2025}, eprint={2501.01054}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2501.01054}, } ``` # Contact If you have any problems, feel free to raise an issue or reach out to us via email at: .