ComBack: A Versatile Dataset for Enhancing Compiler Backend Development Efficiency
ComBack is a large-scale multi-platform compiler backend code dataset. This repository contains all fine-tuned models and scripts for reproducing experimental results.
Dataset Information
Details can be found at https://huggingface.co/datasets/docz-ict/ComBack
Task Example
- Statement-Level Completion: complete current statement.
//Inputs:
...
adjustReg(MBB,LastFrameDestroy, DL, SPReg, FPReg, -StackSize+RVFI->getVarArgsSaveSize()
//Ground Truth:
MachineInstr::FrameDestroy);
- Next-Statement Suggestion: predict the next statement.
//Inputs:
...
maxCallFrameSize = (maxCallFrameSize + AlignMask) & ~AlignMask;
//Ground Truth:
MFI -> setMaxCallFrameSize(maxCallFrameSize);
- Code Generation: generate a function with function description in natrual language.
//Inputs:
getPointerRegClass: Returns a TargetRegisterClass used for pointer values.
Target-Specific Value: Sparc, SP::I64RegsRegClass, SP::IntRegsRegClass.
//Ground Truth:
TargetRegisterClass *SparcRegisterInfo::getPointerRegClass(MachineFunction &MF ,unsigned Kind) {
return Subtarget.is64Bit() ? &SP::I64RegsRegClass : &SP::IntRegsRegClass;
}
1. Dependency
- python version == 3.8.1
- pip install -r requirements.txt
2. Fine-Tuning
We fine-tuned six pre-trained code language models on 8 Tesla V100 each with 16GB. You can fine-tune each model on our datasets by running:
# Model Type Options: CodeBert, GraphCodeBert, UnixCoder, CodeT5, NatGen, CodeT5+
# Task Options: code-generation, code-completion, new-target-completion(Only for CodeT5+), new-target-generation(Only for CodeT5+)
bash ./Script/Model/{Model Type}/{Task}/run_fine_tuning*.sh
3. Reproducing Results in Table.2
Dataset split scheme
Split data of all 178 backends into train/valid/test set in the ratio of 80%:10%:10%
- Dataset Info
Task | Train | Valid | Test |
---|---|---|---|
Statement-Level Comp. | 128,899(11.36M Token) | 16,112(1.43M Token) | 16,113(1.43M Token) |
Next-Statement Sugg. | 173,052(15.69M Token) | 21,631(1.99M Token) | 21,632(1.98M Token) |
Code Generation. | 36,236(5.10M Token) | 4,530(0.64M Token) | 4,530(0.64M Token) |
Reproducing results in Table.2 by running:
# Model Type Options: CodeBert, GraphCodeBert, UnixCoder, CodeT5, NatGen, CodeT5+
# Task Options: code-generation, code-completion
bash ./Script/Model/{Model Type}/{Task}/run_test.sh
Results
Without Fine-Tuning
Stmt. Comp. Stmt. Comp. Next. Sugg. Next. Sugg. Code. Gen. Code. Gen. Model EM ED EM ED BLEU4 ED CodeBert-c 0.00 0.97 0.00 1.31 0.00 0.44 GraphCodeBert-c 0.00 0.35 0.00 0.54 0.00 2.41 UnixCoder-base-nine 0.07 27.56 15.93 29.11 0.00 31.81 CodeT5-base 0.65 21.45 7.23 23.50 0.00 13.57 NatGen 0.00 13.52 0.02 15.95 0.01 28.76 CodeT5+-220m 0.02 7.24 0.12 9.87 0.00 12.33 Fine-Tuned
Stmt. Comp. Stmt. Comp. Next. Sugg. Next. Sugg. Code. Gen. Code. Gen. Model EM ED EM ED BLEU4 ED CodeBert-c 53.84 77.44 52.67 70.82 23.54 54.63 GraphCodeBert-c 43.00 71.89 47.10 61.31 20.73 48.83 UnixCoder-base-nine 67.84 85.06 58.51 75.31 56.24 73.45 CodeT5-base 66.38 84.34 58.52 76.03 70.87 80.45 NatGen 67.47 84.83 60.30 76.84 71.73 81.39 CodeT5+-220m 66.93 84.45 59.57 76.41 75.28 82.95
4. Reproducing Results in Table.3
Dataset split scheme
Take data of RISC-V,ARC,NVPTX both in GCC and LLVM as test set, split train/valid set in the ratio of 85%:15% of other CPU, MPU and GPU targets excluding RI5CY(RI5CY is custmoized based on RISCV)
Datset Info
Task Train Valid Test Statement-Level Comp. 114,016(10.20M Token) 20,121(1.81M Token) 6,645(0.58M Token) Next-Statement Sugg. 152,114(14.10M Token) 26,844(2.49M Token) 9,313(0.83M Token) Code Generation. 30,633(4.44M Token) 5,406(0.79M Token) 2,819(0.37M Token)
Input examples for ChatGPT-3.5-Turbo and Code-LLaMA-34B-Instruct
Statement-Level Completion
//Prompt: Complete the last statement of this code snippet:
...
adjustReg(MBB,LastFrameDestroy, DL, SPReg, FPReg, -StackSize+RVFI->getVarArgsSaveSize()
Next-Statement Suggestion
//Prompt: Predict the next statement of this code snippet:
...
maxCallFrameSize = (maxCallFrameSize + AlignMask) & ~AlignMask;
Code Generation
//Prompt: Create a function named "getPointerRegClass" for "Sparc" backend of LLVM Compiler.
//The description of this function is "Returns a TargetRegisterClass used for pointer values".
//It contains "Sparc", "SP::I64RegsRegClass", "SP::IntRegsRegClass" as target specific values.
Reproducing results in Table.3 by running:
# Task Options: new-target-completion, new-target-generation
bash ./Script/Model/CodeT5+/{Task}/run_test_existing_type.sh
# ChatGPT
bash ./Script/Exp_Script/ChatGPT/run_chatgpt.sh
# Code-LLaMA
bash ./Script/Exp_Script/ChatGPT/run_codellama.sh
Results
- GCC
Stmt. Comp. | Stmt. Comp. | Stmt. Comp. | Stmt. Comp. | Stmt. Comp. | Stmt. Comp. | Next. Sugg. | Next. Sugg. | Next. Sugg. | Next. Sugg. | Next. Sugg. | Next. Sugg. | Code. Gen. | Code. Gen. | Code. Gen. | Code. Gen. | Code. Gen. | Code. Gen. | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RISC-V | RISC-V | ARC | ARC | NVPTX | NVPTX | RISC-V | RISC-V | ARC | ARC | NVPTX | NVPTX | RISC-V | RISC-V | ARC | ARC | NVPTX | NVPTX | |
Model | EM | ED | EM | ED | EM | ED | EM | ED | EM | ED | EM | ED | BLEU4 | ED | BLEU4 | ED | BLEU4 | ED |
ChatGPT-3.5-Turbo | 10.34 | 38.41 | 15.35 | 42.94 | 12.01 | 41.47 | 6.44 | 12.9 | 9.75 | 20.79 | 7.97 | 17.79 | 1.37 | 24.12 | 1.67 | 28.26 | 1.57 | 26.97 |
Code-LLaMA-34B | 0.41 | 19.07 | 0.85 | 16.77 | 0.56 | 18.22 | 1.58 | 13.54 | 2.66 | 17.95 | 2.47 | 16.59 | 1.67 | 27.89 | 1.71 | 30.49 | 1.57 | 27.65 |
CodeT5+-220m | 51.16 | 75.32 | 52.45 | 74.57 | 50.56 | 75.52 | 49.11 | 67.84 | 38.26 | 59.21 | 38.33 | 56.31 | 32.56 | 58.67 | 19.94 | 50.27 | 25.47 | 52.60 |
- LLVM
Stmt. Comp. | Stmt. Comp. | Stmt. Comp. | Stmt. Comp. | Stmt. Comp. | Stmt. Comp. | Next. Sugg. | Next. Sugg. | Next. Sugg. | Next. Sugg. | Next. Sugg. | Next. Sugg. | Code. Gen. | Code. Gen. | Code. Gen. | Code. Gen. | Code. Gen. | Code. Gen. | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RISC-V | RISC-V | ARC | ARC | NVPTX | NVPTX | RISC-V | RISC-V | ARC | ARC | NVPTX | NVPTX | RISC-V | RISC-V | ARC | ARC | NVPTX | NVPTX | |
Model | EM | ED | EM | ED | EM | ED | EM | ED | EM | ED | EM | ED | BLEU4 | ED | BLEU4 | ED | BLEU4 | ED |
ChatGPT-3.5-Turbo | 12.08 | 41.39 | 16.77 | 42.02 | 14.73 | 43.72 | 9.80 | 21.86 | 10.81 | 20.66 | 11.39 | 22.82 | 1.23 | 25.12 | 1.30 | 27.19 | 1.43 | 25.45 |
Code-LLaMA-34B | 0.45 | 17.61 | 0.61 | 17.21 | 0.99 | 17.23 | 1.75 | 15.04 | 0.42 | 11.27 | 2.42 | 16.25 | 1.43 | 27.24 | 1.61 | 32.12 | 1.59 | 28.08 |
CodeT5+-220m | 62.68 | 82.02 | 71.34 | 85.98 | 64.45 | 81.53 | 48.71 | 68.95 | 58.68 | 74.57 | 47.81 | 65.5 | 50.34 | 72.98 | 55.38 | 74.41 | 44.33 | 66.36 |
5. Reproducing Results in Figure.6
Reproducing results in Table.4 by running:
# Task Options: new-target-completion, new-target-generation
bash ./Script/Model/CodeT5+/{Task}/run_test_existing_type.sh
# Fork-Flow
bash ./Script/Exp_Script/ForkFlow/run_forkflow.sh
Results
- GCC
RISCV | RISCV | ARC | ARC | NVPTX | NVPTX | |
---|---|---|---|---|---|---|
Method | BLEU4 | ED | BLEU4 | ED | BLEU4 | ED |
ForkFlow Avg | 3.48 | 5.79 | 1.77 | 3.73 | 4.7 | 3.81 |
ForkFlow Max | 28.77 | 34.8 | 4.94 | 8.85 | 4.7 | 3.81 |
CodeT5+ | 32.56 | 58.67 | 25.47 | 52.6 | 19.94 | 50.27 |
- LLVM
RISCV | RISCV | ARC | ARC | NVPTX | NVPTX | |
---|---|---|---|---|---|---|
Method | BLEU4 | ED | BLEU4 | ED | BLEU4 | ED |
ForkFlow Avg | 12.45 | 22.18 | 19.98 | 33.43 | 15.06 | 28.73 |
ForkFlow Max | 27.32 | 46.47 | 41.8 | 60.62 | 18.81 | 39.04 |
CodeT5+ | 50.34 | 72.98 | 55.38 | 74.41 | 44.33 | 66.36 |
6. Reproducing Results in Table.4
Dataset split scheme
Take data of ARC,NVPTX both in GCC and LLVM as test set, split train/valid set in the ratio of 85%:15% of CPU targets excluding RISC-V and RI5CY
Datset Info
Task Train Valid Test Statement-Level Comp. 87,018(7.78M Token) 15,357(1.37M Token) 2,764(0.26M Token) Next-Statement Sugg. 113,684(10.65M Token) 20,063(1.87M Token) 4,029(0.38M Token) Code Generation. 21,184(3.14M Token) 3,739(0.55M Token) 1,372(0.18M Token)
Reproducing results in Table.4 by running:
# Task Options: new-target-completion, new-target-generation
bash ./Script/Model/CodeT5+/{Task}/run_test_new_type.sh
Results
GCC
Stmt. Comp. Stmt. Comp. Stmt. Comp. Stmt. Comp. Next. Sugg. Next. Sugg. Next. Sugg. Next. Sugg. Code. Gen. Code. Gen. Code. Gen. Code. Gen. ARC(MPU) ARC(MPU) NVPTX(GPU) NVPTX(GPU) ARC(MPU) ARC(MPU) NVPTX(GPU) NVPTX(GPU) ARC(MPU) ARC(MPU) NVPTX(GPU) NVPTX(GPU) Dataset EM ED EM ED EM ED EM ED BLEU4 ED BLEU4 ED -w GPU and MPU 52.45 74.57 50.56 75.52 38.26 59.21 38.33 56.31 19.94 50.27 25.47 52.6 -w/o GPU and MPU 50.53 74.09 46.37 72.45 37.22 58.21 38.33 56.83 19.29 49.12 22.46 50.33 Decrease 1.92 0.48 4.19 3.07 1.04 1.00 0.00 -0.52 0.65 1.15 3.01 3.37 LLVM
Stmt. Comp. Stmt. Comp. Stmt. Comp. Stmt. Comp. Next. Sugg. Next. Sugg. Next. Sugg. Next. Sugg. Code. Gen. Code. Gen. Code. Gen. Code. Gen. ARC(MPU) ARC(MPU) NVPTX(GPU) NVPTX(GPU) ARC(MPU) ARC(MPU) NVPTX(GPU) NVPTX(GPU) ARC(MPU) ARC(MPU) NVPTX(GPU) NVPTX(GPU) Dataset EM ED EM ED EM ED EM ED BLEU4 ED BLEU4 ED -w GPU and MPU 71.34 85.98 64.45 81.53 58.68 74.57 47.81 65.50 55.38 74.41 44.33 66.36 -w/o GPU and MPU 69.82 85.59 60.04 79.85 58.26 73.75 46.28 63.92 49.62 70.26 42.94 65.43 Decrease 1.52 0.39 4.41 1.68 0.42 0.82 1.53 1.58 5.76 4.15 1.39 0.93
7. Reproducing Results in Table.5
Dataset split scheme
Take data of RI5CY in LLVM as test set, split train/valid set in the ratio of 85%:15% of CPU targets excluding RISC-V and including RISC-V
Datset Info
- Excluding RISC-V
Task Train Valid Test Statement-Level Comp. 87,018(7.78M Token) 15,357(1.37M Token) 721(0.04M Token) Next-Statement Sugg. 113,684(10.65M Token) 20,063(1.87M Token) 1,035(0.06M Token) Code Generation. 21,184(3.14M Token) 3,739(0.55M Token) 219(0.02M Token) - Including RISC-V
Task Train Valid Test Statement-Level Comp. 90,316(8.06M Token) 15,940(1.42M Token) 721(0.04M Token) Next-Statement Sugg. 118,175(11.04M Token) 20,856(1.94M Token) 1,035(0.06M Token) Code Generation. 22,413(3.30M Token) 3,957(0.58M Token) 219(0.02M Token)
Reproducing results in Table.5 by running:
# Task Options: new-target-completion, new-target-generation
bash ./Script/Model/CodeT5+/{Task}/run_test_itr_exp.sh
Results
Stmt. Comp. | Stmt. Comp. | Next. Sugg. | Next. Sugg. | Code. Gen. | Code. Gen. | |
---|---|---|---|---|---|---|
Dataset | EM | ED | EM | ED | BLEU4 | ED |
-w/o RISC-V | 66.16 | 83.79 | 57.29 | 74.73 | 54.41 | 75.41 |
-w RISC-V | 74.06 | 87.91 | 67.25 | 81.28 | 79.46 | 89.92 |
Diff | 7.90 | 4.12 | 9.96 | 6.55 | 25.05 | 14.51 |
Citation
@inproceedings{zhong2024comback,
title={ComBack: A Versatile Dataset for Enhancing Compiler Backend Development Efficiency},
author={Ming Zhong, Fang Lyu, Lulin Wang, Hongna Geng, Lei Qiu, Huimin Cui, Xiaobing Feng},
booktitle={Thirty-eighth Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024}
}