VEGA: Automatically Generating Compiler Backends Using a Pre-Trained Transformer Model

VEGA is an AI-driven system aimed at easing the development of compiler backends for new targets. This repository contains code and data for replicating experimental results.

1. Directory Structure

VEGA_AE
|──dataset
|──models
|   |──FT_Model
|   |──New_FT_Model
|   └──UnixCoder
|——Scripts
    |──Exp
    |   |──Acc
    |   |──Correction
    |   |──ForkFlow
    |   |──Perf
    |   └──Time
    └──UnixCoder

2. Hardware Dependency

8 Nvidia Tesla V100 GPU, each with 16 GB memory.

3. Software Dependency

CUDA == 11.7.
python version == 3.8.1.
Conda (Any version that supports the installation of Python 3.8.1).

4. Installation

Download the artifact from https://huggingface.co/docz1105/VEGA_AE.

$ git lfs clone https://huggingface.co/docz1105/VEGA_AE
$ cd VEGA_AE

Set up a Conda virtual environment.

We provide a pre-packaged conda virtual environment in ./vega_ae.yml, which includes specific versions of python and required extension packages. The python environment can be directly created using the following command.

$ conda env create -f vega_ae.yml
$ conda activate vega_ae

Alternatively, another way for creating a Conda environment is also available.

$ conda create -n vega_ae python=3.8.1
$ conda activate vega_ae
$ pip install -r requirements.txt

5. Code Generation

We have provided a fine-tuned model in ./models/FT_Model, which is fine-tuned with ./dataset/train.jsonl and ./dataset/valid.jsonl. The train.jsonl and valid.jsonl files contain function templates, feature vectors and ground truth for 98 backends (excluding RISC-V, RI5CY, xCORE) in our dataset.

We have also provided a script fot functionality test, which only generates a single function for RI5CY (Recorded as PULP in our dataset), taking less than 3 minutes with 8 Nvidia Tesla V100 GPUs.

Run functionality test with:

$ bash run_function_test.sh

When the run_function_test.sh script begins execution, the command line displays

$ " Start Function Inferencing !"

Upon completion of the code generation, the script outputs

$ " Finished Function Inferencing."

The inference result will be saved in ./models/FT_Model/result.jsonl.

Check the generated code with:

$ cat ./models/FT_Model/result.jsonl

In the result.jsonl file, the meaning of each item in an entry can be found in the following table:

Item	Description
vega_code	The model-generated code.
ans_code	The ground truth of the code.
vega_pre	The model-generated confidence score.
ans_pre	The ground truth of the confidence score.
File	The file to which this item belongs.
Function	The function to which this item belongs.
Module	The function module to which this item belongs.
Target	The target to which this item belongs. Note that we use "PULP" to represent "RI5CY" in our dataset.

Run code generation with:

The fine-tuned model will take function templates and feature vectors for RISC-V, RI5CY, and xCORE from ./dataset/test.jsonl as input, generating code and confidence scores automatically.

$ bash run_test.sh

Customize parameters for code generation by modifying following options in the run_test.sh.

 --model_name_or_path ../../models/UnixCoder \
 --test_filename ../../dataset/test.jsonl \
 --output_dir ../../models/FT_Model \
 --beam_size 1 \
 --train_batch_size 256 \
 --eval_batch_size 256 \
 --learning_rate 6e-5 \
 --gradient_accumulation_steps 2 \
 --num_train_epochs 10 \
 --mse_loss_weight 0.9  \
 --ce_loss_weight 0.1

Users can inference with their own fine-tuned model by changing the --output_dir option.

When the run_test.sh script begins execution, the command line displays

$ " Start Inferencing !"

Upon completion of the code generation, the script outputs

$ " Finished Inferencing."

The inference result will be saved in ./models/FT_Model/result.jsonl.

Note that if a ./models/FT_Model/result.jsonl file already exists, it will be overwritten after the execution of run_function_test.sh or run_test.sh.

6. Fine-Tuning (Optional)

We provide the original UnixCoder-base-nine in ./models/UnixCoder. The original UnixCoder-base-nine can also be downloaded from HuggingFace: https://huggingface.co/microsoft/unixcoder-base-nine.

The original UnixCoder-base-nine will be fine-tuned with the provided ./dataset/train.jsonl and ./dataset/valid.jsonl by the following command.

Run fine-tuning with:

$ bash run_fine_tuning.sh

Customize parameters for fine-tuning by modifying following options in the run_fine_tuning.sh.

  --model_name_or_path ../../models/UnixCoder \
  --train_filename ../../dataset/train.jsonl \
  --dev_filename ../../dataset/valid.jsonl \
  --output_dir ../../models/New_FT_Model \
  --beam_size 4 \
  --train_batch_size 64 \
  --eval_batch_size 48 \
  --learning_rate 6e-5 \
  --num_train_epochs 50 \
  --mse_loss_weight 0.9 \
  --ce_loss_weight 0.1

The fine-tuned model will be saved in --output_dir.

7. Reproducing Results in the Experiment

We provide the scripts to reproduce each Figure/Table from the paper, along with the corresponding output result files, in the following table:

Script	Description	Output	Figure/Table
./Scripts/Exp/Time/gen_time.py	Calculate the time overhead for VEGA to generate three backends.	./Scripts/Exp/Time/Fig7.csv	Fig.7
./Scripts/Exp/Acc/gen_accuracy.py	Calculate the function-level accuracy of three VEGA-generated backends.	./Scripts/Exp/Acc/Fig8_Acc.csv	Fig.8
./Scripts/Exp/Acc/gen_purple.py	Calculate the results of Purple Bar in Fig. 8.	./Scripts/Exp/Acc/Fig8_Purple.csv	Fig.8
./Scripts/Exp/Acc/gen_accuracy.py	Calculate the percentage of three types of errors in three VEGA-generated backends.	./Scripts/Exp/Acc/Table2.csv	Table.2
./Scripts/Exp/ForkFlow/gen_forkflow.py	Calculate the statement-level accuracy of VEGA-generated backends and ForkFlow-generated backends.	./Scripts/Exp/ForkFlow/Fig9.csv	Fig.9
./Scripts/Exp/ForkFlow/gen_forkflow.py	Calculate the number of statements accurately generated and requiring manual correction by VEGA of three backends.	./Scripts/Exp/ForkFlow/Table3.csv	Table.3
./Scripts/Exp/Correction/gen_correct.py	Calculate time required by two developers to modify the VEGA-generated RISC-V backend.	./Scripts/Exp/Correction/Table4.csv	Table. 4
./Scripts/Exp/Perf/gen_perf.py	Calculate the speedup of LLVM-Base (-O3),and LLVM-VEGA (-O3) over LLVM-Base (-O0) on three benchmarks.	./Scripts/Exp/Perf/Fig10.csv	Fig. 10

7.1 Results for Fig. 7

In the code generation process, we set a batch size of 256 on 8 Nvidia Tesla V100 GPU (each with 16GB memory), meaning each batch contains 256 statements. Since each batch may include statements from different function modules, we did not directly measure the generation time for each function modules of three targets (RISC-V, RI5CY, xCORE) during execution. Instead, we calculated the average inference time of each batch (25 seconds) and then derived the inference time of each statement (25/256 seconds). With the total number of statements within each function module of each target, we subsequently calculated the total inference time required for each function module of each target.

Command:

$ python ./Scripts/Exp/Time/gen_time.py

Results:

$ cat ./Scripts/Exp/Time/Fig7.csv

7.2 Results for Fig. 8

In our experiment, we employed the Pass@1 evaluation metric, which involves replacing each VEGA-generated function individually within the official LLVM (LLVM-Base), then running regression tests to verify the correctness of the replaced function. This process is highly time-consuming, as a single regression test run generally takes about half an hour. Thus, sequentially testing all 1,454 VEGA-generated functions across three targets would require approximately 727 hours.

To simplify this process, we recorded the ground truth for each statement based on the Pass@1 experiment results. Additionally, we documented a list of functions containing Err-Def errors (i.e., errors due to missing necessary statements in the function template; functions with Err-Def can not pass all regression tests). This allowed us to transform the Pass@1 testing process into an Exact Match evaluation.

In this Exact Match evaluation, each statement is deemed correct if the VEGA-generated code matches the ground truth and the confidence score aligns. A function is considered correct if all statements within it are accurate and it is free from Err-Def errors.

Command:

$ cp ./models/FT_Model/result.jsonl ./Scripts/Exp/Acc
$ python ./Scripts/Exp/Acc/gen_accuracy.py

This script will automatically analyze the VEGA's output from "result.jsonl" and compare the generated code and confidence scores with the ground truth. Based on this comparison, it will determine whether each function is correct.

Accuracy Results:

$ cat ./Scripts/Exp/Acc/Fig8_Acc.csv

We also provide a script for calculating the proportion of "Accurate Functions with Integrated Statements Across Multiple Targets". The value corresponding to the purple bar in Fig. 8.

Command:

$ python ./Scripts/Exp/Acc/gen_purple.py

Results:

$ cat ./Scripts/Exp/Acc/Fig8_Purple.csv

7.3 Results for Table. 2

Executing the script in 7.2 will also yield the proportion of the three types of errors for each target.

Command:

$ python ./Scripts/Exp/Acc/gen_accuracy.py

Results:

$ cat ./Scripts/Exp/Acc/Table2.csv

7.4 Results for Fig. 9

We modified the functions generated by VEGA and functions in the MIPS backend (ForkFlow) to ensure they can correctly run on the RISC-V, RI5CY, and xCORE backends respectively. We have reserved function code for the MIPS backend in the ./Scripts/Exp/ForkFlow/Mips_Code directory, along with manually fixed code for the RISC-V, RI5CY, and xCORE LLVM backends in ./Scripts/Exp/ForkFlow/Std_Code. Additionally, the script in 7.2 will automatically write the VEGA-generated code from result.jsonl into the ./Scripts/Exp/ForkFlow/VEGA_Code directory for comparison. By executing the following script, the proportion of accurate and modified statements of the VEGA-generated functions and ForkFlow processes will be automatically calculated.

Command:

$ python ./Scripts/Exp/ForkFlow/gen_forkflow.py

Results:

$ cat ./Scripts/Exp/ForkFlow/Fig9.csv

7.5 Results for Table. 3

Executing the script in 7.4 will also output the number of statements accurately generated and requiring manual correction by VEGA across seven function modules for RISC-V, RI5CY, and xCORE.

Command:

$ python ./Scripts/Exp/ForkFlow/gen_forkflow.py

Results:

$ cat ./Scripts/Exp/ForkFlow/Table3.csv

7.6 Results for Table. 4

The data in Table. 4 show the time two developers needed to modify the VEGA-generated RISC-V backend. As a human-based experiment, only the recorded modification times for each function are provided.

The following script computes the total time spent by Developers A and B to modify each function module in the VEGA-generated RISC-V backend, based on the recorded times for each function.

Command:

$ python ./Scripts/Exp/Correction/gen_correct.py

Results:

$ cat ./Scripts/Exp/Correction/Table4.csv

7.7 Results for Fig. 10

Due to commercial licensing restrictions, we cannot provide the source code for the SPEC 2017 CPU benchmark used in this experiment. Additionally, testing all benchmarks including SPEC 2017 CPU is time-intensive, requiring around 565 hours in total. To address these constraints, we provide our recorded experimental data.

Running the following script will automatically calculate the speedup of the VEGA-generated LLVM backend (LLVM-VEGA) with the "-O3" optimization over the performance of the official LLVM backend (LLVM-Base) with "-O0", as well as the speedup of LLVM-Base with "-O3" over its own performance with "-O0".

Command:

$ python ./Scripts/Exp/Perf/gen_perf.py

Results:

$ cat ./Scripts/Exp/Perf/Fig10.csv

8. Experiment Customization

Users can run this experiment in different software environments, but they must ensure that PyTorch version is compatible with the CUDA version in those software environments. The experiment can also be conducted in different hardware environments, but adjustments to the batch size for fine-tuning and inference are necessary based on the available GPU memory. We have fixed the random seed and parameters in the provided scripts to ensure consistent code generation accuracy within the same hardware and software environment. However, if the model is re-fine-tuned under different hardware or software environments, the accuracy of the newly fine-tuned model may exhibit slight variations.

We further conducted code generation tests on a machine with an Nvidia A100 GPU (80GB memory) and CUDA Version == 12.0. Under the provided Conda virtual environment, the experimental results showed a 25-minute reduction in the time overhead of the code generation process (Fig. 7). This reduction is due to the A100 GPU's higher computational efficiency compared to the V100, as well as the additional time costs in the previous setup with 8 V100 GPUs caused by synchronization requirements across multiple GPUs. Notably, code accuracy remained unchanged (Fig. 8, Fig. 9, Table. 2, Table. 3). This confirms that our experiment is adaptable across different hardware and software environments.

Citation

@inproceedings{zhong2025vega,
  title={VEGA: Automatically Generating Compiler Backends Using a Pre-Trained Transformer Model},
  author={Ming Zhong, Fang Lv, Lulin Wang, Lei Qiu, Yingying Wang, Ying Liu, Huimin Cui, Xiaobing Feng, Jingling Xue},
  booktitle={2025 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)},
  year={2025}
}