ComBack_Models / README.md
unknown
add files
b4970e0
|
raw
history blame
13.2 kB

ComBack: A Versatile Dataset for Enhancing Compiler Backend Development Efficiency

ComBack is a large-scale multi-platform compiler backend code dataset. This repository contains all fine-tuned models for experiments with ComBack.

  • language: C++/C
  • metrics: Exact Match(EM), Edit Distance(ED), BLEU4
  • tags: code; compiler backend
  • license: CC-BY-4.0

Task Information

  • Statement-Level Completion: complete current statement.
//Inputs:
...
adjustReg(MBB,LastFrameDestroy, DL, SPReg, FPReg, -StackSize+RVFI->getVarArgsSaveSize() 
//Ground Truth:
MachineInstr::FrameDestroy);
  • Next-Statement Suggestion: predict the next statement.
  //Inputs:
  ...
  maxCallFrameSize = (maxCallFrameSize + AlignMask) & ~AlignMask;
  //Ground Truth:
  MFI -> setMaxCallFrameSize(maxCallFrameSize);
  • Code Generation: generate a function with function description in natrual language.
  //Inputs:
  getPointerRegClass: Returns a TargetRegisterClass used for pointer values.
  Target-Specific Value: Sparc, SP::I64RegsRegClass, SP::IntRegsRegClass.
  //Ground Truth:
  TargetRegisterClass *SparcRegisterInfo::getPointerRegClass(MachineFunction &MF ,unsigned Kind) {
   return Subtarget.is64Bit() ? &SP::I64RegsRegClass : &SP::IntRegsRegClass;
  }

Organization

  • Existing_Targets/*: split data of all 178 backends into train/valid/test set in the ratio of 80%:10%:10%

    • Dataset Info
    Task Train Valid Test
    Statement-Level Comp. 128,899(11.36M Token) 16,112(1.43M Token) 16,113(1.43M Token)
    Next-Statement Sugg. 173,052(15.69M Token) 21,631(1.99M Token) 21,632(1.98M Token)
    Code Generation. 36,236(5.10M Token) 4,530(0.64M Token) 4,530(0.64M Token)

    We fine-tuned six representative models across three tasks.

    • Without Fine-Tuning

      Stmt. Comp. Stmt. Comp. Next. Sugg. Next. Sugg. Code. Gen. Code. Gen.
      Model EM ED EM ED BLEU4 ED
      CodeBert-c 0.00 0.97 0.00 1.31 0.00 0.44
      GraphCodeBert-c 0.00 0.35 0.00 0.54 0.00 2.41
      UnixCoder-base-nine 0.07 27.56 15.93 29.11 0.00 31.81
      CodeT5-base 0.65 21.45 7.23 23.50 0.00 13.57
      NatGen 0.00 13.52 0.02 15.95 0.01 28.76
      CodeT5+-220m 0.02 7.24 0.12 9.87 0.00 12.33
    • Fine-Tuned

      Stmt. Comp. Stmt. Comp. Next. Sugg. Next. Sugg. Code. Gen. Code. Gen.
      Model EM ED EM ED BLEU4 ED
      CodeBert-c 53.84 77.44 52.67 70.82 xxx xxx
      GraphCodeBert-c 43.00 71.89 47.10 61.31 xxx xxx
      UnixCoder-base-nine 67.84 85.06 58.51 75.31 56.24 73.45
      CodeT5-base 66.38 84.34 58.52 76.03 70.87 80.45
      NatGen 67.47 84.83 60.30 76.84 71.73 81.39
      CodeT5+-220m 66.93 84.45 59.57 76.41 75.28 82.95
  • New_Targets/All_Types/*: Take data of RISC-V,ARC,NVPTX both in GCC and LLVM as test set, split train/valid set in the ratio of 85%:15% of other 171(178 - 2*3 - 1) targets excluding RI5CY(RI5CY is custmoized based on RISCV)

    • Datset Info

      Task Train Valid Test
      Statement-Level Comp. 114,016(10.20M Token) 20,121(1.81M Token) 6,645(0.58M Token)
      Next-Statement Sugg. 152,114(14.10M Token) 26,844(2.49M Token) 9,313(0.83M Token)
      Code Generation. 30,633(4.44M Token) 5,406(0.79M Token) 2,819(0.37M Token)

    We only fine-tuned CodeT5+ across three tasks, and compare it with ChatGPT-3.5-Turbo and Code-LLaMA-34B with similar input.

    • GCC
    Stmt. Comp. Stmt. Comp. Stmt. Comp. Stmt. Comp. Stmt. Comp. Stmt. Comp. Next. Sugg. Next. Sugg. Next. Sugg. Next. Sugg. Next. Sugg. Next. Sugg. Code. Gen. Code. Gen. Code. Gen. Code. Gen. Code. Gen. Code. Gen.
    RISC-V RISC-V ARC ARC NVPTX NVPTX RISC-V RISC-V ARC ARC NVPTX NVPTX RISC-V RISC-V ARC ARC NVPTX NVPTX
    Model EM ED EM ED EM ED EM ED EM ED EM ED BLEU4 ED BLEU4 ED BLEU4 ED
    ChatGPT-3.5-Turbo 10.34 38.41 15.35 42.94 12.01 41.47 6.44 12.9 9.75 20.79 7.97 17.79 7.33 30.83 7.35 32.34 8.12 32.71
    Code-LLaMA-34B 0.41 19.07 0.85 16.77 0.56 18.22 1.58 13.54 2.66 17.95 2.47 16.59 9.38 35.53 11.06 37.15 8.24 33.00
    CodeT5+-220m 51.16 75.32 52.45 74.57 50.56 75.52 49.11 67.84 38.26 59.21 38.33 56.31 32.56 58.67 19.94 50.27 25.47 52.60
    • LLVM
    Stmt. Comp. Stmt. Comp. Stmt. Comp. Stmt. Comp. Stmt. Comp. Stmt. Comp. Next. Sugg. Next. Sugg. Next. Sugg. Next. Sugg. Next. Sugg. Next. Sugg. Code. Gen. Code. Gen. Code. Gen. Code. Gen. Code. Gen. Code. Gen.
    RISC-V RISC-V ARC ARC NVPTX NVPTX RISC-V RISC-V ARC ARC NVPTX NVPTX RISC-V RISC-V ARC ARC NVPTX NVPTX
    Model EM ED EM ED EM ED EM ED EM ED EM ED BLEU4 ED BLEU4 ED BLEU4 ED
    ChatGPT-3.5-Turbo 12.08 41.39 16.77 42.02 14.73 43.72 9.80 21.86 10.81 20.66 11.39 22.82 9.24 32.13 11.96 35.33 10.07 32.90
    Code-LLaMA-34B 0.45 17.61 0.61 17.21 0.99 17.23 1.75 15.04 0.42 11.27 2.42 16.25 6.92 32.54 8.95 38.22 8.20 34.16
    CodeT5+-220m 62.68 82.02 71.34 85.98 64.45 81.53 48.71 68.95 58.68 74.57 47.81 65.5 50.34 72.98 55.38 74.41 44.33 66.36
  • New_Targets/CPU_Only/*: Take data of ARC,NVPTX both in GCC and LLVM as test set, split train/valid set in the ratio of 85%:15% of CPU targets excluding RISC-V and RI5CY

    • Datset Info

      Task Train Valid Test
      Statement-Level Comp. 87,018(7.78M Token) 15,357(1.37M Token) 2,764(0.26M Token)
      Next-Statement Sugg. 113,684(10.65M Token) 20,063(1.87M Token) 4,029(0.38M Token)
      Code Generation. 21,184(3.14M Token) 3,739(0.55M Token) 1,372(0.18M Token)

We only fine-tuned CodeT5+ across three tasks, and compare it with accuracy of ARC(MPU) and NVPTX(GPU) in New_Targets/All_Types/*.

  • GCC

    Stmt. Comp. Stmt. Comp. Stmt. Comp. Stmt. Comp. Next. Sugg. Next. Sugg. Next. Sugg. Next. Sugg. Code. Gen. Code. Gen. Code. Gen. Code. Gen.
    ARC(MPU) ARC(MPU) NVPTX(GPU) NVPTX(GPU) ARC(MPU) ARC(MPU) NVPTX(GPU) NVPTX(GPU) ARC(MPU) ARC(MPU) NVPTX(GPU) NVPTX(GPU)
    Dataset EM ED EM ED EM ED EM ED BLEU4 ED BLEU4 ED
    -w GPU and MPU 52.45 74.57 50.56 75.52 38.26 59.21 38.33 56.31 19.94 50.27 25.47 52.6
    -w/o GPU and MPU 50.53 74.09 46.37 72.45 37.22 58.21 38.33 56.83 19.29 49.12 22.46 50.33
    Decrease 1.92 0.48 4.19 3.07 1.04 1.00 0.00 -0.52 0.65 1.15 3.01 3.37
  • LLVM

    Stmt. Comp. Stmt. Comp. Stmt. Comp. Stmt. Comp. Next. Sugg. Next. Sugg. Next. Sugg. Next. Sugg. Code. Gen. Code. Gen. Code. Gen. Code. Gen.
    ARC(MPU) ARC(MPU) NVPTX(GPU) NVPTX(GPU) ARC(MPU) ARC(MPU) NVPTX(GPU) NVPTX(GPU) ARC(MPU) ARC(MPU) NVPTX(GPU) NVPTX(GPU)
    Dataset EM ED EM ED EM ED EM ED BLEU4 ED BLEU4 ED
    -w GPU and MPU 71.34 85.98 64.45 81.53 58.68 74.57 47.81 65.50 55.38 74.41 44.33 66.36
    -w/o GPU and MPU 69.82 85.59 60.04 79.85 58.26 73.75 46.28 63.92 49.62 70.26 42.94 65.43
    Decrease 1.52 0.39 4.41 1.68 0.42 0.82 1.53 1.58 5.76 4.15 1.39 0.93
  • New_Targets/Itr_Expansion/*: Take data of RI5CY in LLVM as test set, split train/valid set in the ratio of 85%:15% of CPU targets excluding RISC-V and including RISC-V

    • Datset Info

      • Excluding RISC-V
      Task Train Valid Test
      Statement-Level Comp. 87,018(7.78M Token) 15,357(1.37M Token) 721(0.04M Token)
      Next-Statement Sugg. 113,684(10.65M Token) 20,063(1.87M Token) 1,035(0.06M Token)
      Code Generation. 21,184(3.14M Token) 3,739(0.55M Token) 219(0.02M Token)
      • Including RISC-V
      Task Train Valid Test
      Statement-Level Comp. 90,316(8.06M Token) 15,940(1.42M Token) 721(0.04M Token)
      Next-Statement Sugg. 118,175(11.04M Token) 20,856(1.94M Token) 1,035(0.06M Token)
      Code Generation. 22,413(3.30M Token) 3,957(0.58M Token) 219(0.02M Token)

We only fine-tuned CodeT5+ across three tasks, and compare it with accuracy of RI5CY between excluding and including RISC-V in dataset.

Stmt. Comp. Stmt. Comp. Next. Sugg. Next. Sugg. Code. Gen. Code. Gen.
Dataset EM ED EM ED BLEU4 ED
-w/o RISC-V 66.16 83.79 57.29 74.73 54.41 75.41
-w RISC-V 74.06 87.91 67.25 81.28 79.46 89.92
Diff 7.90 4.12 9.96 6.55 25.05 14.51