Terry Zhuo commited on
Commit
cd5ba8d
1 Parent(s): 52ee73d

fix: add more notes

Browse files
Files changed (2) hide show
  1. app.py +1 -1
  2. src/text_content.py +1 -1
app.py CHANGED
@@ -226,7 +226,7 @@ with demo:
226
  - <u>Complete</u>: Code Completion based on the (verbose) structured docstring. This variant tests if the models are good at coding.
227
  - <u>Instruct</u> (🔥Vibe Check🔥): Code Generation based on the (less verbose) NL-oriented instructions. This variant tests if the models are really capable enough to understand human intents to code.
228
  - `complete` and `instruct` represent the calibrated Pass@1 score on the BigCodeBench benchmark variants.
229
- - `elo_mle` represents the task-level Bootstrap of Maximum Likelihood Elo rating on `BigCodeBench-Complete`.
230
  - `size` is the amount of activated model weight during inference.
231
  - Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
232
  - For more details check the 📝 About section.
 
226
  - <u>Complete</u>: Code Completion based on the (verbose) structured docstring. This variant tests if the models are good at coding.
227
  - <u>Instruct</u> (🔥Vibe Check🔥): Code Generation based on the (less verbose) NL-oriented instructions. This variant tests if the models are really capable enough to understand human intents to code.
228
  - `complete` and `instruct` represent the calibrated Pass@1 score on the BigCodeBench benchmark variants.
229
+ - `elo_mle` represents the task-level Bootstrap of Maximum Likelihood Elo rating on `BigCodeBench-Complete`, which starts from 1000 and is boostrapped 500 times.
230
  - `size` is the amount of activated model weight during inference.
231
  - Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
232
  - For more details check the 📝 About section.
src/text_content.py CHANGED
@@ -42,7 +42,7 @@ pip install bigcodebench[generate] --upgrade
42
 
43
  ### Scoring and Rankings
44
  - Models are ranked according to Pass@1 using greedy decoding. Setup details can be found <a href="https://github.com/bigcode-project/bigcodebench/blob/main/bigcodebench/generate.py">here</a>.
45
- - The code to compute Elo rating is based on [Chatbot Arena Notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR#scrollTo=JdiJbB6pZB1B&line=2&uniqifier=1). We only compute the Elo rating for the `BigCodeBench-Complete` variant.
46
 
47
  ### Contact
48
  If you have any questions, feel free to reach out to us at [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected])
 
42
 
43
  ### Scoring and Rankings
44
  - Models are ranked according to Pass@1 using greedy decoding. Setup details can be found <a href="https://github.com/bigcode-project/bigcodebench/blob/main/bigcodebench/generate.py">here</a>.
45
+ - The code to compute Elo rating is [here](https://github.com/bigcode-project/bigcodebench/blob/main/analysis/get_results.py), which is based on [Chatbot Arena Notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR#scrollTo=JdiJbB6pZB1B&line=2&uniqifier=1). We only compute the Elo rating for the `BigCodeBench-Complete` variant.
46
 
47
  ### Contact
48
  If you have any questions, feel free to reach out to us at [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected])