Spaces:

bigcode
/

bigcodebench-leaderboard

Running

Terry Zhuo commited on Jun 11, 2024

Commit

cd5ba8d

1 Parent(s): 52ee73d

fix: add more notes

Files changed (2) hide show

app.py CHANGED Viewed

@@ -226,7 +226,7 @@ with demo:
                         - <u>Complete</u>: Code Completion based on the (verbose) structured docstring. This variant tests if the models are good at coding.
                         - <u>Instruct</u> (🔥Vibe Check🔥): Code Generation based on the (less verbose) NL-oriented instructions. This variant tests if the models are really capable enough to understand human intents to code.
                     - `complete` and `instruct` represent the calibrated Pass@1 score on the BigCodeBench benchmark variants.
-                    - `elo_mle` represents the task-level Bootstrap of Maximum Likelihood Elo rating on `BigCodeBench-Complete`.
                     - `size` is the amount of activated model weight during inference.
                     - Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
                     - For more details check the 📝 About section.

                         - <u>Complete</u>: Code Completion based on the (verbose) structured docstring. This variant tests if the models are good at coding.
                         - <u>Instruct</u> (🔥Vibe Check🔥): Code Generation based on the (less verbose) NL-oriented instructions. This variant tests if the models are really capable enough to understand human intents to code.
                     - `complete` and `instruct` represent the calibrated Pass@1 score on the BigCodeBench benchmark variants.
+                    - `elo_mle` represents the task-level Bootstrap of Maximum Likelihood Elo rating on `BigCodeBench-Complete`, which starts from 1000 and is boostrapped 500 times.
                     - `size` is the amount of activated model weight during inference.
                     - Model providers have the responsibility to avoid data contamination. Models trained on close data can be affected by contamination.
                     - For more details check the 📝 About section.

src/text_content.py CHANGED Viewed

@@ -42,7 +42,7 @@ pip install bigcodebench[generate] --upgrade
 ### Scoring and Rankings
 - Models are ranked according to Pass@1 using greedy decoding. Setup details can be found <a href="https://github.com/bigcode-project/bigcodebench/blob/main/bigcodebench/generate.py">here</a>.
-- The code to compute Elo rating is based on [Chatbot Arena Notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR#scrollTo=JdiJbB6pZB1B&line=2&uniqifier=1). We only compute the Elo rating for the `BigCodeBench-Complete` variant.
 ### Contact
 If you have any questions, feel free to reach out to us at [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected])

 ### Scoring and Rankings
 - Models are ranked according to Pass@1 using greedy decoding. Setup details can be found <a href="https://github.com/bigcode-project/bigcodebench/blob/main/bigcodebench/generate.py">here</a>.
+- The code to compute Elo rating is [here](https://github.com/bigcode-project/bigcodebench/blob/main/analysis/get_results.py), which is based on [Chatbot Arena Notebook](https://colab.research.google.com/drive/1RAWb22-PFNI-X1gPVzc927SGUdfr6nsR#scrollTo=JdiJbB6pZB1B&line=2&uniqifier=1). We only compute the Elo rating for the `BigCodeBench-Complete` variant.
 ### Contact
 If you have any questions, feel free to reach out to us at [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected])