Spaces:

logikon
/

open_cot_leaderboard

App Files Files Community

Gregor Betz commited on Jan 30

Commit

44ef4de

•

1 Parent(s): 689a5b6

descr.

Files changed (1) hide show

src/display/about.py +27 -3

src/display/about.py CHANGED Viewed

@@ -26,15 +26,38 @@ TITLE = """<h1 align="center" id="space-title"><code>/\/</code> &nbsp; Open CoT
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
-Intro text
 """
 # Which evaluations are you running? how can people reproduce what you have?
-LLM_BENCHMARKS_TEXT = f"""
 ## How it works
 ## Reproducibility
-To reproduce our results, here is the commands you can run:
 """
@@ -75,4 +98,5 @@ We're populating the Open CoT Leaderboard step by step. The idea is to grow a di
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
 """

 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
+The `/\/` Open CoT Leaderboard tracks the reasoning skills of LLMs, measured as their ability to generate **effective chain-of-thought reasoning traces**.
+The leaderboard reports **accuracy gains** achieved by using CoT, i.e.
+> _accuracy gain Δ_ =  _CoT accuracy_ - _baseline accuracy_.
+See the "About" tab for more details and motivation.
 """
 # Which evaluations are you running? how can people reproduce what you have?
+LLM_BENCHMARKS_TEXT = """
 ## How it works
+A CoT `regime` consists in a prompt chain and decoding parameters used to generate a reasoning trace. To assess the reasoning skill of a given `model`, we carry out the following steps for each `task` (test dataset) and each `regime`:
+1. Generate CoT reasoning traces for all problems in the test dataset with `model` and according to `regime`.
+2. Let the model answer the test dataset problems and record the resulting _baseline accuracy_.
+3. Let the model answer the test dataset problems _with the reasoning traces appended_ to the prompt and record the resulting _CoT accuracy_.
+4. Compute the _accuracy gain Δ_ =  _CoT accuracy_ - _baseline accuracy_ for the given `model`, `task`, and `regime`.
+Each regime has a different accuracy gain Δ, and the leaderboard reports the best Δ achieved by a regime.
+## How is it different from other leaderboards?
+...
+## Test dataset selection (`tasks`)
 ## Reproducibility
+To reproduce our results, check out the repository [cot-eval](https://github.com/logikon-ai/cot-eval).
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
+Logikon AI Team. (2024). Open CoT Leaderboard. Retrieved from https://huggingface.co/spaces/logikon/open_cot_leaderboard
 """