Gregor Betz commited on
Commit
c2ba07b
1 Parent(s): 44ef4de

description

Browse files
Files changed (1) hide show
  1. src/display/about.py +7 -9
src/display/about.py CHANGED
@@ -28,9 +28,7 @@ TITLE = """<h1 align="center" id="space-title"><code>/\/</code> &nbsp; Open CoT
28
  INTRODUCTION_TEXT = """
29
  The `/\/` Open CoT Leaderboard tracks the reasoning skills of LLMs, measured as their ability to generate **effective chain-of-thought reasoning traces**.
30
 
31
- The leaderboard reports **accuracy gains** achieved by using CoT, i.e.
32
-
33
- > _accuracy gain Δ_ = _CoT accuracy_ - _baseline accuracy_.
34
 
35
  See the "About" tab for more details and motivation.
36
  """
@@ -39,14 +37,14 @@ See the "About" tab for more details and motivation.
39
  LLM_BENCHMARKS_TEXT = """
40
  ## How it works
41
 
42
- A CoT `regime` consists in a prompt chain and decoding parameters used to generate a reasoning trace. To assess the reasoning skill of a given `model`, we carry out the following steps for each `task` (test dataset) and each `regime`:
43
 
44
- 1. Generate CoT reasoning traces for all problems in the test dataset with `model` and according to `regime`.
45
- 2. Let the model answer the test dataset problems and record the resulting _baseline accuracy_.
46
- 3. Let the model answer the test dataset problems _with the reasoning traces appended_ to the prompt and record the resulting _CoT accuracy_.
47
- 4. Compute the _accuracy gain Δ_ = _CoT accuracy_ - _baseline accuracy_ for the given `model`, `task`, and `regime`.
48
 
49
- Each regime has a different accuracy gain Δ, and the leaderboard reports the best Δ achieved by a regime.
50
 
51
 
52
  ## How is it different from other leaderboards?
 
28
  INTRODUCTION_TEXT = """
29
  The `/\/` Open CoT Leaderboard tracks the reasoning skills of LLMs, measured as their ability to generate **effective chain-of-thought reasoning traces**.
30
 
31
+ The leaderboard reports **accuracy gains** achieved by using CoT, i.e.: _accuracy gain Δ_ = _CoT accuracy_ – _baseline accuracy_.
 
 
32
 
33
  See the "About" tab for more details and motivation.
34
  """
 
37
  LLM_BENCHMARKS_TEXT = """
38
  ## How it works
39
 
40
+ To assess the reasoning skill of a given `model`, we carry out the following steps for each `task` (test dataset) and different CoT `regimes`. (A CoT `regime` consists in a prompt chain and decoding parameters used to generate a reasoning trace.)
41
 
42
+ 1. Let the `model` generate CoT reasoning traces for all problems in the test dataset according to `regime`.
43
+ 2. Let the `model` answer the test dataset problems, and record the resulting _baseline accuracy_.
44
+ 3. Let the `model` answer the test dataset problems _with the reasoning traces appended_ to the prompt, and record the resulting _CoT accuracy_.
45
+ 4. Compute the _accuracy gain Δ_ = _CoT accuracy_ _baseline accuracy_ for the given `model`, `task`, and `regime`.
46
 
47
+ Each `regime` has a different accuracy gain Δ, and the leaderboard reports the best Δ achieved by any regime.
48
 
49
 
50
  ## How is it different from other leaderboards?