Gregor Betz commited on
Commit
992caee
1 Parent(s): 0841987

description

Browse files
Files changed (1) hide show
  1. src/display/about.py +1 -1
src/display/about.py CHANGED
@@ -44,7 +44,7 @@ To assess the reasoning skill of a given `model`, we carry out the following ste
44
  3. `model` answers the test dataset problems _with the reasoning traces appended_ to the prompt, we record the resulting _CoT accuracy_.
45
  4. We compute the _accuracy gain Δ_ = _CoT accuracy_ — _baseline accuracy_ for the given `model`, `task`, and `regime`.
46
 
47
- Each `regime` has a different _accuracy gain Δ_, and the leaderboard reports (for every `model`/`task`) the best Δ achieved by any regime. All models are evaluated with the same set of regimes.
48
 
49
 
50
  ## How is it different from other leaderboards?
 
44
  3. `model` answers the test dataset problems _with the reasoning traces appended_ to the prompt, we record the resulting _CoT accuracy_.
45
  4. We compute the _accuracy gain Δ_ = _CoT accuracy_ — _baseline accuracy_ for the given `model`, `task`, and `regime`.
46
 
47
+ Each `regime` yields a different _accuracy gain Δ_, and the leaderboard reports (for every `model`/`task`) the best Δ achieved by any regime. All models are evaluated against the same set of regimes.
48
 
49
 
50
  ## How is it different from other leaderboards?