Gregor Betz commited on
Commit
ad554f1
β€’
1 Parent(s): 992caee

description

Browse files
Files changed (1) hide show
  1. src/display/about.py +5 -1
src/display/about.py CHANGED
@@ -34,7 +34,7 @@ See the "About" tab for more details and motivation.
34
  """
35
 
36
  # Which evaluations are you running? how can people reproduce what you have?
37
- LLM_BENCHMARKS_TEXT = """
38
  ## How it works (roughly)
39
 
40
  To assess the reasoning skill of a given `model`, we carry out the following steps for each `task` (test dataset) and different CoT `regimes`. (A CoT `regime` consists in a prompt chain and decoding parameters used to generate a reasoning trace.)
@@ -53,6 +53,10 @@ Performance leaderboards like the [πŸ€— Open LLM Leaderboard](https://huggingfac
53
 
54
  Unlike these leaderboards, the `/\/` Open CoT Leaderboard assess a model's ability to effectively reason about a `task`:
55
 
 
 
 
 
56
  ### πŸ€— Open LLM Leaderboard
57
  * Can `model` solve `task`?
58
  * Measures `task` performance.
 
34
  """
35
 
36
  # Which evaluations are you running? how can people reproduce what you have?
37
+ LLM_BENCHMARKS_TEXT = f"""
38
  ## How it works (roughly)
39
 
40
  To assess the reasoning skill of a given `model`, we carry out the following steps for each `task` (test dataset) and different CoT `regimes`. (A CoT `regime` consists in a prompt chain and decoding parameters used to generate a reasoning trace.)
 
53
 
54
  Unlike these leaderboards, the `/\/` Open CoT Leaderboard assess a model's ability to effectively reason about a `task`:
55
 
56
+ | Leaderboard | Measures | Metric | Focus |
57
+ |:---|:---|:---|:---|
58
+ | πŸ€— Open LLM Leaderboard | Task performance | Absolute accuracy | Task performance |
59
+
60
  ### πŸ€— Open LLM Leaderboard
61
  * Can `model` solve `task`?
62
  * Measures `task` performance.