Gregor Betz commited on
Commit
44ef4de
1 Parent(s): 689a5b6
Files changed (1) hide show
  1. src/display/about.py +27 -3
src/display/about.py CHANGED
@@ -26,15 +26,38 @@ TITLE = """<h1 align="center" id="space-title"><code>/\/</code> &nbsp; Open CoT
26
 
27
  # What does your leaderboard evaluate?
28
  INTRODUCTION_TEXT = """
29
- Intro text
 
 
 
 
 
 
30
  """
31
 
32
  # Which evaluations are you running? how can people reproduce what you have?
33
- LLM_BENCHMARKS_TEXT = f"""
34
  ## How it works
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ## Reproducibility
37
- To reproduce our results, here is the commands you can run:
38
 
39
  """
40
 
@@ -75,4 +98,5 @@ We're populating the Open CoT Leaderboard step by step. The idea is to grow a di
75
 
76
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
77
  CITATION_BUTTON_TEXT = r"""
 
78
  """
 
26
 
27
  # What does your leaderboard evaluate?
28
  INTRODUCTION_TEXT = """
29
+ The `/\/` Open CoT Leaderboard tracks the reasoning skills of LLMs, measured as their ability to generate **effective chain-of-thought reasoning traces**.
30
+
31
+ The leaderboard reports **accuracy gains** achieved by using CoT, i.e.
32
+
33
+ > _accuracy gain Δ_ = _CoT accuracy_ - _baseline accuracy_.
34
+
35
+ See the "About" tab for more details and motivation.
36
  """
37
 
38
  # Which evaluations are you running? how can people reproduce what you have?
39
+ LLM_BENCHMARKS_TEXT = """
40
  ## How it works
41
 
42
+ A CoT `regime` consists in a prompt chain and decoding parameters used to generate a reasoning trace. To assess the reasoning skill of a given `model`, we carry out the following steps for each `task` (test dataset) and each `regime`:
43
+
44
+ 1. Generate CoT reasoning traces for all problems in the test dataset with `model` and according to `regime`.
45
+ 2. Let the model answer the test dataset problems and record the resulting _baseline accuracy_.
46
+ 3. Let the model answer the test dataset problems _with the reasoning traces appended_ to the prompt and record the resulting _CoT accuracy_.
47
+ 4. Compute the _accuracy gain Δ_ = _CoT accuracy_ - _baseline accuracy_ for the given `model`, `task`, and `regime`.
48
+
49
+ Each regime has a different accuracy gain Δ, and the leaderboard reports the best Δ achieved by a regime.
50
+
51
+
52
+ ## How is it different from other leaderboards?
53
+
54
+ ...
55
+
56
+ ## Test dataset selection (`tasks`)
57
+
58
+
59
  ## Reproducibility
60
+ To reproduce our results, check out the repository [cot-eval](https://github.com/logikon-ai/cot-eval).
61
 
62
  """
63
 
 
98
 
99
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
100
  CITATION_BUTTON_TEXT = r"""
101
+ Logikon AI Team. (2024). Open CoT Leaderboard. Retrieved from https://huggingface.co/spaces/logikon/open_cot_leaderboard
102
  """