mfajcik commited on
Commit
7d29744
β€’
1 Parent(s): ef3bb57

Update content.py

Browse files
Files changed (1) hide show
  1. content.py +15 -6
content.py CHANGED
@@ -7,8 +7,14 @@ HEADER_MARKDOWN = """
7
  Welcome to the leaderboard!
8
  Here you can compare models on tasks in Czech language and/or submit your own model. We use our modified fork of [lm-evaluation-harness](https://github.com/DCGM/lm-evaluation-harness) to evaluate every model under same protocol.
9
 
 
10
  - Head to **Submission** page to learn about submission details.
11
- - See **About** page for brief description of our evaluation protocol & win score mechanism, citation information, and future directions for this benchmark.
 
 
 
 
 
12
  - In submission page, __you can obtain results on leaderboard without publishing them__.
13
  - First step is "pre-submission", and after this is done (significance tests can take up to an hour), the results can be submitted if you'd like to.
14
 
@@ -87,7 +93,6 @@ We use the following metrics for following tasks:
87
  - Fixed-class Classification: average area under the curve (one-vs-all average)
88
  - Multichoice Classification: accuracy
89
  - Question Answering: exact match
90
- - Summarization: rouge-raw (2-gram)
91
  - Language Modeling : word-level perplexity
92
 
93
  On every task, for every metric we compute test for statistical significance at Ξ±=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
@@ -97,8 +102,12 @@ We use the following tests, with varying statistical power:
97
  - summarization & perplexity: bootstrapping.
98
 
99
  ### Duel Scoring Mechanism, Win Score
100
- On each task, each model is scored to each model (up to top-50 currently submitted models). For each model, record proportion of won duels: **Win Score**(WS).
101
- Next, the **Category Win Score**(CWS), is computed as an average over model's WSs in that category. Similarly, πŸ‡¨πŸ‡Ώ **BenCzechMark Win Score** is computed as model's average CWS across categories.
 
 
 
 
102
  The properties of this ranking mechanism include:
103
  - Ranking can change after every submission.
104
  - The across-task aggregation is interpretable: in words, it measures the average proportion of times the model is better.
@@ -108,14 +117,14 @@ The properties of this ranking mechanism include:
108
  The models submitted to leaderboard by the authors were evaluated in following setup:
109
  - max input length: 2048 tokens
110
  - number of shown examples (few-shot mechanism): 3-shot
111
- - truncation: smart truncation
112
  - log-probability aggregation: average-pooling
113
  - chat templates: not used
114
 
115
  ## Citation
116
  You can use the following citation for this leaderboard and our upcoming work.
117
  ```bibtex
118
- @article{fajcik2024benczechmark,
119
  title = {{B}en{C}zech{M}ark: A Czech-centric Multitask and Multimetric Benchmark for Language Models with Duel Scoring Mechanism},
120
  author = {Martin Fajcik and Martin Docekal and Jan Dolezal and Karel Ondrej and Karel Benes and Jan Kapsa and Michal Hradis and Zuzana Neverilova and Ales Horak and Michal Stefanik and Adam Jirkovsky and David Adamczyk and Jan Hula and Jan Sedivy and Hynek Kydlicek},
121
  year = {2024},
 
7
  Welcome to the leaderboard!
8
  Here you can compare models on tasks in Czech language and/or submit your own model. We use our modified fork of [lm-evaluation-harness](https://github.com/DCGM/lm-evaluation-harness) to evaluate every model under same protocol.
9
 
10
+
11
  - Head to **Submission** page to learn about submission details.
12
+ - See **About** page for brief description of our evaluation protocol & win score mechanism, citation information, and future directions for this benchmark.
13
+ - __How scoring works__:
14
+ - On each task, the __Duel Win Score__ reports proportion of won duels.
15
+ - Category scores are obtained by averaging across category tasks.
16
+ - __Average__ Duel Win Scores are an average over category scores.
17
+ - All public submissions are shared in [CZLC/LLM_benchmark_data](https://huggingface.co/datasets/CZLC/LLM_benchmark_data) dataset.
18
  - In submission page, __you can obtain results on leaderboard without publishing them__.
19
  - First step is "pre-submission", and after this is done (significance tests can take up to an hour), the results can be submitted if you'd like to.
20
 
 
93
  - Fixed-class Classification: average area under the curve (one-vs-all average)
94
  - Multichoice Classification: accuracy
95
  - Question Answering: exact match
 
96
  - Language Modeling : word-level perplexity
97
 
98
  On every task, for every metric we compute test for statistical significance at Ξ±=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
 
102
  - summarization & perplexity: bootstrapping.
103
 
104
  ### Duel Scoring Mechanism, Win Score
105
+ - We refer to each test of significance as **duel**. Model A _won_ the duel if it is sigificantly better than model B.
106
+ - On each task, models are compared in duels with up to the top-50 currently submitted models.
107
+ - For each model, the **Win Score** (WS) is calculated as the proportion of duels won by that model.
108
+ - The **Category Win Score** (CWS) is the average of a model's WS across all tasks in a specific category.
109
+ - The πŸ‡¨πŸ‡Ώ **BenCzechMark Win Score** is the overall score for a model, computed as the average of its CWS across all categories.
110
+
111
  The properties of this ranking mechanism include:
112
  - Ranking can change after every submission.
113
  - The across-task aggregation is interpretable: in words, it measures the average proportion of times the model is better.
 
117
  The models submitted to leaderboard by the authors were evaluated in following setup:
118
  - max input length: 2048 tokens
119
  - number of shown examples (few-shot mechanism): 3-shot
120
+ - truncation: smart truncation (few-shot samples are truncated before the task description)
121
  - log-probability aggregation: average-pooling
122
  - chat templates: not used
123
 
124
  ## Citation
125
  You can use the following citation for this leaderboard and our upcoming work.
126
  ```bibtex
127
+ @article{2024benczechmark,
128
  title = {{B}en{C}zech{M}ark: A Czech-centric Multitask and Multimetric Benchmark for Language Models with Duel Scoring Mechanism},
129
  author = {Martin Fajcik and Martin Docekal and Jan Dolezal and Karel Ondrej and Karel Benes and Jan Kapsa and Michal Hradis and Zuzana Neverilova and Ales Horak and Michal Stefanik and Adam Jirkovsky and David Adamczyk and Jan Hula and Jan Sedivy and Hynek Kydlicek},
130
  year = {2024},