Spaces:
Running
Running
Update content.py
Browse files- content.py +15 -6
content.py
CHANGED
@@ -7,8 +7,14 @@ HEADER_MARKDOWN = """
|
|
7 |
Welcome to the leaderboard!
|
8 |
Here you can compare models on tasks in Czech language and/or submit your own model. We use our modified fork of [lm-evaluation-harness](https://github.com/DCGM/lm-evaluation-harness) to evaluate every model under same protocol.
|
9 |
|
|
|
10 |
- Head to **Submission** page to learn about submission details.
|
11 |
-
- See **About** page for brief description of our evaluation protocol & win score mechanism, citation information, and future directions for this benchmark.
|
|
|
|
|
|
|
|
|
|
|
12 |
- In submission page, __you can obtain results on leaderboard without publishing them__.
|
13 |
- First step is "pre-submission", and after this is done (significance tests can take up to an hour), the results can be submitted if you'd like to.
|
14 |
|
@@ -87,7 +93,6 @@ We use the following metrics for following tasks:
|
|
87 |
- Fixed-class Classification: average area under the curve (one-vs-all average)
|
88 |
- Multichoice Classification: accuracy
|
89 |
- Question Answering: exact match
|
90 |
-
- Summarization: rouge-raw (2-gram)
|
91 |
- Language Modeling : word-level perplexity
|
92 |
|
93 |
On every task, for every metric we compute test for statistical significance at Ξ±=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
|
@@ -97,8 +102,12 @@ We use the following tests, with varying statistical power:
|
|
97 |
- summarization & perplexity: bootstrapping.
|
98 |
|
99 |
### Duel Scoring Mechanism, Win Score
|
100 |
-
|
101 |
-
|
|
|
|
|
|
|
|
|
102 |
The properties of this ranking mechanism include:
|
103 |
- Ranking can change after every submission.
|
104 |
- The across-task aggregation is interpretable: in words, it measures the average proportion of times the model is better.
|
@@ -108,14 +117,14 @@ The properties of this ranking mechanism include:
|
|
108 |
The models submitted to leaderboard by the authors were evaluated in following setup:
|
109 |
- max input length: 2048 tokens
|
110 |
- number of shown examples (few-shot mechanism): 3-shot
|
111 |
-
- truncation: smart truncation
|
112 |
- log-probability aggregation: average-pooling
|
113 |
- chat templates: not used
|
114 |
|
115 |
## Citation
|
116 |
You can use the following citation for this leaderboard and our upcoming work.
|
117 |
```bibtex
|
118 |
-
@article{
|
119 |
title = {{B}en{C}zech{M}ark: A Czech-centric Multitask and Multimetric Benchmark for Language Models with Duel Scoring Mechanism},
|
120 |
author = {Martin Fajcik and Martin Docekal and Jan Dolezal and Karel Ondrej and Karel Benes and Jan Kapsa and Michal Hradis and Zuzana Neverilova and Ales Horak and Michal Stefanik and Adam Jirkovsky and David Adamczyk and Jan Hula and Jan Sedivy and Hynek Kydlicek},
|
121 |
year = {2024},
|
|
|
7 |
Welcome to the leaderboard!
|
8 |
Here you can compare models on tasks in Czech language and/or submit your own model. We use our modified fork of [lm-evaluation-harness](https://github.com/DCGM/lm-evaluation-harness) to evaluate every model under same protocol.
|
9 |
|
10 |
+
|
11 |
- Head to **Submission** page to learn about submission details.
|
12 |
+
- See **About** page for brief description of our evaluation protocol & win score mechanism, citation information, and future directions for this benchmark.
|
13 |
+
- __How scoring works__:
|
14 |
+
- On each task, the __Duel Win Score__ reports proportion of won duels.
|
15 |
+
- Category scores are obtained by averaging across category tasks.
|
16 |
+
- __Average__ Duel Win Scores are an average over category scores.
|
17 |
+
- All public submissions are shared in [CZLC/LLM_benchmark_data](https://huggingface.co/datasets/CZLC/LLM_benchmark_data) dataset.
|
18 |
- In submission page, __you can obtain results on leaderboard without publishing them__.
|
19 |
- First step is "pre-submission", and after this is done (significance tests can take up to an hour), the results can be submitted if you'd like to.
|
20 |
|
|
|
93 |
- Fixed-class Classification: average area under the curve (one-vs-all average)
|
94 |
- Multichoice Classification: accuracy
|
95 |
- Question Answering: exact match
|
|
|
96 |
- Language Modeling : word-level perplexity
|
97 |
|
98 |
On every task, for every metric we compute test for statistical significance at Ξ±=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
|
|
|
102 |
- summarization & perplexity: bootstrapping.
|
103 |
|
104 |
### Duel Scoring Mechanism, Win Score
|
105 |
+
- We refer to each test of significance as **duel**. Model A _won_ the duel if it is sigificantly better than model B.
|
106 |
+
- On each task, models are compared in duels with up to the top-50 currently submitted models.
|
107 |
+
- For each model, the **Win Score** (WS) is calculated as the proportion of duels won by that model.
|
108 |
+
- The **Category Win Score** (CWS) is the average of a model's WS across all tasks in a specific category.
|
109 |
+
- The π¨πΏ **BenCzechMark Win Score** is the overall score for a model, computed as the average of its CWS across all categories.
|
110 |
+
|
111 |
The properties of this ranking mechanism include:
|
112 |
- Ranking can change after every submission.
|
113 |
- The across-task aggregation is interpretable: in words, it measures the average proportion of times the model is better.
|
|
|
117 |
The models submitted to leaderboard by the authors were evaluated in following setup:
|
118 |
- max input length: 2048 tokens
|
119 |
- number of shown examples (few-shot mechanism): 3-shot
|
120 |
+
- truncation: smart truncation (few-shot samples are truncated before the task description)
|
121 |
- log-probability aggregation: average-pooling
|
122 |
- chat templates: not used
|
123 |
|
124 |
## Citation
|
125 |
You can use the following citation for this leaderboard and our upcoming work.
|
126 |
```bibtex
|
127 |
+
@article{2024benczechmark,
|
128 |
title = {{B}en{C}zech{M}ark: A Czech-centric Multitask and Multimetric Benchmark for Language Models with Duel Scoring Mechanism},
|
129 |
author = {Martin Fajcik and Martin Docekal and Jan Dolezal and Karel Ondrej and Karel Benes and Jan Kapsa and Michal Hradis and Zuzana Neverilova and Ales Horak and Michal Stefanik and Adam Jirkovsky and David Adamczyk and Jan Hula and Jan Sedivy and Hynek Kydlicek},
|
130 |
year = {2024},
|