Spaces:
Running
Running
File size: 9,216 Bytes
f417250 b66f230 0d38c87 f417250 b133d09 889e149 5be5d06 889e149 7d29744 889e149 11f7f74 5d4717f 889e149 5be5d06 fce75db f417250 b66f230 f417250 8a54af0 8864264 5be5d06 8864264 5be5d06 8864264 a899f05 5be5d06 8a54af0 f417250 b66f230 f417250 935ac4f f417250 b66f230 935ac4f f417250 b66f230 f417250 23931c3 7c3d9a0 23931c3 27b6247 a1840bf 43aa8ea 4a4e00a 43aa8ea 7d29744 43aa8ea 7d29744 43aa8ea 7d29744 43aa8ea a1840bf d3d49a6 a1840bf 1c5642f a1840bf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
"""
This file contains the text content for the leaderboard client.
"""
HEADER_MARKDOWN = """
# 🇨🇿 BenCzechMark
Welcome to the leaderboard!
Here, you can compare models on tasks in the Czech language or submit your own model. We use a modified fork of [lm-evaluation-harness](https://github.com/DCGM/lm-evaluation-harness) to evaluate every model under the same protocol.
- Visit the **Submission** page to learn about how to submit your model.
- Check out the **About** page for a brief overview of our evaluation protocol, win score mechanism, citation details, and future plans for this benchmark.
- __How scoring works__:
- For each task, the __Duel Win Score__ reflects the proportion of duels a model has won.
- Category scores are calculated by averaging scores across all tasks within that category. When viewing a specific category (other than Overall), the "Average" column displays the Category Duel Win Scores.
- The __Overall__ Duel Win Score is the average across all category scores. When selecting the Overall category, the "Average" column shows the Overall Duel Win Score.
- All public submissions are available in the [CZLC/LLM_benchmark_data](https://huggingface.co/datasets/CZLC/LLM_benchmark_data) dataset.
- On the submission page, __you can view your model's results on the leaderboard without publishing them__.
- The first step is "pre-submission." After this is complete (significance tests may take up to an hour), you can choose to submit the results if you wish.
- NEWS:
- 1.10.2024: Find out more about 🇨🇿 BenCzechMark in our [Huggingface blogpost](https://huggingface.co/blog/benczechmark)!
"""
LEADERBOARD_TAB_TITLE_MARKDOWN = """
"""
SUBMISSION_TAB_TITLE_MARKDOWN = """
## How to submit
1. Head down to our modified fork of [lm-evaluation-harness](https://github.com/DCGM/lm-evaluation-harness).
Follow the instructions and evaluate your model on all 🇨🇿 BenCzechMark tasks, while logging your lm harness outputs into designated folder.
2. Use our script from [benczechmark-leaderboard](https://github.com/MFajcik/benczechmark-leaderboard) repository for processing log files from your designated folder into single compact submission file that contains everything we need.
Example usage:
- Download sample outputs for csmpt7b from [csmpt_logdir.zip](https://czechllm.fit.vutbr.cz/csmpt7b/sample_results/csmpt_logdir.zip).
- Unzip.
- Run the script from leaderboard repository with python (with libs jsonlines and tqdm)
```bash
git clone https://github.com/MFajcik/benczechmark-leaderboard.git
cd benczechmark-leaderboard/
export PYTHONPATH=$(pwd)
python -m leaderboard.compile_log_files \
-i "<your_local_path_to_folder>/csmpt_logdir/csmpt/eval_csmpt7b*" \
-o "<your_local_path_to_outfolder>/sample_submission.json"
```
3. Upload your file, and fill the form below!
## Submission
To submit your model, please fill in the form below.
- *Team name:* The name of your team, as it will appear on the leaderboard
- *Model name:* The name of your model
- *Model type:* The type of your model (chat, pretrained, ensemble)
- *Parameters (B):* The number of parameters of your model in billions (10⁹)
- *Input length (# tokens):* The number of input tokens that led to the results
- *Precision:* The precision with which the results were obtained
- *Description:* Short description of your submission (optional)
- *Link to model:* Link to the model's repository or documentation
- *Upload your results:* Results json file to submit
After filling in the form, click the **Pre-submit model** button.
This will run a comparison of your model with the existing leaderboard models.
After the tournament is complete, you will be able to submit your model to the leaderboard.
"""
RANKING_AFTER_SUBMISSION_MARKDOWN = """
This is how will ranking look like after your submission:
"""
SUBMISSION_DETAILS_MARKDOWN = """
Do you really want to submit a model? This action is irreversible.
"""
MORE_DETAILS_MARKDOWN = """
Here you can view, how selected model won/lost duels to all other models, in selected 🇨🇿 BenCzechMark category.
"""
MODAL_SUBMIT_MARKDOWN = """
Are you sure you want to submit your model?
"""
ABOUT_MARKDOWN = """
## Abstract
We present **B**en**C**zech**M**ark (BCM), the first multitask and multimetric Czech language benchmark for large language models with a unique scoring system that utilizes the theory of statistical significance. Our benchmark covers 54 challenging, mostly native Czech tasks spanning across 11 categories, including diverse domains such as historical Czech, pupil and language learner essays, and spoken word.
Furthermore, we collect and clean the [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/BUT-LCC), the largest publicly available clean Czech language corpus, and continuously pretrain the first Czech-centric 7B language model [CSMPT7B](https://huggingface.co/BUT-FIT/csmpt7b), with Czech-specific tokenization. We use our model as a baseline for comparison with publicly available multilingual models.
## Methodology
While we will reveal more details in our upcoming work, here is how leaderboard ranking works in a nutshell.
### Prompting Mechanism
Each task (except for tasks from language modelling category) is composed of 5 or more prompts. The performance of every model is then max-pooled over tasks (best performance counts).
### Metrics and Significance Testing
We use the following metrics for following tasks:
- Fixed-class Classification: average area under the curve (one-vs-all average)
- Multichoice Classification: accuracy
- Question Answering: exact match
- Language Modeling : word-level perplexity
On every task, for every metric we compute test for statistical significance at α=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
We use the following tests, with varying statistical power:
- accuracy and exact-match: one-tailed paired t-test,
- average area under the curve: bayesian test inspired with [Goutte et al., 2005](https://link.springer.com/chapter/10.1007/978-3-540-31865-1_25),
- summarization & perplexity: bootstrapping.
### Duel Scoring Mechanism, Win Score
- We refer to each test of significance as **duel**. Model A _won_ the duel if it is sigificantly better than model B.
- On each task, models are compared in duels with up to the top-50 currently submitted models.
- For each model, the **Win Score** (WS) is calculated as the proportion of duels won by that model.
- The **Category Win Score** (CWS) is the average of a model's WS across all tasks in a specific category.
- The 🇨🇿 **BenCzechMark Win Score** is the overall score for a model, computed as the average of its CWS across all categories.
The properties of this ranking mechanism include:
- Ranking can change after every submission.
- The across-task aggregation is interpretable: in words, it measures the average proportion of times the model is better.
- It allows utilizing wide spectrum of existing resources, evaluated under different metrics.
## Baseline Setup
The models submitted to leaderboard by the authors were evaluated in following setup:
- max input length: 2048 tokens
- number of shown examples (few-shot mechanism): 3-shot
- truncation: smart truncation (few-shot samples are truncated before the task description)
- log-probability aggregation: average-pooling
- chat templates: not used
## Citation
You can use the following citation for this leaderboard and our upcoming work.
```bibtex
@article{2024benczechmark,
title = {{B}en{C}zech{M}ark: A Czech-centric Multitask and Multimetric Benchmark for Language Models with Duel Scoring Mechanism},
author = {Martin Fajcik and Martin Docekal and Jan Dolezal and Karel Ondrej and Karel Benes and Jan Kapsa and Michal Hradis and Zuzana Neverilova and Ales Horak and Michal Stefanik and Adam Jirkovsky and David Adamczyk and Jan Hula and Jan Sedivy and Hynek Kydlicek},
year = {2024},
url = {https://huggingface.co/spaces/CZLC/BenCzechMark}
institution = {Brno University of Technology, Masaryk University, Czech Technical University in Prague, Hugging Face},
}
```
## Authors & Correspondence
- **BenCzechMark Authors & Contributors:**
- **BUT FIT**
- Martin Fajčík
- Martin Dočekal
- Jan Doležal
- Karel Ondřej
- Karel Beneš
- Jan Kapsa
- Michal Hradiš
- **FI MUNI**
- Zuzana Nevěřilová
- Aleš Horák
- Michal Štefánik
- **CIIRC CTU**
- Adam Jirkovský
- David Adamczyk
- Jan Hůla
- Jan Šedivý
- **Hugging Face**
- Hynek Kydlíček
- **Leaderboard Authors & Contributors:**
- Jan Doležal - Coding and Troubleshooting
- Martin Fajčík - Management & Debugging
- Alexander Polok, Jakub Štetina - Leaderboard Version 0.1
**Correspondence to:**
- Martin Fajčík
- Brno University of Technology, Brno, Czech Republic
- Email: [[email protected]](mailto:[email protected])
"""
|