File size: 5,463 Bytes
2c07158
 
 
 
cc9a14a
2c07158
 
 
 
 
 
 
 
b5e73da
2c07158
cc9a14a
2c07158
 
 
 
2f21d72
2c07158
2f21d72
 
 
 
 
 
 
 
 
 
 
 
2c07158
 
b5e73da
cc9a14a
2f21d72
 
 
2c07158
b5e73da
2c07158
b5e73da
2c07158
b5e73da
2c07158
2f21d72
2c07158
 
 
cc9a14a
b5e73da
 
 
 
2f21d72
cc9a14a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
from dataclasses import dataclass
from enum import Enum


NUM_FEWSHOT = 0  # Change with your few shot
# ---------------------------------------------------

TITLE = """<h1>🇹🇭 Thai LLM Leaderboard</h1>"""


# <a href="url"></a>

INTRODUCTION_TEXT = """
The Thai LLM Leaderboard 🇹🇭 aims to standardize evaluation methods for large language models (LLMs) in the Thai language, building on <a href="https://github.com/SEACrowd">SEACrowd</a>.
As part of an open community project, we welcome you to submit new evaluation tasks or models. 
This leaderboard is developed in collaboration with <a href="https://www.scb10x.com">SCB 10X</a>, <a href="https://www.vistec.ac.th/">Vistec</a>, and <a href="https://github.com/SEACrowd">SEACrowd</a>. Read more on <a href="https://blog.opentyphoon.ai/introducing-the-thaillm-leaderboard-thaillm-evaluation-ecosystem-508e789d06bf">Introduction Blog</a>
"""

LLM_BENCHMARKS_TEXT = f"""
The leaderboard currently consists of the following benchmarks:
- <b>Exam</b>
  - <a href="https://huggingface.co/datasets/scb10x/thai_exam">ThaiExam</a>: ThaiExam is a Thai language benchmark based on examinations for high-school students and investment professionals in Thailand.
  - <a href="https://arxiv.org/abs/2306.05179">M3Exam</a>: M3Exam is a novel benchmark sourced from authentic and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. This leaderboard uses the Thai subset of M3Exam.
- <b>LLM-as-a-Judge</b>
  - <a href="https://huggingface.co/datasets/ThaiLLM-Leaderboard/mt-bench-thai">Thai MT-Bench</a>: A Thai version of <a href="https://arxiv.org/abs/2306.05685">MT-Bench</a> developed specially by VISTEC for probing Thai generative skills using the LLM-as-a-judge method.
- <b>NLU</b>
  - <a href="https://huggingface.co/datasets/facebook/belebele">Belebele</a>: Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants, where the Thai subset is used in this leaderboard.
  - <a href="https://huggingface.co/datasets/facebook/xnli">XNLI</a>: XNLI is an evaluation corpus for language transfer and cross-lingual sentence classification in 15 languages. This leaderboard uses the Thai subset of this corpus.
  - <a href="https://huggingface.co/datasets/cambridgeltl/xcopa">XCOPA</a>: XCOPA is a corpus of translated and re-annotated  English COPA,  covers 11 languages. This is designed to measure the commonsense reasoning ability in non-English languages. This leaderboard uses the Thai subset of this corpus.
  - <a href="https://huggingface.co/datasets/pythainlp/wisesight_sentiment">Wisesight</a>: Wisesight sentiment analysis corpus contains social media messages in the Thai language with sentiment labels.
- <b>NLG</b>
  - <a href="https://huggingface.co/datasets/csebuetnlp/xlsum">XLSum</a>: XLSum is a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from the BBC. This corpus evaluates the summarization performance in non-English languages, and this leaderboard uses the Thai subset.
  - <a href="https://huggingface.co/datasets/SEACrowd/flores200">Flores200</a>: FLORES is a machine translation benchmark dataset used to evaluate translation quality between English and low-resource languages. This leaderboard uses the Thai subset of Flores200.
  - <a href="https://huggingface.co/datasets/iapp/iapp_wiki_qa_squad">iapp Wiki QA Squad</a>: iapp Wiki QA Squad is an extractive question-answering dataset derived from Thai Wikipedia articles.


<b>Metric Implementation Details</b>:
- Multiple-choice accuracy is calculated using the <a href="https://github.com/SEACrowd/seacrowd-experiments/blob/048536fc0d4614734d479b298ea00a1f520da42b/evaluation/main_nlu_prompt_batch.py#L71">SEACrowd implementation</a> of logits comparison, similar to the method used by the <a href="https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard">Open LLM Leaderboard</a> (<a href="https://github.com/EleutherAI/lm-evaluation-harness">EleutherAI Harness</a>). <a href="https://huggingface.co/blog/open-llm-leaderboard-mmlu">explain</a>
- BLEU is calculated using flores200's tokenizer using HuggingFace `evaluate` <a href="https://huggingface.co/spaces/evaluate-metric/sacrebleu">implementation</a>.
- ROUGEL is calculated using PyThaiNLP newmm tokenizer and HuggingFace `evaluate` <a href="https://huggingface.co/spaces/evaluate-metric/rouge">implementation</a>.
- LLM-as-a-judge rating is based on OpenAI's gpt-4o-2024-05-13 using the prompt defined in <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/data/judge_prompts.jsonl">lmsys MT-Bench</a>.

<b>Reproducibility</b>:

- For the reproducibility of results, we have open-sourced the evaluation pipeline. Please check out the repository <a href="https://github.com/scb-10x/seacrowd-eval">seacrowd-experiments</a>.

<b>Acknowledgements</b>:

- We are grateful to previous open-source projects that released datasets, tools, and knowledge. We thank community members for tasks and model submissions. To contribute, please see the submit tab.
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""@misc{thaillm-leaderboard,
  author={SCB 10X and VISTEC and SEACrowd},
  title={Thai LLM Leaderboard},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/spaces/ThaiLLM-Leaderboard/leaderboard}
}"""