Spaces:

ThaiLLM-Leaderboard
/

leaderboard

Running

File size: 5,463 Bytes

2c07158
 
 
 
cc9a14a
2c07158
 
 
 
 
 
 
 
b5e73da
2c07158
cc9a14a
2c07158
 
 
 
2f21d72
2c07158
2f21d72
 
 
 
 
 
 
 
 
 
 
 
2c07158
 
b5e73da
cc9a14a
2f21d72
 
 
2c07158
b5e73da
2c07158
b5e73da
2c07158
b5e73da
2c07158
2f21d72
2c07158
 
 
cc9a14a
b5e73da
 
 
 
2f21d72
cc9a14a

from dataclasses import dataclass
from enum import Enum


NUM_FEWSHOT = 0  # Change with your few shot
# ---------------------------------------------------

TITLE = """<h1>🇹🇭 Thai LLM Leaderboard</h1>"""


# <a href="url"></a>

INTRODUCTION_TEXT = """
The Thai LLM Leaderboard 🇹🇭 aims to standardize evaluation methods for large language models (LLMs) in the Thai language, building on <a href="https://github.com/SEACrowd">SEACrowd</a>.
As part of an open community project, we welcome you to submit new evaluation tasks or models. 
This leaderboard is developed in collaboration with <a href="https://www.scb10x.com">SCB 10X</a>, <a href="https://www.vistec.ac.th/">Vistec</a>, and <a href="https://github.com/SEACrowd">SEACrowd</a>. Read more on <a href="https://blog.opentyphoon.ai/introducing-the-thaillm-leaderboard-thaillm-evaluation-ecosystem-508e789d06bf">Introduction Blog</a>
"""

LLM_BENCHMARKS_TEXT = f"""
The leaderboard currently consists of the following benchmarks:
- <b>Exam</b>
  - <a href="https://huggingface.co/datasets/scb10x/thai_exam">ThaiExam</a>: ThaiExam is a Thai language benchmark based on examinations for high-school students and investment professionals in Thailand.
  - <a href="https://arxiv.org/abs/2306.05179">M3Exam</a>: M3Exam is a novel benchmark sourced from authentic and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. This leaderboard uses the Thai subset of M3Exam.
- <b>LLM-as-a-Judge</b>
  - <a href="https://huggingface.co/datasets/ThaiLLM-Leaderboard/mt-bench-thai">Thai MT-Bench</a>: A Thai version of <a href="https://arxiv.org/abs/2306.05685">MT-Bench</a> developed specially by VISTEC for probing Thai generative skills using the LLM-as-a-judge method.
- <b>NLU</b>
  - <a href="https://huggingface.co/datasets/facebook/belebele">Belebele</a>: Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants, where the Thai subset is used in this leaderboard.
  - <a href="https://huggingface.co/datasets/facebook/xnli">XNLI</a>: XNLI is an evaluation corpus for language transfer and cross-lingual sentence classification in 15 languages. This leaderboard uses the Thai subset of this corpus.
  - <a href="https://huggingface.co/datasets/cambridgeltl/xcopa">XCOPA</a>: XCOPA is a corpus of translated and re-annotated  English COPA,  covers 11 languages. This is designed to measure the commonsense reasoning ability in non-English languages. This leaderboard uses the Thai subset of this corpus.
  - <a href="https://huggingface.co/datasets/pythainlp/wisesight_sentiment">Wisesight</a>: Wisesight sentiment analysis corpus contains social media messages in the Thai language with sentiment labels.
- <b>NLG</b>
  - <a href="https://huggingface.co/datasets/csebuetnlp/xlsum">XLSum</a>: XLSum is a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from the BBC. This corpus evaluates the summarization performance in non-English languages, and this leaderboard uses the Thai subset.
  - <a href="https://huggingface.co/datasets/SEACrowd/flores200">Flores200</a>: FLORES is a machine translation benchmark dataset used to evaluate translation quality between English and low-resource languages. This leaderboard uses the Thai subset of Flores200.
  - <a href="https://huggingface.co/datasets/iapp/iapp_wiki_qa_squad">iapp Wiki QA Squad</a>: iapp Wiki QA Squad is an extractive question-answering dataset derived from Thai Wikipedia articles.


<b>Metric Implementation Details</b>:
- Multiple-choice accuracy is calculated using the <a href="https://github.com/SEACrowd/seacrowd-experiments/blob/048536fc0d4614734d479b298ea00a1f520da42b/evaluation/main_nlu_prompt_batch.py#L71">SEACrowd implementation</a> of logits comparison, similar to the method used by the <a href="https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard">Open LLM Leaderboard</a> (<a href="https://github.com/EleutherAI/lm-evaluation-harness">EleutherAI Harness</a>). <a href="https://huggingface.co/blog/open-llm-leaderboard-mmlu">explain</a>
- BLEU is calculated using flores200's tokenizer using HuggingFace `evaluate` <a href="https://huggingface.co/spaces/evaluate-metric/sacrebleu">implementation</a>.
- ROUGEL is calculated using PyThaiNLP newmm tokenizer and HuggingFace `evaluate` <a href="https://huggingface.co/spaces/evaluate-metric/rouge">implementation</a>.
- LLM-as-a-judge rating is based on OpenAI's gpt-4o-2024-05-13 using the prompt defined in <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/data/judge_prompts.jsonl">lmsys MT-Bench</a>.

<b>Reproducibility</b>:

- For the reproducibility of results, we have open-sourced the evaluation pipeline. Please check out the repository <a href="https://github.com/scb-10x/seacrowd-eval">seacrowd-experiments</a>.

<b>Acknowledgements</b>:

- We are grateful to previous open-source projects that released datasets, tools, and knowledge. We thank community members for tasks and model submissions. To contribute, please see the submit tab.
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""@misc{thaillm-leaderboard,
  author={SCB 10X and VISTEC and SEACrowd},
  title={Thai LLM Leaderboard},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/spaces/ThaiLLM-Leaderboard/leaderboard}
}"""