Spaces:

OALL
/

Open-Arabic-LLM-Leaderboard

Running on CPU Upgrade

File size: 11,000 Bytes

efeee6d
314f91a
95f85ed
efeee6d
 
 
 
 
 
314f91a
b899767
 
efeee6d
9faa299
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ffc326
b899767
 
efeee6d
 
 
249112c
 
58733e4
efeee6d
8c49cb6
249112c
 
c5df7b5
 
 
 
 
249112c
c5df7b5
249112c
0227006
 
efeee6d
0227006
249112c
 
 
08824fe
 
 
c5df7b5
08824fe
c5df7b5
08824fe
c5df7b5
08824fe
 
c5df7b5
249112c
 
 
 
 
 
d313dbd
249112c
 
 
 
 
 
 
d313dbd
d16cee2
249112c
 
 
 
 
 
 
 
 
 
 
 
 
 
d313dbd
 
8c49cb6
d313dbd
 
 
249112c
d313dbd
 
 
 
 
 
8c49cb6
b323764
d313dbd
 
 
 
 
 
 
 
b323764
d313dbd
 
 
 
8c49cb6
 
249112c
58733e4
2a73469
 
217b585
249112c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c5df7b5
249112c
 
 
 
 
 
 
9833cdb

from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
    # # task_key in the json file, metric_key in the json file, name to display in the leaderboard 
    # task0 = Task("anli_r1", "acc", "ANLI")
    # task1 = Task("logiqa", "acc_norm", "LogiQA")
    acva = Task("community|acva:_average|5", "acc_norm", "ACVA")
    alghafa = Task("community|alghafa:_average|5", "acc_norm", "AlGhafa")
    arabic_mmlu = Task("community|arabic_mmlu:_average|5", "acc_norm", "MMLU")
    arabic_exams = Task("community|arabic_exams|5", "acc_norm", "EXAMS")
    arc_challenge_okapi_ar = Task("community|arc_challenge_okapi_ar|5", "acc_norm", "ARC Challenge")
    arc_easy_ar = Task("community|arc_easy_ar|5", "acc_norm", "ARC Easy")
    boolq_ar = Task("community|boolq_ar|5", "acc_norm", "BOOLQ")
    copa_ext_ar = Task("community|copa_ext_ar|5", "acc_norm", "COPA")
    hellaswag_okapi_ar = Task("community|hellaswag_okapi_ar|5", "acc_norm", "HELLASWAG")
    openbook_qa_ext_ar = Task("community|openbook_qa_ext_ar|5", "acc_norm", "OPENBOOK QA")
    piqa_ar = Task("community|piqa_ar|5", "acc_norm", "PIQA")
    race_ar = Task("community|race_ar|5", "acc_norm", "RACE")
    sciq_ar = Task("community|sciq_ar|5", "acc_norm", "SCIQ")
    toxigen_ar = Task("community|toxigen_ar|5", "acc_norm", "TOXIGEN")
    
NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------



# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">Open Arabic LLM Leaderboard</h1>"""
# TITLE = """<img src="image.png" style="width:30%;display:block;margin-left:auto;margin-right:auto">"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
🚀 The Open Arabic LLM Leaderboard : Objectively evaluates and compare the performance of Arabic Large Language Models (LLMs).


When you submit a model on the "Submit here!" page, it is automatically evaluated on a set of benchmarks. 

The GPU used for evaluation is operated with the support of  __[Technology Innovation Institute (TII)](https://www.tii.ae/)__.

The datasets used for evaluation consists of datasets that are Arabic Native like the `AlGhafa` benchmark from [TII](https://www.tii.ae/) and `ACVA` benchmark from [FreedomIntelligence](https://huggingface.co/FreedomIntelligence) to assess reasoning, language understanding, commonsense, and more. 

More details about the benchmarks and the evaluation process is provided on the “About” page.
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
# Context
While outstanding LLM models are being released competitively, most of them are centered on English and are familiar with the English cultural sphere. We operate the Open Arabic LLM Leaderboard (OALL), to evaluate models that reflect the characteristics of the Arabic language, culture and heritage. Through this, we hope that users can conveniently use the leaderboard, participate, and contribute to the advancement of research in the Arab region 🔥.

## Icons & Model types

🟢 : `pretrained` or `continuously pretrained`

🔶 : `fine-tuned on domain-specific datasets`

💬 : `chat models (RLHF, DPO, ORPO, ...)`

🤝 : `base merges and moerges`


If the icon is "?", it indicates that there is insufficient information about the model.
Please provide information about the model through an issue! 🤩

Note : Some models might get selected as a subject of caution by the community, implying that users should exercise restraint when using it.
(Models that have used the evaluation set for training to achieve a high leaderboard ranking, among others, are selected as subjects of caution.)

## How it works
📈 We evaluate models using the impressive [LightEval](https://github.com/huggingface/lighteval), a unified and straightforward framework from the HuggingFace Eval Team to test and assess causal language models on a large number of different evaluation tasks.
We have set up a benchmark using datasets, most of them translated to Arabic, and validated by native arabic speakers. We also added `AlGhafa` a new benchmark prepared from scratch natively for Arabic, alongside the `ACVA` benchmark introduced in the [AceGPT](https://arxiv.org/abs/2309.12053) paper by [FreedomIntelligence](https://huggingface.co/FreedomIntelligence).

Find below the Native benchmarks :

- AlGhafa : Find more details [here](https://aclanthology.org/2023.arabicnlp-1.21.pdf) - (provided by [TII](https://www.tii.ae/))
- Arabic-Culture-Value-Alignement (ACVA) : Find more details [here](https://arxiv.org/pdf/2309.12053) - (provided by [FreedomIntelligence](https://huggingface.co/FreedomIntelligence)) 


And here find all the translated benchmarks provided by the Language evaluation team at [Technology Innovation Institute](https://www.tii.ae/) :

- `Arabic-MMLU`, `Arabic-EXAMS`, `Arabic-ARC-Challenge`, `Arabic-ARC-Easy`, `Arabic-BOOLQ`, `Arabic-COPA`, `Arabic-HELLASWAG`, `Arabic-OPENBOOK-QA`, `Arabic-PIQA`, `Arabic-RACE`, `Arabic-SCIQ`, `Arabic-TOXIGEN`. All part of the extended version of the AlGhafa benchmark (AlGhafa-T version)

Please, consider reaching out to us through teh discussions tab if you are working on benchmarks for Arabic LLMs and willing to see them on this leaderboard as well. Your benchmark might change the whole game for Arabic models !

GPUs are provided by __[Technology Innovation Institute (TII)](https://www.tii.ae/)__ for the evaluations.

## Details and Logs
- Detailed numerical results in the `results` OALL dataset: https://huggingface.co/datasets/OALL/results
- Community queries and running status in the `requests` OALL dataset: https://huggingface.co/datasets/OALL/requests

## More resources
If you still have questions, you can check our FAQ [here](https://huggingface.co/spaces/OALL/leaderboard-test-2/discussions/1)!
"""

EVALUATION_QUEUE_TEXT = """
## Some good practices before submitting a model

### 1) Make sure you can load your model and tokenizer using AutoClasses:

```python
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
```
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.

Note: make sure your model is public!
Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!

### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!

### 3) Make sure your model has an open license!
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗

### 4) Fill up your model card
When we add extra information about models to the leaderboard, it will be automatically taken from the model card

## In case of model failure
If your model is displayed in the `FAILED` category, its execution stopped.
Make sure you have followed the above steps first.
If everything is done, check you can launch the LightEval script on your model locally, using [this script](https://gist.github.com/alielfilali01/d486cfc962dca3ed4091b7c562a4377f).
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@misc{OALL,
  author = {Elfilali Ali, Alobeidli Hamza, Clémentine Fourrier, Cojocaru Ruxandra, Nathan Habib},
  title = {Open Arabic LLM Leaderboard},
  year = {2024},
  publisher = {OALL},
  howpublished = "\url{https://huggingface.co/spaces/OALL/leaderboard-test-2}"
}
@inproceedings{almazrouei-etal-2023-alghafa,
    title = "{A}l{G}hafa Evaluation Benchmark for {A}rabic Language Models",
    author = "Almazrouei, Ebtesam  and
      Cojocaru, Ruxandra  and
      Baldo, Michele  and
      Malartic, Quentin  and
      Alobeidli, Hamza  and
      Mazzotta, Daniele  and
      Penedo, Guilherme  and
      Campesan, Giulia  and
      Farooq, Mugariya  and
      Alhammadi, Maitha  and
      Launay, Julien  and
      Noune, Badreddine",
    editor = "Sawaf, Hassan  and
      El-Beltagy, Samhaa  and
      Zaghouani, Wajdi  and
      Magdy, Walid  and
      Abdelali, Ahmed  and
      Tomeh, Nadi  and
      Abu Farha, Ibrahim  and
      Habash, Nizar  and
      Khalifa, Salam  and
      Keleg, Amr  and
      Haddad, Hatem  and
      Zitouni, Imed  and
      Mrini, Khalil  and
      Almatham, Rawan",
    booktitle = "Proceedings of ArabicNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.arabicnlp-1.21",
    doi = "10.18653/v1/2023.arabicnlp-1.21",
    pages = "244--275",
    abstract = "Recent advances in the space of Arabic large language models have opened up a wealth of potential practical applications. From optimal training strategies, large scale data acquisition and continuously increasing NLP resources, the Arabic LLM landscape has improved in a very short span of time, despite being plagued by training data scarcity and limited evaluation resources compared to English. In line with contributing towards this ever-growing field, we introduce AlGhafa, a new multiple-choice evaluation benchmark for Arabic LLMs. For showcasing purposes, we train a new suite of models, including a 14 billion parameter model, the largest monolingual Arabic decoder-only model to date. We use a collection of publicly available datasets, as well as a newly introduced HandMade dataset consisting of 8 billion tokens. Finally, we explore the quantitative and qualitative toxicity of several Arabic models, comparing our models to existing public Arabic LLMs.",
}
@misc{huang2023acegpt,
      title={AceGPT, Localizing Large Language Models in Arabic}, 
      author={Huang Huang and Fei Yu and Jianqing Zhu and Xuening Sun and Hao Cheng and Dingjie Song and Zhihong Chen and Abdulmohsen Alharthi and Bang An and Ziche Liu and Zhiyi Zhang and Junying Chen and Jianquan Li and Benyou Wang and Lian Zhang and Ruoyu Sun and Xiang Wan and Haizhou Li and Jinchao Xu},
      year={2023},
      eprint={2309.12053},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{lighteval,
  author = {Clémentine, Fourrier, and Nathan, Habib and Wolf, Thomas},
  title = {LightEval: A lightweight framework for LLM evaluation},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  url = {https://github.com/huggingface/lighteval}
}
"""