Spaces:

OALL
/

Open-Arabic-LLM-Leaderboard

Running on CPU Upgrade

App Files Files Community

Ali-C137 commited on Apr 25

Commit

249112c

•

1 Parent(s): 866ba51

Update src/about.py

Browse files

Files changed (1) hide show

src/about.py +104 -5

src/about.py CHANGED Viewed

@@ -35,26 +35,64 @@ NUM_FEWSHOT = 0 # Change with your few shot
 # Your leaderboard name
-TITLE = """<h1 align="center" id="space-title">Demo leaderboard</h1>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
-Intro text
 """
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
 ## How it works
-## Reproducibility
-To reproduce our results, here is the commands you can run:
 """
 EVALUATION_QUEUE_TEXT = """
 ## Some good practices before submitting a model
 ### 1) Make sure you can load your model and tokenizer using AutoClasses:
 ```python
 from transformers import AutoConfig, AutoModel, AutoTokenizer
 config = AutoConfig.from_pretrained("your model name", revision=revision)
@@ -78,9 +116,70 @@ When we add extra information about models to the leaderboard, it will be automa
 ## In case of model failure
 If your model is displayed in the `FAILED` category, its execution stopped.
 Make sure you have followed the above steps first.
-If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
 """

 # Your leaderboard name
+TITLE = """<h1 align="center" id="space-title">Open Arabic LLM Leaderboard</h1>"""
+# TITLE = """<img src="image.png" style="width:30%;display:block;margin-left:auto;margin-right:auto">"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
+🚀 The Open Arabic LLM Leaderboard : Objectively evaluates and compare the performance of Arabic Large Language Models (LLMs).
+When you submit a model on the "Submit here!" page, it is automatically evaluated on a set of benchmarks. The GPU used for evaluation is operated with the support of  __[Technology Innovation Institute (TII)](https://www.tii.ae/)__.
+The datasets used for evaluation consists of datasets that are Arabic Native like the `AlGhafa` benchmark from [TII](https://www.tii.ae/) and `ACVA` benchmark from [FreedomIntelligence](https://huggingface.co/FreedomIntelligence) to assess reasoning, language understanding, commonsense, and more.
+More details about the benchmarks and the evaluation process is provided on the “About” page.
 """
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
+# Context
+While outstanding LLM models are being released competitively, most of them are centered on English and are familiar with the English cultural sphere. We operate the Open Arabic LLM Leaderboard (OALL), to evaluate models that reflect the characteristics of the Arabic language, culture and heritage. Through this, we hope that users can conveniently use the leaderboard, participate, and contribute to the advancement of research in the Arab region 🔥.
+## Icons
+{ModelType.PT.to_str(" : ")} model
+{ModelType.IFT.to_str(" : ")} model
+{ModelType.RL.to_str(" : ")} model
+If the icon is "?", it indicates that there is insufficient information about the model.
+Please provide information about the model through an issue! 🤩
+Note : Some models might get selected as a subject of caution by the community, implying that users should exercise restraint when using it.
+(Models that have used the evaluation set for training to achieve a high leaderboard ranking, among others, are selected as subjects of caution.)
 ## How it works
+📈 We evaluate models using the impressive [LightEval](https://github.com/huggingface/lighteval), a unified and straightforward framework from the HuggingFace Eval Team to test and assess causal language models on a large number of different evaluation tasks.
+We have set up a benchmark using datasets, most of them translated to Arabic, and validated by native arabic speakers. We also added `AlGhafa` a new benchmark prepared from scratch natively for Arabic, alongside the `ACVA` benchmark introduced in the [AceGPT](https://arxiv.org/abs/2309.12053) paper by [FreedomIntelligence](https://huggingface.co/FreedomIntelligence).
+Find below the Native benchmarks :
+- AlGhafa : Find more details [here](https://aclanthology.org/2023.arabicnlp-1.21.pdf) - (provided by [TII](https://www.tii.ae/))
+- Arabic-Culture-Value-Alignement (ACVA) : Find more details [here](https://arxiv.org/pdf/2309.12053) - (provided by [FreedomIntelligence](https://huggingface.co/FreedomIntelligence))
+And here find all the translated benchmarks provided by the Language evaluation team at [Technology Innovation Institute](https://www.tii.ae/) :
+- `Arabic-MMLU`, `Arabic-EXAMS`, `Arabic-ARC-Challenge`, `Arabic-ARC-Easy`, `Arabic-BOOLQ`, `Arabic-COPA`, `Arabic-HELLASWAG`, `Arabic-OPENBOOK-QA`, `Arabic-PIQA`, `Arabic-RACE`, `Arabic-SCIQ`, `Arabic-TOXIGEN`. All part of the extended version of the AlGhafa benchmark (AlGhafa-T version)
+Please, consider reaching out to us through teh discussions tab if you are working on benchmarks for Arabic LLMs and willing to see them on this leaderboard as well. Your benchmark might change the whole game for Arabic models !
+GPUs are provided by __[Technology Innovation Institute (TII)](https://www.tii.ae/)__ for the evaluations.
+## Details and Logs
+- Detailed numerical results in the `results` OALL dataset: https://huggingface.co/datasets/OALL/results
+- Community queries and running status in the `requests` OALL dataset: https://huggingface.co/datasets/OALL/requests
+## More resources
+If you still have questions, you can check our FAQ [here](https://huggingface.co/spaces/OALL/leaderboard-test-2/discussions/1)!
 """
 EVALUATION_QUEUE_TEXT = """
 ## Some good practices before submitting a model
 ### 1) Make sure you can load your model and tokenizer using AutoClasses:
 ```python
 from transformers import AutoConfig, AutoModel, AutoTokenizer
 config = AutoConfig.from_pretrained("your model name", revision=revision)
 ## In case of model failure
 If your model is displayed in the `FAILED` category, its execution stopped.
 Make sure you have followed the above steps first.
+If everything is done, check you can launch the LightEval script on your model locally, using [this script](https://gist.github.com/alielfilali01/d486cfc962dca3ed4091b7c562a4377f).
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
+@misc{OALL,
+  author = {Elfilali Ali, Alobeidli Hamza, Clémentine Fourrier, Cojocaru Ruxandra, Nathan Habib},
+  title = {Open Arabic LLM Leaderboard},
+  year = {2024},
+  publisher = {OALL},
+  howpublished = "\url{https://huggingface.co/spaces/OALL/leaderboard-test-2}"
+}
+@inproceedings{almazrouei-etal-2023-alghafa,
+    title = "{A}l{G}hafa Evaluation Benchmark for {A}rabic Language Models",
+    author = "Almazrouei, Ebtesam  and
+      Cojocaru, Ruxandra  and
+      Baldo, Michele  and
+      Malartic, Quentin  and
+      Alobeidli, Hamza  and
+      Mazzotta, Daniele  and
+      Penedo, Guilherme  and
+      Campesan, Giulia  and
+      Farooq, Mugariya  and
+      Alhammadi, Maitha  and
+      Launay, Julien  and
+      Noune, Badreddine",
+    editor = "Sawaf, Hassan  and
+      El-Beltagy, Samhaa  and
+      Zaghouani, Wajdi  and
+      Magdy, Walid  and
+      Abdelali, Ahmed  and
+      Tomeh, Nadi  and
+      Abu Farha, Ibrahim  and
+      Habash, Nizar  and
+      Khalifa, Salam  and
+      Keleg, Amr  and
+      Haddad, Hatem  and
+      Zitouni, Imed  and
+      Mrini, Khalil  and
+      Almatham, Rawan",
+    booktitle = "Proceedings of ArabicNLP 2023",
+    month = dec,
+    year = "2023",
+    address = "Singapore (Hybrid)",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2023.arabicnlp-1.21",
+    doi = "10.18653/v1/2023.arabicnlp-1.21",
+    pages = "244--275",
+    abstract = "Recent advances in the space of Arabic large language models have opened up a wealth of potential practical applications. From optimal training strategies, large scale data acquisition and continuously increasing NLP resources, the Arabic LLM landscape has improved in a very short span of time, despite being plagued by training data scarcity and limited evaluation resources compared to English. In line with contributing towards this ever-growing field, we introduce AlGhafa, a new multiple-choice evaluation benchmark for Arabic LLMs. For showcasing purposes, we train a new suite of models, including a 14 billion parameter model, the largest monolingual Arabic decoder-only model to date. We use a collection of publicly available datasets, as well as a newly introduced HandMade dataset consisting of 8 billion tokens. Finally, we explore the quantitative and qualitative toxicity of several Arabic models, comparing our models to existing public Arabic LLMs.",
+}
+@misc{huang2023acegpt,
+      title={AceGPT, Localizing Large Language Models in Arabic},
+      author={Huang Huang and Fei Yu and Jianqing Zhu and Xuening Sun and Hao Cheng and Dingjie Song and Zhihong Chen and Abdulmohsen Alharthi and Bang An and Ziche Liu and Zhiyi Zhang and Junying Chen and Jianquan Li and Benyou Wang and Lian Zhang and Ruoyu Sun and Xiang Wan and Haizhou Li and Jinchao Xu},
+      year={2023},
+      eprint={2309.12053},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+@misc{datatrove,
+  author = {Clémentine, Fourrier, and Nathan, Habib and Wolf, Thomas},
+  title = {LightEval: A lightweight framework for LLM evaluation},
+  year = {2024},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  url = {https://github.com/huggingface/lighteval}
+}
 """