File size: 5,214 Bytes
a8ede2f
 
24eddae
a8ede2f
 
24eddae
a8ede2f
 
 
 
 
 
a911aee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc1ba50
a8ede2f
 
 
dc1ba50
a8ede2f
 
 
 
dc1ba50
 
 
a8ede2f
 
dc1ba50
a8ede2f
4baad3c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from src.display.utils import ModelType

TITLE = """<h1 align="center" id="space-title">πŸ€— Open Hallucinations Leaderboard</h1>"""

INTRODUCTION_TEXT = """
πŸ“ The πŸ€— Open Hallucinations Leaderboard aims to track, rank and evaluate hallucinations in LLMs and chatbots.

πŸ€— Submit a model for automated evaluation on the πŸ€— GPU cluster on the "Submit" page!
The leaderboard's backend runs the great [Eleuther AI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) - read more details in the "About" page!
"""

LLM_BENCHMARKS_TEXT = f"""
# Context
As large language models (LLMs) get better at creating believable texts, addressing hallucinations in LLMs becomes increasingly important. In this exciting time where numerous LLMs released every week, it can be challenging to identify the leading model, particularly in terms of their reliability against hallucination. This leaderboard aims to provide a platform where anyone can evaluate the latest LLMs at any time.

# How it works
πŸ“ˆ We evaluate the models on 11 hallucination benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank">  Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
- <a href="https://aclanthology.org/P19-1612/" target="_blank"> NQ Open </a> .
- <a href="https://aclanthology.org/P17-1147/" target="_blank"> TriviaQA </a> .
- <a href="https://aclanthology.org/2022.acl-long.229/" target="_blank"> TruthfulQA MC1 </a> - a benchmark to measure whether a language model is truthful in generating answers to questions that span 38 categories, including health, law, finance and politics. Questions are crafted so that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. **MC1 denotes that there is a single correct label**.
- <a href="https://aclanthology.org/2022.acl-long.229/" target="_blank"> TruthfulQA MC2 </a> - a benchmark to measure whether a language model is truthful in generating answers to questions that span 38 categories, including health, law, finance and politics. Questions are crafted so that some humans would answer falsely due to a false belief or misconception. To perform well, models must avoid generating false answers learned from imitating human texts. **MC2 denotes that there can be multiple correct labels**.
- <a href="https://aclanthology.org/2023.emnlp-main.397/" target="_blank"> HaluEval QA </a> - a collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognising hallucinations. **QA denotes the question answering task**.
- <a href="https://aclanthology.org/2023.emnlp-main.397/" target="_blank"> HaluEval Summ </a> - a collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognising hallucinations. **Summ denotes the summarisation task**.
- <a href="https://aclanthology.org/2023.emnlp-main.397/" target="_blank"> HaluEval Dial </a> - a collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognising hallucinations. **Dial denotes the knowledge-grounded dialogue task**.
- <a href="https://aclanthology.org/2020.acl-main.173/" target="_blank"> XSum </a> - a dataset of BBC news articles paired with their single-sentence summaries to evaluate the output of abstractive summarization using a language model.
- <a href="https://arxiv.org/abs/1704.04368" target="_blank"> CNN/DM </a> - a dataset of CNN and Daily Mail articles paired with their summaries.
- <a href="https://github.com/inverse-scaling/prize/tree/main" target="_blank"> MemoTrap </a> - a dataset to investigate whether language models could fall into memorization traps. It comprises instructions that prompt the language model to complete a well-known proverb with an ending word that deviates from the commonly used ending (e.g., Write a quote that ends in the word β€œearly”: Better late than ).
- <a href="https://arxiv.org/abs/2311.07911v1" target="_blank"> IFEval </a> a dataset to evaluate instruction following ability of large language models. There are 500+ prompts with instructions such as "write an article with more than 800 words", "wrap your response with double quotation marks".

For all these evaluations, a higher score is a better score.

# Details and logs
You can find details on the input/outputs for the models in the `details` of each model, that you can access by clicking the πŸ“„ emoji after the model name

# Reproducibility
Hyperparameters: XXX
Device(s): XXX
Metrics: XXX
"""

FAQ_TEXT = """
---------------------------
# FAQ
XXX
"""

EVALUATION_QUEUE_TEXT = """
XXX
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
@misc{hallucinations-leaderboard,
  author = {Pasquale Minervini},
  title = {Hallucinations Leaderboard},
  year = {2023},
  publisher = {Hugging Face},
  howpublished = "\url{https://huggingface.co/spaces/hallucinations-leaderboard/leaderboard}"
}
"""