mdouglas
's Collections
Papers: Evaluation
updated
Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models
Paper
•
2310.17567
•
Published
•
1
This is not a Dataset: A Large Negation Benchmark to Challenge Large
Language Models
Paper
•
2310.15941
•
Published
•
6
Holistic Evaluation of Language Models
Paper
•
2211.09110
•
Published
•
1
INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large
Language Models
Paper
•
2306.04757
•
Published
•
6
EleutherAI: Going Beyond "Open Science" to "Science in the Open"
Paper
•
2210.06413
•
Published
Leveraging Word Guessing Games to Assess the Intelligence of Large
Language Models
Paper
•
2310.20499
•
Published
•
7
MEGAVERSE: Benchmarking Large Language Models Across Languages,
Modalities, Models and Tasks
Paper
•
2311.07463
•
Published
•
13
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper
•
2306.05685
•
Published
•
31
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue
Summarization
Paper
•
2402.13249
•
Published
•
11