Evaluation
updated
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Paper
• 2403.04132
• Published • 40
Evaluating Very Long-Term Conversational Memory of LLM Agents
Paper
• 2402.17753
• Published • 19
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper
• 2402.12659
• Published • 23
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue
Summarization
Paper
• 2402.13249
• Published • 15
Prometheus 2: An Open Source Language Model Specialized in Evaluating
Other Language Models
Paper
• 2405.01535
• Published • 124
To Believe or Not to Believe Your LLM
Paper
• 2406.02543
• Published • 35
Evaluating Open Language Models Across Task Types, Application Domains,
and Reasoning Types: An In-Depth Experimental Analysis
Paper
• 2406.11402
• Published • 6
Judging the Judges: Evaluating Alignment and Vulnerabilities in
LLMs-as-Judges
Paper
• 2406.12624
• Published • 37
Paper
• 2408.02666
• Published • 29
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG
Evaluation Prompts
Paper
• 2504.21117
• Published • 26
AutoLibra: Agent Metric Induction from Open-Ended Feedback
Paper
• 2505.02820
• Published • 3
Which Agent Causes Task Failures and When? On Automated Failure
Attribution of LLM Multi-Agent Systems
Paper
• 2505.00212
• Published • 9
Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in
Smart Personal Assistant
Paper
• 2504.18373
• Published • 2
X-Reasoner: Towards Generalizable Reasoning Across Modalities and
Domains
Paper
• 2505.03981
• Published • 15