Evaluation - a floom Collection

floom 's Collections

ShowAndTell-2025-09-30

PotentialApplication

ShowAndTell-2025-01-30

ShowAndTell-2024-12-03

Data Efficient Approaches

Personalization

sentence-transformer-models

Tool Use & more

Feedback Analysis

Efficient Serving/Inference

Synthetic Data Generation

Frontier research ideas

Evaluation

updated May 9, 2025

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Paper • 2403.04132 • Published Mar 7, 2024 • 40
Evaluating Very Long-Term Conversational Memory of LLM Agents

Paper • 2402.17753 • Published Feb 27, 2024 • 19
The FinBen: An Holistic Financial Benchmark for Large Language Models

Paper • 2402.12659 • Published Feb 20, 2024 • 23
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Paper • 2402.13249 • Published Feb 20, 2024 • 15
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Paper • 2405.01535 • Published May 2, 2024 • 124
To Believe or Not to Believe Your LLM

Paper • 2406.02543 • Published Jun 4, 2024 • 35
Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis

Paper • 2406.11402 • Published Jun 17, 2024 • 6
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Paper • 2406.12624 • Published Jun 18, 2024 • 37
Self-Taught Evaluators

Paper • 2408.02666 • Published Aug 5, 2024 • 29
Beyond One-Size-Fits-All: Inversion Learning for Highly Effective NLG Evaluation Prompts

Paper • 2504.21117 • Published Apr 29, 2025 • 26
AutoLibra: Agent Metric Induction from Open-Ended Feedback

Paper • 2505.02820 • Published May 5, 2025 • 3
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

Paper • 2505.00212 • Published Apr 30, 2025 • 9
Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in Smart Personal Assistant

Paper • 2504.18373 • Published Apr 25, 2025 • 2
X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains

Paper • 2505.03981 • Published May 6, 2025 • 15