EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria Paper • 2309.13633 • Published Sep 24, 2023
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models Paper • 2310.08491 • Published Oct 12, 2023 • 55
Aligning Large Language Models through Synthetic Feedback Paper • 2305.13735 • Published May 23, 2023 • 1
The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning Paper • 2305.14045 • Published May 23, 2023 • 5
Who Wrote this Code? Watermarking for Code Generation Paper • 2305.15060 • Published May 24, 2023 • 1
Dialogue Summaries as Dialogue States (DS2), Template-Guided Summarization for Few-shot Dialogue State Tracking Paper • 2203.01552 • Published Mar 3, 2022
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models Paper • 2405.01535 • Published May 2, 2024 • 123
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models Paper • 2406.05761 • Published Jun 9, 2024 • 2
Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators Paper • 2503.19877 • Published about 1 month ago
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation Paper • 2412.10424 • Published Dec 10, 2024 • 2
Bridging the Data Provenance Gap Across Text, Speech and Video Paper • 2412.17847 • Published Dec 19, 2024 • 9
Evaluating Language Models as Synthetic Data Generators Paper • 2412.03679 • Published Dec 4, 2024 • 49
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models Paper • 2410.17578 • Published Oct 23, 2024 • 1