Scaling Evaluation-time Compute with Reasoning Models as Process Evaluators Paper • 2503.19877 • Published 16 days ago
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models Paper • 2410.17578 • Published Oct 23, 2024 • 1
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models Paper • 2406.05761 • Published Jun 9, 2024 • 2
Knowledge Unlearning for Mitigating Privacy Risks in Language Models Paper • 2210.01504 • Published Oct 4, 2022
Gradient Ascent Post-training Enhances Language Model Generalization Paper • 2306.07052 • Published Jun 12, 2023
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching Paper • 2503.05179 • Published Mar 7 • 44
Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning Paper • 2502.17407 • Published Feb 24 • 25
VideoRAG: Retrieval-Augmented Generation over Video Corpus Paper • 2501.05874 • Published Jan 10 • 72
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation Paper • 2412.10424 • Published Dec 10, 2024 • 2
Revisiting In-Context Learning with Long Context Language Models Paper • 2412.16926 • Published Dec 22, 2024 • 33
Bridging the Data Provenance Gap Across Text, Speech and Video Paper • 2412.17847 • Published Dec 19, 2024 • 9