MLGym: A New Framework and Benchmark for Advancing AI Research Agents Paper • 2502.14499 • Published 21 days ago • 178
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation Paper • 2409.06820 • Published Sep 10, 2024 • 66