Test-Time Scaling Makes Overtraining Compute-Optimal Paper • 2604.01411 • Published 8 days ago • 20
SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks Paper • 2603.24755 • Published 14 days ago • 28
RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning Paper • 2603.09160 • Published 30 days ago • 15
SkillOrchestra: Learning to Route Agents via Skill Transfer Paper • 2602.19672 • Published Feb 23 • 57
Revisiting Generalization Across Difficulty Levels: It's Not So Easy Paper • 2511.21692 • Published Nov 26, 2025 • 15