SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks Paper • 2603.24755 • Published 3 days ago • 22
LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning Paper • 2603.21065 • Published 7 days ago • 74
MetaClaw: Just Talk -- An Agent That Meta-Learns and Evolves in the Wild Paper • 2603.17187 • Published 11 days ago • 132
TRUST-SQL: Tool-Integrated Multi-Turn Reinforcement Learning for Text-to-SQL over Unknown Schemas Paper • 2603.16448 • Published 11 days ago • 58
InCoder-32B: Code Foundation Model for Industrial Scenarios Paper • 2603.16790 • Published 11 days ago • 301
MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification Paper • 2603.15726 • Published 12 days ago • 181
SWE-Skills-Bench: Do Agent Skills Actually Help in Real-World Software Engineering? Paper • 2603.15401 • Published 12 days ago • 18
In-Context Reinforcement Learning for Tool Use in Large Language Models Paper • 2603.08068 • Published 20 days ago • 41
Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams Paper • 2603.07392 • Published 21 days ago • 18