RL+reason model
updated
RL + Transformer = A General-Purpose Problem Solver
Paper
• 2501.14176
• Published
• 28
Towards General-Purpose Model-Free Reinforcement Learning
Paper
• 2501.16142
• Published
• 31
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model
Post-training
Paper
• 2501.17161
• Published
• 124
MaxInfoRL: Boosting exploration in reinforcement learning through
information gain maximization
Paper
• 2412.12098
• Published
• 4
RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning
Paper
• 2412.09858
• Published
• 2
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Paper
• 2501.18585
• Published
• 61
o3-mini vs DeepSeek-R1: Which One is Safer?
Paper
• 2501.18438
• Published
• 23
s1: Simple test-time scaling
Paper
• 2501.19393
• Published
• 124
Process Reinforcement through Implicit Rewards
Paper
• 2502.01456
• Published
• 62
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning
Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
Paper
• 2502.01081
• Published
• 13
Improving Transformer World Models for Data-Efficient RL
Paper
• 2502.01591
• Published
• 9
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM
Reasoning via Autoregressive Search
Paper
• 2502.02508
• Published
• 22
Demystifying Long Chain-of-Thought Reasoning in LLMs
Paper
• 2502.03373
• Published
• 58
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper
• 2502.02339
• Published
• 23
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs
using Particle-Based Monte Carlo Methods
Paper
• 2502.01618
• Published
• 10
BOLT: Bootstrap Long Chain-of-Thought in Language Models without
Distillation
Paper
• 2502.03860
• Published
• 25
Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of
Language Models
Paper
• 2502.04404
• Published
• 25
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time
Scaling
Paper
• 2502.06703
• Published
• 152
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
Paper
• 2502.06772
• Published
• 21
LLMs Can Easily Learn to Reason from Demonstrations Structure, not
content, is what matters!
Paper
• 2502.07374
• Published
• 40
Teaching Language Models to Critique via Reinforcement Learning
Paper
• 2502.03492
• Published
• 24
Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging -- An Open Recipe
Paper
• 2502.09056
• Published
• 31
Logical Reasoning in Large Language Models: A Survey
Paper
• 2502.09100
• Published
• 24
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in
Agentic Tasks
Paper
• 2502.08235
• Published
• 59
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper
• 2502.11775
• Published
• 9
Soundwave: Less is More for Speech-Text Alignment in LLMs
Paper
• 2502.12900
• Published
• 86
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly
Possess Test-Time Scaling Capabilities?
Paper
• 2502.12215
• Published
• 16
Small Models Struggle to Learn from Strong Reasoners
Paper
• 2502.12143
• Published
• 39
Thinking Preference Optimization
Paper
• 2502.13173
• Published
• 17
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement
Learning
Paper
• 2502.14768
• Published
• 47
LightThinker: Thinking Step-by-Step Compression
Paper
• 2502.15589
• Published
• 31
The Relationship Between Reasoning and Performance in Large Language
Models -- o3 (mini) Thinks Harder, Not Longer
Paper
• 2502.15631
• Published
• 9
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
• 2502.16033
• Published
• 18
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open
Software Evolution
Paper
• 2502.18449
• Published
• 75
Self-rewarding correction for mathematical reasoning
Paper
• 2502.19613
• Published
• 82
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language
Models (VLMs) via Reinforcement Learning
Paper
• 2502.19634
• Published
• 63
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through
Reflective Puzzle Solving
Paper
• 2502.20238
• Published
• 23
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper
• 2503.01785
• Published
• 86
SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers
Paper
• 2502.20545
• Published
• 22
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four
Habits of Highly Effective STaRs
Paper
• 2503.01307
• Published
• 38
Efficient Test-Time Scaling via Self-Calibration
Paper
• 2503.00031
• Published
• 15
LADDER: Self-Improving LLMs Through Recursive Problem Decomposition
Paper
• 2503.00735
• Published
• 23
START: Self-taught Reasoner with Tools
Paper
• 2503.04625
• Published
• 113
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive
Cognitive-Inspired Sketching
Paper
• 2503.05179
• Published
• 46
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
Paper
• 2503.05132
• Published
• 57
R1-Searcher: Incentivizing the Search Capability in LLMs via
Reinforcement Learning
Paper
• 2503.05592
• Published
• 27
Learning from Failures in Multi-Attempt Reinforcement Learning
Paper
• 2503.04808
• Published
• 18
An Empirical Study on Eliciting and Improving R1-like Reasoning Models
Paper
• 2503.04548
• Published
• 9
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale
Reinforcement Learning
Paper
• 2503.07365
• Published
• 61
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large
Language Models
Paper
• 2503.06749
• Published
• 31
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
Two-Stage Rule-Based RL
Paper
• 2503.07536
• Published
• 88
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based
VLM Agent Training
Paper
• 2503.08525
• Published
• 17
Search-R1: Training LLMs to Reason and Leverage Search Engines with
Reinforcement Learning
Paper
• 2503.09516
• Published
• 38
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model
for Visual Generation and Editing
Paper
• 2503.10639
• Published
• 53
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
• 2503.10291
• Published
• 36
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and
Beyond
Paper
• 2503.10460
• Published
• 30
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
• 2503.10615
• Published
• 17
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper
• 2503.12605
• Published
• 35
R1-VL: Learning to Reason with Multimodal Large Language Models via
Step-wise Group Relative Policy Optimization
Paper
• 2503.12937
• Published
• 30
reWordBench: Benchmarking and Improving the Robustness of Reward Models
with Transformed Inputs
Paper
• 2503.11751
• Published
• 17
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Paper
• 2503.14476
• Published
• 144
Temporal Consistency for LLM Reasoning Process Error Identification
Paper
• 2503.14495
• Published
• 11
Towards Self-Improving Systematic Cognition for Next-Generation
Foundation MLLMs
Paper
• 2503.12303
• Published
• 7
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning
Tasks
Paper
• 2503.15478
• Published
• 13
Stop Overthinking: A Survey on Efficient Reasoning for Large Language
Models
Paper
• 2503.16419
• Published
• 77
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning
via Iterative Self-Improvement
Paper
• 2503.17352
• Published
• 24
I Have Covered All the Bases Here: Interpreting Reasoning Features in
Large Language Models via Sparse Autoencoders
Paper
• 2503.18878
• Published
• 119
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for
Open Base Models in the Wild
Paper
• 2503.18892
• Published
• 31
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models
via Vision-Guided Reinforcement Learning
Paper
• 2503.18013
• Published
• 20
Mind with Eyes: from Language Reasoning to Multimodal Reasoning
Paper
• 2503.18071
• Published
• 3
Inference-Time Scaling for Flow Models via Stochastic Generation and
Rollover Budget Forcing
Paper
• 2503.19385
• Published
• 34
Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time
Thinking
Paper
• 2503.19855
• Published
• 29
ReSearch: Learning to Reason with Search for LLMs via Reinforcement
Learning
Paper
• 2503.19470
• Published
• 19
ViLBench: A Suite for Vision-Language Process Reward Modeling
Paper
• 2503.20271
• Published
• 7
Video-R1: Reinforcing Video Reasoning in MLLMs
Paper
• 2503.21776
• Published
• 79
OThink-MR1: Stimulating multimodal generalized reasoning capabilities
via dynamic reinforcement learning
Paper
• 2503.16081
• Published
• 28
A Survey of Efficient Reasoning for Large Reasoning Models: Language,
Multimodality, and Beyond
Paper
• 2503.21614
• Published
• 43
Exploring Data Scaling Trends and Effects in Reinforcement Learning from
Human Feedback
Paper
• 2503.22230
• Published
• 45
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement
Learning on the Base Model
Paper
• 2503.24290
• Published
• 62
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large
Language Models
Paper
• 2503.24235
• Published
• 54
Efficient Inference for Large Reasoning Models: A Survey
Paper
• 2503.23077
• Published
• 46
Exploring the Effect of Reinforcement Learning on Video Understanding:
Insights from SEED-Bench-R1
Paper
• 2503.24376
• Published
• 38
Z1: Efficient Test-time Scaling with Code
Paper
• 2504.00810
• Published
• 26
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Paper
• 2504.00883
• Published
• 67
Understanding R1-Zero-Like Training: A Critical Perspective
Paper
• 2503.20783
• Published
• 59
Inference-Time Scaling for Generalist Reward Modeling
Paper
• 2504.02495
• Published
• 58
Rethinking RL Scaling for Vision Language Models: A Transparent,
From-Scratch Framework and Comprehensive Evaluation Scheme
Paper
• 2504.02587
• Published
• 32
Rethinking Reflection in Pre-Training
Paper
• 2504.04022
• Published
• 80
Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning
Models
Paper
• 2504.04823
• Published
• 31
VAPO: Efficient and Reliable Reinforcement Learning for Advanced
Reasoning Tasks
Paper
• 2504.05118
• Published
• 26
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning
(v1)
Paper
• 2504.03151
• Published
• 15
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement
Fine-Tuning
Paper
• 2504.06958
• Published
• 13
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning
Paper
• 2504.07128
• Published
• 87
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Paper
• 2504.07615
• Published
• 35
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models
with Reinforcement Learning
Paper
• 2504.08837
• Published
• 43
DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM
Post-training
Paper
• 2504.09710
• Published
• 19
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper
• 2504.09641
• Published
• 16
Reasoning Models Can Be Effective Without Thinking
Paper
• 2504.09858
• Published
• 12
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
Paper
• 2504.10481
• Published
• 85
Efficient Reasoning Models: A Survey
Paper
• 2504.10903
• Published
• 21
Efficient Process Reward Model Training via Active Learning
Paper
• 2504.10559
• Published
• 13
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to
Reinforce
Paper
• 2504.11343
• Published
• 19
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Paper
• 2504.11536
• Published
• 63
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large
Vision-Language Models
Paper
• 2504.11468
• Published
• 30
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Paper
• 2504.13055
• Published
• 19
Does Reinforcement Learning Really Incentivize Reasoning Capacity in
LLMs Beyond the Base Model?
Paper
• 2504.13837
• Published
• 139
Learning to Reason under Off-Policy Guidance
Paper
• 2504.14945
• Published
• 88
FlowReasoner: Reinforcing Query-Level Meta-Agents
Paper
• 2504.15257
• Published
• 47
ToolRL: Reward is All Tool Learning Needs
Paper
• 2504.13958
• Published
• 49
OTC: Optimal Tool Calls via Reinforcement Learning
Paper
• 2504.14870
• Published
• 35
TTRL: Test-Time Reinforcement Learning
Paper
• 2504.16084
• Published
• 120
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal
Large Language Models
Paper
• 2504.15279
• Published
• 78
Process Reward Models That Think
Paper
• 2504.16828
• Published
• 18
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Paper
• 2504.16656
• Published
• 57
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual
Dependency
Paper
• 2504.18589
• Published
• 13
Reinforcement Learning for Reasoning in Large Language Models with One
Training Example
Paper
• 2504.20571
• Published
• 98
WebThinker: Empowering Large Reasoning Models with Deep Research
Capability
Paper
• 2504.21776
• Published
• 59
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language
Models in Math
Paper
• 2504.21233
• Published
• 49
Phi-4-reasoning Technical Report
Paper
• 2504.21318
• Published
• 54
100 Days After DeepSeek-R1: A Survey on Replication Studies and More
Directions for Reasoning Language Models
Paper
• 2505.00551
• Published
• 36
AdaR1: From Long-CoT to Hybrid-CoT via Bi-Level Adaptive Reasoning
Optimization
Paper
• 2504.21659
• Published
• 14
Llama-Nemotron: Efficient Reasoning Models
Paper
• 2505.00949
• Published
• 41
RM-R1: Reward Modeling as Reasoning
Paper
• 2505.02387
• Published
• 81
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization
in Rejection Sampling and RL
Paper
• 2505.02391
• Published
• 25
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement
Learning
Paper
• 2505.02835
• Published
• 28
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement
Fine-Tuning
Paper
• 2505.03318
• Published
• 92
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Paper
• 2505.03335
• Published
• 189
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Paper
• 2505.04588
• Published
• 65
Scalable Chain of Thoughts via Elastic Reasoning
Paper
• 2505.05315
• Published
• 26
X-Reasoner: Towards Generalizable Reasoning Across Modalities and
Domains
Paper
• 2505.03981
• Published
• 15
MiMo: Unlocking the Reasoning Potential of Language Model -- From
Pretraining to Posttraining
Paper
• 2505.07608
• Published
• 82
DanceGRPO: Unleashing GRPO on Visual Generation
Paper
• 2505.07818
• Published
• 32
Skywork-VL Reward: An Effective Reward Model for Multimodal
Understanding and Reasoning
Paper
• 2505.07263
• Published
• 30
AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale
Paper
• 2505.08311
• Published
• 19
Bring Reason to Vision: Understanding Perception and Reasoning through
Model Merging
Paper
• 2505.05464
• Published
• 11
Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
Paper
• 2505.09439
• Published
• 10
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large
Reasoning Models
Paper
• 2505.10554
• Published
• 120
WorldPM: Scaling Human Preference Modeling
Paper
• 2505.10527
• Published
• 34
AdaptThink: Reasoning Models Can Learn When to Think
Paper
• 2505.13417
• Published
• 83
Thinkless: LLM Learns When to Think
Paper
• 2505.13379
• Published
• 50
VisionReasoner: Unified Visual Perception and Reasoning via
Reinforcement Learning
Paper
• 2505.12081
• Published
• 18
ExTrans: Multilingual Deep Reasoning Translation via Exemplar-Enhanced
Reinforcement Learning
Paper
• 2505.12996
• Published
• 3
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
Paper
• 2505.13438
• Published
• 36
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via
Reinforcement Learning to Rank
Paper
• 2505.14460
• Published
• 33
Paper
• 2505.14674
• Published
• 37
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Paper
• 2505.15277
• Published
• 104
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement
Learning
Paper
• 2505.16410
• Published
• 58
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with
Curiosity-Driven Reinforcement Learning
Paper
• 2505.15966
• Published
• 53
GRIT: Teaching MLLMs to Think with Images
Paper
• 2505.15879
• Published
• 13
VeriThinker: Learning to Verify Makes Reasoning Model Efficient
Paper
• 2505.17941
• Published
• 25
Synthetic Data RL: Task Definition Is All You Need
Paper
• 2505.17063
• Published
• 11
One RL to See Them All: Visual Triple Unified Reinforcement Learning
Paper
• 2505.18129
• Published
• 62
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language
Model via Reinforcement Learning
Paper
• 2505.13426
• Published
• 13
Active-O3: Empowering Multimodal Large Language Models with Active
Perception via GRPO
Paper
• 2505.21457
• Published
• 16
The Entropy Mechanism of Reinforcement Learning for Reasoning Language
Models
Paper
• 2505.22617
• Published
• 131
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper
• 2505.22453
• Published
• 46
Sherlock: Self-Correcting Reasoning in Vision-Language Models
Paper
• 2505.22651
• Published
• 48
Skywork Open Reasoner 1 Technical Report
Paper
• 2505.22312
• Published
• 54
Advancing Multimodal Reasoning via Reinforcement Learning with Cold
Start
Paper
• 2505.22334
• Published
• 36
Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for
Frozen LLMs
Paper
• 2505.19075
• Published
• 21
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement
Learning
Paper
• 2505.14362
• Published
• 4
Table-R1: Inference-Time Scaling for Table Reasoning
Paper
• 2505.23621
• Published
• 93
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation
with Reinforcement Learning
Paper
• 2505.17022
• Published
• 27
LLMs for Engineering: Teaching Models to Design High Powered Rockets
Paper
• 2504.19394
• Published
• 13
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in
Large Language Models
Paper
• 2505.24864
• Published
• 143
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
Paper
• 2505.24025
• Published
• 27
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement
Learning
Paper
• 2505.24871
• Published
• 23
Harnessing Negative Signals: Reinforcement Distillation from Teacher
Data for LLM Reasoning
Paper
• 2505.24850
• Published
• 8
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective
Reinforcement Learning for LLM Reasoning
Paper
• 2506.01939
• Published
• 188
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware
Reinforcement Learning
Paper
• 2506.01713
• Published
• 48
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for
Language Reasoning
Paper
• 2505.24298
• Published
• 28
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
Paper
• 2505.24726
• Published
• 277
OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for
Over-Reasoning Mitigation
Paper
• 2506.02397
• Published
• 36
OpenThoughts: Data Recipes for Reasoning Models
Paper
• 2506.04178
• Published
• 52
Critique-GRPO: Advancing LLM Reasoning with Natural Language and
Numerical Feedback
Paper
• 2506.03106
• Published
• 6
Reinforcement Pre-Training
Paper
• 2506.08007
• Published
• 263
Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
Paper
• 2506.06395
• Published
• 133
Paper
• 2506.10910
• Published
• 66
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning
Attention
Paper
• 2506.13585
• Published
• 273
VGR: Visual Grounded Reasoning
Paper
• 2506.11991
• Published
• 20
A Technical Study into Small Reasoning Language Models
Paper
• 2506.13404
• Published
• 8
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes
Correct Reasoning in Base LLMs
Paper
• 2506.14245
• Published
• 45
Truncated Proximal Policy Optimization
Paper
• 2506.15050
• Published
• 10
Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain
Perspective
Paper
• 2506.14965
• Published
• 50
RLPR: Extrapolating RLVR to General Domains without Verifiers
Paper
• 2506.18254
• Published
• 32
ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought
Reasoning in LLMs
Paper
• 2506.18896
• Published
• 29
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal
Reasoning
Paper
• 2506.16141
• Published
• 27
SRFT: A Single-Stage Method with Supervised and Reinforcement
Fine-Tuning for Reasoning
Paper
• 2506.19767
• Published
• 15
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Paper
• 2506.22434
• Published
• 10
Jan-nano Technical Report
Paper
• 2506.22760
• Published
• 9
Listener-Rewarded Thinking in VLMs for Image Preferences
Paper
• 2506.22832
• Published
• 23
Aha Moment Revisited: Are VLMs Truly Capable of Self Verification in
Inference-time Scaling?
Paper
• 2506.17417
• Published
• 11
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language
Models for Audio Generation and Editing
Paper
• 2506.21448
• Published
• 8
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable
Reinforcement Learning
Paper
• 2507.01006
• Published
• 251
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
Paper
• 2506.21277
• Published
• 14
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and
Future Frontiers
Paper
• 2506.23918
• Published
• 90
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Paper
• 2507.01352
• Published
• 56
Energy-Based Transformers are Scalable Learners and Thinkers
Paper
• 2507.02092
• Published
• 69
A Survey on Latent Reasoning
Paper
• 2507.06203
• Published
• 93
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based
Reinforcement Learning
Paper
• 2507.05920
• Published
• 12
Perception-Aware Policy Optimization for Multimodal Reasoning
Paper
• 2507.06448
• Published
• 48
First Return, Entropy-Eliciting Explore
Paper
• 2507.07017
• Published
• 24
Scaling RL to Long Videos
Paper
• 2507.07966
• Published
• 160
PyVision: Agentic Vision with Dynamic Tooling
Paper
• 2507.07998
• Published
• 33
Open Vision Reasoner: Transferring Linguistic Cognitive Behavior for
Visual Reasoning
Paper
• 2507.05255
• Published
• 75
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality,
Long Context, and Next Generation Agentic Capabilities
Paper
• 2507.06261
• Published
• 67
EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and
Reasoning Modes
Paper
• 2507.11407
• Published
• 60
The Invisible Leash: Why RLVR May Not Escape Its Origin
Paper
• 2507.14843
• Published
• 85
Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for
RLVR
Paper
• 2507.15778
• Published
• 21
A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning
Paper
• 2507.14295
• Published
• 14
Can One Domain Help Others? A Data-Centric Study on Multi-Domain
Reasoning via Reinforcement Learning
Paper
• 2507.17512
• Published
• 37
Group Sequence Policy Optimization
Paper
• 2507.18071
• Published
• 317
Agentic Reinforced Policy Optimization
Paper
• 2507.19849
• Published
• 158
UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing
Large Language Models' Reasoning Abilities
Paper
• 2507.19766
• Published
• 15
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
Paper
• 2507.16806
• Published
• 7
TARS: MinMax Token-Adaptive Preference Strategy for Hallucination
Reduction in MLLMs
Paper
• 2507.21584
• Published
• 11
Beyond the Trade-off: Self-Supervised Reinforcement Learning for
Reasoning Models' Instruction Following
Paper
• 2508.02150
• Published
• 37
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Paper
• 2508.01191
• Published
• 238
IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with
Verifiable Rewards
Paper
• 2508.04632
• Published
• 2
On the Generalization of SFT: A Reinforcement Learning Perspective with
Reward Rectification
Paper
• 2508.05629
• Published
• 183
R-Zero: Self-Evolving Reasoning LLM from Zero Data
Paper
• 2508.05004
• Published
• 130
Don't Overthink It: A Survey of Efficient R1-style Large Reasoning
Models
Paper
• 2508.02120
• Published
• 20
Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving
Clipping Policy Optimization
Paper
• 2508.07629
• Published
• 43
Reinforcement Learning in Vision: A Survey
Paper
• 2508.08189
• Published
• 30
Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning
Paper
• 2508.08221
• Published
• 50
SSRL: Self-Search Reinforcement Learning
Paper
• 2508.10874
• Published
• 97
Thyme: Think Beyond Images
Paper
• 2508.11630
• Published
• 81
DuPO: Enabling Reliable LLM Self-Verification via Dual Preference
Optimization
Paper
• 2508.14460
• Published
• 85
On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised
Fine-Tuning and Reinforcement Learning via Dynamic Weighting
Paper
• 2508.11408
• Published
• 8
Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains
RLVR
Paper
• 2508.14029
• Published
• 118
Hermes 4 Technical Report
Paper
• 2508.18255
• Published
• 44
TreePO: Bridging the Gap of Policy Optimization and Efficacy and
Inference Efficiency with Heuristic Tree-based Modeling
Paper
• 2508.17445
• Published
• 80
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Paper
• 2508.19652
• Published
• 84
rStar2-Agent: Agentic Reasoning Technical Report
Paper
• 2508.20722
• Published
• 117
SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn
Tool-Integrated Reasoning
Paper
• 2509.02479
• Published
• 84
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use
Paper
• 2509.01055
• Published
• 79
Implicit Actor Critic Coupling via a Supervised Learning Framework for
RLVR
Paper
• 2509.02522
• Published
• 26
Beyond Correctness: Harmonizing Process and Outcome Rewards through RL
Training
Paper
• 2509.03403
• Published
• 23
Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers
Paper
• 2509.03059
• Published
• 25
Sharing is Caring: Efficient LM Post-Training with Collective RL
Experience Sharing
Paper
• 2509.08721
• Published
• 662
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual
Search
Paper
• 2509.07969
• Published
• 59
A Survey of Reinforcement Learning for Large Reasoning Models
Paper
• 2509.08827
• Published
• 190
RewardDance: Reward Scaling in Visual Generation
Paper
• 2509.08826
• Published
• 73
CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning
in Large Language Models
Paper
• 2509.09675
• Published
• 28
The Majority is not always right: RL training for solution aggregation
Paper
• 2509.06870
• Published
• 15
Single-stream Policy Optimization
Paper
• 2509.13232
• Published
• 34
THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical
Reasoning
Paper
• 2509.13761
• Published
• 16
Improving Context Fidelity via Native Retrieval-Augmented Reasoning
Paper
• 2509.13683
• Published
• 8
Reinforcement Learning on Pre-Training Data
Paper
• 2509.19249
• Published
• 67
MAPO: Mixed Advantage Policy Optimization
Paper
• 2509.18849
• Published
• 27
Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget
Allocation
Paper
• 2509.25849
• Published
• 48
Agentic Entropy-Balanced Policy Optimization
Paper
• 2510.14545
• Published
• 106
Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale
Thinking Model
Paper
• 2510.18855
• Published
• 73
Tiny Model, Big Logic: Diversity-Driven Optimization Elicits Large-Model
Reasoning Ability in VibeThinker-1.5B
Paper
• 2511.06221
• Published
• 132
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
Paper
• 2512.01374
• Published
• 105
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
Paper
• 2511.22570
• Published
• 91
Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding
Paper
• 2512.17532
• Published
• 67
Token-Budget-Aware LLM Reasoning
Paper
• 2412.18547
• Published
• 46
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Paper
• 2601.05242
• Published
• 228