ORPO: Monolithic Preference Optimization without Reference Model Paper • 2403.07691 • Published Mar 12, 2024 • 64
Teaching Large Language Models to Reason with Reinforcement Learning Paper • 2403.04642 • Published Mar 7, 2024 • 47
Best Practices and Lessons Learned on Synthetic Data for Language Models Paper • 2404.07503 • Published Apr 11, 2024 • 30
Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks Paper • 2404.14723 • Published Apr 23, 2024 • 10
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework Paper • 2405.11143 • Published May 20, 2024 • 37
Mixtures of Experts Unlock Parameter Scaling for Deep RL Paper • 2402.08609 • Published Feb 13, 2024 • 36
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms Paper • 2406.02900 • Published Jun 5, 2024 • 12
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs Paper • 2402.14740 • Published Feb 22, 2024 • 13
HelpSteer2: Open-source dataset for training top-performing reward models Paper • 2406.08673 • Published Jun 12, 2024 • 19
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback Paper • 2406.09279 • Published Jun 13, 2024 • 2
Understanding the performance gap between online and offline alignment algorithms Paper • 2405.08448 • Published May 14, 2024 • 19
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision Paper • 2312.09390 • Published Dec 14, 2023 • 33
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint Paper • 2312.11456 • Published Dec 18, 2023 • 1
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment Paper • 2304.06767 • Published Apr 13, 2023 • 2
Self-Play Preference Optimization for Language Model Alignment Paper • 2405.00675 • Published May 1, 2024 • 27
Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs Paper • 2406.10216 • Published Jun 14, 2024 • 2
AgentInstruct: Toward Generative Teaching with Agentic Flows Paper • 2407.03502 • Published Jul 3, 2024 • 50
Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment Paper • 2405.17931 • Published May 28, 2024
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning Paper • 2405.00451 • Published May 1, 2024
Foundations of Reinforcement Learning and Interactive Decision Making Paper • 2312.16730 • Published Dec 27, 2023
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents Paper • 2408.07199 • Published Aug 13, 2024 • 21
Disentangling Length from Quality in Direct Preference Optimization Paper • 2403.19159 • Published Mar 28, 2024
Imitating Language via Scalable Inverse Reinforcement Learning Paper • 2409.01369 • Published Sep 2, 2024
Contrastive Prefence Learning: Learning from Human Feedback without RL Paper • 2310.13639 • Published Oct 20, 2023 • 25
D2PO: Discriminator-Guided DPO with Response Evaluation Models Paper • 2405.01511 • Published May 2, 2024
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment Paper • 2408.06266 • Published Aug 12, 2024 • 10
Training Language Models to Self-Correct via Reinforcement Learning Paper • 2409.12917 • Published Sep 19, 2024 • 137
The Perfect Blend: Redefining RLHF with Mixture of Judges Paper • 2409.20370 • Published Sep 30, 2024 • 5
HelpSteer2-Preference: Complementing Ratings with Preferences Paper • 2410.01257 • Published Oct 2, 2024 • 23
A Critical Evaluation of AI Feedback for Aligning Large Language Models Paper • 2402.12366 • Published Feb 19, 2024 • 3
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning Paper • 2410.08146 • Published Oct 10, 2024
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning Paper • 2410.02089 • Published Oct 2, 2024 • 12
SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF Paper • 2411.01798 • Published Nov 4, 2024 • 8
OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning Paper • 2412.16849 • Published Dec 22, 2024 • 9
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking Paper • 2501.04519 • Published Jan 8 • 258
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs Paper • 2501.18585 • Published 27 days ago • 56
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution Paper • 2502.18449 • Published 1 day ago • 34