RLHF - a Vigneshwaran Collection

Vigneshwaran 's Collections

RL

RLHF

RLHF

updated about 14 hours ago

ORPO: Monolithic Preference Optimization without Reference Model

Paper • 2403.07691 • Published Mar 12, 2024 • 64
sDPO: Don't Use Your Data All at Once

Paper • 2403.19270 • Published Mar 28, 2024 • 41
Teaching Large Language Models to Reason with Reinforcement Learning

Paper • 2403.04642 • Published Mar 7, 2024 • 47
Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11, 2024 • 30
Rho-1: Not All Tokens Are What You Need

Paper • 2404.07965 • Published Apr 11, 2024 • 90
Learn Your Reference Model for Real Good Alignment

Paper • 2404.09656 • Published Apr 15, 2024 • 84
Dataset Reset Policy Optimization for RLHF

Paper • 2404.08495 • Published Apr 12, 2024 • 9
Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks

Paper • 2404.14723 • Published Apr 23, 2024 • 10
RLHF Workflow: From Reward Modeling to Online RLHF

Paper • 2405.07863 • Published May 13, 2024 • 68
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Paper • 2405.11143 • Published May 20, 2024 • 37
Mixtures of Experts Unlock Parameter Scaling for Deep RL

Paper • 2402.08609 • Published Feb 13, 2024 • 36
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

Paper • 2406.02900 • Published Jun 5, 2024 • 12
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Paper • 2402.14740 • Published Feb 22, 2024 • 13
HelpSteer2: Open-source dataset for training top-performing reward models

Paper • 2406.08673 • Published Jun 12, 2024 • 19
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

Paper • 2406.09279 • Published Jun 13, 2024 • 2
Understanding the performance gap between online and offline alignment algorithms

Paper • 2405.08448 • Published May 14, 2024 • 19
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Paper • 2312.09390 • Published Dec 14, 2023 • 33
Theoretical guarantees on the best-of-n alignment policy

Paper • 2401.01879 • Published Jan 3, 2024
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint

Paper • 2312.11456 • Published Dec 18, 2023 • 1
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Paper • 2304.06767 • Published Apr 13, 2023 • 2
Self-Play Preference Optimization for Language Model Alignment

Paper • 2405.00675 • Published May 1, 2024 • 27
Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs

Paper • 2406.10216 • Published Jun 14, 2024 • 2
Scaling Laws for Reward Model Overoptimization

Paper • 2210.10760 • Published Oct 19, 2022
AgentInstruct: Toward Generative Teaching with Agentic Flows

Paper • 2407.03502 • Published Jul 3, 2024 • 50
Online Merging Optimizers for Boosting Rewards and Mitigating Tax in Alignment

Paper • 2405.17931 • Published May 28, 2024
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

Paper • 2405.00451 • Published May 1, 2024
Foundations of Reinforcement Learning and Interactive Decision Making

Paper • 2312.16730 • Published Dec 27, 2023
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Paper • 2408.07199 • Published Aug 13, 2024 • 21
Disentangling Length from Quality in Direct Preference Optimization

Paper • 2403.19159 • Published Mar 28, 2024
Imitating Language via Scalable Inverse Reinforcement Learning

Paper • 2409.01369 • Published Sep 2, 2024
Contrastive Prefence Learning: Learning from Human Feedback without RL

Paper • 2310.13639 • Published Oct 20, 2023 • 25
D2PO: Discriminator-Guided DPO with Response Evaluation Models

Paper • 2405.01511 • Published May 2, 2024
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

Paper • 2408.06266 • Published Aug 12, 2024 • 10
Training Language Models to Self-Correct via Reinforcement Learning

Paper • 2409.12917 • Published Sep 19, 2024 • 137
The Perfect Blend: Redefining RLHF with Mixture of Judges

Paper • 2409.20370 • Published Sep 30, 2024 • 5
HelpSteer2-Preference: Complementing Ratings with Preferences

Paper • 2410.01257 • Published Oct 2, 2024 • 23
A Critical Evaluation of AI Feedback for Aligning Large Language Models

Paper • 2402.12366 • Published Feb 19, 2024 • 3
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Paper • 2410.08146 • Published Oct 10, 2024
RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Paper • 2410.02089 • Published Oct 2, 2024 • 12
SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF

Paper • 2411.01798 • Published Nov 4, 2024 • 8
OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning

Paper • 2412.16849 • Published Dec 22, 2024 • 9
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Paper • 2501.04519 • Published Jan 8 • 258
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs

Paper • 2501.18585 • Published 27 days ago • 56
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Paper • 2502.18449 • Published 1 day ago • 34