Paper
• 2402.03570
• Published • 8
Iterative Data Smoothing: Mitigating Reward Overfitting and
Overoptimization in RLHF
Paper
• 2401.16335
• Published • 1
Towards Efficient and Exact Optimization of Language Model Alignment
Paper
• 2402.00856
• Published • 1
ODIN: Disentangled Reward Mitigates Hacking in RLHF
Paper
• 2402.07319
• Published • 14
Preference-free Alignment Learning with Regularized Relevance Reward
Paper
• 2402.03469
• Published
Teaching Large Language Models to Reason with Reinforcement Learning
Paper
• 2403.04642
• Published • 48
RewardBench: Evaluating Reward Models for Language Modeling
Paper
• 2403.13787
• Published • 22
PERL: Parameter Efficient Reinforcement Learning from Human Feedback
Paper
• 2403.10704
• Published • 60
Stop Regressing: Training Value Functions via Classification for
Scalable Deep RL
Paper
• 2403.03950
• Published • 15
In deep reinforcement learning, a pruned network is a good network
Paper
• 2402.12479
• Published • 19
Direct Nash Optimization: Teaching Language Models to Self-Improve with
General Preferences
Paper
• 2404.03715
• Published • 62
Learn Your Reference Model for Real Good Alignment
Paper
• 2404.09656
• Published • 90
Offline Regularised Reinforcement Learning for Large Language Models
Alignment
Paper
• 2405.19107
• Published • 15
Self-Improving Robust Preference Optimization
Paper
• 2406.01660
• Published • 20
Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning
Enhancement in RLHF and Effective-Merged LLMs
Paper
• 2406.08657
• Published • 10
BPO: Supercharging Online Preference Learning by Adhering to the
Proximity of Behavior LLM
Paper
• 2406.12168
• Published • 7
THEANINE: Revisiting Memory Management in Long-term Conversations with
Timeline-augmented Response Generation
Paper
• 2406.10996
• Published • 35
WPO: Enhancing RLHF with Weighted Preference Optimization
Paper
• 2406.11827
• Published • 17
Understanding and Diagnosing Deep Reinforcement Learning
Paper
• 2406.16979
• Published • 10
Gradient Boosting Reinforcement Learning
Paper
• 2407.08250
• Published • 13
Understanding Reference Policies in Direct Preference Optimization
Paper
• 2407.13709
• Published • 17
Leveraging Skills from Unlabeled Prior Data for Efficient Online
Exploration
Paper
• 2410.18076
• Published • 4
Beyond Examples: High-level Automated Reasoning Paradigm in In-Context
Learning via MCTS
Paper
• 2411.18478
• Published • 37
A Simple and Provable Scaling Law for the Test-Time Compute of Large
Language Models
Paper
• 2411.19477
• Published • 6
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to
Reinforce
Paper
• 2504.11343
• Published • 20