RRM: Robust Reward Model Training Mitigates Reward Hacking Paper • 2409.13156 • Published Sep 20, 2024 • 4 • 2
Building Math Agents with Multi-Turn Iterative Preference Learning Paper • 2409.02392 • Published Sep 4, 2024 • 15 • 2
LiPO: Listwise Preference Optimization through Learning-to-Rank Paper • 2402.01878 • Published Feb 2, 2024 • 19 • 6