SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models Paper • 2412.11605 • Published 10 days ago • 15
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training Paper • 2411.15124 • Published Nov 22 • 56
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback Paper • 2406.09279 • Published Jun 13 • 2
Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data Paper • 2404.14367 • Published Apr 22 • 1