arxiv:2604.14258

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Published on Apr 15

· Submitted by

Wenqi Zhang on Apr 21

ZJU-OmniAI

Upvote

Authors:

Wangjie Gan ,

Abstract

Group Fine-Tuning addresses limitations in supervised fine-tuning by using diverse response groups and adaptive weight bounding to improve training stability and efficiency.

AI-generated summary

Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.

View arXiv page View PDF Project page GitHub 25 Add to collection

Community

zju-omniai

Paper author about 17 hours ago

Hi everyone, I'd like to share our lab's recent work that has been accepted to ACL 2026 Findings: GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification 🎉🚀

Currently, large language models heavily rely on SFT and RL during the post-training phase, but how to perfectly integrate the two remains a bottleneck 🤔. Although SFT can efficiently inject knowledge, it is limited by extremely sparse implicit rewards and unstable inverse-probability weighting, making it prone to "single-path dependency" and "gradient explosion" 💥. These flaws not only lead to catastrophic forgetting but also severely compress the policy's exploration space, causing the common "SFT + RL (e.g., GRPO)" pipeline to face a "Synergy Dilemma," which greatly diminishes the subsequent gains from RL 📉.

To address this, we propose GFT (Group Fine-Tuning), a single-stage fine-tuning framework that views SFT as a special case of reinforcement learning and resolves its intrinsic deficiencies from a training-dynamics perspective 💡. Our framework includes two key designs:
🔹 Group Advantage Learning (GAL): Integrates expert demonstrations, teacher distillation, and self-sampling to construct a hybrid response group, utilizing normalized relative advantages for contrastive supervision. This effectively breaks single-path dependency and preserves exploration diversity 🌟.
🔹 Dynamic Coefficient Rectification (DCR): Adaptively bounds the inverse-probability weights of extreme tokens. It precisely suppresses gradient explosion, significantly stabilizing the optimization process and mitigating catastrophic forgetting while efficiently injecting new knowledge 🛡️.

We conducted extensive experiments across various mainstream models (such as Qwen2.5 and Llama-3) on 11 mathematical reasoning benchmarks, including AMC23, MATH, and OlympiadBench 📊. The results show that GFT demonstrates exceptionally high data efficiency (surpassing the 100k SFT baseline with only 10k data) ✨, significantly reduces KL divergence drift, and provides a much stronger cold-start policy for subsequent RL, substantially raising the model's performance ceiling 🏆.

The paper and code have both been released. We welcome everyone to discuss and exchange ideas, and wish you all a productive week of research! We would greatly appreciate it if you could help like and star the links below 🙏🔥~

📝 Paper: https://arxiv.org/abs/2604.14258
⭐ Github: https://github.com/ZJU-OmniAI/GFT/tree/main (Stars are highly appreciated! 🥳)