GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification
Abstract
Group Fine-Tuning addresses limitations in supervised fine-tuning by using diverse response groups and adaptive weight bounding to improve training stability and efficiency.
Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.
Community
Hi everyone, I'd like to share our lab's recent work that has been accepted to ACL 2026 Findings: GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification ๐๐
Currently, large language models heavily rely on SFT and RL during the post-training phase, but how to perfectly integrate the two remains a bottleneck ๐ค. Although SFT can efficiently inject knowledge, it is limited by extremely sparse implicit rewards and unstable inverse-probability weighting, making it prone to "single-path dependency" and "gradient explosion" ๐ฅ. These flaws not only lead to catastrophic forgetting but also severely compress the policy's exploration space, causing the common "SFT + RL (e.g., GRPO)" pipeline to face a "Synergy Dilemma," which greatly diminishes the subsequent gains from RL ๐.
To address this, we propose GFT (Group Fine-Tuning), a single-stage fine-tuning framework that views SFT as a special case of reinforcement learning and resolves its intrinsic deficiencies from a training-dynamics perspective ๐ก. Our framework includes two key designs:
๐น Group Advantage Learning (GAL): Integrates expert demonstrations, teacher distillation, and self-sampling to construct a hybrid response group, utilizing normalized relative advantages for contrastive supervision. This effectively breaks single-path dependency and preserves exploration diversity ๐.
๐น Dynamic Coefficient Rectification (DCR): Adaptively bounds the inverse-probability weights of extreme tokens. It precisely suppresses gradient explosion, significantly stabilizing the optimization process and mitigating catastrophic forgetting while efficiently injecting new knowledge ๐ก๏ธ.
We conducted extensive experiments across various mainstream models (such as Qwen2.5 and Llama-3) on 11 mathematical reasoning benchmarks, including AMC23, MATH, and OlympiadBench ๐. The results show that GFT demonstrates exceptionally high data efficiency (surpassing the 100k SFT baseline with only 10k data) โจ, significantly reduces KL divergence drift, and provides a much stronger cold-start policy for subsequent RL, substantially raising the model's performance ceiling ๐.
The paper and code have both been released. We welcome everyone to discuss and exchange ideas, and wish you all a productive week of research! We would greatly appreciate it if you could help like and star the links below ๐๐ฅ~
๐ Paper: https://arxiv.org/abs/2604.14258
โญ Github: https://github.com/ZJU-OmniAI/GFT/tree/main (Stars are highly appreciated! ๐ฅณ)
Hi everyone, I'd like to share our lab's recent work that has been accepted to ACL 2026 Findings: GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification ๐๐
Currently, large language models heavily rely on SFT and RL during the post-training phase, but how to perfectly integrate the two remains a bottleneck ๐ค. Although SFT can efficiently inject knowledge, it is limited by extremely sparse implicit rewards and unstable inverse-probability weighting, making it prone to "single-path dependency" and "gradient explosion" ๐ฅ. These flaws not only lead to catastrophic forgetting but also severely compress the policy's exploration space, causing the common "SFT + RL (e.g., GRPO)" pipeline to face a "Synergy Dilemma," which greatly diminishes the subsequent gains from RL ๐.
To address this, we propose GFT (Group Fine-Tuning), a single-stage fine-tuning framework that views SFT as a special case of reinforcement learning and resolves its intrinsic deficiencies from a training-dynamics perspective ๐ก. Our framework includes two key designs:
๐น Group Advantage Learning (GAL): Integrates expert demonstrations, teacher distillation, and self-sampling to construct a hybrid response group, utilizing normalized relative advantages for contrastive supervision. This effectively breaks single-path dependency and preserves exploration diversity ๐.
๐น Dynamic Coefficient Rectification (DCR): Adaptively bounds the inverse-probability weights of extreme tokens. It precisely suppresses gradient explosion, significantly stabilizing the optimization process and mitigating catastrophic forgetting while efficiently injecting new knowledge ๐ก๏ธ.
We conducted extensive experiments across various mainstream models (such as Qwen2.5 and Llama-3) on 11 mathematical reasoning benchmarks, including AMC23, MATH, and OlympiadBench ๐. The results show that GFT demonstrates exceptionally high data efficiency (surpassing the 100k SFT baseline with only 10k data) โจ, significantly reduces KL divergence drift, and provides a much stronger cold-start policy for subsequent RL, substantially raising the model's performance ceiling ๐.
The paper and code have both been released. We welcome everyone to discuss and exchange ideas, and wish you all a productive week of research! We would greatly appreciate it if you could help like and star the links below ๐๐ฅ~
๐ Paper: https://arxiv.org/abs/2604.14258
โญ Github: https://github.com/ZJU-OmniAI/GFT/tree/main (Stars are highly appreciated! ๐ฅณ)
Get this paper in your agent:
hf papers read 2604.14258 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper