Papers
arxiv:2604.14258

GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification

Published on Apr 15
ยท Submitted by
Wenqi Zhang
on Apr 21
Authors:
,
,
,
,
,

Abstract

Group Fine-Tuning addresses limitations in supervised fine-tuning by using diverse response groups and adaptive weight bounding to improve training stability and efficiency.

AI-generated summary

Large language models are typically post-trained using supervised fine-tuning (SFT) and reinforcement learning (RL), yet effectively unifying efficient knowledge injection with robust generalization remains challenging. In this work, we provide a training-dynamics analysis showing that SFT can be interpreted as a special case of policy gradient optimization with an extremely sparse implicit reward and unstable inverse-probability weighting, which together lead to single-path dependency, entropy collapse, and gradient explosion. Motivated by this diagnosis, we propose Group Fine-Tuning (GFT), a unified post-training framework that addresses these intrinsic limitations through two mechanisms: Group Advantage Learning, which constructs diverse response groups and derives normalized contrastive supervision to alleviate reward sparsity, and Dynamic Coefficient Rectification, which adaptively bounds inverse-probability weights to stabilize optimization while preserving efficient knowledge injection. Experiments demonstrate that GFT consistently surpasses SFT-based methods and yields policies that integrate more smoothly with subsequent RL training.

Community

Hi everyone, I'd like to share our lab's recent work that has been accepted to ACL 2026 Findings: GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification ๐ŸŽ‰๐Ÿš€

Currently, large language models heavily rely on SFT and RL during the post-training phase, but how to perfectly integrate the two remains a bottleneck ๐Ÿค”. Although SFT can efficiently inject knowledge, it is limited by extremely sparse implicit rewards and unstable inverse-probability weighting, making it prone to "single-path dependency" and "gradient explosion" ๐Ÿ’ฅ. These flaws not only lead to catastrophic forgetting but also severely compress the policy's exploration space, causing the common "SFT + RL (e.g., GRPO)" pipeline to face a "Synergy Dilemma," which greatly diminishes the subsequent gains from RL ๐Ÿ“‰.

To address this, we propose GFT (Group Fine-Tuning), a single-stage fine-tuning framework that views SFT as a special case of reinforcement learning and resolves its intrinsic deficiencies from a training-dynamics perspective ๐Ÿ’ก. Our framework includes two key designs:
๐Ÿ”น Group Advantage Learning (GAL): Integrates expert demonstrations, teacher distillation, and self-sampling to construct a hybrid response group, utilizing normalized relative advantages for contrastive supervision. This effectively breaks single-path dependency and preserves exploration diversity ๐ŸŒŸ.
๐Ÿ”น Dynamic Coefficient Rectification (DCR): Adaptively bounds the inverse-probability weights of extreme tokens. It precisely suppresses gradient explosion, significantly stabilizing the optimization process and mitigating catastrophic forgetting while efficiently injecting new knowledge ๐Ÿ›ก๏ธ.

We conducted extensive experiments across various mainstream models (such as Qwen2.5 and Llama-3) on 11 mathematical reasoning benchmarks, including AMC23, MATH, and OlympiadBench ๐Ÿ“Š. The results show that GFT demonstrates exceptionally high data efficiency (surpassing the 100k SFT baseline with only 10k data) โœจ, significantly reduces KL divergence drift, and provides a much stronger cold-start policy for subsequent RL, substantially raising the model's performance ceiling ๐Ÿ†.

The paper and code have both been released. We welcome everyone to discuss and exchange ideas, and wish you all a productive week of research! We would greatly appreciate it if you could help like and star the links below ๐Ÿ™๐Ÿ”ฅ~

๐Ÿ“ Paper: https://arxiv.org/abs/2604.14258
โญ Github: https://github.com/ZJU-OmniAI/GFT/tree/main (Stars are highly appreciated! ๐Ÿฅณ)

Paper submitter

Hi everyone, I'd like to share our lab's recent work that has been accepted to ACL 2026 Findings: GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification ๐ŸŽ‰๐Ÿš€

Currently, large language models heavily rely on SFT and RL during the post-training phase, but how to perfectly integrate the two remains a bottleneck ๐Ÿค”. Although SFT can efficiently inject knowledge, it is limited by extremely sparse implicit rewards and unstable inverse-probability weighting, making it prone to "single-path dependency" and "gradient explosion" ๐Ÿ’ฅ. These flaws not only lead to catastrophic forgetting but also severely compress the policy's exploration space, causing the common "SFT + RL (e.g., GRPO)" pipeline to face a "Synergy Dilemma," which greatly diminishes the subsequent gains from RL ๐Ÿ“‰.

To address this, we propose GFT (Group Fine-Tuning), a single-stage fine-tuning framework that views SFT as a special case of reinforcement learning and resolves its intrinsic deficiencies from a training-dynamics perspective ๐Ÿ’ก. Our framework includes two key designs:
๐Ÿ”น Group Advantage Learning (GAL): Integrates expert demonstrations, teacher distillation, and self-sampling to construct a hybrid response group, utilizing normalized relative advantages for contrastive supervision. This effectively breaks single-path dependency and preserves exploration diversity ๐ŸŒŸ.
๐Ÿ”น Dynamic Coefficient Rectification (DCR): Adaptively bounds the inverse-probability weights of extreme tokens. It precisely suppresses gradient explosion, significantly stabilizing the optimization process and mitigating catastrophic forgetting while efficiently injecting new knowledge ๐Ÿ›ก๏ธ.

We conducted extensive experiments across various mainstream models (such as Qwen2.5 and Llama-3) on 11 mathematical reasoning benchmarks, including AMC23, MATH, and OlympiadBench ๐Ÿ“Š. The results show that GFT demonstrates exceptionally high data efficiency (surpassing the 100k SFT baseline with only 10k data) โœจ, significantly reduces KL divergence drift, and provides a much stronger cold-start policy for subsequent RL, substantially raising the model's performance ceiling ๐Ÿ†.

The paper and code have both been released. We welcome everyone to discuss and exchange ideas, and wish you all a productive week of research! We would greatly appreciate it if you could help like and star the links below ๐Ÿ™๐Ÿ”ฅ~

๐Ÿ“ Paper: https://arxiv.org/abs/2604.14258
โญ Github: https://github.com/ZJU-OmniAI/GFT/tree/main (Stars are highly appreciated! ๐Ÿฅณ)

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.14258
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.14258 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.14258 in a Space README.md to link it from this page.

Collections including this paper 1