MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
Abstract
Data quality and diversity are key to the construction of effective instruction-tuning datasets. % With the increasing availability of open-source instruction-tuning datasets, it is advantageous to automatically select high-quality and diverse subsets from a vast amount of data. % Existing methods typically prioritize instance quality and use heuristic rules to maintain diversity. % However, this absence of a comprehensive view of the entire collection often leads to suboptimal results. % Moreover, heuristic rules generally focus on distance or clustering within the embedding space, which fails to accurately capture the intent of complex instructions in the semantic space. % To bridge this gap, we propose a unified method for quantifying the information content of datasets. This method models the semantic space by constructing a label graph and quantifies diversity based on the distribution of information within the graph. % Based on such a measurement, we further introduce an efficient sampling method that selects data samples iteratively to Maximize the Information Gain (MIG) in semantic space. % Experiments on various datasets and base models demonstrate that MIG consistently outperforms state-of-the-art methods. % Notably, the model fine-tuned with 5\% Tulu3 data sampled by MIG achieves comparable performance to the official SFT model trained on the full dataset, with improvements of +5.73\% on AlpacaEval and +6.89\% on Wildbench.
Community
- Project page: https://yichengchen24.github.io/projects/mig
- Code: https://github.com/yichengchen24/MIG
"information gain metric" is great for in-context examplar selection too. Cool stuff.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MergeIT: From Selection to Merging for Efficient Instruction Tuning (2025)
- Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric (2025)
- Add-One-In: Incremental Sample Selection for Large Language Models via a Choice-Based Greedy Paradigm (2025)
- CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom (2025)
- D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning (2025)
- MDIT: A Model-free Data Interpolation Method for Diverse Instruction Tuning (2025)
- Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper