Update README.md
Browse files
README.md
CHANGED
@@ -37,7 +37,7 @@ Our IMF method follows a three-stage process aimed at effectively transferring c
|
|
37 |
Our datasets were designed to enhance model's instruction following, general conversation, mathematics, coding, and Chinese-language capabilities. We selected data from open-source community datasets, applying targeted filtering and preprocessing. Key datasets and filtering criteria included:
|
38 |
|
39 |
- **Instruction Following & General Conversation**: Sourced from [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback), [Magpie-Pro-DPO-100K-v0.1](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-DPO-100K-v0.1), and [HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2), excluding code and math data.
|
40 |
-
- **Mathematics**: Selected from [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2), with nearly
|
41 |
- **Coding**: Curated from [leetcode](https://huggingface.co/datasets/greengerong/leetcode) and [self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k), retaining prompts with test cases.
|
42 |
- **Chinese Language**: Integrated [alpaca_gpt4_zh](https://huggingface.co/datasets/llamafactory/alpaca_gpt4_zh) and [Magpie-Qwen2-Pro-200K-Chinese](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese), filtering out code and math prompts to retain approximately 10,000 high-quality samples.
|
43 |
|
|
|
37 |
Our datasets were designed to enhance model's instruction following, general conversation, mathematics, coding, and Chinese-language capabilities. We selected data from open-source community datasets, applying targeted filtering and preprocessing. Key datasets and filtering criteria included:
|
38 |
|
39 |
- **Instruction Following & General Conversation**: Sourced from [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback), [Magpie-Pro-DPO-100K-v0.1](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-DPO-100K-v0.1), and [HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2), excluding code and math data.
|
40 |
+
- **Mathematics**: Selected from [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2), with nearly 52,000 unique samples.
|
41 |
- **Coding**: Curated from [leetcode](https://huggingface.co/datasets/greengerong/leetcode) and [self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k), retaining prompts with test cases.
|
42 |
- **Chinese Language**: Integrated [alpaca_gpt4_zh](https://huggingface.co/datasets/llamafactory/alpaca_gpt4_zh) and [Magpie-Qwen2-Pro-200K-Chinese](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese), filtering out code and math prompts to retain approximately 10,000 high-quality samples.
|
43 |
|