Safetensors
qwen2
AALF commited on
Commit
7251b1e
·
verified ·
1 Parent(s): c77e4a6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -37,7 +37,7 @@ Our IMF method follows a three-stage process aimed at effectively transferring c
37
  Our datasets were designed to enhance model's instruction following, general conversation, mathematics, coding, and Chinese-language capabilities. We selected data from open-source community datasets, applying targeted filtering and preprocessing. Key datasets and filtering criteria included:
38
 
39
  - **Instruction Following & General Conversation**: Sourced from [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback), [Magpie-Pro-DPO-100K-v0.1](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-DPO-100K-v0.1), and [HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2), excluding code and math data.
40
- - **Mathematics**: Selected from [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2), with nearly 60,000 unique samples.
41
  - **Coding**: Curated from [leetcode](https://huggingface.co/datasets/greengerong/leetcode) and [self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k), retaining prompts with test cases.
42
  - **Chinese Language**: Integrated [alpaca_gpt4_zh](https://huggingface.co/datasets/llamafactory/alpaca_gpt4_zh) and [Magpie-Qwen2-Pro-200K-Chinese](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese), filtering out code and math prompts to retain approximately 10,000 high-quality samples.
43
 
 
37
  Our datasets were designed to enhance model's instruction following, general conversation, mathematics, coding, and Chinese-language capabilities. We selected data from open-source community datasets, applying targeted filtering and preprocessing. Key datasets and filtering criteria included:
38
 
39
  - **Instruction Following & General Conversation**: Sourced from [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback), [Magpie-Pro-DPO-100K-v0.1](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-DPO-100K-v0.1), and [HelpSteer2](https://huggingface.co/datasets/nvidia/HelpSteer2), excluding code and math data.
40
+ - **Mathematics**: Selected from [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2), with nearly 52,000 unique samples.
41
  - **Coding**: Curated from [leetcode](https://huggingface.co/datasets/greengerong/leetcode) and [self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k), retaining prompts with test cases.
42
  - **Chinese Language**: Integrated [alpaca_gpt4_zh](https://huggingface.co/datasets/llamafactory/alpaca_gpt4_zh) and [Magpie-Qwen2-Pro-200K-Chinese](https://huggingface.co/datasets/Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese), filtering out code and math prompts to retain approximately 10,000 high-quality samples.
43