xinlai commited on
Commit
f8c9733
·
verified ·
1 Parent(s): 0134ded

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -6,7 +6,9 @@ license: apache-2.0
6
 
7
  🖥️[Code](https://github.com/dvlab-research/Step-DPO) | 🤗[Data](https://huggingface.co/datasets/xinlai/Math-Step-DPO-10K) | 📄[Paper](https://arxiv.org/pdf/2406.18629)
8
 
9
- This repo contains the **DeepSeekMath-RL-Step-DPO** model for our paper **Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs**, **Step-DPO** is a simple, effective, and data-efficient method for boosting the mathematical reasoning ability of LLMs. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of **70.8%** and **94.0%** on the test sets of **MATH** and **GSM8K** without bells and wistles, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.
 
 
10
 
11
  ## Contact
12
 
 
6
 
7
  🖥️[Code](https://github.com/dvlab-research/Step-DPO) | 🤗[Data](https://huggingface.co/datasets/xinlai/Math-Step-DPO-10K) | 📄[Paper](https://arxiv.org/pdf/2406.18629)
8
 
9
+ This repo contains the **DeepSeekMath-RL-Step-DPO** model. It is obtained by performing **Step-DPO** on [**DeepSeekMath-RL**](https://huggingface.co/deepseek-ai/deepseek-math-7b-rl).
10
+
11
+ **Step-DPO** is a simple, effective, and data-efficient method for boosting the mathematical reasoning ability of LLMs. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of **70.8%** and **94.0%** on the test sets of **MATH** and **GSM8K** without bells and wistles, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.
12
 
13
  ## Contact
14