aashish1904 commited on
Commit
3c874b0
·
verified ·
1 Parent(s): f377b71

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +126 -0
README.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ license: mit
5
+ library_name: transformers
6
+ datasets:
7
+ - AI-MO/NuminaMath-CoT
8
+ - KbsdJames/Omni-MATH
9
+ - RUC-AIBOX/STILL-3-Preview-RL-Data
10
+ - hendrycks/competition_math
11
+ language:
12
+ - en
13
+ base_model:
14
+ - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
15
+
16
+ ---
17
+
18
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
19
+
20
+
21
+ # QuantFactory/DeepScaleR-1.5B-Preview-GGUF
22
+ This is quantized version of [agentica-org/DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview) created using llama.cpp
23
+
24
+ # Original Model Card
25
+
26
+
27
+ <div align="center">
28
+ <span style="font-family: default; font-size: 1.5em;">DeepScaleR-1.5B-Preview</span>
29
+ <div>
30
+ 🚀 Democratizing Reinforcement Learning for LLMs 🌟
31
+ </div>
32
+ </div>
33
+ <br>
34
+ <div align="center" style="line-height: 1;">
35
+ <a href="https://github.com/agentica-project/deepscaler" style="margin: 2px;">
36
+ <img alt="Code" src="https://img.shields.io/badge/DeepScaleR-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
37
+ </a>
38
+ <a href="https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2" target="_blank" style="margin: 2px;">
39
+ <img alt="Blog" src="https://img.shields.io/badge/Notion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
40
+ </a>
41
+ <a href="https://x.com/Agentica_/status/1889006266661617779" style="margin: 2px;">
42
+ <img alt="X.ai" src="https://img.shields.io/badge/Agentica-white?style=for-the-badge&logo=X&logoColor=000&color=000&labelColor=white" style="display: inline-block; vertical-align: middle;"/>
43
+ </a>
44
+ <a href="https://huggingface.co/agentica-org" style="margin: 2px;">
45
+ <img alt="Hugging Face" src="https://img.shields.io/badge/Agentica-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor" style="display: inline-block; vertical-align: middle;"/>
46
+ </a>
47
+ </div>
48
+ </div>
49
+ </div>
50
+
51
+ ## DeepScaleR Overview
52
+ DeepScaleR-1.5B-Preview is a language model fine-tuned from DeepSeek-R1-Distilled-Qwen-1.5B using distributed reinforcement learning (RL) to scale up to long context lengths. The model achieves 43.1% Pass@1 accuracy on AIME 2024, representing a 15% improvement over the base model (28.8%) and surpassing OpenAI's O1-Preview performance with just 1.5B parameters.
53
+
54
+ ## Data
55
+ Our training dataset consists of approximately 40,000 unique problem-answer pairs compiled from:
56
+ - AIME problems (1984-2023)
57
+ - AMC problems (prior to 2023)
58
+ - Omni-MATH dataset
59
+ - Still dataset
60
+
61
+ ## Training Recipe
62
+ We employ Deepseek's Group Relative Policy Optimization (GRPO), a simplified RL algorithm that extends PPO by:
63
+ - Normalizing advantage function over all samples generated from the same prompt.
64
+ - Applying KL divergence regularization on top of PPO's surrogate loss to prevent significant policy drift.
65
+
66
+ **Reward Function**: Our reward function is simple but effective:
67
+ - 1 for correct answers passing LaTeX/Sympy checks
68
+ - 0 for incorrect or improperly formatted answers
69
+ - Note: No partial rewards (such as PRMs) or intermediate feedback.
70
+
71
+ **Iterative Context Lengthening**: A key challenge in scaling RL for reasoning is compute cost. Our approach trains models with progressively longer contexts as the model improves, thus saving monetary costs and end2end training time:
72
+ - Initial 8K Context (0-1040 steps):
73
+ - 22.9% -> 33% Pass@1 on AIME 2024
74
+ - Trained on 8 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 8 = 1024
75
+ - Extended to 16K (steps 1040-1520):
76
+ - 33% -> 43% Pass@1 on AIME 2024
77
+ - Trained on 32 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 16 = 2048
78
+ - Further extended to 24K (step 1520+):
79
+ - 38% -> 43% Pass@1 on AIME 2024
80
+ - Trained on 32 A100-80GB GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 16 = 2048
81
+ - Significant improvements within <200 steps
82
+
83
+ A more detailed description of the training recipe can be found in our [blog post](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2).
84
+
85
+ ## Evaluation
86
+ We report Pass@1 accuracy averaged over 16 samples for each problem.
87
+ | Model | AIME 2024 | MATH 500 | AMC 2023 | Minerva Math | OlympiadBench | Avg. |
88
+ |-------|-----------|-----------|-----------|--------------|---------------|------|
89
+ | 2.5-7B-Instruct | 13.3 | 79.8 | 50.6 | 34.6 | 40.7 | 43.8 |
90
+ | rStar-Math-7B | 26.7 | 78.4 | 47.5 | - | 47.1 | - |
91
+ | Eurus-2-7B-PRIME | 26.7 | 79.2 | 57.8 | 38.6 | 42.1 | 48.9 |
92
+ | Qwen2.5-7B-SimpleRL | 26.7 | 82.4 | 62.5 | <strong>39.7</strong> | 43.3 | 50.9 |
93
+ | DeepSeek-R1-Distill-Qwen-1.5B | 28.8 | 82.8 | 62.9 | 26.5 | 43.3 | 48.9 |
94
+ | Still-1.5B | 32.5 | 84.4 | 66.7 | 29.0 | 45.4 | 51.6 |
95
+ | <strong>DeepScaleR-1.5B-Preview</strong> | <strong>43.1</strong> | <strong>87.8</strong> | <strong>73.6</strong> | 30.2 | <strong>50.0</strong> | <strong>57.0</strong> |
96
+ | O1-Preview | 40.0 | 81.4 | - | - | - | - |
97
+
98
+ ## Serving DeepScaleR
99
+ Our model can be served using popular high-performance inference systems:
100
+ - vLLM
101
+ - Hugging Face Text Generation Inference (TGI)
102
+ - SGLang
103
+ - TensorRT-LLM
104
+
105
+ All these systems support the OpenAI Chat Completions API format.
106
+
107
+ ## License
108
+ This project is released under the MIT License, reflecting our commitment to open and accessible AI development.
109
+ We believe in democratizing AI technology by making our work freely available for anyone to use, modify, and build upon.
110
+ This permissive license ensures that researchers, developers, and enthusiasts worldwide can leverage and extend our work without restrictions, fostering innovation and collaboration in the AI community.
111
+
112
+ ## Acknowledgement
113
+ - Our training experiments are powered by our heavily modified fork of [Verl](https://github.com/agentica-project/verl), an open-source RLHF library.
114
+ - Our model is trained on top of [`DeepSeek-R1-Distill-Qwen-1.5B`](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B).
115
+ - Our work is done as part of [Berkeley Sky Computing Lab](https://skycomputing.berkeley.edu/) and [Berkeley AI Research](https://bair.berkeley.edu/).
116
+
117
+ ## Citation
118
+ ```bibtex
119
+ @misc{deepscaler2025,
120
+ title={DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL},
121
+ author={Michael Luo and Sijun Tan and Justin Wong and Xiaoxiang Shi and William Tang and Manan Roongta and Colin Cai and Jeffrey Luo and Tianjun Zhang and Erran Li and Raluca Ada Popa and Ion Stoica},
122
+ year={2025},
123
+ howpublished={\url{https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2}},
124
+ note={Notion Blog}
125
+ year={2025}
126
+ }