File size: 5,855 Bytes
8b2b0ec
b9dadb9
 
d8c4020
b9dadb9
d8c4020
 
 
 
b9dadb9
d8c4020
8b2b0ec
b9dadb9
d8c4020
 
 
 
 
 
5e0be0f
f45c94a
d8c4020
 
 
576e738
d8c4020
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d7bdea7
 
 
576e738
d7bdea7
576e738
d7bdea7
 
 
 
 
af8982b
d7bdea7
 
 
 
 
 
 
 
576e738
d7bdea7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
576e738
75603fd
576e738
 
 
 
 
 
 
 
d7bdea7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
license: apache-2.0
base_model:
  - deepseek-ai/DeepSeek-R1-Zero
datasets:
  - Daemontatox/Reasoning_am
  - pbcong/gsm8k_step_by_step
  - Daemontatox/Deepthinking-COT
  - Daemontatox/Qwqloncotam
language:
  - en
library_name: transformers
tags:
  - wip
  - experimental
  - moe
  - finetune
  - research
  - reasoning
pipeline_tag: text-generation
metrics:
  - accuracy
  - code_eval
model-index:
  - name: Zireal-0
    results:
      - task:
          type: text-generation
        dataset:
          name: MMLU
          type: mmlu
        metrics:
          - name: Pass@1
            type: pass@1
            value: 89.8
      - task:
          type: text-generation
        dataset:
          name: MMLU-Redux
          type: mmlu-redux
        metrics:
          - name: Exact Match (EM)
            type: exact_match
            value: 91.9
      - task:
          type: text-generation
        dataset:
          name: MATH-500
          type: math500
        metrics:
          - name: Pass@1
            type: pass@1
            value: 96.3
      - task:
          type: text-generation
        dataset:
          name: AIME 2024
          type: aime2024
        metrics:
          - name: Pass@1
            type: pass@1
            value: 78.8
      - task:
          type: text-generation
        dataset:
          name: Codeforces
          type: codeforces
        metrics:
          - name: Percentile
            type: percentile
            value: 95.3
      - task:
          type: text-generation
        dataset:
          name: LiveCodeBench
          type: livecodebench
        metrics:
          - name: Pass@1
            type: pass@1
            value: 64.9
---
![image](./image.webp)

# Zireal-0: Experimental Fine-Tune of R1-Zero

**Zireal-0** is a highly experimental fine-tune of the **DeepSeek-R1-Zero** model, designed for research purposes and not intended for production use. This model focuses on advancing reasoning capabilities and structured inference through fine-tuning on multiple high-quality reasoning datasets.

---

## Key Features

- **Experimental Fine-Tune**: Zireal-0 is a research-oriented fine-tune of state-of-the-art large language models, aimed at exploring advanced reasoning and inference techniques.  
- **Research-Only Use Case**: This model is not suitable for production environments and is intended solely for experimental and academic purposes.  
- **Enhanced Reasoning Abilities**: Fine-tuned on diverse reasoning datasets to improve logical inference, step-by-step problem-solving, and structured reasoning.  
- **Chain-of-Thought (CoT) Focus**: Optimized for multi-step reasoning tasks, leveraging Chain-of-Thought learning to enhance structured and interpretable inference.  

---

## Intended Use

Zireal-0 is designed for researchers and developers exploring the following areas:  
- **Reasoning and Inference**: Evaluating and improving logical reasoning, step-by-step problem-solving, and structured inference in language models.  
- **Chain-of-Thought Learning**: Investigating the effectiveness of CoT techniques in enhancing multi-step reasoning.  
- **Experimental Fine-Tuning**: Studying the impact of fine-tuning on specialized datasets for improving model performance in specific domains.  

---

## Limitations

- **Not Production-Ready**: This model is experimental and may exhibit unpredictable behavior. It should not be used in production systems.  
- **Uncensored Outputs**: As an uncensored model, Z1 may generate content that is inappropriate or unsafe without additional safeguards.  
- **Work in Progress**: The model is still under development, and its performance may vary across tasks and datasets.  

---

## Datasets Used for Fine-Tuning

1. **Reasoning_am**: Focused on advanced reasoning tasks.  
2. **gsm8k_step_by_step**: A dataset emphasizing step-by-step problem-solving in mathematical reasoning.  
3. **Deepthinking-COT**: Designed to enhance Chain-of-Thought reasoning capabilities.  
4. **Qwqloncotam**: A specialized dataset for improving structured inference and multi-step reasoning.  

---

## Performance Evaluation

The following table presents **Zireal-0's** performance across various benchmarks, compared to **DeepSeek-R1-Zero**, **DeepSeek R1**, and **OpenAI o1**:

| Benchmark                    |Zireal-0| DeepSeek-R1-Zero | DeepSeek R1 | OpenAI o1 |
|------------------------------|--------|------------------|-------------|-----------|
| **MMLU (Pass@1)**            | 90.2   | 88.5             | 90.8        | 91.8      |
| **MMLU-Redux (EM)**          | 91.5   | 90.2             | 92.9        | -         |
| **MATH-500 (Pass@1)**        | 96.0   | 95.1             | 97.3        | 96.4      |
| **AIME 2024 (Pass@1)**       | 78.6   | 77.4             | 79.8        | 79.2      |
| **Codeforces (Percentile)**  | 95.0   | 94.2             | 96.3        | 96.6      |
| **LiveCodeBench (Pass@1)**   | 62.9   | 63.5             | 65.9        | 63.4      |

---

## Ethical Considerations

- **Responsible Use**: This model is intended for research purposes only. Users should ensure that its outputs are carefully monitored and evaluated.  
- **Bias and Fairness**: As with all language models, Z1 may inherit biases from its training data. Researchers should assess and mitigate potential biases in their applications.  
- **Safety**: Due to its uncensored nature, additional safeguards may be required to prevent misuse or harmful outputs.  

---

## Future Work

- **Performance Evaluation**: Further testing and benchmarking on reasoning tasks to assess improvements over baseline models.  
- **Dataset Expansion**: Incorporating additional datasets to enhance reasoning and inference capabilities.  
- **Safety and Alignment**: Exploring methods to align the model with ethical guidelines and safety standards for broader use.