File size: 10,322 Bytes
b4f814f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39379dc
 
b4f814f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
334ec1e
b4f814f
 
 
 
 
 
 
 
 
 
 
 
2d1186b
 
b4f814f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0d376fc
 
227e029
b4f814f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d123cee
 
b4f814f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d478a4a
b4f814f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ea44f6c
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
---
license: apache-2.0
---
<style>
table {
    border-collapse: collapse;
    width: 100%;
    margin-bottom: 20px;
}
th, td {
    border: 1px solid #ddd;
    padding: 8px;
    text-align: center;
}
.best {
    font-weight: bold;
    text-decoration: underline;
}
</style>

<div style="text-align: center; margin: 20px auto; padding: 20px; border: 3px solid #ddd; border-radius: 10px;">
  <h2 style="margin-bottom: 4px; margin-top: 0px;">OuteAI</h2>
  <a href="https://www.outeai.com/" target="_blank" style="margin-right: 10px;">๐ŸŒŽ OuteAI.com</a> 
  <a href="https://discord.gg/vyBM87kAmf" target="_blank" style="margin-right: 10px;">๐Ÿค Join our Discord</a>
  <a href="https://x.com/OuteAI" target="_blank">๐• @OuteAI</a>
</div>

## Introduction
We're excited to introduce our latest model, the Lite Oute 2 Mamba2Attn 250M. <br>
This is our third generation model featuring the new Mamba2 architecture with attention layers. <br>
If you're interested in more technical details that covers the training process, architecture, and performance: <a href="https://outeai.com/blog/lite-oute-2-mamba2attn" target="_blank">Read the full blog post here</a>

## Model Variants
- [Lite-Oute-2-Mamba2Attn-250M-Instruct](https://huggingface.co/OuteAI/Lite-Oute-2-Mamba2Attn-250M-Instruct)
- [Lite-Oute-2-Mamba2Attn-250M-Base](https://huggingface.co/OuteAI/Lite-Oute-2-Mamba2Attn-250M-Base)

## Training Details
The model was pre-trained on 30 billion tokens using a balanced mixture of datasets:
- **50% dclm-baseline-1.0**
- **50% fineweb-edu**

Base model training was conducted on single NVIDIA 4090 and NVIDIA H100 GPUs, with the following key parameters:
- **Max learning rate:** 4e-4
- **Min learning rate:** 1e-4
- **Block size:** 4096
- **Token batches:** ~100k tokens

For instruction training, we first trained the model with Supervised Fine-tuning (SFT) then further refined the model using DPO training.

## Benchmark Results
<table>
<tr>
    <th>Benchmark</th>
    <th>Lite-Oute-2-Mamba2Attn-250M-Instruct</th>
    <th>Lite-Oute-1-300M-Instruct</th>
    <th>Lite-Mistral-150M-v2-Instruct</th>
</tr>
<tr>
    <td>ARC-C (0-shot)</td>
    <td class="best">26.71</td>
    <td>26.02</td>
    <td>-</td>
</tr>
<tr>
    <td>ARC-E (0-shot)</td>
    <td class="best">53.70</td>
    <td>49.79</td>
    <td>39.52</td>
</tr>
<tr>
    <td>HellaSWAG (0-shot)</td>
    <td class="best">38.19</td>
    <td>34.50</td>
    <td>31.01</td>
</tr>
<tr>
    <td>MMLU (0-shot)</td>
    <td class="best">25.13</td>
    <td>24.00</td>
    <td>25.28</td>
</tr>
<tr>
    <td>OpenBookQA (0-shot)</td>
    <td class="best">32.20</td>
    <td>32.20</td>
    <td>28.40</td>
</tr>
<tr>
    <td>PIQA (0-shot)</td>
    <td class="best">66.59</td>
    <td>65.40</td>
    <td>60.23</td>
</tr>
<tr>
    <td>Winogrande (0-shot)</td>
    <td>53.28</td>
    <td class="best">53.75</td>
    <td>51.78</td>
</tr>

<tr>
    <td>ARC-C (5-shot)</td>
    <td class="best">27.82</td>
    <td>26.37</td>
    <td>-</td>
</tr>
<tr>
    <td>ARC-E (5-shot)</td>
    <td class="best">57.28</td>
    <td>51.43</td>
    <td>46.30</td>
</tr>
<tr>
    <td>HellaSWAG (5-shot)</td>
    <td class="best">37.71</td>
    <td>34.93</td>
    <td>30.73</td>
</tr>
<tr>
    <td>MMLU (5-shot)</td>
    <td>25.81</td>
    <td class="best">25.87</td>
    <td>-</td>
</tr>
<tr>
    <td>OpenBookQA (5-shot)</td>
    <td class="best">31.80</td>
    <td>31.40</td>
    <td>28.00</td>
</tr>
<tr>
    <td>PIQA (5-shot)</td>
    <td class="best">66.38</td>
    <td>65.07</td>
    <td>62.68</td>
</tr>
<tr>
    <td>Winogrande (5-shot)</td>
    <td>51.14</td>
    <td class="best">52.01</td>
    <td>51.30</td>
</tr>
</table>

## Interfacing with the Instruct Model

> [!IMPORTANT]
> To ensure optimal performance, please use the following template when interacting with the model:

```
<s>system\n
{System message}</s>
<s>user\n
{Your question or message}</s>
<s>assistant\n
{The model's response}</s>
```

> [!IMPORTANT]
> When prompting the model for a response, your input should be structured as follows:

```
<s>system\n
{System message}</s>
<s>user\n
{Your question or message}</s>
<s>assistant\n
```

This format ensures the best performance, as the model was fine-tuned using this specific template.

### Recommended Settings
- **Temperature:** 0.1 - 0.4
- **Repetition Penalty:** 1.1 - 1.14

If you're using a Hugging Face tokenizer, make sure to use `tokenizer.apply_chat_template` or adjust your interface to match the template above.

## Let's look at some real world examples

```
<s>user
What are three key benefits of electric vehicles compared to traditional gasoline-powered cars?</s>
<s>assistant
Electric vehicles have several advantages over gasoline-powered cars. Here are three key benefits:
1. Efficient: Electric vehicles use less fuel and emit fewer emissions than gasoline-powered cars, which can save consumers money on their monthly bills.
2. Fuel economy: Electric vehicles produce more power per mile driven, which means they consume less fuel overall. This helps reduce the need for refueling stations and reduces carbon emissions from transportation.
3. Flexibility: Electric vehicles can be adapted to different driving conditions, such as snowy or rainy days, and can accommodate different types of terrain. This makes them easier to operate in urban areas with limited parking spaces.
Overall, electric vehicles offer a range of benefits that make them an attractive choice for many drivers. They provide a cost-effective alternative to gasoline-powered cars, reducing the need for frequent refueling stations, and making it easier to drive long distances without having to worry about running out of gas.</s>
```
    
```
<s>user
Who wrote the play "Romeo and Juliet"?</s>
<s>assistant
The play "Romeo and Juliet" by William Shakespeare is a classic example of a tragic love story. It tells the story of Romeo and Juliet, a young couple who fall in love with each other but ultimately lose their lives due to their respective flaws. The play was written during the Renaissance period and it has been performed around the world since then.</s>
```

## Usage with HuggingFace transformers 
Model weights were converted from the original Mamba2 implementation to be Hugging Face compatible. <br>
Due to the lack of official support for Mamba2 attention layers in Hugging Face Transformers, custom modeling files are included. <br>
The implementation of Mamba2 with attention in the modeling files comes from Pull Request #32027 in the Hugging Face Transformers repository: [https://github.com/huggingface/transformers/pull/32027](https://github.com/huggingface/transformers/pull/32027)

To speed up inference, we recommend installing mamba-ssm and flash attention 2.

mamba-ssm:
```bash
pip install causal-conv1d>=1.4.0
pip install mamba-ssm
```

flash attention 2:
```bash
pip install flash-attn --no-build-isolation
```

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained(
    "OuteAI/Lite-Oute-2-Mamba2Attn-Instruct",
    # To allow custom modeling files
    trust_remote_code=True,

    # If you have installed flash attention 2
    # attn_implementation="flash_attention_2",
    # torch_dtype=torch.bfloat16,
)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained("OuteAI/Lite-Oute-2-Mamba2Attn-Instruct")

def generate_response(message: str, temperature: float = 0.1, repetition_penalty: float = 1.12) -> str:
    # Apply the chat template and convert to PyTorch tensors
    messages = [
        {"role": "system", "content": "You are an AI assistant. You will be given a task. You must generate a detailed answer."},
        {"role": "user", "content": message}
    ]
    input_ids = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True, return_tensors="pt"
    ).to(device)
    # Generate the response
    output = model.generate(
        input_ids,
        max_length=256,
        temperature=temperature,
        repetition_penalty=repetition_penalty,
        do_sample=True
    ) 
    # Decode the generated output
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text
message = "Who wrote the play Romeo and Juliet?"
response = generate_response(message)
print(response)
```

## Fine-Tuning for Specific Tasks:
For optimal task-specific performance, it is recommended to fine-tune the model on datasets relevant to the intended use case. <br>
Fine-tuning can significantly improve the model's accuracy, relevance, and depth of knowledge in specialized domains or for particular types of tasks. <br>
This process allows the model to adapt its pre-trained knowledge to the nuances and requirements of specific applications, potentially mitigating some of the weaknesses observed in general use.

## Conclusion
The Lite-Oute-2-Mamba2Attn-250M-Instruct model shows significant improvements over previous versions, particularly in zero-shot and few-shot learning tasks. <br>
Despite its smaller size, it outperforms older models in most benchmarks, demonstrating better common sense reasoning, language understanding, and general knowledge. <br>
While it can provide coherent responses to various queries, it has limitations due to its small size (250 million parameters). <br>
Users should be aware that it may produce incorrect outputs and should not be relied upon for critical decisions without verification. <br>
It's best suited for lightweight applications or as a basis for further development.

## Disclaimer
By using this model, you acknowledge that you understand and assume the risks associated with its use. 
You are solely responsible for ensuring compliance with all applicable laws and regulations. 
We disclaim any liability for problems arising from the use of this open-source model, including but not limited to direct, indirect, incidental, consequential, or punitive damages. 
We make no warranties, express or implied, regarding the model's performance, accuracy, or fitness for a particular purpose. Your use of this model is at your own risk, and you agree to hold harmless and indemnify us, our affiliates, and our contributors from any claims, damages, or expenses arising from your use of the model.