File size: 1,404 Bytes
93a4367
 
ec86e4a
 
 
 
 
 
 
 
 
 
 
 
 
93a4367
 
ae35364
 
 
e87272e
 
1d902c0
 
 
a695f07
 
 
 
1d902c0
 
3bfa4da
1d902c0
3bfa4da
 
3f7f0b9
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
---
license: apache-2.0
language:
- nb
- nn
- 'no'
- se
- sv
- da
- en
- is
- fo
base_model:
- mistralai/Mistral-Nemo-Base-2407
library_name: transformers
---

![](puffin.png)

NorMistral-11b-warm is a large Norwegian language model initialized from [Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407) and 
continuously pretrained on a total of 260 billion subword tokens -- using a mix of Scandinavian, Sámi, English and code data (four repetitions of open Norwegian texts).

*Disclaimer: This model is pretrained on raw (mostly web-based) textual data. It is not finetuned to follow instructions, and it can generate harmful completions after inappropriate user prompts. It is primarily intended for research purposes.*


## License

*Here, we should probably discuss our understanding of the license* 

## Tokenizer

This model uses a new tokenizer, specially trained on the target languages. Therefore it offers substantially faster inference than the original Mistral-Nemo-Base-2407 model. Here are the subword-to-word split ratios across different languages:

| Tokenizer  | # tokens | Bokmål | Nynorsk | Sámi  | Danish | Swedish |
|------------|--------|--------|---------|-------|--------|---------|
| Mistral-Nemo-Base-2407    | 131072 | 1.79   | 1.87    | 2.63  | 1.82   | 2.00    |
| NorMistral-11b-warm | 51200 | 1.22   | 1.28    | 1.82  | 1.33   | 1.39    |