File size: 2,863 Bytes
93a4367
 
ec86e4a
 
 
 
 
 
 
 
 
 
 
 
 
93a4367
 
ae35364
 
 
e87272e
 
1d902c0
 
 
a695f07
 
 
 
1d902c0
 
3bfa4da
1d902c0
3bfa4da
67dcdbf
3f7f0b9
 
adff286
 
 
a7f1bf6
adff286
 
 
 
 
 
 
 
 
 
 
 
 
 
214828c
adff286
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: apache-2.0
language:
- nb
- nn
- 'no'
- se
- sv
- da
- en
- is
- fo
base_model:
- mistralai/Mistral-Nemo-Base-2407
library_name: transformers
---

![](puffin.png)

NorMistral-11b-warm is a large Norwegian language model initialized from [Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407) and 
continuously pretrained on a total of 260 billion subword tokens -- using a mix of Scandinavian, Sámi, English and code data (four repetitions of open Norwegian texts).

*Disclaimer: This model is pretrained on raw (mostly web-based) textual data. It is not finetuned to follow instructions, and it can generate harmful completions after inappropriate user prompts. It is primarily intended for research purposes.*


## License

*Here, we should probably discuss our understanding of the license* 

## Tokenizer

This model uses a new tokenizer, specially trained on the target languages. Therefore it offers substantially faster inference than the original Mistral-Nemo-Base-2407 model. Here are the subword-to-word split ratios across different languages:

| Tokenizer  | # tokens | Bokmål | Nynorsk | Sámi  | Danish | Swedish |
|:------------|:--------:|:--------:|:---------:|:-------:|:--------:|:---------:|
| Mistral-Nemo-Base-2407    | 131072 | 1.79   | 1.87    | 2.63  | 1.82   | 2.00    |
| NorMistral-11b-warm | 51200 | 1.22   | 1.28    | 1.82  | 1.33   | 1.39    |

## NorMistral-11b is also a bidirectional masked language model

Having been pretrained on a mixed causal-masked objective, this model knows how to process texts bidirectionally. You can thus finetune this model like any other BERT (or any other prefix language model). The model can also be used directly for masked language modeling:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# First, we will have to import the tokenizer and the language model
# we can use CausalLM instead of MaskedLM just fine
tokenizer = AutoTokenizer.from_pretrained(
    "norallm/normistral-11b-warm"
)
model = AutoModelForCausalLM.from_pretrained(
    "norallm/normistral-11b-warm"
).cuda().eval()

# A partially-masked input text string
text = "En søt lundefugl flyr over de<mask>norske fjorder."
input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()

# An empty attention mask allows uncontrained bidirectional attention
attention_mask = torch.zeros(input_ids.size(0), 1, input_ids.size(1), input_ids.size(1), device=input_ids.device)

output_logits = model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    return_dict=True
).logits
predictions = output_logits[0, :, :].argmax(dim=-1)

# Expected output:
# En søt lundefugl flyr over de<mask> norske fjorder. -> En søt lundefugl flyr over de vakre norske fjorder.
print(f"{tokenizer.decode(input_ids[0, 1:])} -> {tokenizer.decode(predictions[:-1])}")
```