File size: 8,535 Bytes
93a4367 ec86e4a 1d0829a 93a4367 23f50de ae35364 1d0829a 389f075 1d0829a 8806f98 e87272e 1d902c0 a695f07 1d0829a dbcdd6f 1d0829a a695f07 1d902c0 3bfa4da 1d902c0 3bfa4da 67dcdbf 3f7f0b9 adff286 1d0829a 67e3212 1d0829a ab095bd 1d0829a adff286 a7f1bf6 adff286 214828c adff286 1d0829a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
---
license: apache-2.0
language:
- nb
- nn
- 'no'
- se
- sv
- da
- en
- is
- fo
base_model:
- mistralai/Mistral-Nemo-Base-2407
library_name: transformers
pipeline_tag: text-generation
tags:
- norwegian
- sami
- bokmaal
- nynorsk
---
![](images/puffin_2.png)
**NorMistral-11b-warm** is a large Norwegian language model initialized from [Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407) and
continually pretrained on a total of 250 billion subword tokens – using a mix of Scandinavian, Sámi, English and code data (four repetitions of open Norwegian texts).
This model is a part of the NORA.LLM family developed by [the Language Technology Group at the University of Oslo (LTG)](https://huggingface.co/ltg),
*Disclaimer: This model is pretrained on raw (mostly web-based) textual data. It is not finetuned to follow instructions, and it can generate harmful completions after inappropriate user prompts. It is primarily intended for research purposes.*
## License
We release the model under Apache 2.0 license to indicate that we do not impose any additional constraints on the model weights.
However, we do not own the data in the training collection.
## Pretraining corpus
The model is pretrained on a combination of publicly available data and a custom web crawl for Sámi. The total training corpus consists of 250 billion tokens from the following sources:
1. Norwegian text (Bokmål and Nynorsk); this collection was created by the National Library of Norway and it's a prerelease of an update of NCC (codenamed "Mímir core"). It consists of: a) the public part of [Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) with permissible licenses (i.e. it doesn't include newspaper texts with the CC BY-NC 2.0 license); b) Bokmål and Nynorsk [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX), and c) Bokmål and Nynorsk [HPLT corpus v1.2](https://hplt-project.org/datasets/v1.2).
2. Northern Sámi texts are sourced from a) [Glot500](https://huggingface.co/datasets/cis-lmu/Glot500); b) [the SIKOR North Saami free corpus](https://repo.clarino.uib.no/xmlui/handle/11509/100); and c) a custom web crawl (seeded from Sámi Wikipedia external links).
3. Additional languages for knowledge/language transfer: a) Danish, Swedish, Icelandic, and Faroese from CulturaX and Glot500; b) high-quality English from [FineWeb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu); and c) programming code from [The Stack v2 (the high-quality subset)](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids).
The corpus is carefully balanced through strategic upsampling to handle the resource disparity between languages. Following data-constrained scaling laws, the corpus data for target languages is repeated multiple times (up to 16x for low-resource languages) to reach the optimal training budget while avoiding overfitting:
![](images/corpus.png)
## Tokenizer
This model uses a new tokenizer, specially trained on the target languages. Therefore it offers substantially faster inference than the original Mistral-Nemo-Base-2407 model. Here are the subword-to-word split ratios across different languages:
| Tokenizer | # tokens | Bokmål | Nynorsk | Sámi | Danish | Swedish |
|:------------|:--------:|:--------:|:---------:|:-------:|:--------:|:---------:|
| Mistral-Nemo-Base-2407 | 131072 | 1.79 | 1.87 | 2.63 | 1.82 | 2.00 |
| NorMistral-11b-warm | 51200 | 1.22 | 1.28 | 1.82 | 1.33 | 1.39 |
## Evaluation
More details about the evaluation setup and the new Norwegian benchmarks will be described in upcoming papers.
![](images/results.png)
## Model details
**Model Developers:** Language Technology Group at the University of Oslo in collaboration with NORA.LLM.
**Architecture:** NorMistral-11B uses the Mistral architecture based on an improved Llama design, featuring:
- Pre-normalization with RMSNorm
- SwiGLU activation function
- Rotary positional embeddings
- Grouped-query attention
- 40 transformer layers
- Hidden dimension: 5,120
- Intermediate dimension: 14,336
- 32 query heads and 8 key & value heads (dimension 128)
- Vocabulary size: 51,200 tokens
- Total parameters: 11.4 billion
**Training Details:**
- Training tokens: 250 billion
- Batch size: 1,024 × 4,096 tokens (# sequences × sequence length)
- Training steps: 60,000
- Peak learning rate: 1e-4
- Warm-up steps: 1,000
- Learning rate decay steps: 10,000
- Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
- Weight decay: 0.1
- Training precision: bfloat16
- Hardware: 256 AMD MI250X GPUs (128 GB)
- Training time: 8.5 days
- Theoretical computation: 2.0e22 FLOP/s
- Model FLOP/s utilization (MFU): 38%
**Unique Features:**
- Hybrid masked-causal training (90% causal LM, 10% masked next-token prediction)
- Can be used as both a causal generative model and a bidirectional encoder model
- Three-stage continual pretraining:
1. Tokenizer optimization for target languages
2. Embedding weight realignment
3. Full model training
**Base Model:** Initialized from Mistral-Nemo-Base-2407
**License:** Apache-2.0
## Example usage
### Basic Causal Language Model Usage
Here's how to use NorMistral-11B as a standard causal language model for translation:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Import the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b")
model = AutoModelForCausalLM.from_pretrained("norallm/normistral-11b").cuda().eval()
# Define zero-shot translation prompt template
prompt = """Engelsk: {0}
Bokmål:"""
# Generation function
@torch.no_grad()
def generate(text):
text = prompt.format(text)
input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()
prediction = model.generate(
input_ids,
max_new_tokens=64,
do_sample=False,
eos_token_id=tokenizer('\n').input_ids
)
return tokenizer.decode(prediction[0, input_ids.size(1):]).strip()
# Example usage
generate("I'm excited to try this new Norwegian language model!")
# > Expected output: 'Jeg er spent på å prøve denne nye norske språkmodellen!'
```
### Memory-Efficient Loading
For systems with limited VRAM, you can load the model in 8-bit or 4-bit quantization:
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b")
# Load in 8-bit mode (requires ~12GB VRAM)
model = AutoModelForCausalLM.from_pretrained(
"norallm/normistral-11b",
device_map='auto',
load_in_8bit=True,
torch_dtype=torch.bfloat16
)
# Or load in 4-bit mode (requires ~8GB VRAM)
model = AutoModelForCausalLM.from_pretrained(
"norallm/normistral-11b",
device_map='auto',
load_in_4bit=True,
torch_dtype=torch.bfloat16
)
```
## NorMistral-11b is also a bidirectional masked language model
Having been pretrained on a mixed causal-masked objective, this model knows how to process texts bidirectionally. You can thus finetune this model like any other BERT (or any other prefix language model). The model can also be used directly for masked language modeling:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# First, we will have to import the tokenizer and the language model
# we can use CausalLM instead of MaskedLM just fine
tokenizer = AutoTokenizer.from_pretrained(
"norallm/normistral-11b-warm"
)
model = AutoModelForCausalLM.from_pretrained(
"norallm/normistral-11b-warm"
).cuda().eval()
# A partially-masked input text string
text = "En søt lundefugl flyr over de<mask>norske fjorder."
input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()
# An empty attention mask allows uncontrained bidirectional attention
attention_mask = torch.zeros(input_ids.size(0), 1, input_ids.size(1), input_ids.size(1), device=input_ids.device)
output_logits = model(
input_ids=input_ids,
attention_mask=attention_mask,
return_dict=True
).logits
predictions = output_logits[0, :, :].argmax(dim=-1)
# Expected output:
# En søt lundefugl flyr over de<mask> norske fjorder. -> En søt lundefugl flyr over de vakre norske fjorder.
print(f"{tokenizer.decode(input_ids[0, 1:])} -> {tokenizer.decode(predictions[:-1])}")
```
## Contact
Please write [a community message](https://huggingface.co/norallm/normistral-11b-warm/discussions) or contact David Samuel ([email protected]) if you have any questions about this model.
|