davda54 commited on
Commit
1d0829a
·
verified ·
1 Parent(s): e95cc30

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -3
README.md CHANGED
@@ -13,19 +13,45 @@ language:
13
  base_model:
14
  - mistralai/Mistral-Nemo-Base-2407
15
  library_name: transformers
 
 
 
 
 
 
16
  ---
17
 
18
  ![](puffin.png)
19
 
20
- NorMistral-11b-warm is a large Norwegian language model initialized from [Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407) and
21
- continuously pretrained on a total of 260 billion subword tokens -- using a mix of Scandinavian, Sámi, English and code data (four repetitions of open Norwegian texts).
 
 
22
 
23
  *Disclaimer: This model is pretrained on raw (mostly web-based) textual data. It is not finetuned to follow instructions, and it can generate harmful completions after inappropriate user prompts. It is primarily intended for research purposes.*
24
 
25
 
26
  ## License
27
 
28
- *Here, we should probably discuss our understanding of the license*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  ## Tokenizer
31
 
@@ -36,6 +62,125 @@ This model uses a new tokenizer, specially trained on the target languages. Ther
36
  | Mistral-Nemo-Base-2407 | 131072 | 1.79 | 1.87 | 2.63 | 1.82 | 2.00 |
37
  | NorMistral-11b-warm | 51200 | 1.22 | 1.28 | 1.82 | 1.33 | 1.39 |
38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  ## NorMistral-11b is also a bidirectional masked language model
40
 
41
  Having been pretrained on a mixed causal-masked objective, this model knows how to process texts bidirectionally. You can thus finetune this model like any other BERT (or any other prefix language model). The model can also be used directly for masked language modeling:
@@ -70,3 +215,7 @@ predictions = output_logits[0, :, :].argmax(dim=-1)
70
  # En søt lundefugl flyr over de<mask> norske fjorder. -> En søt lundefugl flyr over de vakre norske fjorder.
71
  print(f"{tokenizer.decode(input_ids[0, 1:])} -> {tokenizer.decode(predictions[:-1])}")
72
  ```
 
 
 
 
 
13
  base_model:
14
  - mistralai/Mistral-Nemo-Base-2407
15
  library_name: transformers
16
+ pipeline_tag: text-generation
17
+ tags:
18
+ - norwegian
19
+ - sami
20
+ - bokmaal
21
+ - nynorsk
22
  ---
23
 
24
  ![](puffin.png)
25
 
26
+ **NorMistral-11b-warm** is a large Norwegian language model initialized from [Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407) and
27
+ continuously pretrained on a total of 250 billion subword tokens -- using a mix of Scandinavian, Sámi, English and code data (four repetitions of open Norwegian texts).
28
+
29
+ This model is a part of the NORA.LLM family developed by [the Language Technology Group at the University of Oslo](https://huggingface.co/ltg),
30
 
31
  *Disclaimer: This model is pretrained on raw (mostly web-based) textual data. It is not finetuned to follow instructions, and it can generate harmful completions after inappropriate user prompts. It is primarily intended for research purposes.*
32
 
33
 
34
  ## License
35
 
36
+ We release the model under Apache 2.0 license to indicate that we do not impose any additional constraints on the model weights.
37
+ However, we do not own the data in the training collection.
38
+
39
+
40
+ ## Pretraining corpus
41
+
42
+ The model is pretrained on a combination of publicly available data and a custom web crawl for Sámi. The total training corpus consists of 250 billion tokens from the following sources:
43
+
44
+ 1. Norwegian text (Bokmål and Nynorsk); this collection was created by the National Library of Norway and it's a prerelease of an update of NCC (codenamed "Mímir core"). It consists of: a) the public part of [Norwegian Colossal Corpus (NCC)](https://huggingface.co/datasets/NbAiLab/NCC) with permissible licenses; b) Bokmål and Nynorsk [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX), and c) Bokmål and Nynorsk [HPLT corpus v1.2](https://hplt-project.org/datasets/v1.2).
45
+
46
+ 2. Northern Sámi texts are sourced from a) [Glot500](https://huggingface.co/datasets/cis-lmu/Glot500); b) [the SIKOR North Saami free corpus](https://repo.clarino.uib.no/xmlui/handle/11509/100); and c) a custom web crawl (seeded from Sámi Wikipedia external links).
47
+
48
+ 3. Additional languages for knowledge/language transfer: a) Danish, Swedish, Icelandic, and Faroese from CulturaX and Glot500; b) high-quality English from [FineWeb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu); and c) programming code from [The Stack v2 (the high-quality subset)](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids).
49
+
50
+ The corpus is carefully balanced through strategic upsampling to handle the resource disparity between languages. Following data-constrained scaling laws, the corpus data for target languages is repeated multiple times (up to 16x for low-resource languages) to reach the optimal training budget while avoiding overfitting:
51
+
52
+ ![](images/corpus.png)
53
+
54
+
55
 
56
  ## Tokenizer
57
 
 
62
  | Mistral-Nemo-Base-2407 | 131072 | 1.79 | 1.87 | 2.63 | 1.82 | 2.00 |
63
  | NorMistral-11b-warm | 51200 | 1.22 | 1.28 | 1.82 | 1.33 | 1.39 |
64
 
65
+
66
+
67
+ ## Evaluation
68
+
69
+ More details about the evaluation setup and the new Norwegian benchmarks will be described in upcoming papers.
70
+
71
+ ![](images/results.png)
72
+
73
+
74
+
75
+ ## Model details
76
+
77
+ **Model Developers:** Language Technology Group at the University of Oslo in collaboration with NORA.LLM.
78
+
79
+ **Architecture:** NorMistral-11B uses the Mistral architecture based on an improved Llama design, featuring:
80
+ - Pre-normalization with RMSNorm
81
+ - SwiGLU activation function
82
+ - Rotary positional embeddings
83
+ - Grouped-query attention
84
+ - 40 transformer layers
85
+ - Hidden dimension: 5,120
86
+ - Intermediate dimension: 14,336
87
+ - 32 query heads and 8 key & value heads (dimension 128)
88
+ - Vocabulary size: 51,200 tokens
89
+ - Total parameters: 11.4 billion
90
+
91
+ **Training Details:**
92
+ - Training tokens: 250 billion
93
+ - Batch size: 1,024 × 4,096 tokens
94
+ - Training steps: 60,000
95
+ - Peak learning rate: 1e-4
96
+ - Warm-up steps: 1,000
97
+ - Learning rate decay steps: 10,000
98
+ - Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=1e-8)
99
+ - Weight decay: 0.1
100
+ - Training precision: bfloat16
101
+ - Hardware: 256 AMD MI250X GPUs (128 GB)
102
+ - Training time: 8.5 days
103
+ - Theoretical computation: 1.7e22 FLOP/s
104
+ - Model FLOP/s utilization (MFU): 38%
105
+
106
+ **Unique Features:**
107
+ - Hybrid masked-causal training (90% causal LM, 10% masked next-token prediction)
108
+ - Can be used as both a causal generative model and a bidirectional encoder model
109
+ - Three-stage continual pretraining:
110
+ 1. Tokenizer optimization for target languages
111
+ 2. Embedding weight realignment
112
+ 3. Full model training
113
+
114
+ **Base Model:** Initialized from Mistral-Nemo-Base-2407
115
+
116
+ **License:** Apache-2.0
117
+
118
+
119
+
120
+ ## Example usage
121
+
122
+ ### Basic Causal Language Model Usage
123
+
124
+ Here's how to use NorMistral-11B as a standard causal language model for translation:
125
+
126
+ ```python
127
+ from transformers import AutoTokenizer, AutoModelForCausalLM
128
+ import torch
129
+
130
+ # Import the tokenizer and model
131
+ tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b")
132
+ model = AutoModelForCausalLM.from_pretrained("norallm/normistral-11b").cuda().eval()
133
+
134
+ # Define zero-shot translation prompt template
135
+ prompt = """Engelsk: {0}
136
+ Bokmål:"""
137
+
138
+ # Generation function
139
+ @torch.no_grad()
140
+ def generate(text):
141
+ text = prompt.format(text)
142
+ input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()
143
+ prediction = model.generate(
144
+ input_ids,
145
+ max_new_tokens=64,
146
+ do_sample=False,
147
+ eos_token_id=tokenizer('\n').input_ids
148
+ )
149
+ return tokenizer.decode(prediction[0, input_ids.size(1):]).strip()
150
+
151
+ # Example usage
152
+ generate("I'm excited to try this new Norwegian language model!")
153
+ # > Expected output: 'Jeg er spent på å prøve denne nye norske språkmodellen!'
154
+ ```
155
+
156
+ ### Memory-Efficient Loading
157
+
158
+ For systems with limited VRAM, you can load the model in 8-bit or 4-bit quantization:
159
+
160
+ ```python
161
+ import torch
162
+ from transformers import AutoTokenizer, AutoModelForCausalLM
163
+
164
+ tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b")
165
+
166
+ # Load in 8-bit mode (requires ~12GB VRAM)
167
+ model = AutoModelForCausalLM.from_pretrained(
168
+ "norallm/normistral-11b",
169
+ device_map='auto',
170
+ load_in_8bit=True,
171
+ torch_dtype=torch.bfloat16
172
+ )
173
+
174
+ # Or load in 4-bit mode (requires ~8GB VRAM)
175
+ model = AutoModelForCausalLM.from_pretrained(
176
+ "norallm/normistral-11b",
177
+ device_map='auto',
178
+ load_in_4bit=True,
179
+ torch_dtype=torch.bfloat16
180
+ )
181
+ ```
182
+
183
+
184
  ## NorMistral-11b is also a bidirectional masked language model
185
 
186
  Having been pretrained on a mixed causal-masked objective, this model knows how to process texts bidirectionally. You can thus finetune this model like any other BERT (or any other prefix language model). The model can also be used directly for masked language modeling:
 
215
  # En søt lundefugl flyr over de<mask> norske fjorder. -> En søt lundefugl flyr over de vakre norske fjorder.
216
  print(f"{tokenizer.decode(input_ids[0, 1:])} -> {tokenizer.decode(predictions[:-1])}")
217
  ```
218
+
219
+ ## Contact
220
+
221
+ Please write [a community message](https://huggingface.co/norallm/normistral-11b-warm/discussions) or contact David Samuel ([email protected]) if you have any questions about this model.