--- license: apache-2.0 language: - nb - nn - 'no' - se - sv - da - en - is - fo base_model: - mistralai/Mistral-Nemo-Base-2407 library_name: transformers --- ![](puffin.png) NorMistral-11b-warm is a large Norwegian language model initialized from [Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407) and continuously pretrained on a total of 260 billion subword tokens -- using a mix of Scandinavian, Sámi, English and code data (four repetitions of open Norwegian texts). *Disclaimer: This model is pretrained on raw (mostly web-based) textual data. It is not finetuned to follow instructions, and it can generate harmful completions after inappropriate user prompts. It is primarily intended for research purposes.* ## License *Here, we should probably discuss our understanding of the license* ## Tokenizer This model uses a new tokenizer, specially trained on the target languages. Therefore it offers substantially faster inference than the original Mistral-Nemo-Base-2407 model. Here are the subword-to-word split ratios across different languages: | Tokenizer | # tokens | Bokmål | Nynorsk | Sámi | Danish | Swedish | |------------|--------|--------|---------|-------|--------|---------| | Mistral-Nemo-Base-2407 | 131072 | 1.79 | 1.87 | 2.63 | 1.82 | 2.00 | | NorMistral-11b-warm | 51200 | 1.22 | 1.28 | 1.82 | 1.33 | 1.39 |