README.md · norallm/normistral-11b-warm at a695f07377152d4b69a27550c419b4f1a140816a

metadata

license: apache-2.0
language:
  - nb
  - nn
  - 'no'
  - se
  - sv
  - da
  - en
  - is
  - fo
base_model:
  - mistralai/Mistral-Nemo-Base-2407
library_name: transformers

NorMistral-11b-warm is a large Norwegian language model initialized from Mistral-Nemo-Base-2407 and continuously pretrained on a total of 260 billion subword tokens -- using a mix of Scandinavian, Sámi, English and code data (four repetitions of open Norwegian texts).

Disclaimer: This model is pretrained on raw (mostly web-based) textual data. It is not finetuned to follow instructions, and it can generate harmful completions after inappropriate user prompts. It is primarily intended for research purposes.

License

Here, we should probably discuss our understanding of the license

Tokenizer

This model uses a new tokenizer, specially trained on the target languages. Therefore it offers substantially faster inference than the original Mistral-Nemo-Base-2407 model. Here are the subword-to-word split ratios across different languages:

Tokenizer	# tokens	Bokmål	Nynorsk	Sámi	Danish	Swedish
Mistral-Nemo-Base-2407	131072	1.79	1.87	2.63	1.82	2.00
NorMistral-11b-warm	51200	1.22	1.28	1.82	1.33	1.39