ltgoslo commited on
Commit
250be49
1 Parent(s): d762609

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -5
README.md CHANGED
@@ -20,7 +20,8 @@ datasets:
20
 
21
  <img align="center" src="https://huggingface.co/ltg/norbert3-base/resolve/main/norbert.png" width=12.5%>
22
 
23
- NorMistral-7b-warm is a large Norwegian language model initialized from [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and continuously pretrained on a total of 260 billion tokens (using six repetitions of open Norwegian texts).
 
24
 
25
  This model is a part of the NORA-LLM family developed in collaboration between [the Language Technology Group at the University of Oslo](https://huggingface.co/ltg), [the High Performance Language Technologies (HPLT) project team](https://hplt-project.org/), [the National Library of Norway](https://huggingface.co/NbAiLab), and [the University of Turku](https://huggingface.co/TurkuNLP).
26
  All the models are pre-trained on the same dataset and with the same tokenizer.
@@ -40,11 +41,9 @@ _____
40
  ## Pretraining corpus
41
 
42
  The model is pretrained exclusively on publicly available data. We combine the resources from [the public part of the NCC corpus](https://huggingface.co/datasets/NbAiLab/NCC), from [the cleaned HPLT corpus](https://hplt-project.org/datasets/v1.2), and from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX).
43
- This resulted in over 34B tokens of Norwegian (Bokmål or Nynorsk) in total.
44
  We also augment the corpus with [Starcoder](https://huggingface.co/datasets/vikp/starcoder_filtered); 20% of the 260B tokens are sampled from this code corpus.
45
- The Norwegian data is repeated six times to get the pretraining budget of 260B tokens, in accordance with findings from [Muennighoff et al. (2023)](https://neurips.cc/virtual/2023/poster/70706).
46
-
47
-
48
 
49
  _____
50
  ## Model details
 
20
 
21
  <img align="center" src="https://huggingface.co/ltg/norbert3-base/resolve/main/norbert.png" width=12.5%>
22
 
23
+ NorMistral-7b-warm is a large Norwegian language model initialized from [Mistral-7b-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) and
24
+ continuously pretrained on a total of 260 billion subword tokens (using six repetitions of open Norwegian texts).
25
 
26
  This model is a part of the NORA-LLM family developed in collaboration between [the Language Technology Group at the University of Oslo](https://huggingface.co/ltg), [the High Performance Language Technologies (HPLT) project team](https://hplt-project.org/), [the National Library of Norway](https://huggingface.co/NbAiLab), and [the University of Turku](https://huggingface.co/TurkuNLP).
27
  All the models are pre-trained on the same dataset and with the same tokenizer.
 
41
  ## Pretraining corpus
42
 
43
  The model is pretrained exclusively on publicly available data. We combine the resources from [the public part of the NCC corpus](https://huggingface.co/datasets/NbAiLab/NCC), from [the cleaned HPLT corpus](https://hplt-project.org/datasets/v1.2), and from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX).
44
+ This resulted in over 34B subword tokens of Norwegian (Bokmål or Nynorsk) in total, which amounts to about 26.7B whitespace-separated tokens
45
  We also augment the corpus with [Starcoder](https://huggingface.co/datasets/vikp/starcoder_filtered); 20% of the 260B tokens are sampled from this code corpus.
46
+ The natural language data is repeated six times to get the pretraining budget of 260B tokens, in accordance with findings from [Muennighoff et al. (2023)](https://neurips.cc/virtual/2023/poster/70706).
 
 
47
 
48
  _____
49
  ## Model details