metadata
language:
- is
- da
- sv
- 'no'
- fo
widget:
- text: Fina lilla<mask>, jag vill inte bliva stur.
- text: Nu ved jeg, at du frygter<mask> og end ikke vil nægte mig din eneste søn..
- text: Það er vorhret á<mask>, napur vindur sem hvín.
- text: Ja, Gud signi<mask>, mítt land.
- text: Alle dyrene i<mask> må være venner.
tags:
- roberta
- icelandic
- norwegian
- faroese
- danish
- swedish
- masked-lm
- pytorch
license: agpl-3.0
ScandiBERT
Note note: The model has been updated on 2022-09-27
The model was trained on the data shown in the table below. Batch size was 8.8k, the model was trained for 72 epochs on 24 V100 cards for about 2 weeks.
Language | Data | Size |
---|---|---|
Icelandic | See IceBERT paper | 16 GB |
Danish | Danish Gigaword Corpus (incl Twitter) | 4,7 GB |
Norwegian | NCC corpus | 42 GB |
Swedish | Swedish Gigaword Corpus | 3,4 GB |
Faroese | FC3 + Sosialurinn + Bible | 69 MB |
Note: At an earlier date a half trained model went up here, it has since been removed. The model has since been updated.
This is a Scandinavian BERT model trained on a large collection of Danish, Faroese, Icelandic, Norwegian and Swedish text. It is currently the highest ranking model on the ScandEval leaderbord https://scandeval.github.io/pretrained/