ScandiBERT / README.md
vesteinn's picture
Update README.md
b475f92
metadata
language:
  - is
  - da
  - sv
  - 'no'
  - fo
widget:
  - text: Fina lilla<mask>, jag vill inte bliva stur.
  - text: Nu ved jeg, at du frygter<mask> og end ikke vil nægte mig din eneste søn..
  - text: Það er vorhret á<mask>, napur vindur sem hvín.
  - text: Ja, Gud signi<mask>, mítt land.
  - text: Alle dyrene i<mask>  være venner.
tags:
  - roberta
  - icelandic
  - norwegian
  - faroese
  - danish
  - swedish
  - masked-lm
  - pytorch
license: agpl-3.0
datasets:
  - vesteinn/FC3
  - vesteinn/IC3
  - mideind/icelandic-common-crawl-corpus-IC3
  - NbAiLab/NCC
  - DDSC/partial-danish-gigaword-no-twitter

ScandiBERT

Note note: The model has been updated on 2022-09-27

The model was trained on the data shown in the table below. Batch size was 8.8k, the model was trained for 72 epochs on 24 V100 cards for about 2 weeks.

Language Data Size
Icelandic See IceBERT paper 16 GB
Danish Danish Gigaword Corpus (incl Twitter) 4,7 GB
Norwegian NCC corpus 42 GB
Swedish Swedish Gigaword Corpus 3,4 GB
Faroese FC3 + Sosialurinn + Bible 69 MB

Note: At an earlier date a half trained model went up here, it has since been removed. The model has since been updated.

This is a Scandinavian BERT model trained on a large collection of Danish, Faroese, Icelandic, Norwegian and Swedish text. It is currently the highest ranking model on the ScandEval leaderbord https://scandeval.github.io/pretrained/

If you find this model useful, please cite

@inproceedings{snaebjarnarson-etal-2023-transfer,
    title = "{T}ransfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese",
    author = "Snæbjarnarson, Vésteinn  and
      Simonsen, Annika  and
      Glavaš, Goran  and
      Vulić, Ivan",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = "may 22--24",
    year = "2023",
    address = "Tórshavn, Faroe Islands",
    publisher = {Link{\"o}ping University Electronic Press, Sweden},
}