IceBERT

IceBERT was trained with fairseq using the RoBERTa-base architecture. The training data used is shown in the table below.

Dataset Size Tokens
Icelandic Gigaword Corpus v20.05 (IGC) 8.2 GB 1,388M
Icelandic Common Crawl Corpus (IC3) 4.9 GB 824M
Greynir News articles 456 MB 76M
Icelandic Sagas 9 MB 1.7M
Open Icelandic e-books (Rafbókavefurinn) 14 MB 2.6M
Data from the medical library of Landspitali 33 MB 5.2M
Student theses from Icelandic universities (Skemman) 2.2 GB 367M
Total 15.8 GB 2,664M

If you find this model useful, please cite

@inproceedings{snaebjarnarson-etal-2022-warm,
    title = "A Warm Start and a Clean Crawled Corpus - A Recipe for Good Language Models",
    author = "Sn{\ae}bjarnarson, V{\'e}steinn  and
      S{\'\i}monarson, Haukur Barri  and
      Ragnarsson, P{\'e}tur Orri  and
      Ing{\'o}lfsd{\'o}ttir, Svanhv{\'\i}t Lilja  and
      J{\'o}nsson, Haukur  and
      Thorsteinsson, Vilhjalmur  and
      Einarsson, Hafsteinn",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.464",
    pages = "4356--4366",
}
Downloads last month
19
Safetensors
Model size
163M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for vesteinn/IceBERT

Finetunes
3 models

Dataset used to train vesteinn/IceBERT