HerBERT
HerBERT is a BERT-based Language Model trained on Polish Corpora using only MLM objective with dynamic masking of whole words. For more details, please refer to: KLEJ: Comprehensive Benchmark for Polish Language Understanding.
Dataset
HerBERT training dataset is a combination of several publicly available corpora for Polish language:
Corpus | Tokens | Texts |
---|---|---|
OSCAR | 6710M | 145M |
Open Subtitles | 1084M | 1.1M |
Wikipedia | 260M | 1.5M |
Wolne Lektury | 41M | 5.5k |
Allegro Articles | 18M | 33k |
Tokenizer
The training dataset was tokenized into subwords using HerBERT Tokenizer; a character level byte-pair encoding with a vocabulary size of 50k tokens. The tokenizer itself was trained on Wolne Lektury and a publicly available subset of National Corpus of Polish with a fastBPE library.
Tokenizer utilizes XLMTokenizer
implementation for that reason, one should load it as allegro/herbert-klej-cased-tokenizer-v1
.
HerBERT models summary
Model | WWM | Cased | Tokenizer | Vocab Size | Batch Size | Train Steps |
---|---|---|---|---|---|---|
herbert-klej-cased-v1 | YES | YES | BPE | 50K | 570 | 180k |
Model evaluation
HerBERT was evaluated on the KLEJ benchmark, publicly available set of nine evaluation tasks for the Polish language understanding. It had the best average performance and obtained the best results for three of them.
Model | Average | NKJP-NER | CDSC-E | CDSC-R | CBD | PolEmo2.0-IN\t | PolEmo2.0-OUT | DYK | PSC | AR\t |
---|---|---|---|---|---|---|---|---|---|---|
herbert-klej-cased-v1 | 80.5 | 92.7 | 92.5 | 91.9 | 50.3 | 89.2 | 76.3 | 52.1 | 95.3 | 84.5 |
Full leaderboard is available online.
HerBERT usage
Model training and experiments were conducted with transformers in version 2.0.
Example code:
from transformers import XLMTokenizer, RobertaModel
tokenizer = XLMTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
model = RobertaModel.from_pretrained("allegro/herbert-klej-cased-v1")
encoded_input = tokenizer.encode("Kto ma lepszą sztukę, ma lepszy rząd – to jasne.", return_tensors='pt')
outputs = model(encoded_input)
HerBERT can also be loaded using AutoTokenizer
and AutoModel
:
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-klej-cased-tokenizer-v1")
model = AutoModel.from_pretrained("allegro/herbert-klej-cased-v1")
License
CC BY-SA 4.0
Citation
If you use this model, please cite the following paper:
@inproceedings{rybak-etal-2020-klej,
title = "{KLEJ}: Comprehensive Benchmark for {P}olish Language Understanding",
author = "Rybak, Piotr and
Mroczkowski, Robert and
Tracz, Janusz and
Gawlik, Ireneusz",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.111",
doi = "10.18653/v1/2020.acl-main.111",
pages = "1191--1201",
}
Authors
The model was trained by Allegro Machine Learning Research team.
You can contact us at: [email protected]
- Downloads last month
- 473