|
--- |
|
language: |
|
- om |
|
- am |
|
- rw |
|
- rn |
|
- ha |
|
- ig |
|
- so |
|
- sw |
|
- ti |
|
- yo |
|
- pcm |
|
- multilingual |
|
license: mit |
|
datasets: |
|
- castorini/afriberta-corpus |
|
--- |
|
|
|
# afriberta_large |
|
## Model description |
|
AfriBERTa large is a pretrained multilingual language model with around 126 million parameters. |
|
The model has 10 layers, 6 attention heads, 768 hidden units and 3072 feed forward size. |
|
The model was pretrained on 11 African languages namely - Afaan Oromoo (also called Oromo), Amharic, Gahuza (a mixed language containing Kinyarwanda and Kirundi), Hausa, Igbo, Nigerian Pidgin, Somali, Swahili, Tigrinya and Yorùbá. |
|
The model has been shown to obtain competitive downstream performances on text classification and Named Entity Recognition on several African languages, including those it was not pretrained on. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
#### How to use |
|
You can use this model with Transformers for any downstream task. |
|
For example, assuming we want to finetune this model on a token classification task, we do the following: |
|
|
|
```python |
|
>>> from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
>>> model = AutoModelForTokenClassification.from_pretrained("castorini/afriberta_large") |
|
>>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriberta_large") |
|
# we have to manually set the model max length because it is an imported sentencepiece model, which huggingface does not properly support right now |
|
>>> tokenizer.model_max_length = 512 |
|
``` |
|
|
|
#### Limitations and bias |
|
- This model is possibly limited by its training dataset which are majorly obtained from news articles from a specific span of time. Thus, it may not generalize well. |
|
- This model is trained on very little data (less than 1 GB), hence it may not have seen enough data to learn very complex linguistic relations. |
|
|
|
|
|
## Training data |
|
The model was trained on an aggregation of datasets from the BBC news website and Common Crawl. |
|
|
|
## Training procedure |
|
For information on training procedures, please refer to the AfriBERTa [paper]() or [repository](https://github.com/keleog/afriberta) |
|
|
|
### BibTeX entry and citation info |
|
``` |
|
@inproceedings{ogueji-etal-2021-small, |
|
title = "Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages", |
|
author = "Ogueji, Kelechi and |
|
Zhu, Yuxin and |
|
Lin, Jimmy", |
|
booktitle = "Proceedings of the 1st Workshop on Multilingual Representation Learning", |
|
month = nov, |
|
year = "2021", |
|
address = "Punta Cana, Dominican Republic", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2021.mrl-1.11", |
|
pages = "116--126", |
|
} |
|
``` |