roberta-base-amharic

This model has the same architecture as xlm-roberta-base and was pretrained from scratch using the Amharic subsets of the oscar, mc4, and amharic-sentences-corpus datasets, on a total of 290 Million tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 32k.

The model was trained for 22 hours on an A100 40GB GPU.

It achieves the following results on the evaluation set:

Loss: 2.09
Perplexity: 8.08

This model has 110 Million parameters and is currently the best Amharic encoder model, beating the 2.5x larger 279 Million parameter xlm-roberta-base multilingual model on Amharic Sentiment Classification and Named Entity Recognition tasks.

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/roberta-base-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ <mask> ተቆጥሯል።")

[{'score': 0.40162667632102966,
  'token': 137,
  'token_str': 'ዓመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል።'},
 {'score': 0.24096301198005676,
  'token': 346,
  'token_str': 'አመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል።'},
 {'score': 0.15971705317497253,
  'token': 217,
  'token_str': 'ዓመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል።'},
 {'score': 0.13074122369289398,
  'token': 733,
  'token_str': 'አመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል።'},
 {'score': 0.03847867250442505,
  'token': 194,
  'token_str': 'ዘመን',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዘመን ተቆጥሯል።'}]

Finetuning

This model was finetuned and evaluated on the following Amharic NLP tasks

Sentiment Classification
- Dataset: amharic-sentiment
- Code: https://github.com/rasyosef/amharic-sentiment-classification
Named Entity Recognition
- Dataset: amharic-named-entity-recognition
- Code: https://github.com/rasyosef/amharic-named-entity-recognition

Finetuned Model Performance

The reported F1 scores are macro averages.

Model	Size (# params)	Perplexity	Sentiment (F1)	Named Entity Recognition (F1)
roberta-base-amharic	110M	8.08	0.88	0.78
roberta-medium-amharic	42.2M	11.59	0.84	0.75
bert-medium-amharic	40.5M	13.74	0.83	0.68
bert-small-amharic	27.8M	15.96	0.83	0.68
bert-mini-amharic	10.7M	22.42	0.81	0.64
bert-tiny-amharic	4.18M	71.52	0.79	0.54
xlm-roberta-base	279M		0.83	0.73
afro-xlmr-base	278M		0.83	0.75
afro-xlmr-large	560M		0.86	0.76
am-roberta	443M		0.82	0.69

rasyosef
/

roberta-base-amharic

roberta-base-amharic

How to use

Finetuning

Finetuned Model Performance

Model tree for rasyosef/roberta-base-amharic

Datasets used to train rasyosef/roberta-base-amharic

Collection including rasyosef/roberta-base-amharic

Amharic BERT and RoBERTa