PuoBertaJW300: A curated Setswana Language Model (trained on PuoData + JW300 Setswana)

Zenodo doi badge arXiv 🤗 https://huggingface.co/dsfsi/PuoBERTa

A Roberta-based language model specially designed for Setswana, using the new PuoData dataset + JW300 corpora.

NOTE: If you are looking for the model without JW300, go to https://huggingface.co/dsfsi/PuoBERTa

Model Details

Model Description

This is a masked language model trained on Setswana corpora, making it a valuable tool for a range of downstream applications from translation to content creation. It's powered by the PuoData dataset to ensure accuracy and cultural relevance.

  • Developed by: Vukosi Marivate (@vukosi), Moseli Mots'Oehli (@MoseliMotsoehli) , Valencia Wagner, Richard Lastrucci and Isheanesu Dzingirai
  • Model type: RoBERTa Model
  • Language(s) (NLP): Setswana
  • License: CC BY 4.0

Usage

Use this model filling in masks or finetune for downstream tasks. Here’s a simple example for masked prediction:

from transformers import RobertaTokenizer, RobertaModel

# Load model and tokenizer
model = RobertaModel.from_pretrained('dsfsi/PuoBERTaJW300')
tokenizer = RobertaTokenizer.from_pretrained('dsfsi/PuoBERTaJW300')

Downstream Use

Downstream Performance

Daily News Dikgang

Learn more about the dataset in the Dataset Folder

Model 5-fold Cross Validation F1 Test F1
Logistic Regression + TFIDF 60.1 56.2
NCHLT TSN RoBERTa 64.7 60.3
PuoBERTa 63.8 62.9
PuoBERTaJW300 66.2 65.4

Downstream News Categorisation model 🤗 https://huggingface.co/dsfsi/PuoBERTa-News

MasakhaPOS

Performance of models on the MasakhaPOS downstream task.

Model Test Performance
Multilingual Models
AfroLM 83.8
AfriBERTa 82.5
AfroXLMR-base 82.7
AfroXLMR-large 83.0
Monolingual Models
NCHLT TSN RoBERTa 82.3
PuoBERTa 83.4
PuoBERTa+JW300 84.1

Downstream POS model 🤗 https://huggingface.co/dsfsi/PuoBERTa-POS

MasakhaNER

Performance of models on the MasakhaNER downstream task.

Model Test Performance (f1 score)
Multilingual Models
AfriBERTa 83.2
AfroXLMR-base 87.7
AfroXLMR-large 89.4
Monolingual Models
NCHLT TSN RoBERTa 74.2
PuoBERTa 78.2
PuoBERTa+JW300 80.2

Downstream NER model 🤗 https://huggingface.co/dsfsi/PuoBERTa-NER

Pre-Training Dataset

We used the PuoData dataset, a rich source of Setswana text, ensuring that our model is well-trained and culturally attuned.

Github, 🤗 https://huggingface.co/datasets/dsfsi/PuoData

Citation Information

Bibtex Reference

@inproceedings{marivate2023puoberta,
  title   = {PuoBERTa: Training and evaluation of a curated language model for Setswana},
  author  = {Vukosi Marivate and Moseli Mots'Oehli and Valencia Wagner and Richard Lastrucci and Isheanesu Dzingirai},
  year    = {2023},
  booktitle= {Artificial Intelligence Research. SACAIR 2023. Communications in Computer and Information Science},
  url= {https://link.springer.com/chapter/10.1007/978-3-031-49002-6_17},
  keywords = {NLP},
  preprint_url = {https://arxiv.org/abs/2310.09141},
  dataset_url = {https://github.com/dsfsi/PuoBERTa},
  software_url = {https://huggingface.co/dsfsi/PuoBERTa}
}

Contributing

Your contributions are welcome! Feel free to improve the model.

Model Card Authors

Vukosi Marivate

Model Card Contact

For more details, reach out or check our website.

Email: [email protected]

Enjoy exploring Setswana through AI!

Downloads last month
7
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train dsfsi/PuoBERTaJW300