byt5-dv

Pretrained from scratch on Dhivei (language of the Maldives) with ByT5, Google's new byte-level tokenizer strategy.

Corpus: dv.wikipedia.org as of March 2020 (TFDS)

Notebook - Pretraining on Wikipedia: https://colab.research.google.com/drive/19Afq7CI6cOi1DaTpnQhBbEbnBzLSFHbH

Demo

Notebook - Finetuning on Maldivian news classification task: https://colab.research.google.com/drive/11u5SafR4bKICmArgDl6KQ9vqfYtDpyWp

Current performance:

  • mBERT: 52%
  • byt5-dv: 81%
  • dv-wave (ELECTRA): 89%
  • dv-muril: 90.7%
  • dv-labse: 91.3-91.5%

Source of dataset: https://github.com/Sofwath/DhivehiDatasets

Work in progress - todos

The Wikipedia corpus is too small for this language. In the future I would add OSCAR and Sofwath's Maldivian corpus, if I can rewrite the script to accept those as one TFDS dataset.

This is based on ByT5-small ... we should try a larger model

This needs more time for pretraining

Downloads last month
13
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.