Description
This tokenizer is designed for Moroccan Darija, a dialectal variety of Arabic (ISO code: ary
).
It has been trained using the Byte Pair Encoding (BPE) algorithm on the dataset: atlasia/AL-Atlas-Moroccan-Darija-Pretraining-Dataset.
Features
- Tokenizes Moroccan Darija text efficiently, see Moroccan darija leaderboard.
- Provides robust handling of dialectal variations and specific features of Moroccan Darija.