A Robust Transformer Model for Arabic Dialect Identification (ADI) in Speech
We present an accurate and robust Transformer-based model for Arabic Dialect Identification (ADI) in Speech. We fine-tune the pre-trained MMS (Massively Multilingual Speech) model on diverse Arabic TV broadcast speech to identify Modern Standard Arabic (MSA) as well as four major Arabic dialects. You can interact with the model using this Hugging Face space.
The model can identify the following Arabic dialects/varieties:
- Modern Standard Arabic (MSA)
- Egyptian Arabic (Masri and Sudani)
- Gulf Arabic (Khleeji, Iraqi, and Yemeni)
- Levantine Arabic (Shami)
- Maghrebi Arabic (Dialects of Al-Maghreb Al-Arabi in North Africa)
Info
- Developed by: Badr M. Abdullah and Matthew Baas
- Model type: wav2vec 2.0 architecture
- Language: Arabic (and its varieties)
- License: Creative Commons Attribution 4.0 (CC BY 4.0)
- Finetuned from model: MMS-300m [https://huggingface.co/facebook/mms-300m]
Training Data
TV Broadcast speech (news, interviews, dicsuccsion, TV shows, etc.)
Evaluation
The model has been tested and evaluated on different datasets that present challenges to dialect classification (e.g., background noise, channel mismatch, emotional tone in speech). The model performed very well in our evaluation and expected it to be robust to real-world speech samples.
Uses
The model can be used as a component in a large-scale speech data collection pipeline to create resources for different Arabic dialects. It can also be used to filter speech data for Modern Standard Arabic (MSA) which can be used to develop text-to-speech (TTS) systems.
Direct Use
from transformers import pipeline
# Load the model
model_id = "badrex/mms-300m-arabic-dialect-identifier"
adi5_classifier = pipeline(
"audio-classification",
model=model_id,
device='cpu'
)
# Predict dialect for an audio sample
audio_path = "./samples/arabic_audio_sample.mp3"
predictions = adi5_classifier(audio_path)
for pred in predictions:
print(f"Dialect: {pred['label']:<10} Confidence: {pred['score']:.4f}")
# Dialect: MSA Confidence: 0.8370
# Dialect: Levantine Confidence: 0.1208
# Dialect: Egyptian Confidence: 0.0406
# Dialect: Gulf Confidence: 0.0011
# Dialect: Maghrebi Confidence: 0.0004
Out-of-Scope Use
The model should not be used to
- Assess fluency or nativeness of speech
- Determine whether the speaker uses a formal or informal register
- Make judgments about a speaker's origin, education level, or socioeconomic status
- Filter or discriminate against speakers based on dialect
Bias, Risks, and Limitations
Some Arabic varieites are not well-represented in the training data. The model may not work well for some dialects such as Yemeni Arabic, Iraqi Arabic, and Saharan Arabic.
Additional limitations include:
- Very short audio samples (< 2 seconds) may not provide enough information for accurate classification
- Code-switching between dialects (specially mixing with MSA) may result in less reliable classifications
- Speakers who have lived in multiple dialect regions may exhibit mixed features
- Speech from non-typical speakers such as children and people with speech disorders might be challenging for the model
Recommendations
- For optimal results, use audio segments of at least 5-10 seconds
- Confidence scores may not always be informative (e.g., the model could make a wrong decision but still very confident)
- For critical applications, consider human verification of model predictions
Citation
BibTeX:
@misc{abdullah2025arabicadi,
author = {Abdullah, Badr M. and Baas, Matthew},
title = {A Robust Transformer Model for Arabic Dialect Identification in Speech},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/badrex/mms-300m-arabic-dialect-identifier}}
}
APA: Abdullah, B. M., & Baas, M. (2025). A Robust Transformer Model for Arabic Dialect Identification in Speech. Retrieved from https://huggingface.co/badrex/mms-300m-arabic-dialect-identifier/
Model Card Contact
If you have any question, please do not hesitate to write an email to badr dot nlp at gmail dot com ๐
- Downloads last month
- 39
Model tree for badrex/mms-300m-arabic-dialect-identifier
Base model
facebook/mms-300m