A Robust Transformer Model for Arabic Dialect Identification (ADI) in Speech

We present an accurate and robust Transformer-based model for Arabic Dialect Identification (ADI) in Speech. We fine-tune the pre-trained MMS (Massively Multilingual Speech) model on diverse Arabic TV broadcast speech to identify Modern Standard Arabic (MSA) as well as four major Arabic dialects. You can interact with the model using this Hugging Face space.

The model can identify the following Arabic dialects/varieties:

Modern Standard Arabic (MSA)
Egyptian Arabic (Masri and Sudani)
Gulf Arabic (Khleeji, Iraqi, and Yemeni)
Levantine Arabic (Shami)
Maghrebi Arabic (Dialects of Al-Maghreb Al-Arabi in North Africa)

Info

Developed by: Badr M. Abdullah and Matthew Baas
Model type: wav2vec 2.0 architecture
Language: Arabic (and its varieties)
License: Creative Commons Attribution 4.0 (CC BY 4.0)
Finetuned from model: MMS-300m [https://huggingface.co/facebook/mms-300m]

Training Data

TV Broadcast speech (news, interviews, dicsuccsion, TV shows, etc.)

Evaluation

The model has been tested and evaluated on different datasets that present challenges to dialect classification (e.g., background noise, channel mismatch, emotional tone in speech). The model performed very well in our evaluation and expected it to be robust to real-world speech samples.

Uses

The model can be used as a component in a large-scale speech data collection pipeline to create resources for different Arabic dialects. It can also be used to filter speech data for Modern Standard Arabic (MSA) which can be used to develop text-to-speech (TTS) systems.

Direct Use

from transformers import pipeline

# Load the model
model_id = "badrex/mms-300m-arabic-dialect-identifier"
adi5_classifier = pipeline(
    "audio-classification", 
    model=model_id,
    device='cpu'
)

# Predict dialect for an audio sample 
audio_path = "./samples/arabic_audio_sample.mp3"

predictions = adi5_classifier(audio_path)

for pred in predictions:
    print(f"Dialect: {pred['label']:<10} Confidence: {pred['score']:.4f}")

# Dialect: MSA        Confidence: 0.8370
# Dialect: Levantine  Confidence: 0.1208
# Dialect: Egyptian   Confidence: 0.0406
# Dialect: Gulf       Confidence: 0.0011
# Dialect: Maghrebi   Confidence: 0.0004

Out-of-Scope Use

The model should not be used to

Assess fluency or nativeness of speech
Determine whether the speaker uses a formal or informal register
Make judgments about a speaker's origin, education level, or socioeconomic status
Filter or discriminate against speakers based on dialect

Bias, Risks, and Limitations

Some Arabic varieites are not well-represented in the training data. The model may not work well for some dialects such as Yemeni Arabic, Iraqi Arabic, and Saharan Arabic.

Additional limitations include:

Very short audio samples (< 2 seconds) may not provide enough information for accurate classification
Code-switching between dialects (specially mixing with MSA) may result in less reliable classifications
Speakers who have lived in multiple dialect regions may exhibit mixed features
Speech from non-typical speakers such as children and people with speech disorders might be challenging for the model

Recommendations

For optimal results, use audio segments of at least 5-10 seconds
Confidence scores may not always be informative (e.g., the model could make a wrong decision but still very confident)
For critical applications, consider human verification of model predictions

Citation

BibTeX:

@misc{abdullah2025arabicadi,
  author = {Abdullah, Badr M. and Baas, Matthew},
  title = {A Robust Transformer Model for Arabic Dialect Identification in Speech},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/badrex/mms-300m-arabic-dialect-identifier}}
}

APA: Abdullah, B. M., & Baas, M. (2025). A Robust Transformer Model for Arabic Dialect Identification in Speech. Retrieved from https://huggingface.co/badrex/mms-300m-arabic-dialect-identifier/

Model Card Contact

If you have any question, please do not hesitate to write an email to badr dot nlp at gmail dot com 😊

badrex
/

mms-300m-arabic-dialect-identifier