Baligh: Dialect to MSA Translation Model Overview

Baligh is dedicated to advancing the translation of Arabic dialects to Modern Standard Arabic (MSA) using state-of-the-art language models. Developed by a collaborative effort among experts in Arabic linguistics and AI, this model aims to bridge the linguistic gap between the diverse dialects spoken across the Arab world and the standardized form of Arabic.

Model description

This model, named Fasih, represents a significant advancement in the field of Natural Language Processing (NLP) for the Arabic language, specifically in translating various Arabic dialects to Modern Standard Arabic (MSA). It is based on the fine-tuning of AraT5v2, a state-of-the-art transformer model, enhanced to understand and translate more than 25 distinct Arabic dialects.

Fine-tuning Details

The fine-tuning process was designed to capture the nuances of each dialect. Leveraging a diverse dataset of MADAR Corpus comprising over 100K samples, we ensured a broad representation of dialects from across the Arab world as shown in the table:

Region Sub-region Cities
Maghreb Morocco Rabat (RAB), Fes (FES)
Algeria Algiers (ALG)
Tunisia Tunis (TUN), Sfax (SFX)
Libya Tripoli (TRI), Benghazi (BEN)
Nile Basin Egypt/Sudan Cairo (CAI), Alexandria (ALX), Aswan (ASW), Khartoum (KHA)
Levant South Levant Jerusalem (JER), Amman (AMM), Salt (SAL)
North Levant Beirut (BEI), Damascus (DAM), Aleppo (ALE)
Gulf Iraq Mosul (MOS), Baghdad (BAG), Basra (BAS)
Yemen Sana’a (SAN)
Gulf Doha (DOH), Muscat (MUS), Riyadh (RIY), Jeddah (JED)

Citations

The dataset used for training the Project Fasih model includes data from the following source:

Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., ... & Oflazer, K. (2018, May). The MADAR Arabic Dialect Corpus and Lexicon. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).

This corpus has been instrumental in understanding and translating the nuances of over 25 Arabic dialects into Modern Standard Arabic (MSA), aiding significantly in the development and refinement of our model.

Acknowledgments

Special thanks to Prince Sultan University, particularly the Robotics and Internet of Things Lab.

Contact Information

For inquiries: [email protected].

Disclaimer for the Use of Baligh

We disclaim all responsibility for any inaccuracies or inappropriate content generated by the model. Users should apply the model's outputs at their own risk. Further improvements to enhance its performance are underway.

Downloads last month
8
Safetensors
Model size
368M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.