Update readme.md
Browse files
README.md
CHANGED
@@ -24,4 +24,58 @@ Baligh is dedicated to advancing the translation of Arabic dialects to Modern St
|
|
24 |
|
25 |
This model, named Fasih, represents a significant advancement in the field of Natural Language Processing (NLP) for the Arabic language, specifically in translating various Arabic dialects to Modern Standard Arabic (MSA). It is based on the fine-tuning of AraT5v2, a state-of-the-art transformer model, enhanced to understand and translate more than 25 distinct Arabic dialects.
|
26 |
|
27 |
-
<img src="https://i.ibb.co/kybsNkX/Unknown.png" width="
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
This model, named Fasih, represents a significant advancement in the field of Natural Language Processing (NLP) for the Arabic language, specifically in translating various Arabic dialects to Modern Standard Arabic (MSA). It is based on the fine-tuning of AraT5v2, a state-of-the-art transformer model, enhanced to understand and translate more than 25 distinct Arabic dialects.
|
26 |
|
27 |
+
<img src="https://i.ibb.co/kybsNkX/Unknown.png" width="500"/>
|
28 |
+
|
29 |
+
# Fine-tuning Details
|
30 |
+
|
31 |
+
The fine-tuning process was designed to capture the nuances of each dialect. Leveraging a diverse dataset of MADAR Corpus comprising over 100K samples, we ensured a broad representation of dialects from across the Arab world as shown in the table:
|
32 |
+
|
33 |
+
| Region | Sub-region | Cities |
|
34 |
+
|-----------|--------------|-----------------------------------------|
|
35 |
+
| Maghreb | Morocco | Rabat (RAB), Fes (FES) |
|
36 |
+
| | Algeria | Algiers (ALG) |
|
37 |
+
| | Tunisia | Tunis (TUN), Sfax (SFX) |
|
38 |
+
| | Libya | Tripoli (TRI), Benghazi (BEN) |
|
39 |
+
| Nile Basin| Egypt/Sudan | Cairo (CAI), Alexandria (ALX), Aswan (ASW), Khartoum (KHA) |
|
40 |
+
| Levant | South Levant | Jerusalem (JER), Amman (AMM), Salt (SAL)|
|
41 |
+
| | North Levant | Beirut (BEI), Damascus (DAM), Aleppo (ALE) |
|
42 |
+
| Gulf | Iraq | Mosul (MOS), Baghdad (BAG), Basra (BAS) |
|
43 |
+
| | Yemen | Sana’a (SAN) |
|
44 |
+
| | Gulf | Doha (DOH), Muscat (MUS), Riyadh (RIY), Jeddah (JED) |
|
45 |
+
|
46 |
+
## Citations
|
47 |
+
|
48 |
+
The dataset used for training the Project Fasih model includes data from the following source:
|
49 |
+
|
50 |
+
Bouamor, H., Habash, N., Salameh, M., Zaghouani, W., Rambow, O., Abdulrahim, D., ... & Oflazer, K. (2018, May). The MADAR Arabic Dialect Corpus and Lexicon. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
|
51 |
+
|
52 |
+
This corpus has been instrumental in understanding and translating the nuances of over 25 Arabic dialects into Modern Standard Arabic (MSA), aiding significantly in the development and refinement of our model.
|
53 |
+
|
54 |
+
## Performance and Evaluation
|
55 |
+
|
56 |
+
The **Baligh** model has undergone extensive testing to ensure its accuracy and reliability in translating Arabic dialects to Modern Standard Arabic (MSA). A key part of our evaluation involved testing the model over an unseen dataset to measure its translational efficacy and generalizability.
|
57 |
+
|
58 |
+
### BLEU Score
|
59 |
+
|
60 |
+
One of the primary metrics used for this evaluation was the BLEU (Bilingual Evaluation Understudy) score, which is a standard measure used to compare a machine's output with that of a human. A BLEU score closer to 100% indicates a translation closer to a human-level performance.
|
61 |
+
|
62 |
+
For the **Baligh** model, we are proud to report a BLEU score of 85% on this unseen dataset. This high score demonstrates the model's exceptional ability in understanding and accurately translating the nuances of over 25 Arabic dialects into MSA. It reflects not only the robustness of our training dataset and methodology but also the model's potential for practical applications requiring high-quality translation.
|
63 |
+
|
64 |
+
### Implications
|
65 |
+
|
66 |
+
An 85% BLEU score is indicative of a highly effective translation model, suggesting that **Baligh** is capable of producing translations with a high degree of accuracy and fluency. This level of performance positions the model as a valuable tool for researchers, linguists, and practitioners working with Arabic dialects and MSA.
|
67 |
+
|
68 |
+
We continue to seek ways to enhance the model's accuracy further and expand its applicability across more dialects and translation tasks.
|
69 |
+
|
70 |
+
## Acknowledgments
|
71 |
+
|
72 |
+
Special thanks to Prince Sultan University, particularly the Robotics and Internet of Things Lab.
|
73 |
+
|
74 |
+
## Contact Information
|
75 |
+
|
76 |
+
For inquiries: [riotu@psu.edu.sa](mailto:riotu@psu.edu.sa).
|
77 |
+
|
78 |
+
## Disclaimer for the Use of Baligh
|
79 |
+
|
80 |
+
<p style="color: red;">We disclaim all responsibility for any inaccuracies or inappropriate content generated by the model. Users should apply the model's outputs at their own risk. Further improvements to enhance its performance are underway.</p>
|
81 |
+
|