File size: 5,158 Bytes
0aa82ad ff9419f ae7cac4 ff9419f 6d5e570 f21d7d5 6d5e570 98016fe 6d5e570 3ab43b3 98016fe 7a0d216 98016fe 65b3d25 98016fe |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
---
language:
- ar
pipeline_tag: text-classification
widget:
- text: "خليلي في مساج بريفي كيفاش الاتصال"
- text: "كيفك يا زلمة"
- text: "وين المحطة؟"
- text: "شنو رقم الرحلة؟"
- text: "ازيك يا جومانا وحشاني"
- text: "شحوالك يا جومانا توحشتك"
- text: "كيفك يا جومانا اشتقتلك"
- text: "كيفك جومانا اشتقتلك كتير"
- text: "كيفك حالك يا جمانه مشتاقلك"
---
# Model Card for NADI-2024-baseline
<!-- Provide a quick summary of what the model is/does. -->
A BERT-based model fine-tuned to perform single-label Arabic Dialect Identification (ADI). Instead of predicting the most probable dialect, the logits are used to generate multilabel predictions.
### Model Description
<!-- Provide a longer summary of what this model is. -->
<!-- - **Developed by:** Amr Keleg -->
- **Model type:** A Dialect Identification model fine-tuned on the training sets of: NADI2020,2021,2023 and MADAR 2018.
- **Language(s) (NLP):** Arabic.
<!--- **License:** [More Information Needed] -->
- **Finetuned from model :** [MarBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2)
### Multilabel country-level Dialect Identification
#### Baseline I (Top 90%)
```
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
DIALECTS = ["Algeria",
"Bahrain",
"Egypt",
"Iraq",
"Jordan",
"Kuwait",
"Lebanon",
"Libya",
"Morocco",
"Oman",
"Palestine",
"Qatar",
"Saudi_Arabia",
"Sudan",
"Syria",
"Tunisia",
"UAE",
"Yemen",
]
assert len(DIALECTS) == 18
MODEL_NAME = "AMR-KELEG/NADI2024-baseline"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
def predict_top_p(text, P=0.9):
"""Predict the top dialects with an accumulative confidence of at least P."""
assert P <= 1 and P >= 0
logits = model(**tokenizer(text, return_tensors="pt")).logits
probabilities = torch.softmax(logits, dim=1).flatten().tolist()
topk_predictions = torch.topk(logits, 18).indices.flatten().tolist()
predictions = [0 for _ in range(18)]
total_prob = 0
for i in range(18):
total_prob += probabilities[topk_predictions[i]]
predictions[topk_predictions[i]] = 1
if total_prob >= P:
break
return [DIALECTS[i] for i, p in enumerate(predictions) if p == 1]
s1 = "كيفك يا زلمة"
s1_pred = predict_top_p(s1) # ['Jordan', 'Lebanon', 'Palestine', 'Syria']
print(s1, s1_pred)
s2 = "خليلي في مساج بريفي كيفاش الاتصال"
s2_pred = predict_top_p(s2) # ['Algeria', 'Tunisia']
print(s2, s2_pred)
```
### Citation
If you find the model useful, please cite the following [respective paper](https://aclanthology.org/2024.arabicnlp-1.79/):
```
@inproceedings{abdul-mageed-etal-2024-nadi,
title = "{NADI} 2024: The Fifth Nuanced {A}rabic Dialect Identification Shared Task",
author = "Abdul-Mageed, Muhammad and
Keleg, Amr and
Elmadany, AbdelRahim and
Zhang, Chiyu and
Hamed, Injy and
Magdy, Walid and
Bouamor, Houda and
Habash, Nizar",
editor = "Habash, Nizar and
Bouamor, Houda and
Eskander, Ramy and
Tomeh, Nadi and
Abu Farha, Ibrahim and
Abdelali, Ahmed and
Touileb, Samia and
Hamed, Injy and
Onaizan, Yaser and
Alhafni, Bashar and
Antoun, Wissam and
Khalifa, Salam and
Haddad, Hatem and
Zitouni, Imed and
AlKhamissi, Badr and
Almatham, Rawan and
Mrini, Khalil",
booktitle = "Proceedings of The Second Arabic Natural Language Processing Conference",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.arabicnlp-1.79",
pages = "709--728",
abstract = "We describe the findings of the fifth Nuanced Arabic Dialect Identification Shared Task (NADI 2024). NADI{'}s objective is to help advance SoTA Arabic NLP by providing guidance, datasets, modeling opportunities, and standardized evaluation conditions that allow researchers to collaboratively compete on prespecified tasks. NADI 2024 targeted both dialect identification cast as a multi-label task (Subtask 1), identification of the Arabic level of dialectness (Subtask 2), and dialect-to-MSA machine translation (Subtask 3). A total of 51 unique teams registered for the shared task, of whom 12 teams have participated (with 76 valid submissions during the test phase). Among these, three teams participated in Subtask 1, three in Subtask 2, and eight in Subtask 3. The winning teams achieved 50.57 F1 on Subtask 1, 0.1403 RMSE for Subtask 2, and 20.44 BLEU in Subtask 3, respectively. Results show that Arabic dialect processing tasks such as dialect identification and machine translation remain challenging. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.",
}
``` |