A newer version of the Streamlit SDK is available:
1.46.0
Test Set Details
The test set used for evaluation is composed of 1000 sentences geolocated to the 14 most-populated Arab countries (excluding Somalia from which data was scarce). Each sample is annotated by native speakers recruited from 11 different Arab countries, namely: Algeria, Egypt, Iraq, Jordan, Morocco, Palestine, Saudi Arabia, Sudan, Syria, Tunisia, Yemen.
Evaluation Metrics
We compute the precision, recall, and F1 scores for each of the 11 countries (treating each label as a binary classification problem).
Data Access
If you need to access the single-label training sets, and the multi-label development set, please fill the following form: https://forms.gle/t3QTC6ZqyDJBzAau8
Further Notes
- The beta version of the leaderboard is running on limited resources, and is not able to evaluate models with a relatively large number of parameters.
- Please refer to the paper for more information about how the data was curated and annotated.
- We are planning to extend the annotations to include more country-level dialects. If you are interested in helping, please ping us, and we are happy to discuss it further.