NER Model for Moroccan Dialect (Darija)
Model Description
This model is a Named Entity Recognition (NER) model fine-tuned on the DarNERcorp dataset. It is designed to recognize entities such as person names, locations, organizations, and miscellaneous entities in Moroccan Arabic (Darija) text. The model is based on the BERT architecture and is useful for tasks such as information extraction from social media or news articles.
Model Architecture
- Architecture: BERT-based model for token classification
- Pre-trained Model: aubmindlab/bert-base-arabertv02
- Fine-tuning Dataset: DarNERcorp
- Languages: Moroccan Arabic (Darija)
Intended Use
This model is designed for Named Entity Recognition tasks in Moroccan Arabic. It can identify and classify entities such as:
- PER: Person names (e.g., "ู ุญู ุฏ", "ูุงุทู ุฉ")
- LOC: Locations (e.g., "ุงูุฑุจุงุท", "ุทูุฌุฉ")
- ORG: Organizations (e.g., "ุงูุจูู ุงูู ุบุฑุจู", "ุฌุงู ุนุฉ ุงูุญุณู ุงูุซุงูู")
- MISC: Miscellaneous entities
Use Cases
- Social media analysis: Extracting entities from Moroccan Arabic posts and tweets.
- News summarization: Identifying important entities in news articles.
- Information extraction: Extracting named entities from informal or formal texts.
Evaluation Results
The model achieves the following results on the evaluation dataset:
- Precision: 74.04%
- Recall: 85.16%
- F1 Score: 78.61%
How to Use
To use the model, you need to load it with the Hugging Face Transformers library. Here's an example:
from transformers import pipeline
# Load the model
nlp = pipeline("ner", model="mohannad-tazi/ner-darija-darner")
# Use the model
text = "ู
ุญู
ุฏ ูุงู ูู ุงูุฑุจุงุท."
result = nlp(text)
print(result)
# Dataset
The model is trained on the DarNERcorp dataset, a corpus designed specifically for Named Entity Recognition in the Moroccan Arabic dialect. The dataset includes sentences labeled with named entity tags such as PER, LOC, ORG, and MISC.
# Preprocessing Steps
- Tokenization using the BERT tokenizer.
- Alignment of labels with tokenized inputs (considering word-piece tokens).
- Padding and truncating sentences to a fixed length for uniformity.
#Limitations
The model is trained on a specific corpus and may not generalize well to all Moroccan Arabic texts.
Performance may vary depending on text quality and tagging consistency in the dataset.
- Downloads last month
- 6
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for mohannad-tazi/NER_Darija_MAR_FSBM
Base model
aubmindlab/bert-base-arabertv02