NER Model for Moroccan Dialect (Darija)

Model Description

This model is a Named Entity Recognition (NER) model fine-tuned on the DarNERcorp dataset. It is designed to recognize entities such as person names, locations, organizations, and miscellaneous entities in Moroccan Arabic (Darija) text. The model is based on the BERT architecture and is useful for tasks such as information extraction from social media or news articles.

Model Architecture

Architecture: BERT-based model for token classification
Pre-trained Model: aubmindlab/bert-base-arabertv02
Fine-tuning Dataset: DarNERcorp
Languages: Moroccan Arabic (Darija)

Intended Use

This model is designed for Named Entity Recognition tasks in Moroccan Arabic. It can identify and classify entities such as:

PER: Person names (e.g., "محمد", "فاطمة")
LOC: Locations (e.g., "الرباط", "طنجة")
ORG: Organizations (e.g., "البنك المغربي", "جامعة الحسن الثاني")
MISC: Miscellaneous entities

Use Cases

Social media analysis: Extracting entities from Moroccan Arabic posts and tweets.
News summarization: Identifying important entities in news articles.
Information extraction: Extracting named entities from informal or formal texts.

Evaluation Results

The model achieves the following results on the evaluation dataset:

Precision: 74.04%
Recall: 85.16%
F1 Score: 78.61%

How to Use

To use the model, you need to load it with the Hugging Face Transformers library. Here's an example:

from transformers import pipeline

# Load the model
nlp = pipeline("ner", model="mohannad-tazi/ner-darija-darner")

# Use the model
text = "محمد كان في الرباط."
result = nlp(text)
print(result)

# Dataset
The model is trained on the DarNERcorp dataset, a corpus designed specifically for Named Entity Recognition in the Moroccan Arabic dialect. The dataset includes sentences labeled with named entity tags such as PER, LOC, ORG, and MISC.

# Preprocessing Steps
- Tokenization using the BERT tokenizer.
- Alignment of labels with tokenized inputs (considering word-piece tokens).
- Padding and truncating sentences to a fixed length for uniformity.

#Limitations
The model is trained on a specific corpus and may not generalize well to all Moroccan Arabic texts.
Performance may vary depending on text quality and tagging consistency in the dataset.

mohannad-tazi
/

NER_Darija_MAR_FSBM