NER_Darija_MAR_FSBM / README.md
mohannad-tazi's picture
Update README.md
cfbe1ec verified
---
language:
- ar
metrics:
- precision
- accuracy
- recall
- f1
base_model:
- aubmindlab/bert-base-arabertv02
pipeline_tag: token-classification
library_name: transformers
datasets:
- DarNERcorp
tags:
- ner
- named-entity-recognition
- arabic
- darija
---
# NER Model for Moroccan Dialect (Darija)
## Model Description
This model is a **Named Entity Recognition (NER)** model fine-tuned on the **DarNERcorp** dataset. It is designed to recognize entities such as **person names**, **locations**, **organizations**, and **miscellaneous entities** in Moroccan Arabic (Darija) text. The model is based on the **BERT architecture** and is useful for tasks such as information extraction from social media or news articles.
### Model Architecture
- **Architecture**: BERT-based model for token classification
- **Pre-trained Model**: aubmindlab/bert-base-arabertv02
- **Fine-tuning Dataset**: DarNERcorp
- **Languages**: Moroccan Arabic (Darija)
## Intended Use
This model is designed for Named Entity Recognition tasks in Moroccan Arabic. It can identify and classify entities such as:
- **PER**: Person names (e.g., "محمد", "فاطمة")
- **LOC**: Locations (e.g., "الرباط", "طنجة")
- **ORG**: Organizations (e.g., "البنك المغربي", "جامعة الحسن الثاني")
- **MISC**: Miscellaneous entities
### Use Cases
- **Social media analysis**: Extracting entities from Moroccan Arabic posts and tweets.
- **News summarization**: Identifying important entities in news articles.
- **Information extraction**: Extracting named entities from informal or formal texts.
## Evaluation Results
The model achieves the following results on the evaluation dataset:
- **Precision**: 74.04%
- **Recall**: 85.16%
- **F1 Score**: 78.61%
## How to Use
To use the model, you need to load it with the Hugging Face Transformers library. Here's an example:
```python
from transformers import pipeline
# Load the model
nlp = pipeline("ner", model="mohannad-tazi/ner-darija-darner")
# Use the model
text = "محمد كان في الرباط."
result = nlp(text)
print(result)
# Dataset
The model is trained on the DarNERcorp dataset, a corpus designed specifically for Named Entity Recognition in the Moroccan Arabic dialect. The dataset includes sentences labeled with named entity tags such as PER, LOC, ORG, and MISC.
# Preprocessing Steps
- Tokenization using the BERT tokenizer.
- Alignment of labels with tokenized inputs (considering word-piece tokens).
- Padding and truncating sentences to a fixed length for uniformity.
#Limitations
The model is trained on a specific corpus and may not generalize well to all Moroccan Arabic texts.
Performance may vary depending on text quality and tagging consistency in the dataset.