|
--- |
|
language: |
|
- ar |
|
metrics: |
|
- precision |
|
- accuracy |
|
- recall |
|
- f1 |
|
base_model: |
|
- aubmindlab/bert-base-arabertv02 |
|
pipeline_tag: token-classification |
|
library_name: transformers |
|
datasets: |
|
- DarNERcorp |
|
tags: |
|
- ner |
|
- named-entity-recognition |
|
- arabic |
|
- darija |
|
|
|
|
|
--- |
|
|
|
# NER Model for Moroccan Dialect (Darija) |
|
|
|
## Model Description |
|
This model is a **Named Entity Recognition (NER)** model fine-tuned on the **DarNERcorp** dataset. It is designed to recognize entities such as **person names**, **locations**, **organizations**, and **miscellaneous entities** in Moroccan Arabic (Darija) text. The model is based on the **BERT architecture** and is useful for tasks such as information extraction from social media or news articles. |
|
|
|
### Model Architecture |
|
- **Architecture**: BERT-based model for token classification |
|
- **Pre-trained Model**: aubmindlab/bert-base-arabertv02 |
|
- **Fine-tuning Dataset**: DarNERcorp |
|
- **Languages**: Moroccan Arabic (Darija) |
|
|
|
## Intended Use |
|
This model is designed for Named Entity Recognition tasks in Moroccan Arabic. It can identify and classify entities such as: |
|
- **PER**: Person names (e.g., "محمد", "فاطمة") |
|
- **LOC**: Locations (e.g., "الرباط", "طنجة") |
|
- **ORG**: Organizations (e.g., "البنك المغربي", "جامعة الحسن الثاني") |
|
- **MISC**: Miscellaneous entities |
|
|
|
### Use Cases |
|
- **Social media analysis**: Extracting entities from Moroccan Arabic posts and tweets. |
|
- **News summarization**: Identifying important entities in news articles. |
|
- **Information extraction**: Extracting named entities from informal or formal texts. |
|
|
|
## Evaluation Results |
|
|
|
The model achieves the following results on the evaluation dataset: |
|
- **Precision**: 74.04% |
|
- **Recall**: 85.16% |
|
- **F1 Score**: 78.61% |
|
|
|
## How to Use |
|
To use the model, you need to load it with the Hugging Face Transformers library. Here's an example: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
# Load the model |
|
nlp = pipeline("ner", model="mohannad-tazi/ner-darija-darner") |
|
|
|
# Use the model |
|
text = "محمد كان في الرباط." |
|
result = nlp(text) |
|
print(result) |
|
|
|
# Dataset |
|
The model is trained on the DarNERcorp dataset, a corpus designed specifically for Named Entity Recognition in the Moroccan Arabic dialect. The dataset includes sentences labeled with named entity tags such as PER, LOC, ORG, and MISC. |
|
|
|
# Preprocessing Steps |
|
- Tokenization using the BERT tokenizer. |
|
- Alignment of labels with tokenized inputs (considering word-piece tokens). |
|
- Padding and truncating sentences to a fixed length for uniformity. |
|
|
|
#Limitations |
|
The model is trained on a specific corpus and may not generalize well to all Moroccan Arabic texts. |
|
Performance may vary depending on text quality and tagging consistency in the dataset. |
|
|