--- language: - ar metrics: - precision - accuracy - recall - f1 base_model: - aubmindlab/bert-base-arabertv02 pipeline_tag: token-classification library_name: transformers datasets: - DarNERcorp tags: - ner - named-entity-recognition - arabic - darija --- # NER Model for Moroccan Dialect (Darija) ## Model Description This model is a **Named Entity Recognition (NER)** model fine-tuned on the **DarNERcorp** dataset. It is designed to recognize entities such as **person names**, **locations**, **organizations**, and **miscellaneous entities** in Moroccan Arabic (Darija) text. The model is based on the **BERT architecture** and is useful for tasks such as information extraction from social media or news articles. ### Model Architecture - **Architecture**: BERT-based model for token classification - **Pre-trained Model**: aubmindlab/bert-base-arabertv02 - **Fine-tuning Dataset**: DarNERcorp - **Languages**: Moroccan Arabic (Darija) ## Intended Use This model is designed for Named Entity Recognition tasks in Moroccan Arabic. It can identify and classify entities such as: - **PER**: Person names (e.g., "محمد", "فاطمة") - **LOC**: Locations (e.g., "الرباط", "طنجة") - **ORG**: Organizations (e.g., "البنك المغربي", "جامعة الحسن الثاني") - **MISC**: Miscellaneous entities ### Use Cases - **Social media analysis**: Extracting entities from Moroccan Arabic posts and tweets. - **News summarization**: Identifying important entities in news articles. - **Information extraction**: Extracting named entities from informal or formal texts. ## Evaluation Results The model achieves the following results on the evaluation dataset: - **Precision**: 74.04% - **Recall**: 85.16% - **F1 Score**: 78.61% ## How to Use To use the model, you need to load it with the Hugging Face Transformers library. Here's an example: ```python from transformers import pipeline # Load the model nlp = pipeline("ner", model="mohannad-tazi/ner-darija-darner") # Use the model text = "محمد كان في الرباط." result = nlp(text) print(result) # Dataset The model is trained on the DarNERcorp dataset, a corpus designed specifically for Named Entity Recognition in the Moroccan Arabic dialect. The dataset includes sentences labeled with named entity tags such as PER, LOC, ORG, and MISC. # Preprocessing Steps - Tokenization using the BERT tokenizer. - Alignment of labels with tokenized inputs (considering word-piece tokens). - Padding and truncating sentences to a fixed length for uniformity. #Limitations The model is trained on a specific corpus and may not generalize well to all Moroccan Arabic texts. Performance may vary depending on text quality and tagging consistency in the dataset.