--- |
language: ar |
license: apache-2.0 |
datasets: |
- ANERcorp |
thumbnail: https://www.informatik.hu-berlin.de/en/forschung-en/gebiete/ml-en/resolveuid/a6f82e0d7fa446a59c902cac4cafa9cb/@@images/image/preview |
tags: |
- flair |
- Text Classification |
- token-classification |
- sequence-tagger-model |
metrics: |
- f1 |
widget: |
- text: "لائحة «الوطنية للصحافة».. خطوة جديدة في طريق «الحصار»" |
--- |
# Arabic NER Model using Flair Embeddings |
Training was conducted over 94 epochs, using a linear decaying learning rate of 2e-05, starting from 0.225 and a batch size of 32 with GloVe and Flair forward and backward embeddings. |
## Original Datasets: |
- [AQMAR](http://www.cs.cmu.edu/~ark/ArabicNER/) |
- [ANERcorp](http://curtis.ml.cmu.edu/w/courses/index.php/ANERcorp) |
## Results: |
- F1-score (micro) 0.8666 |
- F1-score (macro) 0.8488 |
| | True Posititves | False Positives | False Negatives | Precision | Recall | class-F1 | |
|------|-----|----|----|-----------|--------|----------| |
| LOC | 539 | 51 | 68 | 0.9136 | 0.8880 | 0.9006 | |
| MISC | 408 | 57 | 89 | 0.8774 | 0.8209 | 0.8482 | |
| ORG | 167 | 43 | 64 | 0.7952 | 0.7229 | 0.7574 | |
| PER | 501 | 65 | 60 | 0.8852 | 0.8930 | 0.8891 | |
--- |
# Usage |
```python |
from flair.data import Sentence |
from flair.models import SequenceTagger |
import pyarabic.araby as araby |
from flair.tokenization import JapaneseTokenizer |
from icecream import ic |
tagger = SequenceTagger.load("julien-c/flair-ner") |
arTagger = SequenceTagger.load('megantosh/flair-arabic-multi-ner') |
sentence = Sentence('George Washington went to Washington .') |
arSentence = Sentence('عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية بالقاهرة .') |
# predict NER tags |
tagger.predict(sentence) |
arTagger.predict(arSentence) |
# print sentence with predicted tags |
ic(sentence.to_tagged_string) |
ic(arSentence.to_tagged_string) |
``` |
# Example |
```bash |
2021-07-07 14:30:59,649 loading file /Users/mega/.flair/models/flair-ner/f22eb997f66ae2eacad974121069abaefca5fe85fce71b49e527420ff45b9283.941c7c30b38aef8d8a4eb5c1b6dd7fe8583ff723fef457382589ad6a4e859cfc |
2021-07-07 14:31:04,654 loading file /Users/mega/.flair/models/flair-arabic-multi-ner/c7af7ddef4fdcc681fcbe1f37719348afd2862b12aa1cfd4f3b93bd2d77282c7.242d030cb106124f7f9f6a88fb9af8e390f581d42eeca013367a86d585ee6dd6 |
ic| sentence.to_tagged_string: <bound method Sentence.to_tagged_string of Sentence: "George Washington went to Washington ." [− Tokens: 6 − Token-Labels: "George <B-PER> Washington <E-PER> went to Washington <S-LOC> ."]> |
ic| arSentence.to_tagged_string: <bound method Sentence.to_tagged_string of Sentence: "عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية بالقاهرة ." [− Tokens: 11 − Token-Labels: "عمرو <B-PER> عادلي <I-PER> أستاذ للاقتصاد السياسي المساعد في الجامعة <B-ORG> الأمريكية <I-ORG> بالقاهرة <B-LOC> ."]> |
ic| entity: <PER-span (1,2): "George Washington"> |
ic| entity: <LOC-span (5): "Washington"> |
ic| entity: <PER-span (1,2): "عمرو عادلي"> |
ic| entity: <ORG-span (8,9): "الجامعة الأمريكية"> |
ic| entity: <LOC-span (10): "بالقاهرة"> |
ic| sentence.to_dict(tag_type='ner'): |
{"text":"عمرو عادلي أستاذ للاقتصاد السياسي المساعد في الجامعة الأمريكية بالقاهرة .", |
"labels":[], |
{"entities":[{{{ |
"text":"عمرو عادلي", |
"start_pos":0, |
"end_pos":10, |
"labels":[PER (0.9826)]}, |
{"text":"الجامعة الأمريكية", |
"start_pos":45, |
"end_pos":62, |
"labels":[ORG (0.7679)]}, |
{"text":"بالقاهرة", |
"start_pos":64, |
"end_pos":72, |
"labels":[LOC (0.8079)]}]} |
"text":"George Washington went to Washington .", |
"labels":[], |
"entities":[{ |
{"text":"George Washington", |
"start_pos":0, |
"end_pos":17, |
"labels":[PER (0.9968)]}, |
{"text":"Washington""start_pos":26, |
"end_pos":36, |
"labels":[LOC (0.9994)]}}]} |
``` |
# Model Configuration |
```python |
SequenceTagger( |
(embeddings): StackedEmbeddings( |
(list_embedding_0): WordEmbeddings('glove') |
(list_embedding_1): FlairEmbeddings( |
(lm): LanguageModel( |
(drop): Dropout(p=0.1, inplace=False) |
(encoder): Embedding(7125, 100) |
(rnn): LSTM(100, 2048) |
(decoder): Linear(in_features=2048, out_features=7125, bias=True) |
) |
) |
(list_embedding_2): FlairEmbeddings( |
(lm): LanguageModel( |
(drop): Dropout(p=0.1, inplace=False) |
(encoder): Embedding(7125, 100) |
(rnn): LSTM(100, 2048) |
(decoder): Linear(in_features=2048, out_features=7125, bias=True) |
) |
) |
) |
(word_dropout): WordDropout(p=0.05) |
(locked_dropout): LockedDropout(p=0.5) |
(embedding2nn): Linear(in_features=4196, out_features=4196, bias=True) |
(rnn): LSTM(4196, 256, batch_first=True, bidirectional=True) |
(linear): Linear(in_features=512, out_features=15, bias=True) |
(beta): 1.0 |
(weights): None |
(weight_tensor) None |
``` |
Due to some formatting errors, your code might appear like [this](https://ibb.co/ky20Lnq), attempted on 2020-10-27 12:05:47,801 |
# Citation |
*if you use this model in your work, please consider citing this work:* |
```latex |
@unpublished{MMHU21 |
author = "M. Megahed", |
title = "Sequence Labeling Architectures in Diglossia", |
note = "In preparation", |
} |
``` |