File size: 2,792 Bytes
f387863
eebddcc
a47a147
 
 
2ea8baa
 
 
f387863
eebddcc
 
f387863
dda4f80
f59a57b
e38a37a
 
 
eebddcc
dda4f80
e38a37a
 
 
dda4f80
5e5fda0
eebddcc
 
 
 
7069d9a
d5c900f
 
be2b397
a13afb9
 
d5c900f
be2b397
d5c900f
 
be2b397
 
d5c900f
 
be2b397
eebddcc
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
---
tags:
- BERT
- token-classification
- sequence-tagger-model
language:
- ar
- en
license: mit
datasets:
- ACE2005
---
# Arabic NER Model
- [Github repo](https://github.com/edchengg/GigaBERT)
- NER BIO tagging model based on [GigaBERTv4](https://huggingface.co/lanwuwei/GigaBERT-v4-Arabic-and-English).
- ACE2005 Training data: English + Arabic
- [NER tags](https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-entities-guidelines-v6.6.pdf) including: PER, VEH, GPE, WEA, ORG, LOC, FAC

## Hyperparameters
- learning_rate=2e-5
- num_train_epochs=10
- weight_decay=0.01

## ACE2005 Evaluation results (F1)
| Language | Arabic | English  | 
|:----:|:-----------:|:----:|
|      | 89.4   | 88.8 |

## How to use
```python
>>> from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

>>> ner_model = AutoModelForTokenClassification.from_pretrained("ychenNLP/arabic-ner-ace")
>>> ner_tokenizer = AutoTokenizer.from_pretrained("ychenNLP/arabic-ner-ace")
>>> ner_pip = pipeline("ner", model=ner_model, tokenizer=ner_tokenizer, grouped_entities=True)

>>> output = ner_pip('Protests break out across the US after Supreme Court overturns.')
>>> print(output)
[{'entity_group': 'GPE', 'score': 0.9979881, 'word': 'us', 'start': 30, 'end': 32}, {'entity_group': 'ORG', 'score': 0.99898684, 'word': 'supreme court', 'start': 39, 'end': 52}]

>>> output = ner_pip('ู‚ุงู„ ูˆุฒูŠุฑ ุงู„ุนุฏู„ ุงู„ุชุฑูƒูŠ ุจูƒูŠุฑ ุจูˆุฒุฏุงุบ ุฅู† ุฃู†ู‚ุฑุฉ ุชุฑูŠุฏ 12 ู…ุดุชุจู‡ุงู‹ ุจู‡ู… ู…ู† ูู†ู„ู†ุฏุง ูˆ 21 ู…ู† ุงู„ุณูˆูŠุฏ')
>>> print(output)
[{'entity_group': 'PER', 'score': 0.9996214, 'word': 'ูˆุฒูŠุฑ', 'start': 4, 'end': 8}, {'entity_group': 'ORG', 'score': 0.9952383, 'word': 'ุงู„ุนุฏู„', 'start': 9, 'end': 14}, {'entity_group': 'GPE', 'score': 0.9996675, 'word': 'ุงู„ุชุฑูƒูŠ', 'start': 15, 'end': 21}, {'entity_group': 'PER', 'score': 0.9978992, 'word': 'ุจูƒูŠุฑ ุจูˆุฒุฏุงุบ', 'start': 22, 'end': 33}, {'entity_group': 'GPE', 'score': 0.9997154, 'word': 'ุงู†ู‚ุฑุฉ', 'start': 37, 'end': 42}, {'entity_group': 'PER', 'score': 0.9946885, 'word': 'ู…ุดุชุจู‡ุง ุจู‡ู…', 'start': 51, 'end': 62}, {'entity_group': 'GPE', 'score': 0.99967396, 'word': 'ูู†ู„ู†ุฏุง', 'start': 66, 'end': 72}, {'entity_group': 'PER', 'score': 0.99694425, 'word': '21', 'start': 75, 'end': 77}, {'entity_group': 'GPE', 'score': 0.99963355, 'word': 'ุงู„ุณูˆูŠุฏ', 'start': 81, 'end': 87}]
```

### BibTeX entry and citation info

```bibtex
@inproceedings{lan2020gigabert,
  author     = {Lan, Wuwei and Chen, Yang and Xu, Wei and Ritter, Alan},
    title      = {Giga{BERT}: Zero-shot Transfer Learning from {E}nglish to {A}rabic},
    booktitle  = {Proceedings of The 2020 Conference on Empirical Methods on Natural Language Processing (EMNLP)},
    year       = {2020}
  } 
```