Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/qarib/bert-base-qarib60_1970k/README.md
README.md
ADDED
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: ar
|
3 |
+
tags:
|
4 |
+
- qarib
|
5 |
+
|
6 |
+
license: apache-2.0
|
7 |
+
datasets:
|
8 |
+
- Arabic GigaWord
|
9 |
+
- Abulkhair Arabic Corpus
|
10 |
+
- opus
|
11 |
+
- Twitter data
|
12 |
+
---
|
13 |
+
|
14 |
+
# QARiB: QCRI Arabic and Dialectal BERT
|
15 |
+
|
16 |
+
## About QARiB
|
17 |
+
QCRI Arabic and Dialectal BERT (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
|
18 |
+
For Tweets, the data was collected using twitter API and using language filter. `lang:ar`. For Text data, it was a combination from
|
19 |
+
[Arabic GigaWord](url), [Abulkhair Arabic Corpus]() and [OPUS](http://opus.nlpl.eu/).
|
20 |
+
|
21 |
+
### bert-base-qarib60_1970k
|
22 |
+
- Data size: 60Gb
|
23 |
+
- Number of Iterations: 1970k
|
24 |
+
- Loss: 1.5708898
|
25 |
+
|
26 |
+
## Training QARiB
|
27 |
+
The training of the model has been performed using Google’s original Tensorflow code on Google Cloud TPU v2.
|
28 |
+
We used a Google Cloud Storage bucket, for persistent storage of training data and models.
|
29 |
+
See more details in [Training QARiB](../Training_QARiB.md)
|
30 |
+
|
31 |
+
## Using QARiB
|
32 |
+
|
33 |
+
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. For more details, see [Using QARiB](../Using_QARiB.md)
|
34 |
+
|
35 |
+
### How to use
|
36 |
+
You can use this model directly with a pipeline for masked language modeling:
|
37 |
+
|
38 |
+
```python
|
39 |
+
>>>from transformers import pipeline
|
40 |
+
>>>fill_mask = pipeline("fill-mask", model="./models/data60gb_86k")
|
41 |
+
|
42 |
+
>>> fill_mask("شو عندكم يا [MASK]")
|
43 |
+
[{'sequence': '[CLS] شو عندكم يا عرب [SEP]', 'score': 0.0990147516131401, 'token': 2355, 'token_str': 'عرب'},
|
44 |
+
{'sequence': '[CLS] شو عندكم يا جماعة [SEP]', 'score': 0.051633741706609726, 'token': 2308, 'token_str': 'جماعة'},
|
45 |
+
{'sequence': '[CLS] شو عندكم يا شباب [SEP]', 'score': 0.046871256083250046, 'token': 939, 'token_str': 'شباب'},
|
46 |
+
{'sequence': '[CLS] شو عندكم يا رفاق [SEP]', 'score': 0.03598872944712639, 'token': 7664, 'token_str': 'رفاق'},
|
47 |
+
{'sequence': '[CLS] شو عندكم يا ناس [SEP]', 'score': 0.031996358186006546, 'token': 271, 'token_str': 'ناس'}]
|
48 |
+
|
49 |
+
>>> fill_mask("قللي وشفيييك يرحم [MASK]")
|
50 |
+
[{'sequence': '[CLS] قللي وشفيييك يرحم والديك [SEP]', 'score': 0.4152909517288208, 'token': 9650, 'token_str': 'والديك'},
|
51 |
+
{'sequence': '[CLS] قللي وشفيييك يرحملي [SEP]', 'score': 0.07663793861865997, 'token': 294, 'token_str': '##لي'},
|
52 |
+
{'sequence': '[CLS] قللي وشفيييك يرحم حالك [SEP]', 'score': 0.0453166700899601, 'token': 2663, 'token_str': 'حالك'},
|
53 |
+
{'sequence': '[CLS] قللي وشفيييك يرحم امك [SEP]', 'score': 0.04390475153923035, 'token': 1942, 'token_str': 'امك'},
|
54 |
+
{'sequence': '[CLS] قللي وشفيييك يرحمونك [SEP]', 'score': 0.027349254116415977, 'token': 3283, 'token_str': '##ونك'}]
|
55 |
+
|
56 |
+
>>> fill_mask("وقام المدير [MASK]")
|
57 |
+
[
|
58 |
+
{'sequence': '[CLS] وقام المدير بالعمل [SEP]', 'score': 0.0678194984793663, 'token': 4230, 'token_str': 'بالعمل'},
|
59 |
+
{'sequence': '[CLS] وقام المدير بذلك [SEP]', 'score': 0.05191086605191231, 'token': 984, 'token_str': 'بذلك'},
|
60 |
+
{'sequence': '[CLS] وقام المدير بالاتصال [SEP]', 'score': 0.045264165848493576, 'token': 26096, 'token_str': 'بالاتصال'},
|
61 |
+
{'sequence': '[CLS] وقام المدير بعمله [SEP]', 'score': 0.03732728958129883, 'token': 40486, 'token_str': 'بعمله'},
|
62 |
+
{'sequence': '[CLS] وقام المدير بالامر [SEP]', 'score': 0.0246378555893898, 'token': 29124, 'token_str': 'بالامر'}
|
63 |
+
]
|
64 |
+
>>> fill_mask("وقامت المديرة [MASK]")
|
65 |
+
|
66 |
+
[{'sequence': '[CLS] وقامت المديرة بذلك [SEP]', 'score': 0.23992691934108734, 'token': 984, 'token_str': 'بذلك'},
|
67 |
+
{'sequence': '[CLS] وقامت المديرة بالامر [SEP]', 'score': 0.108805812895298, 'token': 29124, 'token_str': 'بالامر'},
|
68 |
+
{'sequence': '[CLS] وقامت المديرة بالعمل [SEP]', 'score': 0.06639821827411652, 'token': 4230, 'token_str': 'بالعمل'},
|
69 |
+
{'sequence': '[CLS] وقامت المديرة بالاتصال [SEP]', 'score': 0.05613093823194504, 'token': 26096, 'token_str': 'بالاتصال'},
|
70 |
+
{'sequence': '[CLS] وقامت المديرة المديرة [SEP]', 'score': 0.021778125315904617, 'token': 41635, 'token_str': 'المديرة'}]
|
71 |
+
```
|
72 |
+
## Training procedure
|
73 |
+
|
74 |
+
The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2.
|
75 |
+
We used a Google Cloud Storage bucket, for persistent storage of training data and models.
|
76 |
+
|
77 |
+
## Eval results
|
78 |
+
|
79 |
+
We evaluated QARiB models on five NLP downstream task:
|
80 |
+
- Sentiment Analysis
|
81 |
+
- Emotion Detection
|
82 |
+
- Named-Entity Recognition (NER)
|
83 |
+
- Offensive Language Detection
|
84 |
+
- Dialect Identification
|
85 |
+
|
86 |
+
The results obtained from QARiB models outperforms multilingual BERT/AraBERT/ArabicBERT.
|
87 |
+
|
88 |
+
|
89 |
+
## Model Weights and Vocab Download
|
90 |
+
TBD
|
91 |
+
|
92 |
+
## Contacts
|
93 |
+
|
94 |
+
Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih
|
95 |
+
|
96 |
+
|