Migrate model card from transformers-repo
Browse filesRead announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/asafaya/bert-large-arabic/README.md
README.md
ADDED
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: ar
|
3 |
+
datasets:
|
4 |
+
- oscar
|
5 |
+
- wikipedia
|
6 |
+
---
|
7 |
+
|
8 |
+
|
9 |
+
# Arabic BERT Large Model
|
10 |
+
|
11 |
+
Pretrained BERT Large language model for Arabic
|
12 |
+
|
13 |
+
_If you use this model in your work, please cite this paper:_
|
14 |
+
|
15 |
+
<!--```
|
16 |
+
@inproceedings{
|
17 |
+
title={KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media},
|
18 |
+
author={Safaya, Ali and Abdullatif, Moutasem and Yuret, Deniz},
|
19 |
+
booktitle={Proceedings of the International Workshop on Semantic Evaluation (SemEval)},
|
20 |
+
year={2020}
|
21 |
+
}
|
22 |
+
```-->
|
23 |
+
|
24 |
+
```
|
25 |
+
@misc{safaya2020kuisail,
|
26 |
+
title={KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media},
|
27 |
+
author={Ali Safaya and Moutasem Abdullatif and Deniz Yuret},
|
28 |
+
year={2020},
|
29 |
+
eprint={2007.13184},
|
30 |
+
archivePrefix={arXiv},
|
31 |
+
primaryClass={cs.CL}
|
32 |
+
}
|
33 |
+
```
|
34 |
+
|
35 |
+
## Pretraining Corpus
|
36 |
+
|
37 |
+
`arabic-bert-large` model was pretrained on ~8.2 Billion words:
|
38 |
+
|
39 |
+
- Arabic version of [OSCAR](https://traces1.inria.fr/oscar/) - filtered from [Common Crawl](http://commoncrawl.org/)
|
40 |
+
- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
|
41 |
+
|
42 |
+
and other Arabic resources which sum up to ~95GB of text.
|
43 |
+
|
44 |
+
__Notes on training data:__
|
45 |
+
|
46 |
+
- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
|
47 |
+
- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters does not have upper or lower case, there is no cased and uncased version of the model.
|
48 |
+
- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.
|
49 |
+
|
50 |
+
## Pretraining details
|
51 |
+
|
52 |
+
- This model was trained using Google BERT's github [repository](https://github.com/google-research/bert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).
|
53 |
+
- Our pretraining procedure follows training settings of bert with some changes: trained for 3M training steps with batchsize of 128, instead of 1M with batchsize of 256.
|
54 |
+
|
55 |
+
## Load Pretrained Model
|
56 |
+
|
57 |
+
You can use this model by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
|
58 |
+
|
59 |
+
```python
|
60 |
+
from transformers import AutoTokenizer, AutoModel
|
61 |
+
|
62 |
+
tokenizer = AutoTokenizer.from_pretrained("asafaya/bert-large-arabic")
|
63 |
+
model = AutoModel.from_pretrained("asafaya/bert-large-arabic")
|
64 |
+
```
|
65 |
+
|
66 |
+
## Results
|
67 |
+
|
68 |
+
For further details on the models performance or any other queries, please refer to [Arabic-BERT](https://github.com/alisafaya/Arabic-BERT)
|
69 |
+
|
70 |
+
## Acknowledgement
|
71 |
+
|
72 |
+
Thanks to Google for providing free TPU for the training process and for Huggingface for hosting this model on their servers 😊
|
73 |
+
|
74 |
+
|