ali safaya
commited on
Commit
•
3530341
1
Parent(s):
43f84b0
transfer models from org to user account
Browse files- README.md +79 -3
- config.json +29 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- spiece.model +3 -0
- tf_model.h5 +3 -0
- tokenizer_config.json +1 -0
README.md
CHANGED
@@ -1,3 +1,79 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: ar
|
3 |
+
datasets:
|
4 |
+
- oscar
|
5 |
+
- wikipedia
|
6 |
+
tags:
|
7 |
+
- ar
|
8 |
+
- masked-lm
|
9 |
+
---
|
10 |
+
|
11 |
+
|
12 |
+
# Arabic-ALBERT Xlarge
|
13 |
+
|
14 |
+
Arabic edition of ALBERT Xlarge pretrained language model
|
15 |
+
|
16 |
+
_If you use any of these models in your work, please cite this work as:_
|
17 |
+
|
18 |
+
```
|
19 |
+
@software{ali_safaya_2020_4718724,
|
20 |
+
author = {Ali Safaya},
|
21 |
+
title = {Arabic-ALBERT},
|
22 |
+
month = aug,
|
23 |
+
year = 2020,
|
24 |
+
publisher = {Zenodo},
|
25 |
+
version = {1.0.0},
|
26 |
+
doi = {10.5281/zenodo.4718724},
|
27 |
+
url = {https://doi.org/10.5281/zenodo.4718724}
|
28 |
+
}
|
29 |
+
```
|
30 |
+
|
31 |
+
## Pretraining data
|
32 |
+
|
33 |
+
The models were pretrained on ~4.4 Billion words:
|
34 |
+
|
35 |
+
- Arabic version of [OSCAR](https://oscar-corpus.com/) (unshuffled version of the corpus) - filtered from [Common Crawl](http://commoncrawl.org/)
|
36 |
+
- Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
|
37 |
+
|
38 |
+
__Notes on training data:__
|
39 |
+
|
40 |
+
- Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
|
41 |
+
- Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters do not have upper or lower case, there is no cased and uncased version of the model.
|
42 |
+
- The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.
|
43 |
+
|
44 |
+
## Pretraining details
|
45 |
+
|
46 |
+
- These models were trained using Google ALBERT's github [repository](https://github.com/google-research/albert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).
|
47 |
+
- Our pretraining procedure follows training settings of bert with some changes: trained for 7M training steps with batchsize of 64, instead of 125K with batchsize of 4096.
|
48 |
+
|
49 |
+
## Models
|
50 |
+
|
51 |
+
| | albert-base | albert-large | albert-xlarge |
|
52 |
+
|:---:|:---:|:---:|:---:|
|
53 |
+
| Hidden Layers | 12 | 24 | 24 |
|
54 |
+
| Attention heads | 12 | 16 | 32 |
|
55 |
+
| Hidden size | 768 | 1024 | 2048 |
|
56 |
+
|
57 |
+
## Results
|
58 |
+
|
59 |
+
For further details on the models performance or any other queries, please refer to [Arabic-ALBERT](https://github.com/KUIS-AI-Lab/Arabic-ALBERT/)
|
60 |
+
|
61 |
+
## How to use
|
62 |
+
|
63 |
+
You can use these models by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
|
64 |
+
|
65 |
+
```python
|
66 |
+
|
67 |
+
from transformers import AutoTokenizer, AutoModel
|
68 |
+
|
69 |
+
# loading the tokenizer
|
70 |
+
tokenizer = AutoTokenizer.from_pretrained("kuisailab/albert-xlarge-arabic")
|
71 |
+
|
72 |
+
# loading the model
|
73 |
+
model = AutoModelForMaskedLM.from_pretrained("kuisailab/albert-xlarge-arabic")
|
74 |
+
|
75 |
+
```
|
76 |
+
|
77 |
+
## Acknowledgement
|
78 |
+
|
79 |
+
Thanks to Google for providing free TPU for the training process and for Huggingface for hosting these models on their servers 😊
|
config.json
ADDED
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"architectures": [
|
3 |
+
"AlbertForMaskedLM"
|
4 |
+
],
|
5 |
+
"attention_probs_dropout_prob": 0,
|
6 |
+
"bos_token_id": 2,
|
7 |
+
"classifier_dropout_prob": 0.1,
|
8 |
+
"down_scale_factor": 1,
|
9 |
+
"embedding_size": 128,
|
10 |
+
"eos_token_id": 3,
|
11 |
+
"gap_size": 0,
|
12 |
+
"hidden_act": "gelu",
|
13 |
+
"hidden_dropout_prob": 0,
|
14 |
+
"hidden_size": 2048,
|
15 |
+
"initializer_range": 0.02,
|
16 |
+
"inner_group_num": 1,
|
17 |
+
"intermediate_size": 8192,
|
18 |
+
"layer_norm_eps": 1e-12,
|
19 |
+
"max_position_embeddings": 512,
|
20 |
+
"model_type": "albert",
|
21 |
+
"net_structure_type": 0,
|
22 |
+
"num_attention_heads": 32,
|
23 |
+
"num_hidden_groups": 1,
|
24 |
+
"num_hidden_layers": 24,
|
25 |
+
"num_memory_blocks": 0,
|
26 |
+
"pad_token_id": 0,
|
27 |
+
"type_vocab_size": 2,
|
28 |
+
"vocab_size": 30000
|
29 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:417f617905c8c85cd6d4780d9b6d864acebfdae8237a51f0ee2b48901da9a27c
|
3 |
+
size 236076710
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"bos_token": "[CLS]", "eos_token": "[SEP]", "unk_token": "<unk>", "sep_token": "[SEP]", "pad_token": "<pad>", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
spiece.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:40f25b5aae5c42a4089292e6616f37bca7b5a4f08608678b16ba5a55c9f050d7
|
3 |
+
size 860481
|
tf_model.h5
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:18e4a4d8cc6a8d506068b9c2f7f073eaa0e0f1777a19e99d83cd6e8bac234ad1
|
3 |
+
size 251868920
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"special_tokens_map_file": null, "full_tokenizer_file": null}
|