ali safaya commited on
Commit
ee582a9
1 Parent(s): fb5c569

transfer models from org to user account

Browse files
README.md CHANGED
@@ -1,3 +1,79 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ar
3
+ datasets:
4
+ - oscar
5
+ - wikipedia
6
+ tags:
7
+ - ar
8
+ - masked-lm
9
+ ---
10
+
11
+
12
+ # Arabic-ALBERT Base
13
+
14
+ Arabic edition of ALBERT Base pretrained language model
15
+
16
+ _If you use any of these models in your work, please cite this work as:_
17
+
18
+ ```
19
+ @software{ali_safaya_2020_4718724,
20
+ author = {Ali Safaya},
21
+ title = {Arabic-ALBERT},
22
+ month = aug,
23
+ year = 2020,
24
+ publisher = {Zenodo},
25
+ version = {1.0.0},
26
+ doi = {10.5281/zenodo.4718724},
27
+ url = {https://doi.org/10.5281/zenodo.4718724}
28
+ }
29
+ ```
30
+
31
+ ## Pretraining data
32
+
33
+ The models were pretrained on ~4.4 Billion words:
34
+
35
+ - Arabic version of [OSCAR](https://oscar-corpus.com/) (unshuffled version of the corpus) - filtered from [Common Crawl](http://commoncrawl.org/)
36
+ - Recent dump of Arabic [Wikipedia](https://dumps.wikimedia.org/backup-index.html)
37
+
38
+ __Notes on training data:__
39
+
40
+ - Our final version of corpus contains some non-Arabic words inlines, which we did not remove from sentences since that would affect some tasks like NER.
41
+ - Although non-Arabic characters were lowered as a preprocessing step, since Arabic characters do not have upper or lower case, there is no cased and uncased version of the model.
42
+ - The corpus and vocabulary set are not restricted to Modern Standard Arabic, they contain some dialectical Arabic too.
43
+
44
+ ## Pretraining details
45
+
46
+ - These models were trained using Google ALBERT's github [repository](https://github.com/google-research/albert) on a single TPU v3-8 provided for free from [TFRC](https://www.tensorflow.org/tfrc).
47
+ - Our pretraining procedure follows training settings of bert with some changes: trained for 7M training steps with batchsize of 64, instead of 125K with batchsize of 4096.
48
+
49
+ ## Models
50
+
51
+ | | albert-base | albert-large | albert-xlarge |
52
+ |:---:|:---:|:---:|:---:|
53
+ | Hidden Layers | 12 | 24 | 24 |
54
+ | Attention heads | 12 | 16 | 32 |
55
+ | Hidden size | 768 | 1024 | 2048 |
56
+
57
+ ## Results
58
+
59
+ For further details on the models performance or any other queries, please refer to [Arabic-ALBERT](https://github.com/KUIS-AI-Lab/Arabic-ALBERT/)
60
+
61
+ ## How to use
62
+
63
+ You can use these models by installing `torch` or `tensorflow` and Huggingface library `transformers`. And you can use it directly by initializing it like this:
64
+
65
+ ```python
66
+
67
+ from transformers import AutoTokenizer, AutoModel
68
+
69
+ # loading the tokenizer
70
+ base_tokenizer = AutoTokenizer.from_pretrained("kuisailab/albert-base-arabic")
71
+
72
+ # loading the model
73
+ base_model = AutoModelForMaskedLM.from_pretrained("kuisailab/albert-base-arabic")
74
+
75
+ ```
76
+
77
+ ## Acknowledgement
78
+
79
+ Thanks to Google for providing free TPU for the training process and for Huggingface for hosting these models on their servers 😊
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "AlbertForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0,
6
+ "bos_token_id": 2,
7
+ "classifier_dropout_prob": 0.1,
8
+ "down_scale_factor": 1,
9
+ "embedding_size": 128,
10
+ "eos_token_id": 3,
11
+ "gap_size": 0,
12
+ "hidden_act": "gelu",
13
+ "hidden_dropout_prob": 0,
14
+ "hidden_size": 768,
15
+ "initializer_range": 0.02,
16
+ "inner_group_num": 1,
17
+ "intermediate_size": 3072,
18
+ "layer_norm_eps": 1e-12,
19
+ "max_position_embeddings": 512,
20
+ "model_type": "albert",
21
+ "net_structure_type": 0,
22
+ "num_attention_heads": 12,
23
+ "num_hidden_groups": 1,
24
+ "num_hidden_layers": 12,
25
+ "num_memory_blocks": 0,
26
+ "pad_token_id": 0,
27
+ "type_vocab_size": 2,
28
+ "vocab_size": 30000
29
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5561ab780a507925c89d250baaf09c41d37d0489ce17710b60830bd479f31790
3
+ size 47256230
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "[CLS]", "eos_token": "[SEP]", "unk_token": "<unk>", "sep_token": "[SEP]", "pad_token": "<pad>", "cls_token": "[CLS]", "mask_token": "[MASK]"}
spiece.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:40f25b5aae5c42a4089292e6616f37bca7b5a4f08608678b16ba5a55c9f050d7
3
+ size 860481
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:734d5e2809013b58ed2e8cc0cd629b4fa3608826b30688261a05a7e71e17ed1f
3
+ size 63048368
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"special_tokens_map_file": null, "full_tokenizer_file": null}