retarfi commited on
Commit
07526c2
·
1 Parent(s): 1513ca9
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+
3
+ language: ja
4
+
5
+ license: cc-by-sa-4.0
6
+
7
+ datasets:
8
+
9
+ - wikipedia
10
+
11
+ widget:
12
+
13
+ - text: 東京大学で[MASK]の研究をしています。
14
+
15
+ ---
16
+
17
+ # ELECTRA base Japanese generator
18
+
19
+ This is a [ELECTRA](https://github.com/google-research/electra) model pretrained on texts in the Japanese language.
20
+
21
+ The codes for the pretraining are available at [retarfi/language-pretraining](https://github.com/retarfi/language-pretraining/tree/v1.0).
22
+
23
+ ## Model architecture
24
+
25
+ The model architecture is the same as ELECTRA base in the [original ELECTRA implementation](https://github.com/google-research/electra); 12 layers, 256 dimensions of hidden states, and 4 attention heads.
26
+
27
+ ## Training Data
28
+
29
+ The models are trained on the Japanese version of Wikipedia.
30
+
31
+ The training corpus is generated from the Japanese version of Wikipedia, using Wikipedia dump file as of June 1, 2021.
32
+
33
+ The corpus file is 2.9GB, consisting of approximately 20M sentences.
34
+
35
+ ## Tokenization
36
+
37
+ The texts are first tokenized by MeCab with IPA dictionary and then split into subwords by the WordPiece algorithm.
38
+
39
+ The vocabulary size is 32768.
40
+
41
+ ## Training
42
+
43
+ The models are trained with the same configuration as ELECTRA base in the [original ELECTRA paper](https://arxiv.org/abs/2003.10555) except size; 512 tokens per instance, 256 instances per batch, and 766k training steps.
44
+
45
+ The size of the generator is the same of the discriminator.
46
+
47
+ ## Citation
48
+
49
+ **There will be another paper for this pretrained model. Be sure to check here again when you cite.**
50
+
51
+ ```
52
+ @inproceedings{bert_electra_japanese,
53
+ title = {Construction and Validation of a Pre-Trained Language Model
54
+ Using Financial Documents}
55
+ author = {Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi},
56
+ month = {oct},
57
+ year = {2021},
58
+ booktitle = {"Proceedings of JSAI Special Interest Group on Financial Infomatics (SIG-FIN) 27"}
59
+ }
60
+ ```
61
+
62
+ ## Licenses
63
+
64
+ The pretrained models are distributed under the terms of the [Creative Commons Attribution-ShareAlike 4.0](https://creativecommons.org/licenses/by-sa/4.0/).
65
+
66
+ ## Acknowledgments
67
+
68
+ This work was supported by JSPS KAKENHI Grant Number JP21K12010.
config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ElectraForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "embedding_size": 768,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 256,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 1024,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "electra",
15
+ "num_attention_heads": 4,
16
+ "num_hidden_layers": 12,
17
+ "pad_token_id": 0,
18
+ "tokenizer_class": "BertJapaneseTokenizer",
19
+ "position_embedding_type": "absolute",
20
+ "summary_activation": "gelu",
21
+ "summary_last_dropout": 0.1,
22
+ "summary_type": "first",
23
+ "summary_use_proj": true,
24
+ "transformers_version": "4.7.0",
25
+ "type_vocab_size": 2,
26
+ "vocab_size": 32768
27
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7c58bfa60829cdaa825662f71d565b1ff9fc964020fab94108cadebf82cf099b
3
+ size 141960100
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "do_lower_case": false, "do_word_tokenize": true, "do_subword_tokenize": true, "word_tokenizer_type": "mecab", "subword_tokenizer_type": "wordpiece", "never_split": null, "mecab_kwargs": {"mecab_dic": "ipadic"}, "tokenize_chinese_chars": false}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff