Update README.md
Browse files
README.md
CHANGED
@@ -3,6 +3,8 @@ language: Chinese
|
|
3 |
datasets: CLUECorpusSmall
|
4 |
widget:
|
5 |
- text: "北京是[MASK]国的首都。"
|
|
|
|
|
6 |
---
|
7 |
|
8 |
|
@@ -16,26 +18,27 @@ This is the set of 24 Chinese RoBERTa models pre-trained by [UER-py](https://www
|
|
16 |
|
17 |
You can download the 24 Chinese RoBERTa miniatures either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
|
18 |
|
19 |
-
|
|
20 |
-
|
21 |
-
| **L=2** |[**2/128 (Tiny)**][2_128]|[2/256]|[2/512]|[2/768]|
|
22 |
-
| **L=4** |[4/128]|[**4/256 (Mini)**][4_256]|[**4/512 (Small)**][4_512]|[4/768]|
|
23 |
-
| **L=6** |[6/128]|[6/256]|[6/512]|[6/768]|
|
24 |
-
| **L=8** |[8/128]|[8/256]|[**8/512 (Medium)**][8_512]|[8/768]|
|
25 |
-
| **L=10** |[10/128]|[10/256]|[10/512]|[10/768]|
|
26 |
-
| **L=12** |[12/128]|[12/256]|[12/512]|[**12/768 (Base)**][12_768]|
|
27 |
|
28 |
Here are scores on the devlopment set of six Chinese tasks:
|
29 |
|
30 |
-
|Model|Score|douban|chnsenticorp|lcqmc|tnews(CLUE)|iflytek(CLUE)|ocnli(CLUE)|
|
31 |
-
|
32 |
-
|RoBERTa-Tiny|72.3|83.0|91.4|81.8|62.0|55.0|60.3|
|
33 |
-
|RoBERTa-Mini|75.7|84.8|93.7|86.1|63.9|58.3|67.4|
|
34 |
-
|RoBERTa-Small|76.8|86.5|93.4|86.5|65.1|59.4|69.7|
|
35 |
-
|RoBERTa-Medium|77.8|87.6|94.8|88.1|65.6|59.5|71.2|
|
36 |
-
|RoBERTa-Base|79.5|89.1|95.2|89.2|67.0|60.9|75.5|
|
|
|
|
|
37 |
|
38 |
-
For each task, we selected the best fine-tuning hyperparameters from the lists below:
|
39 |
- epochs: 3, 5, 8
|
40 |
- batch sizes: 32, 64
|
41 |
- learning rates: 3e-5, 1e-4, 3e-4
|
@@ -96,7 +99,7 @@ output = model(encoded_input)
|
|
96 |
|
97 |
## Training data
|
98 |
|
99 |
-
CLUECorpusSmall is used as training data. We found that models pre-trained on CLUECorpusSmall outperform those pre-trained on CLUECorpus2020, although CLUECorpus2020 is much larger than CLUECorpusSmall.
|
100 |
|
101 |
## Training procedure
|
102 |
|
@@ -105,41 +108,54 @@ Models are pre-trained by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent
|
|
105 |
Taking the case of RoBERTa-Medium
|
106 |
|
107 |
Stage1:
|
|
|
108 |
```
|
109 |
python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
|
110 |
--vocab_path models/google_zh_vocab.txt \
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
```
|
|
|
115 |
```
|
116 |
python3 pretrain.py --dataset_path cluecorpussmall_seq128_dataset.pt \
|
117 |
--vocab_path models/google_zh_vocab.txt \
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
```
|
|
|
125 |
Stage2:
|
|
|
126 |
```
|
127 |
python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
|
128 |
--vocab_path models/google_zh_vocab.txt \
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
```
|
|
|
133 |
```
|
134 |
python3 pretrain.py --dataset_path cluecorpussmall_seq512_dataset.pt \
|
135 |
--pretrained_model_path models/cluecorpussmall_roberta_medium_seq128_model.bin-1000000 \
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
|
141 |
-
|
142 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
143 |
```
|
144 |
|
145 |
### BibTeX entry and citation info
|
@@ -158,5 +174,4 @@ python3 pretrain.py --dataset_path cluecorpussmall_seq512_dataset.pt \
|
|
158 |
[4_256]: https://huggingface.co/uer/chinese_roberta_L-4_H-256
|
159 |
[4_512]: https://huggingface.co/uer/chinese_roberta_L-4_H-512
|
160 |
[8_512]: https://huggingface.co/uer/chinese_roberta_L-8_H-512
|
161 |
-
[12_768]: https://huggingface.co/uer/chinese_roberta_L-12_H-768
|
162 |
-
|
|
|
3 |
datasets: CLUECorpusSmall
|
4 |
widget:
|
5 |
- text: "北京是[MASK]国的首都。"
|
6 |
+
|
7 |
+
|
8 |
---
|
9 |
|
10 |
|
|
|
18 |
|
19 |
You can download the 24 Chinese RoBERTa miniatures either from the [UER-py Github page](https://github.com/dbiir/UER-py/), or via HuggingFace from the links below:
|
20 |
|
21 |
+
| | H=128 | H=256 | H=512 | H=768 |
|
22 |
+
| -------- | :-----------------------: | :-----------------------: | :-------------------------: | :-------------------------: |
|
23 |
+
| **L=2** | [**2/128 (Tiny)**][2_128] | [2/256] | [2/512] | [2/768] |
|
24 |
+
| **L=4** | [4/128] | [**4/256 (Mini)**][4_256] | [**4/512 (Small)**][4_512] | [4/768] |
|
25 |
+
| **L=6** | [6/128] | [6/256] | [6/512] | [6/768] |
|
26 |
+
| **L=8** | [8/128] | [8/256] | [**8/512 (Medium)**][8_512] | [8/768] |
|
27 |
+
| **L=10** | [10/128] | [10/256] | [10/512] | [10/768] |
|
28 |
+
| **L=12** | [12/128] | [12/256] | [12/512] | [**12/768 (Base)**][12_768] |
|
29 |
|
30 |
Here are scores on the devlopment set of six Chinese tasks:
|
31 |
|
32 |
+
| Model | Score | douban | chnsenticorp | lcqmc | tnews(CLUE) | iflytek(CLUE) | ocnli(CLUE) |
|
33 |
+
| -------------- | :---: | :----: | :----------: | :---: | :---------: | :-----------: | :---------: |
|
34 |
+
| RoBERTa-Tiny | 72.3 | 83.0 | 91.4 | 81.8 | 62.0 | 55.0 | 60.3 |
|
35 |
+
| RoBERTa-Mini | 75.7 | 84.8 | 93.7 | 86.1 | 63.9 | 58.3 | 67.4 |
|
36 |
+
| RoBERTa-Small | 76.8 | 86.5 | 93.4 | 86.5 | 65.1 | 59.4 | 69.7 |
|
37 |
+
| RoBERTa-Medium | 77.8 | 87.6 | 94.8 | 88.1 | 65.6 | 59.5 | 71.2 |
|
38 |
+
| RoBERTa-Base | 79.5 | 89.1 | 95.2 | 89.2 | 67.0 | 60.9 | 75.5 |
|
39 |
+
|
40 |
+
For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained with the sequence length of 128:
|
41 |
|
|
|
42 |
- epochs: 3, 5, 8
|
43 |
- batch sizes: 32, 64
|
44 |
- learning rates: 3e-5, 1e-4, 3e-4
|
|
|
99 |
|
100 |
## Training data
|
101 |
|
102 |
+
[CLUECorpusSmall](https://github.com/CLUEbenchmark/CLUECorpus2020/) is used as training data. We found that models pre-trained on CLUECorpusSmall outperform those pre-trained on CLUECorpus2020, although CLUECorpus2020 is much larger than CLUECorpusSmall.
|
103 |
|
104 |
## Training procedure
|
105 |
|
|
|
108 |
Taking the case of RoBERTa-Medium
|
109 |
|
110 |
Stage1:
|
111 |
+
|
112 |
```
|
113 |
python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
|
114 |
--vocab_path models/google_zh_vocab.txt \
|
115 |
+
--dataset_path cluecorpussmall_seq128_dataset.pt \
|
116 |
+
--processes_num 32 --seq_length 128 \
|
117 |
+
--dynamic_masking --target mlm
|
118 |
```
|
119 |
+
|
120 |
```
|
121 |
python3 pretrain.py --dataset_path cluecorpussmall_seq128_dataset.pt \
|
122 |
--vocab_path models/google_zh_vocab.txt \
|
123 |
+
--config_path models/bert_medium_config.json \
|
124 |
+
--output_model_path models/cluecorpussmall_roberta_medium_seq128_model.bin \
|
125 |
+
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
|
126 |
+
--total_steps 1000000 --save_checkpoint_steps 100000 --report_steps 50000 \
|
127 |
+
--learning_rate 1e-4 --batch_size 64 \
|
128 |
+
--tie_weights --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm
|
129 |
```
|
130 |
+
|
131 |
Stage2:
|
132 |
+
|
133 |
```
|
134 |
python3 preprocess.py --corpus_path corpora/cluecorpussmall.txt \
|
135 |
--vocab_path models/google_zh_vocab.txt \
|
136 |
+
--dataset_path cluecorpussmall_seq512_dataset.pt \
|
137 |
+
--processes_num 32 --seq_length 512 \
|
138 |
+
--dynamic_masking --target mlm
|
139 |
```
|
140 |
+
|
141 |
```
|
142 |
python3 pretrain.py --dataset_path cluecorpussmall_seq512_dataset.pt \
|
143 |
--pretrained_model_path models/cluecorpussmall_roberta_medium_seq128_model.bin-1000000 \
|
144 |
+
--vocab_path models/google_zh_vocab.txt \
|
145 |
+
--config_path models/bert_medium_config.json \
|
146 |
+
--output_model_path models/cluecorpussmall_roberta_medium_seq512_model.bin \
|
147 |
+
--world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \
|
148 |
+
--total_steps 250000 --save_checkpoint_steps 50000 --report_steps 10000 \
|
149 |
+
--learning_rate 5e-5 --batch_size 16 \
|
150 |
+
--tie_weights --embedding word_pos_seg --encoder transformer --mask fully_visible --target mlm
|
151 |
+
```
|
152 |
+
|
153 |
+
Finally, we convert the pre-trained model into Huggingface's format:
|
154 |
+
|
155 |
+
```
|
156 |
+
python3 scripts/convert_bert_from_uer_to_huggingface.py --input_model_path models/cluecorpussmall_roberta_medium_seq512_model.bin-250000 \
|
157 |
+
--output_model_path pytorch_model.bin \
|
158 |
+
--layers_num 8 --target mlm
|
159 |
```
|
160 |
|
161 |
### BibTeX entry and citation info
|
|
|
174 |
[4_256]: https://huggingface.co/uer/chinese_roberta_L-4_H-256
|
175 |
[4_512]: https://huggingface.co/uer/chinese_roberta_L-4_H-512
|
176 |
[8_512]: https://huggingface.co/uer/chinese_roberta_L-8_H-512
|
177 |
+
[12_768]: https://huggingface.co/uer/chinese_roberta_L-12_H-768
|
|