uer commited on
Commit
8503c3f
·
1 Parent(s): c2b1156

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +100 -0
README.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: Chinese
3
+ widget:
4
+ - text: "北京上个月召开了两会"
5
+
6
+ ---
7
+
8
+ # Chinese RoBERTa-Base Models for Text Classification
9
+
10
+ ## Model description
11
+
12
+ This is the set of 5 Chinese RoBERTa-Base classification models fine-tuned by [UER-py](https://arxiv.org/abs/1909.05658). You can download the 5 Chinese RoBERTa-Base classification models either from the [UER-py Modelzoo page](https://github.com/dbiir/UER-py/wiki/Modelzoo) (in UER-py format), or via HuggingFace from the links below:
13
+
14
+ | Dataset | Link |
15
+ | :-----------: | :-------------------------------------------------------: |
16
+ | **JD full** | [**roberta-base-finetuned-jd-full-chinese**][jd_full] |
17
+ | **JD binary** | [**roberta-base-finetuned-jd-binary-chinese**][jd_binary] |
18
+ | **Dianping** | [**roberta-base-finetuned-dianping-chinese**][dianping] |
19
+ | **Ifeng** | [**roberta-base-finetuned-ifeng-chinese**][ifeng] |
20
+ | **Chinanews** | [**roberta-base-finetuned-chinanews-chinese**][chinanews] |
21
+
22
+ ## How to use
23
+
24
+ You can use this model directly with a pipeline for text classification (take the case of roberta-base-finetuned-chinanews-chinese):
25
+
26
+ ```python
27
+ >>> from transformers import AutoModelForSequenceClassification,AutoTokenizer,pipeline
28
+ >>> model = AutoModelForSequenceClassification.from_pretrained('uer/roberta-base-finetuned-chinanews-chinese')
29
+ >>> tokenizer = AutoTokenizer.from_pretrained('uer/roberta-base-finetuned-chinanews-chinese')
30
+ >>> text_classification = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
31
+ >>> text_classification("北京上个月召开了两会")
32
+ [{'label': 'mainland China politics', 'score': 0.7211663722991943}]
33
+ ```
34
+
35
+ ## Training data
36
+
37
+ 5 Chinese text classification datasets are used. JD full, JD binary, and Dianping datasets consist of user reviews of different sentiment polarities. Ifeng and Chinanews consist of first paragraphs of news articles of different topic classes. They are collected by [Glyph](https://github.com/zhangxiangxiao/glyph) project and more details are discussed in corresponding [paper](https://arxiv.org/abs/1708.02657).
38
+
39
+ ## Training procedure
40
+
41
+ Models are fine-tuned by [UER-py](https://github.com/dbiir/UER-py/) on [Tencent Cloud](https://cloud.tencent.com/). We fine-tune three epochs with a sequence length of 512 on the basis of the pre-trained model [chinese_roberta_L-12_H-768](https://huggingface.co/uer/chinese_roberta_L-12_H-768). At the end of each epoch, the model is saved when the best performance on development set is achieved. We use the same hyper-parameters on different models.
42
+
43
+ Taking the case of roberta-base-finetuned-chinanews-chinese
44
+
45
+ ```
46
+ python3 run_classifier.py --pretrained_model_path models/cluecorpussmall_roberta_base_seq512_model.bin-250000 \
47
+ --vocab_path models/google_zh_vocab.txt \
48
+ --train_path datasets/glyph/chinanews/train.tsv \
49
+ --dev_path datasets/glyph/chinanews/dev.tsv \
50
+ --output_model_path models/chinanews_classifier_model.bin \
51
+ --learning_rate 3e-5 --batch_size 32 --epochs_num 3 --seq_length 512 \
52
+ --embedding word_pos_seg --encoder transformer --mask fully_visible
53
+ ```
54
+
55
+ Finally, we convert the pre-trained model into Huggingface's format:
56
+
57
+ ```
58
+ python3 scripts/convert_bert_text_classification_from_uer_to_huggingface.py --input_model_path models/chinanews_classifier_model.bin \
59
+ --output_model_path pytorch_model.bin \
60
+ --layers_num 12
61
+ ```
62
+
63
+ ### BibTeX entry and citation info
64
+
65
+ ```
66
+ @article{devlin2018bert,
67
+ title={BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding},
68
+ author={Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
69
+ journal={arXiv preprint arXiv:1810.04805},
70
+ year={2018}
71
+ }
72
+
73
+ @article{liu2019roberta,
74
+ title={Roberta: A robustly optimized bert pretraining approach},
75
+ author={Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin},
76
+ journal={arXiv preprint arXiv:1907.11692},
77
+ year={2019}
78
+ }
79
+
80
+ @article{zhang2017encoding,
81
+ title={Which encoding is the best for text classification in chinese, english, japanese and korean?},
82
+ author={Zhang, Xiang and LeCun, Yann},
83
+ journal={arXiv preprint arXiv:1708.02657},
84
+ year={2017}
85
+ }
86
+
87
+ @article{zhao2019uer,
88
+ title={UER: An Open-Source Toolkit for Pre-training Models},
89
+ author={Zhao, Zhe and Chen, Hui and Zhang, Jinbin and Zhao, Xin and Liu, Tao and Lu, Wei and Chen, Xi and Deng, Haotang and Ju, Qi and Du, Xiaoyong},
90
+ journal={EMNLP-IJCNLP 2019},
91
+ pages={241},
92
+ year={2019}
93
+ }
94
+ ```
95
+
96
+ [jd_full]:https://huggingface.co/uer/roberta-base-finetuned-jd-full-chinese
97
+ [jd_binary]:https://huggingface.co/uer/roberta-base-finetuned-jd-binary-chinese
98
+ [dianping]:https://huggingface.co/uer/roberta-base-finetuned-dianping-chinese
99
+ [ifeng]:https://huggingface.co/uer/roberta-base-finetuned-ifeng-chinese
100
+ [chinanews]:https://huggingface.co/uer/roberta-base-finetuned-chinanews-chinese