julien-c HF staff commited on
Commit
abdbae8
·
1 Parent(s): d83b316

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/HooshvareLab/bert-fa-base-uncased/README.md

Files changed (1) hide show
  1. README.md +147 -0
README.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: fa
3
+ tags:
4
+ - bert-fa
5
+ - bert-persian
6
+ - persian-lm
7
+ license: apache-2.0
8
+ ---
9
+
10
+ # ParsBERT (v2.0)
11
+ A Transformer-based Model for Persian Language Understanding
12
+
13
+
14
+ We reconstructed the vocabulary and fine-tuned the ParsBERT v1.1 on the new Persian corpora in order to provide some functionalities for using ParsBERT in other scopes!
15
+ Please follow the [ParsBERT](https://github.com/hooshvare/parsbert) repo for the latest information about previous and current models.
16
+
17
+ ## Introduction
18
+
19
+ ParsBERT is a monolingual language model based on Google’s BERT architecture. This model is pre-trained on large Persian corpora with various writing styles from numerous subjects (e.g., scientific, novels, news) with more than `3.9M` documents, `73M` sentences, and `1.3B` words.
20
+
21
+ Paper presenting ParsBERT: [arXiv:2005.12515](https://arxiv.org/abs/2005.12515)
22
+
23
+ ## Intended uses & limitations
24
+
25
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
26
+ be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?search=bert-fa) to look for
27
+ fine-tuned versions on a task that interests you.
28
+
29
+
30
+ ### How to use
31
+
32
+ #### TensorFlow 2.0
33
+
34
+ ```python
35
+ from transformers import AutoConfig, AutoTokenizer, TFAutoModel
36
+
37
+ config = AutoConfig.from_pretrained("HooshvareLab/bert-fa-base-uncased")
38
+ tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-fa-base-uncased")
39
+ model = TFAutoModel.from_pretrained("HooshvareLab/bert-fa-base-uncased")
40
+
41
+ text = "ما در هوشواره معتقدیم با انتقال صحیح دانش و آگاهی، همه افراد میتوانند از ابزارهای هوشمند استفاده کنند. شعار ما هوش مصنوعی برای همه است."
42
+ tokenizer.tokenize(text)
43
+
44
+ >>> ['ما', 'در', 'هوش', '##واره', 'معتقدیم', 'با', 'انتقال', 'صحیح', 'دانش', 'و', 'اگاهی', '،', 'همه', 'افراد', 'میتوانند', 'از', 'ابزارهای', 'هوشمند', 'استفاده', 'کنند', '.', 'شعار', 'ما', 'هوش', 'مصنوعی', 'برای', 'همه', 'است', '.']
45
+ ```
46
+
47
+ #### Pytorch
48
+
49
+ ```python
50
+ from transformers import AutoConfig, AutoTokenizer, AutoModel
51
+
52
+ config = AutoConfig.from_pretrained("HooshvareLab/bert-fa-base-uncased")
53
+ tokenizer = AutoTokenizer.from_pretrained("HooshvareLab/bert-fa-base-uncased")
54
+ model = AutoModel.from_pretrained("HooshvareLab/bert-fa-base-uncased")
55
+ ```
56
+
57
+ ## Training
58
+
59
+ ParsBERT trained on a massive amount of public corpora ([Persian Wikidumps](https://dumps.wikimedia.org/fawiki/), [MirasText](https://github.com/miras-tech/MirasText)) and six other manually crawled text data from a various type of websites ([BigBang Page](https://bigbangpage.com/) `scientific`, [Chetor](https://www.chetor.com/) `lifestyle`, [Eligasht](https://www.eligasht.com/Blog/) `itinerary`, [Digikala](https://www.digikala.com/mag/) `digital magazine`, [Ted Talks](https://www.ted.com/talks) `general conversational`, Books `novels, storybooks, short stories from old to the contemporary era`).
60
+
61
+ As a part of ParsBERT methodology, an extensive pre-processing combining POS tagging and WordPiece segmentation was carried out to bring the corpora into a proper format.
62
+
63
+ ## Goals
64
+ Objective goals during training are as below (after 300k steps).
65
+
66
+ ``` bash
67
+ ***** Eval results *****
68
+ global_step = 300000
69
+ loss = 1.4392426
70
+ masked_lm_accuracy = 0.6865794
71
+ masked_lm_loss = 1.4469004
72
+ next_sentence_accuracy = 1.0
73
+ next_sentence_loss = 6.534152e-05
74
+ ```
75
+
76
+
77
+ ## Derivative models
78
+
79
+ ### Base Config
80
+
81
+ #### ParsBERT v2.0 Model
82
+ - [HooshvareLab/bert-fa-base-uncased](https://huggingface.co/HooshvareLab/bert-fa-base-uncased)
83
+
84
+ #### ParsBERT v2.0 Sentiment Analysis
85
+ - [HooshvareLab/bert-fa-base-uncased-sentiment-digikala](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-digikala)
86
+ - [HooshvareLab/bert-fa-base-uncased-sentiment-snappfood](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-snappfood)
87
+ - [HooshvareLab/bert-fa-base-uncased-sentiment-deepsentipers-binary](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-deepsentipers-binary)
88
+ - [HooshvareLab/bert-fa-base-uncased-sentiment-deepsentipers-multi](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-sentiment-deepsentipers-multi)
89
+
90
+ #### ParsBERT v2.0 Text Classification
91
+ - [HooshvareLab/bert-fa-base-uncased-clf-digimag](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-clf-digimag)
92
+ - [HooshvareLab/bert-fa-base-uncased-clf-persiannews](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-clf-persiannews)
93
+
94
+ #### ParsBERT v2.0 NER
95
+ - [HooshvareLab/bert-fa-base-uncased-ner-peyma](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-ner-peyma)
96
+ - [HooshvareLab/bert-fa-base-uncased-ner-arman](https://huggingface.co/HooshvareLab/bert-fa-base-uncased-ner-arman)
97
+
98
+
99
+ ## Eval results
100
+
101
+ ParsBERT is evaluated on three NLP downstream tasks: Sentiment Analysis (SA), Text Classification, and Named Entity Recognition (NER). For this matter and due to insufficient resources, two large datasets for SA and two for text classification were manually composed, which are available for public use and benchmarking. ParsBERT outperformed all other language models, including multilingual BERT and other hybrid deep learning models for all tasks, improving the state-of-the-art performance in Persian language modeling.
102
+
103
+
104
+ ### Sentiment Analysis (SA) Task
105
+
106
+ | Dataset | ParsBERT v2 | ParsBERT v1 | mBERT | DeepSentiPers |
107
+ |:------------------------:|:-----------:|:-----------:|:-----:|:-------------:|
108
+ | Digikala User Comments | 81.72 | 81.74* | 80.74 | - |
109
+ | SnappFood User Comments | 87.98 | 88.12* | 87.87 | - |
110
+ | SentiPers (Multi Class) | 71.31* | 71.11 | - | 69.33 |
111
+ | SentiPers (Binary Class) | 92.42* | 92.13 | - | 91.98 |
112
+
113
+
114
+ ### Text Classification (TC) Task
115
+
116
+ | Dataset | ParsBERT v2 | ParsBERT v1 | mBERT |
117
+ |:-----------------:|:-----------:|:-----------:|:-----:|
118
+ | Digikala Magazine | 93.65* | 93.59 | 90.72 |
119
+ | Persian News | 97.44* | 97.19 | 95.79 |
120
+
121
+
122
+ ### Named Entity Recognition (NER) Task
123
+
124
+ | Dataset | ParsBERT v2 | ParsBERT v1 | mBERT | MorphoBERT | Beheshti-NER | LSTM-CRF | Rule-Based CRF | BiLSTM-CRF |
125
+ |:-------:|:-----------:|:-----------:|:-----:|:----------:|:------------:|:--------:|:--------------:|:----------:|
126
+ | PEYMA | 93.40* | 93.10 | 86.64 | - | 90.59 | - | 84.00 | - |
127
+ | ARMAN | 99.84* | 98.79 | 95.89 | 89.9 | 84.03 | 86.55 | - | 77.45 |
128
+
129
+
130
+
131
+
132
+ ### BibTeX entry and citation info
133
+
134
+ Please cite in publications as the following:
135
+
136
+ ```bibtex
137
+ @article{ParsBERT,
138
+ title={ParsBERT: Transformer-based Model for Persian Language Understanding},
139
+ author={Mehrdad Farahani, Mohammad Gharachorloo, Marzieh Farahani, Mohammad Manthouri},
140
+ journal={ArXiv},
141
+ year={2020},
142
+ volume={abs/2005.12515}
143
+ }
144
+ ```
145
+
146
+ ## Questions?
147
+ Post a Github issue on the [ParsBERT Issues](https://github.com/hooshvare/parsbert/issues) repo.