MarcosDib commited on
Commit
e12ab8c
1 Parent(s): c795b59

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -130
README.md CHANGED
@@ -66,31 +66,28 @@ In this context, was considered two approaches:
66
  - Pre-training word embeddings using similar datasets for text classification;
67
  - Using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
68
 
69
- The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
70
 
71
- Os modelos que utilizam Word2Vec e Longformer também precisam ser carregados e seus pesos são os seguintes:
 
 
 
 
72
 
73
- Longformer: 10.88 GB
74
 
75
- Word2Vec: 56.1 MB
76
-
77
- Table 1:
78
- | Model | #params | Language |
79
- |------------------------------|:-------:|:--------:|
80
- | [`mcti-base-uncased`] | 110M | English |
81
- | [`mcti-large-uncased`] | 340M | English |
82
- | [`mcti-base-cased`] | 110M | English |
83
- | [`mcti-large-cased`] | 110M | Chinese |
84
- | [`-base-multilingual-cased`] | 110M | Multiple |
85
-
86
- Table 2: Compatibility results (*base = labeled MCTI dataset entries)
87
- | Dataset | |
88
- |--------------------------------------|:----------------------:|
89
- | Labeled MCTI | 100% |
90
- | Full MCTI | 100% |
91
- | BBC News Articles | 56.77% |
92
- | New unlabeled MCTI | 75.26% |
93
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  ## Intended uses
96
 
@@ -107,40 +104,18 @@ generation you should look at model like XXX.
107
  You can use this model directly with a pipeline for masked language modeling:
108
 
109
  ```python
110
- >>> from transformers import pipeline
111
- >>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
112
- >>> unmasker("Hello I'm a [MASK] model.")
113
-
114
- [{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
115
- 'score': 0.1073106899857521,
116
- 'token': 4827,
117
- 'token_str': 'fashion'},
118
- {'sequence': "[CLS] hello i'm a fine model. [SEP]",
119
- 'score': 0.027095865458250046,
120
- 'token': 2986,
121
- 'token_str': 'fine'}]
122
  ```
123
 
124
  Here is how to use this model to get the features of a given text in PyTorch:
125
 
126
  ```python
127
- from transformers import BertTokenizer, BertModel
128
- tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
129
- model = BertModel.from_pretrained("bert-base-uncased")
130
- text = "Replace me by any text you'd like."
131
- encoded_input = tokenizer(text, return_tensors='pt')
132
- output = model(**encoded_input)
133
  ```
134
 
135
  and in TensorFlow:
136
 
137
  ```python
138
- from transformers import BertTokenizer, TFBertModel
139
- tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
140
- model = TFBertModel.from_pretrained("bert-base-uncased")
141
- text = "Replace me by any text you'd like."
142
- encoded_input = tokenizer(text, return_tensors='tf')
143
- output = model(encoded_input)
144
  ```
145
 
146
  ### Limitations and bias
@@ -150,32 +125,8 @@ This model is uncased: it does not make a difference between english and English
150
  Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
151
  predictions:
152
 
153
- ```python
154
- >>> from transformers import pipeline
155
- >>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
156
- >>> unmasker("The man worked as a [MASK].")
157
-
158
- [{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
159
- 'score': 0.09747550636529922,
160
- 'token': 10533,
161
- 'token_str': 'carpenter'},
162
- {'sequence': '[CLS] the man worked as a salesman. [SEP]',
163
- 'score': 0.037680890411138535,
164
- 'token': 18968,
165
- 'token_str': 'salesman'}]
166
-
167
- >>> unmasker("The woman worked as a [MASK].")
168
-
169
- [{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
170
- 'score': 0.21981462836265564,
171
- 'token': 6821,
172
- 'token_str': 'nurse'},
173
- {'sequence': '[CLS] the woman worked as a cook. [SEP]',
174
- 'score': 0.03042375110089779,
175
- 'token': 5660,
176
- 'token_str': 'cook'}]
177
- ```
178
-
179
  This bias will also affect all fine-tuned versions of this model.
180
 
181
  ## Training data
@@ -186,6 +137,21 @@ the main text content (column u). Text content averages 800 tokens in length, bu
186
 
187
  ## Training procedure
188
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
189
  ### Preprocessing
190
 
191
  Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and
@@ -202,7 +168,7 @@ to implement the [preprocessing code](https://github.com/mcti-sefip/mcti-sefip-p
202
 
203
  Several Python packages were used to develop the preprocessing code:
204
 
205
- Table 3: Python packages used
206
  | Objective | Package |
207
  |--------------------------------------------------------|--------------|
208
  | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
@@ -219,7 +185,7 @@ Table 3: Python packages used
219
  As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
220
  bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
221
 
222
- Table 4: Preprocessing methods evaluated
223
  | id | Experiments |
224
  |--------|------------------------------------------------------------------------|
225
  | Base | Original Texts |
@@ -240,9 +206,9 @@ stemming + stopwords removal (xp7), and Lemmatization + stopwords removal (xp8).
240
 
241
  All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
242
  neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
243
- shown in Table 5.
244
 
245
- Table 5: Results obtained in Preprocessing
246
  | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
247
  |--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
248
  | Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
@@ -266,66 +232,27 @@ available on the project's GitHub with the inclusion of columns opo_pre (text) a
266
 
267
  ### Pretraining
268
 
269
- Since labeled data is scarce, word-embeddings was trained in an unsupervised manner using other datasets that contain most of
270
- the words it needs to learn. The alternative was to use web scraping algorithms to acquire more unlabeled data from the same
271
- sources, which would give a higher chance of providing compatible texts. The original dataset had 357 entries, with 260 of
272
- them labeled.
273
-
274
- ## Evaluation results
275
-
276
- ### Model training with Word2Vec embeddings
277
-
278
- Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem.
279
- We can couple it to our classification models (Fig. 4), realizing transferlearning and then training the model with the labeled
280
- data in a supervised manner. The new coupled model can be seen in Figure 5 under word2vec model training. The Table 3 shows the
281
- obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
282
- architecture and 88% for the LSTM architecture.
283
-
284
- Table 6: Results from Pre-trained WE + ML models
285
- | ML Model | Accuracy | F1 Score | Precision | Recall |
286
- |:--------:|:---------:|:---------:|:---------:|:---------:|
287
- | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
288
- | DNN | 0.7115 | 0.7794 | 0.7255 | 0.8485 |
289
- | CNN | 0.8654 | 0.9083 | 0.8486 | 0.9773 |
290
- | LSTM | 0.8846 | 0.9139 | 0.9056 | 0.9318 |
291
-
292
- ### Transformer-based implementation
293
 
294
- Another way we used pre-trained vector representations was by use of a Longformer (Beltagy et al., 2020). We chose it because
295
- of the limitation of the first generation of transformers and BERT-based architectures involving the size of the sentences:
296
- the maximum of 512 tokens. The reason behind that limitation is that the self-attention mechanism scale quadratically with the
297
- input sequence length O(n2) (Beltagy et al., 2020). The Longformer allowed the processing sequences of a thousand characters
298
- without facing the memory bottleneck of BERT-like architectures and achieved SOTA in several benchmarks.
299
-
300
- For our text length distribution in Figure 3, if we used a Bert-based architecture with a maximum length of 512, 99 sentences
301
- would have to be truncated and probably miss some critical information. By comparison, with the Longformer, with a maximum
302
- length of 4096, only eight sentences will have their information shortened.
303
-
304
- To apply the Longformer, we used the pre-trained base (available on the link) that was previously trained with a combination
305
- of vast datasets as input to the model, as shown in figure 5 under Longformer model training. After coupling to our classification
306
- models, we realized supervised training of the whole model. At this point, only transfer learning was applied since more
307
- computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
308
- This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
309
-
310
- Table 7: Results from Pre-trained Longformer + ML models
311
- | ML Model | Accuracy | F1 Score | Precision | Recall |
312
- |:--------:|:---------:|:---------:|:---------:|:---------:|
313
- | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
314
- | DNN | 0.8462 | 0.8776 |0.8474 | 0.9123 |
315
- | CNN | 0.8462 | 0.8776 |0.8474 | 0.9123 |
316
- | LSTM | 0.8269 | 0.8801 |0.8571 | 0.9091 |
317
 
 
 
 
 
 
 
 
318
 
319
- ## Checkpoints
320
- - Examples
321
- - Implementation Notes
322
- - Usage Example
323
- - >>>
324
- - >>> ...
325
 
326
- ## Config
327
 
328
- ## Tokenizer
329
 
330
  ## Benchmarks
331
 
 
66
  - Pre-training word embeddings using similar datasets for text classification;
67
  - Using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
68
 
69
+ Templates using Word2Vec and Longformer also need to be loaded and their weights are as follows:
70
 
71
+ Table 1: Templates using Word2Vec and Longformer
72
+ | Tamplates | weights |
73
+ |------------------------------|:-------:|
74
+ | Longformer | 10.9GB |
75
+ | Word2Vec | 56.1MB |
76
 
 
77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
+ | Keras Embedding + SNN | 92.47 | 88.46 | 79.66 | 100 | 0.2 | 0.7 | 1.8 |
80
+ | Keras Embedding + DNN | 89.78 | 84.41 | 77.81 | 92.57 | 1 | 1.4 | 7.6 |
81
+ | Keras Embedding + CNN | 93.01 | 89.91 | 85.18 | 95.69 | 0.4 | 1.1 | 3.2 |
82
+ | Keras Embedding + LSTM| 93.01 | 88.94 | 83.32 | 95.54 | 1.4 | 2 | 1.8 |
83
+ | Word2Vec + SNN | 89.25 | 83.82 | 74.15 | 97.10 | 1.4 | 1.2 | 9.6 |
84
+ | Word2Vec + DNN | 90.32 | 86.52 | 85.18 | 88.70 | 2 | 6.8 | 7.8 |
85
+ | Word2Vec + CNN | 92.47 | 88.42 | 80.85 | 98.72 | 1.9 | 3.4 | 4.7 |
86
+ | Word2Vec + LSTM | 89.78 | 84.36 | 75.36 | 95.81 | 2.6 | 14.3 | 1.2 |
87
+ | Longformer + SNN | 61.29 | 0 | 0 | 0 | 128 | 1.5 | 36.8 |
88
+ | Longformer + DNN | 91.93 | 87.62 | 80.37 | 97.62 | 81 | 8.4 | 12.7 |
89
+ | Longformer + CNN | 94.09 | 90.69 | 83.41 | 100 | 57 | 4.5 | 9.6 |
90
+ | Longformer + LSTM | 61.29 | 0 | 0 | 0 | 135 | 8.6 | 2.6 |
91
 
92
  ## Intended uses
93
 
 
104
  You can use this model directly with a pipeline for masked language modeling:
105
 
106
  ```python
107
+
 
 
 
 
 
 
 
 
 
 
 
108
  ```
109
 
110
  Here is how to use this model to get the features of a given text in PyTorch:
111
 
112
  ```python
113
+
 
 
 
 
 
114
  ```
115
 
116
  and in TensorFlow:
117
 
118
  ```python
 
 
 
 
 
 
119
  ```
120
 
121
  ### Limitations and bias
 
125
  Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
126
  predictions:
127
 
128
+ -
129
+ -
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
  This bias will also affect all fine-tuned versions of this model.
131
 
132
  ## Training data
 
137
 
138
  ## Training procedure
139
 
140
+ ### Model training with Word2Vec embeddings
141
+
142
+ After the pre-trained model of word2vec embeddings had already learned meanings relevant to the classification problem,
143
+ it was coupled to the classification model to train it with the labeled data in a supervised way. Table 6 shows the results
144
+ obtained with related metrics. With this implementation, was reached new levels of accuracy with 86% for CNN architecture
145
+ and 88% for the LSTM architecture.
146
+
147
+ Table 6: Results from Pre-trained WE + ML models
148
+ | ML Model | Accuracy | F1 Score | Precision | Recall |
149
+ |:--------:|:---------:|:---------:|:---------:|:---------:|
150
+ | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
151
+ | DNN | 0.7115 | 0.7794 | 0.7255 | 0.8485 |
152
+ | CNN | 0.8654 | 0.9083 | 0.8486 | 0.9773 |
153
+ | LSTM | 0.8846 | 0.9139 | 0.9056 | 0.9318 |
154
+
155
  ### Preprocessing
156
 
157
  Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and
 
168
 
169
  Several Python packages were used to develop the preprocessing code:
170
 
171
+ Table 2: Python packages used
172
  | Objective | Package |
173
  |--------------------------------------------------------|--------------|
174
  | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
 
185
  As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
186
  bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
187
 
188
+ Table 3: Preprocessing methods evaluated
189
  | id | Experiments |
190
  |--------|------------------------------------------------------------------------|
191
  | Base | Original Texts |
 
206
 
207
  All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
208
  neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
209
+ shown in Table 4.
210
 
211
+ Table 4: Results obtained in Preprocessing
212
  | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
213
  |--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
214
  | Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
 
232
 
233
  ### Pretraining
234
 
235
+ Since labeled data is scarce, word-embeddings was trained in an unsupervised manner using other datasets that
236
+ contain most of the words it needs to learn. The idea implemented was based on introducing better and better-trained
237
+ word embeddings in the model. For an additional dataset to be applied to improve word-embedding training, it must be
238
+ compatible with the dataset used to train the classifier. Was searched for datasets from the Kaggle, a platform with
239
+ over a thousand available NLP datasets, and the closest we found was the BBC News Articles dataset, which achieved
240
+ only 56% compatibility.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
241
 
242
+ The alternative was to use web scraping algorithms to acquire more unlabeled data from the same sources, thus ensuring
243
+ compatibility. The original dataset had 260 labeled entries.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
244
 
245
+ Table 5: Compatibility results (*base = labeled MCTI dataset entries)
246
+ | Dataset | |
247
+ |--------------------------------------|:----------------------:|
248
+ | Labeled MCTI | 100% |
249
+ | Full MCTI | 100% |
250
+ | BBC News Articles | 56.77% |
251
+ | New unlabeled MCTI | 75.26% |
252
 
253
+ ## Evaluation results
 
 
 
 
 
254
 
 
255
 
 
256
 
257
  ## Benchmarks
258