Update README.md
Browse files
README.md
CHANGED
@@ -66,31 +66,28 @@ In this context, was considered two approaches:
|
|
66 |
- Pre-training word embeddings using similar datasets for text classification;
|
67 |
- Using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
|
68 |
|
69 |
-
|
70 |
|
71 |
-
|
|
|
|
|
|
|
|
|
72 |
|
73 |
-
Longformer: 10.88 GB
|
74 |
|
75 |
-
Word2Vec: 56.1 MB
|
76 |
-
|
77 |
-
Table 1:
|
78 |
-
| Model | #params | Language |
|
79 |
-
|------------------------------|:-------:|:--------:|
|
80 |
-
| [`mcti-base-uncased`] | 110M | English |
|
81 |
-
| [`mcti-large-uncased`] | 340M | English |
|
82 |
-
| [`mcti-base-cased`] | 110M | English |
|
83 |
-
| [`mcti-large-cased`] | 110M | Chinese |
|
84 |
-
| [`-base-multilingual-cased`] | 110M | Multiple |
|
85 |
-
|
86 |
-
Table 2: Compatibility results (*base = labeled MCTI dataset entries)
|
87 |
-
| Dataset | |
|
88 |
-
|--------------------------------------|:----------------------:|
|
89 |
-
| Labeled MCTI | 100% |
|
90 |
-
| Full MCTI | 100% |
|
91 |
-
| BBC News Articles | 56.77% |
|
92 |
-
| New unlabeled MCTI | 75.26% |
|
93 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
94 |
|
95 |
## Intended uses
|
96 |
|
@@ -107,40 +104,18 @@ generation you should look at model like XXX.
|
|
107 |
You can use this model directly with a pipeline for masked language modeling:
|
108 |
|
109 |
```python
|
110 |
-
|
111 |
-
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
|
112 |
-
>>> unmasker("Hello I'm a [MASK] model.")
|
113 |
-
|
114 |
-
[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
|
115 |
-
'score': 0.1073106899857521,
|
116 |
-
'token': 4827,
|
117 |
-
'token_str': 'fashion'},
|
118 |
-
{'sequence': "[CLS] hello i'm a fine model. [SEP]",
|
119 |
-
'score': 0.027095865458250046,
|
120 |
-
'token': 2986,
|
121 |
-
'token_str': 'fine'}]
|
122 |
```
|
123 |
|
124 |
Here is how to use this model to get the features of a given text in PyTorch:
|
125 |
|
126 |
```python
|
127 |
-
|
128 |
-
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
129 |
-
model = BertModel.from_pretrained("bert-base-uncased")
|
130 |
-
text = "Replace me by any text you'd like."
|
131 |
-
encoded_input = tokenizer(text, return_tensors='pt')
|
132 |
-
output = model(**encoded_input)
|
133 |
```
|
134 |
|
135 |
and in TensorFlow:
|
136 |
|
137 |
```python
|
138 |
-
from transformers import BertTokenizer, TFBertModel
|
139 |
-
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
140 |
-
model = TFBertModel.from_pretrained("bert-base-uncased")
|
141 |
-
text = "Replace me by any text you'd like."
|
142 |
-
encoded_input = tokenizer(text, return_tensors='tf')
|
143 |
-
output = model(encoded_input)
|
144 |
```
|
145 |
|
146 |
### Limitations and bias
|
@@ -150,32 +125,8 @@ This model is uncased: it does not make a difference between english and English
|
|
150 |
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
151 |
predictions:
|
152 |
|
153 |
-
|
154 |
-
|
155 |
-
>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
|
156 |
-
>>> unmasker("The man worked as a [MASK].")
|
157 |
-
|
158 |
-
[{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
|
159 |
-
'score': 0.09747550636529922,
|
160 |
-
'token': 10533,
|
161 |
-
'token_str': 'carpenter'},
|
162 |
-
{'sequence': '[CLS] the man worked as a salesman. [SEP]',
|
163 |
-
'score': 0.037680890411138535,
|
164 |
-
'token': 18968,
|
165 |
-
'token_str': 'salesman'}]
|
166 |
-
|
167 |
-
>>> unmasker("The woman worked as a [MASK].")
|
168 |
-
|
169 |
-
[{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
|
170 |
-
'score': 0.21981462836265564,
|
171 |
-
'token': 6821,
|
172 |
-
'token_str': 'nurse'},
|
173 |
-
{'sequence': '[CLS] the woman worked as a cook. [SEP]',
|
174 |
-
'score': 0.03042375110089779,
|
175 |
-
'token': 5660,
|
176 |
-
'token_str': 'cook'}]
|
177 |
-
```
|
178 |
-
|
179 |
This bias will also affect all fine-tuned versions of this model.
|
180 |
|
181 |
## Training data
|
@@ -186,6 +137,21 @@ the main text content (column u). Text content averages 800 tokens in length, bu
|
|
186 |
|
187 |
## Training procedure
|
188 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
189 |
### Preprocessing
|
190 |
|
191 |
Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and
|
@@ -202,7 +168,7 @@ to implement the [preprocessing code](https://github.com/mcti-sefip/mcti-sefip-p
|
|
202 |
|
203 |
Several Python packages were used to develop the preprocessing code:
|
204 |
|
205 |
-
Table
|
206 |
| Objective | Package |
|
207 |
|--------------------------------------------------------|--------------|
|
208 |
| Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
|
@@ -219,7 +185,7 @@ Table 3: Python packages used
|
|
219 |
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
|
220 |
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
|
221 |
|
222 |
-
Table
|
223 |
| id | Experiments |
|
224 |
|--------|------------------------------------------------------------------------|
|
225 |
| Base | Original Texts |
|
@@ -240,9 +206,9 @@ stemming + stopwords removal (xp7), and Lemmatization + stopwords removal (xp8).
|
|
240 |
|
241 |
All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
|
242 |
neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
|
243 |
-
shown in Table
|
244 |
|
245 |
-
Table
|
246 |
| id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
|
247 |
|--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
|
248 |
| Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
|
@@ -266,66 +232,27 @@ available on the project's GitHub with the inclusion of columns opo_pre (text) a
|
|
266 |
|
267 |
### Pretraining
|
268 |
|
269 |
-
Since labeled data is scarce, word-embeddings was trained in an unsupervised manner using other datasets that
|
270 |
-
the words it needs to learn. The
|
271 |
-
|
272 |
-
|
273 |
-
|
274 |
-
|
275 |
-
|
276 |
-
### Model training with Word2Vec embeddings
|
277 |
-
|
278 |
-
Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem.
|
279 |
-
We can couple it to our classification models (Fig. 4), realizing transferlearning and then training the model with the labeled
|
280 |
-
data in a supervised manner. The new coupled model can be seen in Figure 5 under word2vec model training. The Table 3 shows the
|
281 |
-
obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
|
282 |
-
architecture and 88% for the LSTM architecture.
|
283 |
-
|
284 |
-
Table 6: Results from Pre-trained WE + ML models
|
285 |
-
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
286 |
-
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
287 |
-
| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
|
288 |
-
| DNN | 0.7115 | 0.7794 | 0.7255 | 0.8485 |
|
289 |
-
| CNN | 0.8654 | 0.9083 | 0.8486 | 0.9773 |
|
290 |
-
| LSTM | 0.8846 | 0.9139 | 0.9056 | 0.9318 |
|
291 |
-
|
292 |
-
### Transformer-based implementation
|
293 |
|
294 |
-
|
295 |
-
|
296 |
-
the maximum of 512 tokens. The reason behind that limitation is that the self-attention mechanism scale quadratically with the
|
297 |
-
input sequence length O(n2) (Beltagy et al., 2020). The Longformer allowed the processing sequences of a thousand characters
|
298 |
-
without facing the memory bottleneck of BERT-like architectures and achieved SOTA in several benchmarks.
|
299 |
-
|
300 |
-
For our text length distribution in Figure 3, if we used a Bert-based architecture with a maximum length of 512, 99 sentences
|
301 |
-
would have to be truncated and probably miss some critical information. By comparison, with the Longformer, with a maximum
|
302 |
-
length of 4096, only eight sentences will have their information shortened.
|
303 |
-
|
304 |
-
To apply the Longformer, we used the pre-trained base (available on the link) that was previously trained with a combination
|
305 |
-
of vast datasets as input to the model, as shown in figure 5 under Longformer model training. After coupling to our classification
|
306 |
-
models, we realized supervised training of the whole model. At this point, only transfer learning was applied since more
|
307 |
-
computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
|
308 |
-
This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
|
309 |
-
|
310 |
-
Table 7: Results from Pre-trained Longformer + ML models
|
311 |
-
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
312 |
-
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
313 |
-
| NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
|
314 |
-
| DNN | 0.8462 | 0.8776 |0.8474 | 0.9123 |
|
315 |
-
| CNN | 0.8462 | 0.8776 |0.8474 | 0.9123 |
|
316 |
-
| LSTM | 0.8269 | 0.8801 |0.8571 | 0.9091 |
|
317 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
318 |
|
319 |
-
##
|
320 |
-
- Examples
|
321 |
-
- Implementation Notes
|
322 |
-
- Usage Example
|
323 |
-
- >>>
|
324 |
-
- >>> ...
|
325 |
|
326 |
-
## Config
|
327 |
|
328 |
-
## Tokenizer
|
329 |
|
330 |
## Benchmarks
|
331 |
|
|
|
66 |
- Pre-training word embeddings using similar datasets for text classification;
|
67 |
- Using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
|
68 |
|
69 |
+
Templates using Word2Vec and Longformer also need to be loaded and their weights are as follows:
|
70 |
|
71 |
+
Table 1: Templates using Word2Vec and Longformer
|
72 |
+
| Tamplates | weights |
|
73 |
+
|------------------------------|:-------:|
|
74 |
+
| Longformer | 10.9GB |
|
75 |
+
| Word2Vec | 56.1MB |
|
76 |
|
|
|
77 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
78 |
|
79 |
+
| Keras Embedding + SNN | 92.47 | 88.46 | 79.66 | 100 | 0.2 | 0.7 | 1.8 |
|
80 |
+
| Keras Embedding + DNN | 89.78 | 84.41 | 77.81 | 92.57 | 1 | 1.4 | 7.6 |
|
81 |
+
| Keras Embedding + CNN | 93.01 | 89.91 | 85.18 | 95.69 | 0.4 | 1.1 | 3.2 |
|
82 |
+
| Keras Embedding + LSTM| 93.01 | 88.94 | 83.32 | 95.54 | 1.4 | 2 | 1.8 |
|
83 |
+
| Word2Vec + SNN | 89.25 | 83.82 | 74.15 | 97.10 | 1.4 | 1.2 | 9.6 |
|
84 |
+
| Word2Vec + DNN | 90.32 | 86.52 | 85.18 | 88.70 | 2 | 6.8 | 7.8 |
|
85 |
+
| Word2Vec + CNN | 92.47 | 88.42 | 80.85 | 98.72 | 1.9 | 3.4 | 4.7 |
|
86 |
+
| Word2Vec + LSTM | 89.78 | 84.36 | 75.36 | 95.81 | 2.6 | 14.3 | 1.2 |
|
87 |
+
| Longformer + SNN | 61.29 | 0 | 0 | 0 | 128 | 1.5 | 36.8 |
|
88 |
+
| Longformer + DNN | 91.93 | 87.62 | 80.37 | 97.62 | 81 | 8.4 | 12.7 |
|
89 |
+
| Longformer + CNN | 94.09 | 90.69 | 83.41 | 100 | 57 | 4.5 | 9.6 |
|
90 |
+
| Longformer + LSTM | 61.29 | 0 | 0 | 0 | 135 | 8.6 | 2.6 |
|
91 |
|
92 |
## Intended uses
|
93 |
|
|
|
104 |
You can use this model directly with a pipeline for masked language modeling:
|
105 |
|
106 |
```python
|
107 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
108 |
```
|
109 |
|
110 |
Here is how to use this model to get the features of a given text in PyTorch:
|
111 |
|
112 |
```python
|
113 |
+
|
|
|
|
|
|
|
|
|
|
|
114 |
```
|
115 |
|
116 |
and in TensorFlow:
|
117 |
|
118 |
```python
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
```
|
120 |
|
121 |
### Limitations and bias
|
|
|
125 |
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
126 |
predictions:
|
127 |
|
128 |
+
-
|
129 |
+
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
130 |
This bias will also affect all fine-tuned versions of this model.
|
131 |
|
132 |
## Training data
|
|
|
137 |
|
138 |
## Training procedure
|
139 |
|
140 |
+
### Model training with Word2Vec embeddings
|
141 |
+
|
142 |
+
After the pre-trained model of word2vec embeddings had already learned meanings relevant to the classification problem,
|
143 |
+
it was coupled to the classification model to train it with the labeled data in a supervised way. Table 6 shows the results
|
144 |
+
obtained with related metrics. With this implementation, was reached new levels of accuracy with 86% for CNN architecture
|
145 |
+
and 88% for the LSTM architecture.
|
146 |
+
|
147 |
+
Table 6: Results from Pre-trained WE + ML models
|
148 |
+
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
149 |
+
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
150 |
+
| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
|
151 |
+
| DNN | 0.7115 | 0.7794 | 0.7255 | 0.8485 |
|
152 |
+
| CNN | 0.8654 | 0.9083 | 0.8486 | 0.9773 |
|
153 |
+
| LSTM | 0.8846 | 0.9139 | 0.9056 | 0.9318 |
|
154 |
+
|
155 |
### Preprocessing
|
156 |
|
157 |
Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and
|
|
|
168 |
|
169 |
Several Python packages were used to develop the preprocessing code:
|
170 |
|
171 |
+
Table 2: Python packages used
|
172 |
| Objective | Package |
|
173 |
|--------------------------------------------------------|--------------|
|
174 |
| Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
|
|
|
185 |
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
|
186 |
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
|
187 |
|
188 |
+
Table 3: Preprocessing methods evaluated
|
189 |
| id | Experiments |
|
190 |
|--------|------------------------------------------------------------------------|
|
191 |
| Base | Original Texts |
|
|
|
206 |
|
207 |
All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
|
208 |
neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
|
209 |
+
shown in Table 4.
|
210 |
|
211 |
+
Table 4: Results obtained in Preprocessing
|
212 |
| id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
|
213 |
|--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
|
214 |
| Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
|
|
|
232 |
|
233 |
### Pretraining
|
234 |
|
235 |
+
Since labeled data is scarce, word-embeddings was trained in an unsupervised manner using other datasets that
|
236 |
+
contain most of the words it needs to learn. The idea implemented was based on introducing better and better-trained
|
237 |
+
word embeddings in the model. For an additional dataset to be applied to improve word-embedding training, it must be
|
238 |
+
compatible with the dataset used to train the classifier. Was searched for datasets from the Kaggle, a platform with
|
239 |
+
over a thousand available NLP datasets, and the closest we found was the BBC News Articles dataset, which achieved
|
240 |
+
only 56% compatibility.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
241 |
|
242 |
+
The alternative was to use web scraping algorithms to acquire more unlabeled data from the same sources, thus ensuring
|
243 |
+
compatibility. The original dataset had 260 labeled entries.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
244 |
|
245 |
+
Table 5: Compatibility results (*base = labeled MCTI dataset entries)
|
246 |
+
| Dataset | |
|
247 |
+
|--------------------------------------|:----------------------:|
|
248 |
+
| Labeled MCTI | 100% |
|
249 |
+
| Full MCTI | 100% |
|
250 |
+
| BBC News Articles | 56.77% |
|
251 |
+
| New unlabeled MCTI | 75.26% |
|
252 |
|
253 |
+
## Evaluation results
|
|
|
|
|
|
|
|
|
|
|
254 |
|
|
|
255 |
|
|
|
256 |
|
257 |
## Benchmarks
|
258 |
|