Update README.md
Browse files
README.md
CHANGED
@@ -85,7 +85,7 @@ The detailed release history can be found on the [here](https://huggingface.co/u
|
|
85 |
| Model | #params | Language |
|
86 |
|------------------------------|:-------:|:--------:|
|
87 |
| [`mcti-base-uncased`] | 110M | English |
|
88 |
-
| [`mcti-large-uncased`] | 340M | English |
|
89 |
| [`mcti-base-cased`] | 110M | English |
|
90 |
| [`mcti-large-cased`] | 110M | Chinese |
|
91 |
| [`-base-multilingual-cased`] | 110M | Multiple |
|
@@ -223,17 +223,35 @@ Several Python packages were used to develop the preprocessing code:
|
|
223 |
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
|
224 |
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
|
225 |
|
226 |
-
|
227 |
-
|
228 |
-
|
229 |
-
|
|
230 |
-
|
|
231 |
-
|
|
232 |
-
|
|
233 |
-
|
|
234 |
-
|
|
235 |
-
|
|
236 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
237 |
|
238 |
### Pretraining
|
239 |
|
@@ -253,8 +271,7 @@ obtained results with related metrics. With this implementation, we achieved new
|
|
253 |
architecture and 88% for the LSTM architecture.
|
254 |
|
255 |
|
256 |
-
Table
|
257 |
-
|
258 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
259 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
260 |
| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
|
@@ -281,8 +298,7 @@ computational power was needed to realize the fine-tuning of the weights. The re
|
|
281 |
This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
|
282 |
|
283 |
|
284 |
-
Table
|
285 |
-
|
286 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
287 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
288 |
| NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
|
|
|
85 |
| Model | #params | Language |
|
86 |
|------------------------------|:-------:|:--------:|
|
87 |
| [`mcti-base-uncased`] | 110M | English |
|
88 |
+
| [`mcti-large-uncased`] | 340M | English |
|
89 |
| [`mcti-base-cased`] | 110M | English |
|
90 |
| [`mcti-large-cased`] | 110M | Chinese |
|
91 |
| [`-base-multilingual-cased`] | 110M | Multiple |
|
|
|
223 |
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
|
224 |
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
|
225 |
|
226 |
+
Table 4: Preprocessing methods evaluated
|
227 |
+
| id | Experiments |
|
228 |
+
|--------|------------------------------------------------------------------------|
|
229 |
+
| Base | Original Texts |
|
230 |
+
| xp1 | Expand Contractions |
|
231 |
+
| xp2 | Expand Contractions + Convert text to lowercase |
|
232 |
+
| xp3 | Expand Contractions + Remove Punctuation |
|
233 |
+
| xp4 | Expand Contractions + Remove Punctuation + Convert text to lowercase |
|
234 |
+
| xp5 | xp4 + Stemming |
|
235 |
+
| xp6 | xp4 + Lemmatization |
|
236 |
+
| xp7 | xp4 + Stemming + Stopwords Removal |
|
237 |
+
| xp8 | ap4 + Lemmatization + Stopwords Removal |
|
238 |
+
|
239 |
+
|
240 |
+
|
241 |
+
|
242 |
+
Table 5: Results obtained in Preprocessing
|
243 |
+
| id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
|
244 |
+
|--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
|
245 |
+
| Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
|
246 |
+
| xp1 | Expand Contractions | 88,71% | 81,59% | 71,54% | 97,33% | 414,715 | 23768 | 5636 |
|
247 |
+
| xp2 | Expand Contractions + Convert text to lowercase | 90,32% | 85,64% | 77,19% | 97,44% | 368,375 | 20322 | 5629 |
|
248 |
+
| xp3 | Expand Contractions + Remove Punctuation | 91,94% | 87,73% | 79,66% | 98,72% | 386,650 | 22121 | 4950 |
|
249 |
+
| xp4 | Expand Contractions + Remove Punctuation + Convert text to lowercase | 90,86% | 86,61% | 80,85% | 94,25% | 326,830 | 18616 | 4950 |
|
250 |
+
| xp5 | xp4 + Stemming | 91,94% | 87,68% | 78,47% | 100,00% | 257,960 | 14319 | 4950 |
|
251 |
+
| xp6 | xp4 + Lemmatization | 89,78% | 85,06% | 79,66% | 91,87% | 282,645 | 16194 | 4950 |
|
252 |
+
| xp7 | xp4 + Stemming + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 210,320 | 14212 | 2817 |
|
253 |
+
| xp8 | ap4 + Lemmatization + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 225,580 | 16081 | 2726 |
|
254 |
+
|
255 |
|
256 |
### Pretraining
|
257 |
|
|
|
271 |
architecture and 88% for the LSTM architecture.
|
272 |
|
273 |
|
274 |
+
Table 6: Results from Pre-trained WE + ML models
|
|
|
275 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
276 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
277 |
| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
|
|
|
298 |
This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
|
299 |
|
300 |
|
301 |
+
Table 7: Results from Pre-trained Longformer + ML models.
|
|
|
302 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
303 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
304 |
| NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
|