MarcosDib commited on
Commit
62ab1ca
1 Parent(s): 2522575

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -16
README.md CHANGED
@@ -85,7 +85,7 @@ The detailed release history can be found on the [here](https://huggingface.co/u
85
  | Model | #params | Language |
86
  |------------------------------|:-------:|:--------:|
87
  | [`mcti-base-uncased`] | 110M | English |
88
- | [`mcti-large-uncased`] | 340M | English | sub
89
  | [`mcti-base-cased`] | 110M | English |
90
  | [`mcti-large-cased`] | 110M | Chinese |
91
  | [`-base-multilingual-cased`] | 110M | Multiple |
@@ -223,17 +223,35 @@ Several Python packages were used to develop the preprocessing code:
223
  As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
224
  bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
225
 
226
- | Base | Textos originais |
227
- |--------|--------------------------------------------------------------|
228
- | xp1 | Expandir Contrações |
229
- | xp2 | Expandir Contrações + Transformar texto em minúsculo |
230
- | xp3 | Expandir Contrações + Remover Pontuação |
231
- | xp4 | Expandir Contrações + Remover Pontuação + Transformar Texto |
232
- | xp5 | xp4 + Stemização |
233
- | xp6 | xp4 + Lematização |
234
- | xp7 | xp4 + Stemização + Remoção de StopWords |
235
- | xp8 | ap4 + Lematização + Remoção de StopWords |
236
- Table 2 Pre-processing methods evaluated
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
237
 
238
  ### Pretraining
239
 
@@ -253,8 +271,7 @@ obtained results with related metrics. With this implementation, we achieved new
253
  architecture and 88% for the LSTM architecture.
254
 
255
 
256
- Table 1: Results from Pre-trained WE + ML models.
257
-
258
  | ML Model | Accuracy | F1 Score | Precision | Recall |
259
  |:--------:|:---------:|:---------:|:---------:|:---------:|
260
  | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
@@ -281,8 +298,7 @@ computational power was needed to realize the fine-tuning of the weights. The re
281
  This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
282
 
283
 
284
- Table 2: Results from Pre-trained Longformer + ML models.
285
-
286
  | ML Model | Accuracy | F1 Score | Precision | Recall |
287
  |:--------:|:---------:|:---------:|:---------:|:---------:|
288
  | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
 
85
  | Model | #params | Language |
86
  |------------------------------|:-------:|:--------:|
87
  | [`mcti-base-uncased`] | 110M | English |
88
+ | [`mcti-large-uncased`] | 340M | English |
89
  | [`mcti-base-cased`] | 110M | English |
90
  | [`mcti-large-cased`] | 110M | Chinese |
91
  | [`-base-multilingual-cased`] | 110M | Multiple |
 
223
  As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
224
  bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
225
 
226
+ Table 4: Preprocessing methods evaluated
227
+ | id | Experiments |
228
+ |--------|------------------------------------------------------------------------|
229
+ | Base | Original Texts |
230
+ | xp1 | Expand Contractions |
231
+ | xp2 | Expand Contractions + Convert text to lowercase |
232
+ | xp3 | Expand Contractions + Remove Punctuation |
233
+ | xp4 | Expand Contractions + Remove Punctuation + Convert text to lowercase |
234
+ | xp5 | xp4 + Stemming |
235
+ | xp6 | xp4 + Lemmatization |
236
+ | xp7 | xp4 + Stemming + Stopwords Removal |
237
+ | xp8 | ap4 + Lemmatization + Stopwords Removal |
238
+
239
+
240
+
241
+
242
+ Table 5: Results obtained in Preprocessing
243
+ | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
244
+ |--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
245
+ | Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
246
+ | xp1 | Expand Contractions | 88,71% | 81,59% | 71,54% | 97,33% | 414,715 | 23768 | 5636 |
247
+ | xp2 | Expand Contractions + Convert text to lowercase | 90,32% | 85,64% | 77,19% | 97,44% | 368,375 | 20322 | 5629 |
248
+ | xp3 | Expand Contractions + Remove Punctuation | 91,94% | 87,73% | 79,66% | 98,72% | 386,650 | 22121 | 4950 |
249
+ | xp4 | Expand Contractions + Remove Punctuation + Convert text to lowercase | 90,86% | 86,61% | 80,85% | 94,25% | 326,830 | 18616 | 4950 |
250
+ | xp5 | xp4 + Stemming | 91,94% | 87,68% | 78,47% | 100,00% | 257,960 | 14319 | 4950 |
251
+ | xp6 | xp4 + Lemmatization | 89,78% | 85,06% | 79,66% | 91,87% | 282,645 | 16194 | 4950 |
252
+ | xp7 | xp4 + Stemming + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 210,320 | 14212 | 2817 |
253
+ | xp8 | ap4 + Lemmatization + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 225,580 | 16081 | 2726 |
254
+
255
 
256
  ### Pretraining
257
 
 
271
  architecture and 88% for the LSTM architecture.
272
 
273
 
274
+ Table 6: Results from Pre-trained WE + ML models
 
275
  | ML Model | Accuracy | F1 Score | Precision | Recall |
276
  |:--------:|:---------:|:---------:|:---------:|:---------:|
277
  | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
 
298
  This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
299
 
300
 
301
+ Table 7: Results from Pre-trained Longformer + ML models.
 
302
  | ML Model | Accuracy | F1 Score | Precision | Recall |
303
  |:--------:|:---------:|:---------:|:---------:|:---------:|
304
  | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |