unb-lamfo-nlp-mcti
/

NLP-Classification-MCTI

English

Clsssification

science

Model card Files Files and versions Community

MarcosDib commited on Dec 12, 2022

Commit

62ab1ca

•

1 Parent(s): 2522575

Update README.md

Browse files

Files changed (1) hide show

README.md +32 -16

README.md CHANGED Viewed

@@ -85,7 +85,7 @@ The detailed release history can be found on the [here](https://huggingface.co/u
 | Model | #params | Language |
 |------------------------------|:-------:|:--------:|
 | [`mcti-base-uncased`] | 110M | English |
-| [`mcti-large-uncased`] | 340M | English | sub
 | [`mcti-base-cased`] | 110M | English |
 | [`mcti-large-cased`] | 110M | Chinese |
 | [`-base-multilingual-cased`] | 110M | Multiple |
@@ -223,17 +223,35 @@ Several Python packages were used to develop the preprocessing code:
 As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
 bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
-| Base | Textos originais |
-|--------|--------------------------------------------------------------|
-| xp1 | Expandir Contrações |
-| xp2 | Expandir Contrações + Transformar texto em minúsculo |
-| xp3 | Expandir Contrações + Remover Pontuação |
-| xp4 | Expandir Contrações + Remover Pontuação + Transformar Texto |
-| xp5 | xp4 + Stemização |
-| xp6 | xp4 + Lematização |
-| xp7 | xp4 + Stemização + Remoção de StopWords |
-| xp8 | ap4 + Lematização + Remoção de StopWords |
-Table 2 – Pre-processing methods evaluated
 ### Pretraining
@@ -253,8 +271,7 @@ obtained results with related metrics. With this implementation, we achieved new
 architecture and 88% for the LSTM architecture.
-Table 1: Results from Pre-trained WE + ML models.
 | ML Model | Accuracy | F1 Score | Precision | Recall |
 |:--------:|:---------:|:---------:|:---------:|:---------:|
 | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
@@ -281,8 +298,7 @@ computational power was needed to realize the fine-tuning of the weights. The re
 This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
-Table 2: Results from Pre-trained Longformer + ML models.
 | ML Model | Accuracy | F1 Score | Precision | Recall |
 |:--------:|:---------:|:---------:|:---------:|:---------:|
 | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |

 | Model | #params | Language |
 |------------------------------|:-------:|:--------:|
 | [`mcti-base-uncased`] | 110M | English |
+| [`mcti-large-uncased`] | 340M | English |
 | [`mcti-base-cased`] | 110M | English |
 | [`mcti-large-cased`] | 110M | Chinese |
 | [`-base-multilingual-cased`] | 110M | Multiple |
 As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
 bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
+Table 4: Preprocessing methods evaluated
+| id | Experiments |
+|--------|------------------------------------------------------------------------|
+| Base | Original Texts |
+| xp1 | Expand Contractions |
+| xp2 | Expand Contractions + Convert text to lowercase |
+| xp3 | Expand Contractions + Remove Punctuation |
+| xp4 | Expand Contractions + Remove Punctuation + Convert text to lowercase |
+| xp5 | xp4 + Stemming |
+| xp6 | xp4 + Lemmatization |
+| xp7 | xp4 + Stemming + Stopwords Removal |
+| xp8 | ap4 + Lemmatization + Stopwords Removal |
+Table 5: Results obtained in Preprocessing
+| id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
+|--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
+| Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
+| xp1 | Expand Contractions | 88,71% | 81,59% | 71,54% | 97,33% | 414,715 | 23768 | 5636 |
+| xp2 | Expand Contractions + Convert text to lowercase | 90,32% | 85,64% | 77,19% | 97,44% | 368,375 | 20322 | 5629 |
+| xp3 | Expand Contractions + Remove Punctuation | 91,94% | 87,73% | 79,66% | 98,72% | 386,650 | 22121 | 4950 |
+| xp4 | Expand Contractions + Remove Punctuation + Convert text to lowercase | 90,86% | 86,61% | 80,85% | 94,25% | 326,830 | 18616 | 4950 |
+| xp5 | xp4 + Stemming | 91,94% | 87,68% | 78,47% | 100,00% | 257,960 | 14319 | 4950 |
+| xp6 | xp4 + Lemmatization | 89,78% | 85,06% | 79,66% | 91,87% | 282,645 | 16194 | 4950 |
+| xp7 | xp4 + Stemming + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 210,320 | 14212 | 2817 |
+| xp8 | ap4 + Lemmatization + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 225,580 | 16081 | 2726 |
 ### Pretraining
 architecture and 88% for the LSTM architecture.
+Table 6: Results from Pre-trained WE + ML models
 | ML Model | Accuracy | F1 Score | Precision | Recall |
 |:--------:|:---------:|:---------:|:---------:|:---------:|
 | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
 This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
+Table 7: Results from Pre-trained Longformer + ML models.
 | ML Model | Accuracy | F1 Score | Precision | Recall |
 |:--------:|:---------:|:---------:|:---------:|:---------:|
 | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |