unb-lamfo-nlp-mcti
/

NLP-Classification-MCTI

English

Clsssification

science

Model card Files Files and versions Community

MarcosDib commited on Dec 12, 2022

Commit

a34b084

•

1 Parent(s): 6c974c9

Update README.md

Browse files

Files changed (1) hide show

README.md +16 -2

README.md CHANGED Viewed

@@ -239,7 +239,15 @@ bases, derived from the base of goal 4, with the application of the methods show
 | xp7 | xp4 + Stemming + Stopwords Removal |
 | xp8 | ap4 + Lemmatization + Stopwords Removal |
 #### Table 5: Results obtained in Preprocessing
 | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
@@ -254,6 +262,13 @@ bases, derived from the base of goal 4, with the application of the methods show
 | xp7 | xp4 + Stemming + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 210,320 | 14212 | 2817 |
 | xp8 | ap4 + Lemmatization + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 225,580 | 16081 | 2726 |
 ### Pretraining
@@ -298,8 +313,7 @@ models, we realized supervised training of the whole model. At this point, only
 computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
 This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
-Table 7: Results from Pre-trained Longformer + ML models.
 | ML Model | Accuracy | F1 Score | Precision | Recall |
 |:--------:|:---------:|:---------:|:---------:|:---------:|
 | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |

 | xp7 | xp4 + Stemming + Stopwords Removal |
 | xp8 | ap4 + Lemmatization + Stopwords Removal |
+First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and
+evaluation of the first four bases (xp1, xp2, xp3, xp4).
+Then, the content simplification was evaluated, from the xp4 base, considering stemming (xp5), stemming (xp6),
+stemming + stopwords removal (xp7), and stemming + stopwords removal (xp8).
+All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
+neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
+shown in Table 5.
 #### Table 5: Results obtained in Preprocessing
 | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
 | xp7 | xp4 + Stemming + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 210,320 | 14212 | 2817 |
 | xp8 | ap4 + Lemmatization + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 225,580 | 16081 | 2726 |
+Even so, between these two excellent options, one can judge which one to choose. XP7: It has less training time,
+less number of unique tokens. XP8: It has smaller maximum sizes. In this case, the criterion used for the choice
+was the computational cost required to train the vector representation models (word-embedding, sentence-embeddings,
+document-embedding). The training time is so close that it did not have such a large weight for the analysis.
+As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
+available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized).
 ### Pretraining
 computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
 This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
+#### Table 7: Results from Pre-trained Longformer + ML models
 | ML Model | Accuracy | F1 Score | Precision | Recall |
 |:--------:|:---------:|:---------:|:---------:|:---------:|
 | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |