MarcosDib commited on
Commit
a34b084
1 Parent(s): 6c974c9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -2
README.md CHANGED
@@ -239,7 +239,15 @@ bases, derived from the base of goal 4, with the application of the methods show
239
  | xp7 | xp4 + Stemming + Stopwords Removal |
240
  | xp8 | ap4 + Lemmatization + Stopwords Removal |
241
 
 
 
242
 
 
 
 
 
 
 
243
 
244
  #### Table 5: Results obtained in Preprocessing
245
  | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
@@ -254,6 +262,13 @@ bases, derived from the base of goal 4, with the application of the methods show
254
  | xp7 | xp4 + Stemming + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 210,320 | 14212 | 2817 |
255
  | xp8 | ap4 + Lemmatization + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 225,580 | 16081 | 2726 |
256
 
 
 
 
 
 
 
 
257
 
258
  ### Pretraining
259
 
@@ -298,8 +313,7 @@ models, we realized supervised training of the whole model. At this point, only
298
  computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
299
  This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
300
 
301
-
302
- Table 7: Results from Pre-trained Longformer + ML models.
303
  | ML Model | Accuracy | F1 Score | Precision | Recall |
304
  |:--------:|:---------:|:---------:|:---------:|:---------:|
305
  | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
 
239
  | xp7 | xp4 + Stemming + Stopwords Removal |
240
  | xp8 | ap4 + Lemmatization + Stopwords Removal |
241
 
242
+ First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and
243
+ evaluation of the first four bases (xp1, xp2, xp3, xp4).
244
 
245
+ Then, the content simplification was evaluated, from the xp4 base, considering stemming (xp5), stemming (xp6),
246
+ stemming + stopwords removal (xp7), and stemming + stopwords removal (xp8).
247
+
248
+ All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
249
+ neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
250
+ shown in Table 5.
251
 
252
  #### Table 5: Results obtained in Preprocessing
253
  | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
 
262
  | xp7 | xp4 + Stemming + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 210,320 | 14212 | 2817 |
263
  | xp8 | ap4 + Lemmatization + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 225,580 | 16081 | 2726 |
264
 
265
+ Even so, between these two excellent options, one can judge which one to choose. XP7: It has less training time,
266
+ less number of unique tokens. XP8: It has smaller maximum sizes. In this case, the criterion used for the choice
267
+ was the computational cost required to train the vector representation models (word-embedding, sentence-embeddings,
268
+ document-embedding). The training time is so close that it did not have such a large weight for the analysis.
269
+
270
+ As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
271
+ available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized).
272
 
273
  ### Pretraining
274
 
 
313
  computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
314
  This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
315
 
316
+ #### Table 7: Results from Pre-trained Longformer + ML models
 
317
  | ML Model | Accuracy | F1 Score | Precision | Recall |
318
  |:--------:|:---------:|:---------:|:---------:|:---------:|
319
  | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |