Update README.md
Browse files
README.md
CHANGED
@@ -239,7 +239,15 @@ bases, derived from the base of goal 4, with the application of the methods show
|
|
239 |
| xp7 | xp4 + Stemming + Stopwords Removal |
|
240 |
| xp8 | ap4 + Lemmatization + Stopwords Removal |
|
241 |
|
|
|
|
|
242 |
|
|
|
|
|
|
|
|
|
|
|
|
|
243 |
|
244 |
#### Table 5: Results obtained in Preprocessing
|
245 |
| id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
|
@@ -254,6 +262,13 @@ bases, derived from the base of goal 4, with the application of the methods show
|
|
254 |
| xp7 | xp4 + Stemming + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 210,320 | 14212 | 2817 |
|
255 |
| xp8 | ap4 + Lemmatization + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 225,580 | 16081 | 2726 |
|
256 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
257 |
|
258 |
### Pretraining
|
259 |
|
@@ -298,8 +313,7 @@ models, we realized supervised training of the whole model. At this point, only
|
|
298 |
computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
|
299 |
This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
|
300 |
|
301 |
-
|
302 |
-
Table 7: Results from Pre-trained Longformer + ML models.
|
303 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
304 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
305 |
| NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
|
|
|
239 |
| xp7 | xp4 + Stemming + Stopwords Removal |
|
240 |
| xp8 | ap4 + Lemmatization + Stopwords Removal |
|
241 |
|
242 |
+
First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and
|
243 |
+
evaluation of the first four bases (xp1, xp2, xp3, xp4).
|
244 |
|
245 |
+
Then, the content simplification was evaluated, from the xp4 base, considering stemming (xp5), stemming (xp6),
|
246 |
+
stemming + stopwords removal (xp7), and stemming + stopwords removal (xp8).
|
247 |
+
|
248 |
+
All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
|
249 |
+
neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
|
250 |
+
shown in Table 5.
|
251 |
|
252 |
#### Table 5: Results obtained in Preprocessing
|
253 |
| id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
|
|
|
262 |
| xp7 | xp4 + Stemming + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 210,320 | 14212 | 2817 |
|
263 |
| xp8 | ap4 + Lemmatization + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 225,580 | 16081 | 2726 |
|
264 |
|
265 |
+
Even so, between these two excellent options, one can judge which one to choose. XP7: It has less training time,
|
266 |
+
less number of unique tokens. XP8: It has smaller maximum sizes. In this case, the criterion used for the choice
|
267 |
+
was the computational cost required to train the vector representation models (word-embedding, sentence-embeddings,
|
268 |
+
document-embedding). The training time is so close that it did not have such a large weight for the analysis.
|
269 |
+
|
270 |
+
As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
|
271 |
+
available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized).
|
272 |
|
273 |
### Pretraining
|
274 |
|
|
|
313 |
computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
|
314 |
This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
|
315 |
|
316 |
+
#### Table 7: Results from Pre-trained Longformer + ML models
|
|
|
317 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
318 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
319 |
| NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
|