Update README.md
Browse files
README.md
CHANGED
@@ -262,14 +262,14 @@ document-embedding). The training time is so close that it did not have such a l
|
|
262 |
|
263 |
As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
|
264 |
preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
|
265 |
-
available on the project's GitHub with the inclusion of columns
|
266 |
|
267 |
### Pretraining
|
268 |
|
269 |
-
|
270 |
-
|
271 |
-
|
272 |
-
|
273 |
|
274 |
## Evaluation results
|
275 |
|
@@ -281,6 +281,11 @@ data in a supervised manner. The new coupled model can be seen in Figure 5 under
|
|
281 |
obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
|
282 |
architecture and 88% for the LSTM architecture.
|
283 |
|
|
|
|
|
|
|
|
|
|
|
284 |
Table 6: Results from Pre-trained WE + ML models
|
285 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
286 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
|
|
262 |
|
263 |
As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
|
264 |
preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
|
265 |
+
available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized).
|
266 |
|
267 |
### Pretraining
|
268 |
|
269 |
+
Since labeled data is scarce, word-embeddings was trained in an unsupervised manner using other datasets that contain most of
|
270 |
+
the words it needs to learn. The alternative was to use web scraping algorithms to acquire more unlabeled data from the same
|
271 |
+
sources, which would give a higher chance of providing compatible texts. The original dataset had 357 entries, with 260 of
|
272 |
+
them labeled.
|
273 |
|
274 |
## Evaluation results
|
275 |
|
|
|
281 |
obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
|
282 |
architecture and 88% for the LSTM architecture.
|
283 |
|
284 |
+
We can couple it to our classification models (Fig. 4), realizing transferlearning and then training the model with the labeled
|
285 |
+
data in a supervised manner. The new coupled model can be seen in Figure 5 under word2vec model training. The Table 3 shows the
|
286 |
+
obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
|
287 |
+
|
288 |
+
|
289 |
Table 6: Results from Pre-trained WE + ML models
|
290 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
291 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|