MarcosDib commited on
Commit
6d1e759
1 Parent(s): 98b88df

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -5
README.md CHANGED
@@ -262,14 +262,14 @@ document-embedding). The training time is so close that it did not have such a l
262
 
263
  As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
264
  preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
265
- available on the project's GitHub with the inclusion of columns \\(\bf opo_pre)\\ (text) and \textbf{opo_pre_tkn} (tokenized)\\.
266
 
267
  ### Pretraining
268
 
269
- The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
270
- of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
271
- used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
272
- learning rate warmup for 10,000 steps and linear decay of the learning rate after.
273
 
274
  ## Evaluation results
275
 
@@ -281,6 +281,11 @@ data in a supervised manner. The new coupled model can be seen in Figure 5 under
281
  obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
282
  architecture and 88% for the LSTM architecture.
283
 
 
 
 
 
 
284
  Table 6: Results from Pre-trained WE + ML models
285
  | ML Model | Accuracy | F1 Score | Precision | Recall |
286
  |:--------:|:---------:|:---------:|:---------:|:---------:|
 
262
 
263
  As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
264
  preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
265
+ available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized).
266
 
267
  ### Pretraining
268
 
269
+ Since labeled data is scarce, word-embeddings was trained in an unsupervised manner using other datasets that contain most of
270
+ the words it needs to learn. The alternative was to use web scraping algorithms to acquire more unlabeled data from the same
271
+ sources, which would give a higher chance of providing compatible texts. The original dataset had 357 entries, with 260 of
272
+ them labeled.
273
 
274
  ## Evaluation results
275
 
 
281
  obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
282
  architecture and 88% for the LSTM architecture.
283
 
284
+ We can couple it to our classification models (Fig. 4), realizing transferlearning and then training the model with the labeled
285
+ data in a supervised manner. The new coupled model can be seen in Figure 5 under word2vec model training. The Table 3 shows the
286
+ obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
287
+
288
+
289
  Table 6: Results from Pre-trained WE + ML models
290
  | ML Model | Accuracy | F1 Score | Precision | Recall |
291
  |:--------:|:---------:|:---------:|:---------:|:---------:|