MarcosDib commited on
Commit
42b2aa7
1 Parent(s): bdba148

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -21
README.md CHANGED
@@ -59,23 +59,24 @@ bibendum cursus. Nunc volutpat vitae neque ut bibendum.
59
 
60
  ## Model variations
61
 
62
- With the motivation to increase accuracy obtained with baseline implementation, we implemented a transfer learning
63
  strategy under the assumption that small data available for training was insufficient for adequate embedding training.
64
- In this context, we considered two approaches:
65
 
66
- i) pre-training wordembeddings using similar datasets for text classification;
67
  ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
68
 
69
- XXXX has originally been released in base and large variations, for cased and uncased input text. The uncased models
70
- also strips out an accent markers. Chinese and multilingual uncased and cased versions followed shortly after.
71
- Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of
72
- two models.
73
-
74
  Other 24 smaller models are released afterward.
75
 
76
  The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
77
 
78
- #### Table 1:
 
 
 
 
 
 
79
  | Model | #params | Language |
80
  |------------------------------|:-------:|:--------:|
81
  | [`mcti-base-uncased`] | 110M | English |
@@ -84,8 +85,8 @@ The detailed release history can be found on the [here](https://huggingface.co/u
84
  | [`mcti-large-cased`] | 110M | Chinese |
85
  | [`-base-multilingual-cased`] | 110M | Multiple |
86
 
87
- #### Table 2:
88
- | Dataset | Compatibility to base* |
89
  |--------------------------------------|:----------------------:|
90
  | Labeled MCTI | 100% |
91
  | Full MCTI | 100% |
@@ -146,8 +147,7 @@ output = model(encoded_input)
146
 
147
  ### Limitations and bias
148
 
149
- This model is uncased: it does not make a difference between english
150
- and English.
151
 
152
  Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
153
  predictions:
@@ -182,9 +182,9 @@ This bias will also affect all fine-tuned versions of this model.
182
 
183
  ## Training data
184
 
185
- The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
186
- unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
187
- headers).
188
 
189
  ## Training procedure
190
 
@@ -204,7 +204,7 @@ to implement the [pre-processing code](https://github.com/mcti-sefip/mcti-sefip-
204
 
205
  Several Python packages were used to develop the preprocessing code:
206
 
207
- #### Table 3: Python packages used
208
  | Objective | Package |
209
  |--------------------------------------------------------|--------------|
210
  | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
@@ -221,7 +221,7 @@ Several Python packages were used to develop the preprocessing code:
221
  As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
222
  bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
223
 
224
- #### Table 4: Preprocessing methods evaluated
225
  | id | Experiments |
226
  |--------|------------------------------------------------------------------------|
227
  | Base | Original Texts |
@@ -244,7 +244,7 @@ All eight bases were evaluated to classify the eligibility of the opportunity,
244
  neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
245
  shown in Table 5.
246
 
247
- #### Table 5: Results obtained in Preprocessing
248
  | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
249
  |--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
250
  | Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
@@ -282,7 +282,7 @@ data in a supervised manner. The new coupled model can be seen in Figure 5 under
282
  obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
283
  architecture and 88% for the LSTM architecture.
284
 
285
- #### Table 6: Results from Pre-trained WE + ML models
286
  | ML Model | Accuracy | F1 Score | Precision | Recall |
287
  |:--------:|:---------:|:---------:|:---------:|:---------:|
288
  | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
@@ -308,7 +308,7 @@ models, we realized supervised training of the whole model. At this point, only
308
  computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
309
  This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
310
 
311
- #### Table 7: Results from Pre-trained Longformer + ML models
312
  | ML Model | Accuracy | F1 Score | Precision | Recall |
313
  |:--------:|:---------:|:---------:|:---------:|:---------:|
314
  | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
 
59
 
60
  ## Model variations
61
 
62
+ With the motivation to increase accuracy obtained with baseline implementation, was implemented a transfer learning
63
  strategy under the assumption that small data available for training was insufficient for adequate embedding training.
64
+ In this context, was considered two approaches:
65
 
66
+ i) pre-training word embeddings using similar datasets for text classification;
67
  ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
68
 
 
 
 
 
 
69
  Other 24 smaller models are released afterward.
70
 
71
  The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
72
 
73
+ Os modelos que utilizam Word2Vec e Longformer também precisam ser carregados e seus pesos são os seguintes:
74
+
75
+ Longformer: 10.88 GB
76
+
77
+ Word2Vec: 56.1 MB
78
+
79
+ Table 1:
80
  | Model | #params | Language |
81
  |------------------------------|:-------:|:--------:|
82
  | [`mcti-base-uncased`] | 110M | English |
 
85
  | [`mcti-large-cased`] | 110M | Chinese |
86
  | [`-base-multilingual-cased`] | 110M | Multiple |
87
 
88
+ Table 2: Compatibility results (*base = labeled MCTI dataset entries)
89
+ | Dataset | |
90
  |--------------------------------------|:----------------------:|
91
  | Labeled MCTI | 100% |
92
  | Full MCTI | 100% |
 
147
 
148
  ### Limitations and bias
149
 
150
+ This model is uncased: it does not make a difference between english and English.
 
151
 
152
  Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
153
  predictions:
 
182
 
183
  ## Training data
184
 
185
+ The [inputted training](https://github.com/chap0lin/PPF-MCTI/tree/master/Datasets) data was obtained from scrapping techniques, over 30 different platforms e.g. The Royal Society,
186
+ Annenberg foundation, and contained 928 labeled entries (928 rows x 21 columns). Of the data gathered, was used only
187
+ the main text content (column u). Text content averages 800 tokens in length, but with high variance, up to 5,000 tokens.
188
 
189
  ## Training procedure
190
 
 
204
 
205
  Several Python packages were used to develop the preprocessing code:
206
 
207
+ Table 3: Python packages used
208
  | Objective | Package |
209
  |--------------------------------------------------------|--------------|
210
  | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
 
221
  As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
222
  bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
223
 
224
+ Table 4: Preprocessing methods evaluated
225
  | id | Experiments |
226
  |--------|------------------------------------------------------------------------|
227
  | Base | Original Texts |
 
244
  neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
245
  shown in Table 5.
246
 
247
+ Table 5: Results obtained in Preprocessing
248
  | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
249
  |--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
250
  | Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
 
282
  obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
283
  architecture and 88% for the LSTM architecture.
284
 
285
+ Table 6: Results from Pre-trained WE + ML models
286
  | ML Model | Accuracy | F1 Score | Precision | Recall |
287
  |:--------:|:---------:|:---------:|:---------:|:---------:|
288
  | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
 
308
  computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
309
  This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
310
 
311
+ Table 7: Results from Pre-trained Longformer + ML models
312
  | ML Model | Accuracy | F1 Score | Precision | Recall |
313
  |:--------:|:---------:|:---------:|:---------:|:---------:|
314
  | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |