Update README.md
Browse files
README.md
CHANGED
@@ -59,23 +59,24 @@ bibendum cursus. Nunc volutpat vitae neque ut bibendum.
|
|
59 |
|
60 |
## Model variations
|
61 |
|
62 |
-
With the motivation to increase accuracy obtained with baseline implementation,
|
63 |
strategy under the assumption that small data available for training was insufficient for adequate embedding training.
|
64 |
-
In this context,
|
65 |
|
66 |
-
i) pre-training
|
67 |
ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
|
68 |
|
69 |
-
XXXX has originally been released in base and large variations, for cased and uncased input text. The uncased models
|
70 |
-
also strips out an accent markers. Chinese and multilingual uncased and cased versions followed shortly after.
|
71 |
-
Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of
|
72 |
-
two models.
|
73 |
-
|
74 |
Other 24 smaller models are released afterward.
|
75 |
|
76 |
The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
|
77 |
|
78 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
79 |
| Model | #params | Language |
|
80 |
|------------------------------|:-------:|:--------:|
|
81 |
| [`mcti-base-uncased`] | 110M | English |
|
@@ -84,8 +85,8 @@ The detailed release history can be found on the [here](https://huggingface.co/u
|
|
84 |
| [`mcti-large-cased`] | 110M | Chinese |
|
85 |
| [`-base-multilingual-cased`] | 110M | Multiple |
|
86 |
|
87 |
-
|
88 |
-
| Dataset |
|
89 |
|--------------------------------------|:----------------------:|
|
90 |
| Labeled MCTI | 100% |
|
91 |
| Full MCTI | 100% |
|
@@ -146,8 +147,7 @@ output = model(encoded_input)
|
|
146 |
|
147 |
### Limitations and bias
|
148 |
|
149 |
-
This model is uncased: it does not make a difference between english
|
150 |
-
and English.
|
151 |
|
152 |
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
153 |
predictions:
|
@@ -182,9 +182,9 @@ This bias will also affect all fine-tuned versions of this model.
|
|
182 |
|
183 |
## Training data
|
184 |
|
185 |
-
The
|
186 |
-
|
187 |
-
|
188 |
|
189 |
## Training procedure
|
190 |
|
@@ -204,7 +204,7 @@ to implement the [pre-processing code](https://github.com/mcti-sefip/mcti-sefip-
|
|
204 |
|
205 |
Several Python packages were used to develop the preprocessing code:
|
206 |
|
207 |
-
|
208 |
| Objective | Package |
|
209 |
|--------------------------------------------------------|--------------|
|
210 |
| Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
|
@@ -221,7 +221,7 @@ Several Python packages were used to develop the preprocessing code:
|
|
221 |
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
|
222 |
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
|
223 |
|
224 |
-
|
225 |
| id | Experiments |
|
226 |
|--------|------------------------------------------------------------------------|
|
227 |
| Base | Original Texts |
|
@@ -244,7 +244,7 @@ All eight bases were evaluated to classify the eligibility of the opportunity,
|
|
244 |
neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
|
245 |
shown in Table 5.
|
246 |
|
247 |
-
|
248 |
| id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
|
249 |
|--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
|
250 |
| Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
|
@@ -282,7 +282,7 @@ data in a supervised manner. The new coupled model can be seen in Figure 5 under
|
|
282 |
obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
|
283 |
architecture and 88% for the LSTM architecture.
|
284 |
|
285 |
-
|
286 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
287 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
288 |
| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
|
@@ -308,7 +308,7 @@ models, we realized supervised training of the whole model. At this point, only
|
|
308 |
computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
|
309 |
This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
|
310 |
|
311 |
-
|
312 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
313 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
314 |
| NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
|
|
|
59 |
|
60 |
## Model variations
|
61 |
|
62 |
+
With the motivation to increase accuracy obtained with baseline implementation, was implemented a transfer learning
|
63 |
strategy under the assumption that small data available for training was insufficient for adequate embedding training.
|
64 |
+
In this context, was considered two approaches:
|
65 |
|
66 |
+
i) pre-training word embeddings using similar datasets for text classification;
|
67 |
ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
|
68 |
|
|
|
|
|
|
|
|
|
|
|
69 |
Other 24 smaller models are released afterward.
|
70 |
|
71 |
The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
|
72 |
|
73 |
+
Os modelos que utilizam Word2Vec e Longformer também precisam ser carregados e seus pesos são os seguintes:
|
74 |
+
|
75 |
+
Longformer: 10.88 GB
|
76 |
+
|
77 |
+
Word2Vec: 56.1 MB
|
78 |
+
|
79 |
+
Table 1:
|
80 |
| Model | #params | Language |
|
81 |
|------------------------------|:-------:|:--------:|
|
82 |
| [`mcti-base-uncased`] | 110M | English |
|
|
|
85 |
| [`mcti-large-cased`] | 110M | Chinese |
|
86 |
| [`-base-multilingual-cased`] | 110M | Multiple |
|
87 |
|
88 |
+
Table 2: Compatibility results (*base = labeled MCTI dataset entries)
|
89 |
+
| Dataset | |
|
90 |
|--------------------------------------|:----------------------:|
|
91 |
| Labeled MCTI | 100% |
|
92 |
| Full MCTI | 100% |
|
|
|
147 |
|
148 |
### Limitations and bias
|
149 |
|
150 |
+
This model is uncased: it does not make a difference between english and English.
|
|
|
151 |
|
152 |
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
153 |
predictions:
|
|
|
182 |
|
183 |
## Training data
|
184 |
|
185 |
+
The [inputted training](https://github.com/chap0lin/PPF-MCTI/tree/master/Datasets) data was obtained from scrapping techniques, over 30 different platforms e.g. The Royal Society,
|
186 |
+
Annenberg foundation, and contained 928 labeled entries (928 rows x 21 columns). Of the data gathered, was used only
|
187 |
+
the main text content (column u). Text content averages 800 tokens in length, but with high variance, up to 5,000 tokens.
|
188 |
|
189 |
## Training procedure
|
190 |
|
|
|
204 |
|
205 |
Several Python packages were used to develop the preprocessing code:
|
206 |
|
207 |
+
Table 3: Python packages used
|
208 |
| Objective | Package |
|
209 |
|--------------------------------------------------------|--------------|
|
210 |
| Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
|
|
|
221 |
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
|
222 |
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
|
223 |
|
224 |
+
Table 4: Preprocessing methods evaluated
|
225 |
| id | Experiments |
|
226 |
|--------|------------------------------------------------------------------------|
|
227 |
| Base | Original Texts |
|
|
|
244 |
neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
|
245 |
shown in Table 5.
|
246 |
|
247 |
+
Table 5: Results obtained in Preprocessing
|
248 |
| id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
|
249 |
|--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
|
250 |
| Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
|
|
|
282 |
obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
|
283 |
architecture and 88% for the LSTM architecture.
|
284 |
|
285 |
+
Table 6: Results from Pre-trained WE + ML models
|
286 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
287 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
288 |
| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
|
|
|
308 |
computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
|
309 |
This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
|
310 |
|
311 |
+
Table 7: Results from Pre-trained Longformer + ML models
|
312 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
313 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
314 |
| NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
|