Update README.md
Browse files
README.md
CHANGED
@@ -23,10 +23,10 @@ thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_mode
|
|
23 |
|
24 |
Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project.
|
25 |
|
26 |
-
The model [NLP MCTI Classification Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/NLP-W2V-CNN-Multi) is part of the project [Research Financing Product Portfolio (FPP)](https://huggingface.co/unb-lamfo-nlp-mcti) focuses
|
27 |
-
on the
|
28 |
of long, unstructured, and uneven data to find a proper method with good performance. Pre-training and word embedding
|
29 |
-
solutions were used to learn word relationships from other datasets with considerable similarity and larger scale.
|
30 |
Then, using the acquired resources, based on the dataset available in the MCTI, transfer learning plus deep learning
|
31 |
models were applied to improve the understanding of each sentence.
|
32 |
|
@@ -47,10 +47,10 @@ As the input data is compose of unstructured and nonuniform texts it is essentia
|
|
47 |
little insights and valuable relationships to work with the best features of them. In this way, learning is
|
48 |
facilitated and allows the gradient descent to converge more quickly.
|
49 |
|
50 |
-
The first layer of the model is
|
51 |
-
replace one-hot coding with dimensional reduction.
|
52 |
|
53 |
-
The architecture of the CNN network is composed of a 50% dropout layer followed by two 1D convolution layers
|
54 |
associated with a MaxPooling layer. After maximum grouping, a dense layer of size 128 is added connected to
|
55 |
a 50% dropout which finally connects to a flattened layer and the final sort dense layer. The dropout layers
|
56 |
helped to avoid network overfitting by masking part of the data so that the network learned to create
|
@@ -60,7 +60,7 @@ redundancies in the analysis of the inputs.
|
|
60 |
|
61 |
## Model variations
|
62 |
|
63 |
-
Table
|
64 |
accuracy, f1-score, recall and precision results obtained in the training of each network.
|
65 |
|
66 |
Table 1: Results of experiments
|
@@ -116,27 +116,20 @@ required for training since the GPU performs matrix operations faster then a CPU
|
|
116 |
|
117 |
### How to use
|
118 |
|
119 |
-
|
|
|
120 |
|
121 |
-
|
|
|
122 |
|
123 |
-
```
|
124 |
-
|
125 |
-
Here is how to use this model to get the features of a given text in PyTorch:
|
126 |
-
|
127 |
-
```python
|
128 |
-
|
129 |
-
```
|
130 |
-
|
131 |
-
and in TensorFlow:
|
132 |
-
|
133 |
-
```python
|
134 |
-
```
|
135 |
|
136 |
### Limitations and bias
|
137 |
|
138 |
This model is uncased: it does not make a difference between english and English.
|
139 |
|
|
|
|
|
|
|
140 |
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
141 |
predictions:
|
142 |
|
@@ -149,8 +142,6 @@ Replicability limitation: Due to the simplicity of the keras embedding model, we
|
|
149 |
and it has a delicate problem for replication in production. This detail is pending further study to define
|
150 |
whether it is possible to use one of these models.
|
151 |
|
152 |
-
-
|
153 |
-
-
|
154 |
This bias will also affect all fine-tuned versions of this model.
|
155 |
|
156 |
## Training data
|
@@ -184,7 +175,7 @@ to implement the [preprocessing code](https://github.com/mcti-sefip/mcti-sefip-p
|
|
184 |
|
185 |
Several Python packages were used to develop the preprocessing code:
|
186 |
|
187 |
-
Table
|
188 |
| Objective | Package |
|
189 |
|--------------------------------------------------------|--------------|
|
190 |
| Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
|
@@ -199,9 +190,9 @@ Table 2: Python packages used
|
|
199 |
|
200 |
|
201 |
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
|
202 |
-
bases, derived from the base of goal 4, with the application of the methods shown in
|
203 |
|
204 |
-
Table
|
205 |
| id | Experiments |
|
206 |
|--------|------------------------------------------------------------------------|
|
207 |
| Base | Original Texts |
|
@@ -212,7 +203,7 @@ Table 3: Preprocessing methods evaluated
|
|
212 |
| xp5 | xp4 + Stemming |
|
213 |
| xp6 | xp4 + Lemmatization |
|
214 |
| xp7 | xp4 + Stemming + Stopwords Removal |
|
215 |
-
| xp8 |
|
216 |
|
217 |
First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and
|
218 |
evaluation of the first four bases (xp1, xp2, xp3, xp4).
|
@@ -222,9 +213,9 @@ stemming + stopwords removal (xp7), and Lemmatization + stopwords removal (xp8).
|
|
222 |
|
223 |
All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
|
224 |
neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
|
225 |
-
shown in Table
|
226 |
|
227 |
-
Table
|
228 |
| id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
|
229 |
|--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
|
230 |
| Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
|
@@ -258,7 +249,7 @@ only 56% compatibility.
|
|
258 |
The alternative was to use web scraping algorithms to acquire more unlabeled data from the same sources, thus ensuring
|
259 |
compatibility. The original dataset had 260 labeled entries.
|
260 |
|
261 |
-
Table
|
262 |
| Dataset | |
|
263 |
|--------------------------------------|:----------------------:|
|
264 |
| Labeled MCTI | 100% |
|
@@ -266,7 +257,7 @@ Table 5: Compatibility results (*base = labeled MCTI dataset entries)
|
|
266 |
| BBC News Articles | 56.77% |
|
267 |
| New unlabeled MCTI | 75.26% |
|
268 |
|
269 |
-
Table
|
270 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
271 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
272 |
| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
|
@@ -280,7 +271,7 @@ The table below presents the results of accuracy, f1-score, recall and precision
|
|
280 |
In addition, the necessary times for training each epoch, the data validation execution time and the weight of the deep
|
281 |
learning model associated with each implementation were added.
|
282 |
|
283 |
-
Table
|
284 |
| Model | Accuracy | F1-score | Recall | Precision | Training time epoch(s) | Validation time (s) | Weight(MB) |
|
285 |
|------------------------|----------|----------|--------|-----------|------------------------|---------------------|------------|
|
286 |
| Keras Embedding + SNN | 92.47 | 88.46 | 79.66 | 100.00 | 0.2 | 0.7 | 1.8 |
|
@@ -309,9 +300,9 @@ In this context, was considered two approaches:
|
|
309 |
|
310 |
Templates using Word2Vec and Longformer also need to be loaded and their weights are as follows:
|
311 |
|
312 |
-
Table
|
313 |
-
|
|
314 |
-
|
315 |
| Longformer | 10.9GB |
|
316 |
| Word2Vec | 56.1MB |
|
317 |
|
|
|
23 |
|
24 |
Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project.
|
25 |
|
26 |
+
The model [NLP MCTI Classification Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/NLP-W2V-CNN-Multi) is part of the project [Research Financing Product Portfolio (FPP)](https://huggingface.co/unb-lamfo-nlp-mcti) and focuses
|
27 |
+
on the Text Classification task, exploring different machine learning strategies to classify a small amount
|
28 |
of long, unstructured, and uneven data to find a proper method with good performance. Pre-training and word embedding
|
29 |
+
solutions were used to learn word relationships from other datasets with considerable similarity and a larger scale.
|
30 |
Then, using the acquired resources, based on the dataset available in the MCTI, transfer learning plus deep learning
|
31 |
models were applied to improve the understanding of each sentence.
|
32 |
|
|
|
47 |
little insights and valuable relationships to work with the best features of them. In this way, learning is
|
48 |
facilitated and allows the gradient descent to converge more quickly.
|
49 |
|
50 |
+
The first layer of the model is a pre-trained Word2Vec embedding layer as a method of extracting features from the data that can
|
51 |
+
replace one-hot coding with dimensional reduction. The pre-training of this model is explained further in this document.
|
52 |
|
53 |
+
After the embedding layer, there is the CNN classification model. The architecture of the CNN network is composed of a 50% dropout layer followed by two 1D convolution layers
|
54 |
associated with a MaxPooling layer. After maximum grouping, a dense layer of size 128 is added connected to
|
55 |
a 50% dropout which finally connects to a flattened layer and the final sort dense layer. The dropout layers
|
56 |
helped to avoid network overfitting by masking part of the data so that the network learned to create
|
|
|
60 |
|
61 |
## Model variations
|
62 |
|
63 |
+
Table 1 below presents the results of several implementations with different architectures, highlighting the
|
64 |
accuracy, f1-score, recall and precision results obtained in the training of each network.
|
65 |
|
66 |
Table 1: Results of experiments
|
|
|
116 |
|
117 |
### How to use
|
118 |
|
119 |
+
This model is available in huggingface spaces to be applied to excel files containing scrapped oportunity data.
|
120 |
+
- [NLP MCTI Classification Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/NLP-W2V-CNN-Multi)
|
121 |
|
122 |
+
You can also find the training and evaluation notebooks in the github repository:
|
123 |
+
- [PPF-MCTI Repository](https://github.com/chap0lin/PPF-MCTI)
|
124 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
125 |
|
126 |
### Limitations and bias
|
127 |
|
128 |
This model is uncased: it does not make a difference between english and English.
|
129 |
|
130 |
+
This model depends on high-quality scrapped data. Since the model understands a finite number of words, the input needs to
|
131 |
+
have little to no wrong encodings and abstract markdowns so that the preprocessing can remove them and correctly identify the words.
|
132 |
+
|
133 |
Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
|
134 |
predictions:
|
135 |
|
|
|
142 |
and it has a delicate problem for replication in production. This detail is pending further study to define
|
143 |
whether it is possible to use one of these models.
|
144 |
|
|
|
|
|
145 |
This bias will also affect all fine-tuned versions of this model.
|
146 |
|
147 |
## Training data
|
|
|
175 |
|
176 |
Several Python packages were used to develop the preprocessing code:
|
177 |
|
178 |
+
Table 3: Python packages used
|
179 |
| Objective | Package |
|
180 |
|--------------------------------------------------------|--------------|
|
181 |
| Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
|
|
|
190 |
|
191 |
|
192 |
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
|
193 |
+
bases, derived from the base of goal 4, with the application of the methods shown in table 4.
|
194 |
|
195 |
+
Table 4: Preprocessing methods evaluated
|
196 |
| id | Experiments |
|
197 |
|--------|------------------------------------------------------------------------|
|
198 |
| Base | Original Texts |
|
|
|
203 |
| xp5 | xp4 + Stemming |
|
204 |
| xp6 | xp4 + Lemmatization |
|
205 |
| xp7 | xp4 + Stemming + Stopwords Removal |
|
206 |
+
| xp8 | xp4 + Lemmatization + Stopwords Removal |
|
207 |
|
208 |
First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and
|
209 |
evaluation of the first four bases (xp1, xp2, xp3, xp4).
|
|
|
213 |
|
214 |
All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
|
215 |
neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
|
216 |
+
shown in Table 5.
|
217 |
|
218 |
+
Table 5: Results obtained in Preprocessing
|
219 |
| id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
|
220 |
|--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
|
221 |
| Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
|
|
|
249 |
The alternative was to use web scraping algorithms to acquire more unlabeled data from the same sources, thus ensuring
|
250 |
compatibility. The original dataset had 260 labeled entries.
|
251 |
|
252 |
+
Table 6: Compatibility results (*base = labeled MCTI dataset entries)
|
253 |
| Dataset | |
|
254 |
|--------------------------------------|:----------------------:|
|
255 |
| Labeled MCTI | 100% |
|
|
|
257 |
| BBC News Articles | 56.77% |
|
258 |
| New unlabeled MCTI | 75.26% |
|
259 |
|
260 |
+
Table 7: Results from Pre-trained WE + ML models
|
261 |
| ML Model | Accuracy | F1 Score | Precision | Recall |
|
262 |
|:--------:|:---------:|:---------:|:---------:|:---------:|
|
263 |
| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
|
|
|
271 |
In addition, the necessary times for training each epoch, the data validation execution time and the weight of the deep
|
272 |
learning model associated with each implementation were added.
|
273 |
|
274 |
+
Table 8: Results of experiments
|
275 |
| Model | Accuracy | F1-score | Recall | Precision | Training time epoch(s) | Validation time (s) | Weight(MB) |
|
276 |
|------------------------|----------|----------|--------|-----------|------------------------|---------------------|------------|
|
277 |
| Keras Embedding + SNN | 92.47 | 88.46 | 79.66 | 100.00 | 0.2 | 0.7 | 1.8 |
|
|
|
300 |
|
301 |
Templates using Word2Vec and Longformer also need to be loaded and their weights are as follows:
|
302 |
|
303 |
+
Table 9: Templates using Word2Vec and Longformer
|
304 |
+
| Templates | weights |
|
305 |
+
|------------------------------|---------|
|
306 |
| Longformer | 10.9GB |
|
307 |
| Word2Vec | 56.1MB |
|
308 |
|