chap0lin commited on
Commit
69370ac
1 Parent(s): b192b15

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -35
README.md CHANGED
@@ -23,10 +23,10 @@ thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_mode
23
 
24
  Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project.
25
 
26
- The model [NLP MCTI Classification Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/NLP-W2V-CNN-Multi) is part of the project [Research Financing Product Portfolio (FPP)](https://huggingface.co/unb-lamfo-nlp-mcti) focuses
27
- on the task of Text Classification and explores different machine learning strategies to classify a small amount
28
  of long, unstructured, and uneven data to find a proper method with good performance. Pre-training and word embedding
29
- solutions were used to learn word relationships from other datasets with considerable similarity and larger scale.
30
  Then, using the acquired resources, based on the dataset available in the MCTI, transfer learning plus deep learning
31
  models were applied to improve the understanding of each sentence.
32
 
@@ -47,10 +47,10 @@ As the input data is compose of unstructured and nonuniform texts it is essentia
47
  little insights and valuable relationships to work with the best features of them. In this way, learning is
48
  facilitated and allows the gradient descent to converge more quickly.
49
 
50
- The first layer of the model is an embedding layer as a method of extracting features from the data that can
51
- replace one-hot coding with dimensional reduction.
52
 
53
- The architecture of the CNN network is composed of a 50% dropout layer followed by two 1D convolution layers
54
  associated with a MaxPooling layer. After maximum grouping, a dense layer of size 128 is added connected to
55
  a 50% dropout which finally connects to a flattened layer and the final sort dense layer. The dropout layers
56
  helped to avoid network overfitting by masking part of the data so that the network learned to create
@@ -60,7 +60,7 @@ redundancies in the analysis of the inputs.
60
 
61
  ## Model variations
62
 
63
- Table x below presents the results of several implementations with different architectures, highlighting the
64
  accuracy, f1-score, recall and precision results obtained in the training of each network.
65
 
66
  Table 1: Results of experiments
@@ -116,27 +116,20 @@ required for training since the GPU performs matrix operations faster then a CPU
116
 
117
  ### How to use
118
 
119
- You can use this model directly with a pipeline for masked language modeling:
 
120
 
121
- ```python
 
122
 
123
- ```
124
-
125
- Here is how to use this model to get the features of a given text in PyTorch:
126
-
127
- ```python
128
-
129
- ```
130
-
131
- and in TensorFlow:
132
-
133
- ```python
134
- ```
135
 
136
  ### Limitations and bias
137
 
138
  This model is uncased: it does not make a difference between english and English.
139
 
 
 
 
140
  Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
141
  predictions:
142
 
@@ -149,8 +142,6 @@ Replicability limitation: Due to the simplicity of the keras embedding model, we
149
  and it has a delicate problem for replication in production. This detail is pending further study to define
150
  whether it is possible to use one of these models.
151
 
152
- -
153
- -
154
  This bias will also affect all fine-tuned versions of this model.
155
 
156
  ## Training data
@@ -184,7 +175,7 @@ to implement the [preprocessing code](https://github.com/mcti-sefip/mcti-sefip-p
184
 
185
  Several Python packages were used to develop the preprocessing code:
186
 
187
- Table 2: Python packages used
188
  | Objective | Package |
189
  |--------------------------------------------------------|--------------|
190
  | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
@@ -199,9 +190,9 @@ Table 2: Python packages used
199
 
200
 
201
  As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
202
- bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
203
 
204
- Table 3: Preprocessing methods evaluated
205
  | id | Experiments |
206
  |--------|------------------------------------------------------------------------|
207
  | Base | Original Texts |
@@ -212,7 +203,7 @@ Table 3: Preprocessing methods evaluated
212
  | xp5 | xp4 + Stemming |
213
  | xp6 | xp4 + Lemmatization |
214
  | xp7 | xp4 + Stemming + Stopwords Removal |
215
- | xp8 | ap4 + Lemmatization + Stopwords Removal |
216
 
217
  First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and
218
  evaluation of the first four bases (xp1, xp2, xp3, xp4).
@@ -222,9 +213,9 @@ stemming + stopwords removal (xp7), and Lemmatization + stopwords removal (xp8).
222
 
223
  All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
224
  neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
225
- shown in Table 4.
226
 
227
- Table 4: Results obtained in Preprocessing
228
  | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
229
  |--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
230
  | Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
@@ -258,7 +249,7 @@ only 56% compatibility.
258
  The alternative was to use web scraping algorithms to acquire more unlabeled data from the same sources, thus ensuring
259
  compatibility. The original dataset had 260 labeled entries.
260
 
261
- Table 5: Compatibility results (*base = labeled MCTI dataset entries)
262
  | Dataset | |
263
  |--------------------------------------|:----------------------:|
264
  | Labeled MCTI | 100% |
@@ -266,7 +257,7 @@ Table 5: Compatibility results (*base = labeled MCTI dataset entries)
266
  | BBC News Articles | 56.77% |
267
  | New unlabeled MCTI | 75.26% |
268
 
269
- Table 6: Results from Pre-trained WE + ML models
270
  | ML Model | Accuracy | F1 Score | Precision | Recall |
271
  |:--------:|:---------:|:---------:|:---------:|:---------:|
272
  | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
@@ -280,7 +271,7 @@ The table below presents the results of accuracy, f1-score, recall and precision
280
  In addition, the necessary times for training each epoch, the data validation execution time and the weight of the deep
281
  learning model associated with each implementation were added.
282
 
283
- Table 7: Results of experiments
284
  | Model | Accuracy | F1-score | Recall | Precision | Training time epoch(s) | Validation time (s) | Weight(MB) |
285
  |------------------------|----------|----------|--------|-----------|------------------------|---------------------|------------|
286
  | Keras Embedding + SNN | 92.47 | 88.46 | 79.66 | 100.00 | 0.2 | 0.7 | 1.8 |
@@ -309,9 +300,9 @@ In this context, was considered two approaches:
309
 
310
  Templates using Word2Vec and Longformer also need to be loaded and their weights are as follows:
311
 
312
- Table 1: Templates using Word2Vec and Longformer
313
- | Tamplates | weights |
314
- |+----------------------------+|:-------:|
315
  | Longformer | 10.9GB |
316
  | Word2Vec | 56.1MB |
317
 
 
23
 
24
  Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project.
25
 
26
+ The model [NLP MCTI Classification Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/NLP-W2V-CNN-Multi) is part of the project [Research Financing Product Portfolio (FPP)](https://huggingface.co/unb-lamfo-nlp-mcti) and focuses
27
+ on the Text Classification task, exploring different machine learning strategies to classify a small amount
28
  of long, unstructured, and uneven data to find a proper method with good performance. Pre-training and word embedding
29
+ solutions were used to learn word relationships from other datasets with considerable similarity and a larger scale.
30
  Then, using the acquired resources, based on the dataset available in the MCTI, transfer learning plus deep learning
31
  models were applied to improve the understanding of each sentence.
32
 
 
47
  little insights and valuable relationships to work with the best features of them. In this way, learning is
48
  facilitated and allows the gradient descent to converge more quickly.
49
 
50
+ The first layer of the model is a pre-trained Word2Vec embedding layer as a method of extracting features from the data that can
51
+ replace one-hot coding with dimensional reduction. The pre-training of this model is explained further in this document.
52
 
53
+ After the embedding layer, there is the CNN classification model. The architecture of the CNN network is composed of a 50% dropout layer followed by two 1D convolution layers
54
  associated with a MaxPooling layer. After maximum grouping, a dense layer of size 128 is added connected to
55
  a 50% dropout which finally connects to a flattened layer and the final sort dense layer. The dropout layers
56
  helped to avoid network overfitting by masking part of the data so that the network learned to create
 
60
 
61
  ## Model variations
62
 
63
+ Table 1 below presents the results of several implementations with different architectures, highlighting the
64
  accuracy, f1-score, recall and precision results obtained in the training of each network.
65
 
66
  Table 1: Results of experiments
 
116
 
117
  ### How to use
118
 
119
+ This model is available in huggingface spaces to be applied to excel files containing scrapped oportunity data.
120
+ - [NLP MCTI Classification Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/NLP-W2V-CNN-Multi)
121
 
122
+ You can also find the training and evaluation notebooks in the github repository:
123
+ - [PPF-MCTI Repository](https://github.com/chap0lin/PPF-MCTI)
124
 
 
 
 
 
 
 
 
 
 
 
 
 
125
 
126
  ### Limitations and bias
127
 
128
  This model is uncased: it does not make a difference between english and English.
129
 
130
+ This model depends on high-quality scrapped data. Since the model understands a finite number of words, the input needs to
131
+ have little to no wrong encodings and abstract markdowns so that the preprocessing can remove them and correctly identify the words.
132
+
133
  Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
134
  predictions:
135
 
 
142
  and it has a delicate problem for replication in production. This detail is pending further study to define
143
  whether it is possible to use one of these models.
144
 
 
 
145
  This bias will also affect all fine-tuned versions of this model.
146
 
147
  ## Training data
 
175
 
176
  Several Python packages were used to develop the preprocessing code:
177
 
178
+ Table 3: Python packages used
179
  | Objective | Package |
180
  |--------------------------------------------------------|--------------|
181
  | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
 
190
 
191
 
192
  As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
193
+ bases, derived from the base of goal 4, with the application of the methods shown in table 4.
194
 
195
+ Table 4: Preprocessing methods evaluated
196
  | id | Experiments |
197
  |--------|------------------------------------------------------------------------|
198
  | Base | Original Texts |
 
203
  | xp5 | xp4 + Stemming |
204
  | xp6 | xp4 + Lemmatization |
205
  | xp7 | xp4 + Stemming + Stopwords Removal |
206
+ | xp8 | xp4 + Lemmatization + Stopwords Removal |
207
 
208
  First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and
209
  evaluation of the first four bases (xp1, xp2, xp3, xp4).
 
213
 
214
  All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
215
  neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
216
+ shown in Table 5.
217
 
218
+ Table 5: Results obtained in Preprocessing
219
  | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
220
  |--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
221
  | Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
 
249
  The alternative was to use web scraping algorithms to acquire more unlabeled data from the same sources, thus ensuring
250
  compatibility. The original dataset had 260 labeled entries.
251
 
252
+ Table 6: Compatibility results (*base = labeled MCTI dataset entries)
253
  | Dataset | |
254
  |--------------------------------------|:----------------------:|
255
  | Labeled MCTI | 100% |
 
257
  | BBC News Articles | 56.77% |
258
  | New unlabeled MCTI | 75.26% |
259
 
260
+ Table 7: Results from Pre-trained WE + ML models
261
  | ML Model | Accuracy | F1 Score | Precision | Recall |
262
  |:--------:|:---------:|:---------:|:---------:|:---------:|
263
  | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
 
271
  In addition, the necessary times for training each epoch, the data validation execution time and the weight of the deep
272
  learning model associated with each implementation were added.
273
 
274
+ Table 8: Results of experiments
275
  | Model | Accuracy | F1-score | Recall | Precision | Training time epoch(s) | Validation time (s) | Weight(MB) |
276
  |------------------------|----------|----------|--------|-----------|------------------------|---------------------|------------|
277
  | Keras Embedding + SNN | 92.47 | 88.46 | 79.66 | 100.00 | 0.2 | 0.7 | 1.8 |
 
300
 
301
  Templates using Word2Vec and Longformer also need to be loaded and their weights are as follows:
302
 
303
+ Table 9: Templates using Word2Vec and Longformer
304
+ | Templates | weights |
305
+ |------------------------------|---------|
306
  | Longformer | 10.9GB |
307
  | Word2Vec | 56.1MB |
308