chap0lin commited on
Commit
9400cf1
1 Parent(s): 69370ac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -37
README.md CHANGED
@@ -33,35 +33,35 @@ models were applied to improve the understanding of each sentence.
33
  ## According to the abstract,
34
 
35
  "The research results serve as a successful case of artificial intelligence in a federal government application".
36
- More details about the project, architecture model, training model and classifications process can be found in the article
37
  ["Using transfer learning to classify long unstructured texts with small amounts of labeled data"](https://www.scitepress.org/Link.aspx?doi=10.5220/0011527700003318).
38
 
39
  ## Model description
40
 
41
  The work consists of a machine learning model with word embedding and Convolutional Neural Network (CNN).
42
  For the project, a Convolutional Neural Network (CNN) was chosen, as it presents better accuracy in empirical
43
- comparison with 3 other different architectures: Neural Network (NN), Deep Neural Network (DNN) and Long-Term
44
  Memory (LSTM).
45
 
46
- As the input data is compose of unstructured and nonuniform texts it is essential normalize the data to study
47
- little insights and valuable relationships to work with the best features of them. In this way, learning is
48
  facilitated and allows the gradient descent to converge more quickly.
49
 
50
  The first layer of the model is a pre-trained Word2Vec embedding layer as a method of extracting features from the data that can
51
  replace one-hot coding with dimensional reduction. The pre-training of this model is explained further in this document.
52
 
53
  After the embedding layer, there is the CNN classification model. The architecture of the CNN network is composed of a 50% dropout layer followed by two 1D convolution layers
54
- associated with a MaxPooling layer. After maximum grouping, a dense layer of size 128 is added connected to
55
  a 50% dropout which finally connects to a flattened layer and the final sort dense layer. The dropout layers
56
- helped to avoid network overfitting by masking part of the data so that the network learned to create
57
- redundancies in the analysis of the inputs.
58
 
59
  ![CNN Classification Model](https://raw.githubusercontent.com/chap0lin/WEBIST2022/master/Assets/cnn_model.png)
60
 
61
  ## Model variations
62
 
63
  Table 1 below presents the results of several implementations with different architectures, highlighting the
64
- accuracy, f1-score, recall and precision results obtained in the training of each network.
65
 
66
  Table 1: Results of experiments
67
  | Model | Accuracy | F1-score | Recall | Precision |
@@ -79,10 +79,10 @@ Table 1: Results of experiments
79
  | Longformer + CNN | 94.09 | 90.69 | 83.41 | 100.00 |
80
  | Longformer + LSTM | 61.29 | 0 | 0 | 0 |
81
 
82
- Table 2 bellow shows the times required for training each epoch, the data validation execution time and the weight of the deep learning
83
  model associated with each implementation.
84
 
85
- Table 2: Results of Training time epoch, Validation time and Weight
86
  | Model | Training time epoch(s) | Validation time (s) | Weight(MB) |
87
  |------------------------|:-----------------------:|:-------------------:|:----------:|
88
  | Keras Embedding + SNN | 100.00 | 0.2 | 0.7 | 1.8 |
@@ -99,15 +99,15 @@ Table 2: Results of Training time epoch, Validation time and Weight
99
  | Longformer + LSTM | 0 | 13.0 | 8.6 | 2.6 |
100
 
101
  In addition, it is possible to notice that the model of Longformer + SNN and Longformer + LSTM were not able
102
- to learn. Perhaps the models need some adjustments, however, each training attempt takes between 5 and 8 hours,
103
- which made the attempt to adjust unfeasible in view of other models already showing promising results.
104
 
105
- With Longformer the problems caused by the size of the model became more visible. First, it was necessary to
106
  actively deallocate unused chunks of memory right after use so that the next steps could be loaded. Then, it
107
- was necessary to use a CPU environment for training the networks because the weight of the model exceeded the
108
  16GB of video memory available on the P100 board, available in Colab during training. In this case, the high
109
- RAM environment was used, which delivers 25GB of memory for use with the CPU, and this means a longer time
110
- required for training since the GPU performs matrix operations faster then a CPU. These models were trained
111
  5x with 100 training epochs each.
112
 
113
  ## Intended uses
@@ -116,10 +116,10 @@ required for training since the GPU performs matrix operations faster then a CPU
116
 
117
  ### How to use
118
 
119
- This model is available in huggingface spaces to be applied to excel files containing scrapped oportunity data.
120
  - [NLP MCTI Classification Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/NLP-W2V-CNN-Multi)
121
 
122
- You can also find the training and evaluation notebooks in the github repository:
123
  - [PPF-MCTI Repository](https://github.com/chap0lin/PPF-MCTI)
124
 
125
 
@@ -133,12 +133,12 @@ have little to no wrong encodings and abstract markdowns so that the preprocessi
133
  Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
134
  predictions:
135
 
136
- Performance limiting: Loading the longformer model in memory means needing 11Gb available only for the model,
137
- without considering the weight of the deep learning network. For training this means we need a 20+ Gb GPU to
138
  perform the training. Here this was resolved using the high RAM environment of google Colab Pro and training
139
- using CPU which justifies the longer training time per season.
140
 
141
- Replicability limitation: Due to the simplicity of the keras embedding model, we are using one hot encoding,
142
  and it has a delicate problem for replication in production. This detail is pending further study to define
143
  whether it is possible to use one of these models.
144
 
@@ -146,7 +146,7 @@ This bias will also affect all fine-tuned versions of this model.
146
 
147
  ## Training data
148
 
149
- The [inputted training data](https://github.com/chap0lin/PPF-MCTI/tree/master/Datasets) was obtained from scrapping techniques, over 30 different platforms e.g. The Royal Society,
150
  Annenberg foundation, and contained 928 labeled entries (928 rows x 21 columns). Of the data gathered, was used only
151
  the main text content (column u). Text content averages 800 tokens in length, but with high variance, up to 5,000 tokens.
152
 
@@ -156,7 +156,7 @@ the main text content (column u). Text content averages 800 tokens in length, bu
156
 
157
  After the pre-trained model of word2vec embeddings had already learned meanings relevant to the classification problem,
158
  it was coupled to the classification model to train it with the labeled data in a supervised way. Table 6 shows the results
159
- obtained with related metrics. With this implementation, was reached new levels of accuracy with 86% for CNN architecture
160
  and 88% for the LSTM architecture.
161
 
162
  ### Preprocessing
@@ -180,7 +180,7 @@ Table 3: Python packages used
180
  |--------------------------------------------------------|--------------|
181
  | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
182
  | Natural Language Processing | [nltk](https://pypi.org/project/nltk) |
183
- | Others data manipulations and calculations included in Python 3.10: io, json, math, re (regular expressions), shutil, time, unicodedata; | [numpy](https://pypi.org/project/numpy) |
184
  | Data manipulation and analysis | [pandas](https://pypi.org/project/pandas) |
185
  | http library | [requests](https://pypi.org/project/requests) |
186
  | Training model | [scikit-learn](https://pypi.org/project/scikit-learn) |
@@ -211,7 +211,7 @@ evaluation of the first four bases (xp1, xp2, xp3, xp4).
211
  Then, the content simplification was evaluated, from the xp4 base, considering stemming (xp5), Lemmatization (xp6),
212
  stemming + stopwords removal (xp7), and Lemmatization + stopwords removal (xp8).
213
 
214
- All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
215
  neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
216
  shown in Table 5.
217
 
@@ -229,20 +229,20 @@ Table 5: Results obtained in Preprocessing
229
  | xp8 | ap4 + Lemmatization + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 225,580 | 16081 | 2726 |
230
 
231
  Even so, between these two excellent options, one can judge which one to choose. XP7: It has less training time,
232
- less number of unique tokens. XP8: It has smaller maximum sizes. In this case, the criterion used for the choice
233
  was the computational cost required to train the vector representation models (word-embedding, sentence-embeddings,
234
  document-embedding). The training time is so close that it did not have such a large weight for the analysis.
235
 
236
- As a last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
237
  preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
238
  available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized).
239
 
240
  ### Pretraining
241
 
242
- Since labeled data is scarce, word-embeddings was trained in an unsupervised manner using other datasets that
243
  contain most of the words it needs to learn. The idea implemented was based on introducing better and better-trained
244
  word embeddings in the model. For an additional dataset to be applied to improve word-embedding training, it must be
245
- compatible with the dataset used to train the classifier. Was searched for datasets from the Kaggle, a platform with
246
  over a thousand available NLP datasets, and the closest we found was the BBC News Articles dataset, which achieved
247
  only 56% compatibility.
248
 
@@ -267,8 +267,8 @@ Table 7: Results from Pre-trained WE + ML models
267
 
268
  ## Evaluation results
269
 
270
- The table below presents the results of accuracy, f1-score, recall and precision obtained in the training of each network.
271
- In addition, the necessary times for training each epoch, the data validation execution time and the weight of the deep
272
  learning model associated with each implementation were added.
273
 
274
  Table 8: Results of experiments
@@ -293,7 +293,7 @@ that used the CNN network for deep learning.
293
 
294
  With the motivation to increase accuracy obtained with baseline implementation, was implemented a transfer learning
295
  strategy under the assumption that small data available for training was insufficient for adequate embedding training.
296
- In this context, was considered two approaches:
297
 
298
  - Pre-training word embeddings using similar datasets for text classification;
299
  - Using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
@@ -314,15 +314,15 @@ Above the results obtained, it is also necessary to highlight two limitations fo
314
  training of networks:
315
 
316
 
317
- These 10Gb of the model exceed the Github limit and did not go to the repository, so to run the system we need
318
  to download the pre-trained network in the notebook and run the encoder-decoder with the data to create the model.
319
- It is advisable to do this in a GPU environment and save the file on the drive. After that change the environment to
320
- CPU to perform the training. Trying to generate the model in CPU will take more than 3 hours of processing.
321
 
322
 
323
  The best model that does not have any limitations is Word2Vec + CNN. However, we need to study the limitations to
324
  understand whether it is possible to introduce a new model with better accuracy and indicators. These adjustments
325
- will be worked on during goals 13 and 14 where the main objective will be to encapsulate the solution in the most
326
  suitable way for use in production.
327
 
328
  ## Benchmarks
 
33
  ## According to the abstract,
34
 
35
  "The research results serve as a successful case of artificial intelligence in a federal government application".
36
+ More details about the project, architecture model, training model, and classifications process can be found in the article
37
  ["Using transfer learning to classify long unstructured texts with small amounts of labeled data"](https://www.scitepress.org/Link.aspx?doi=10.5220/0011527700003318).
38
 
39
  ## Model description
40
 
41
  The work consists of a machine learning model with word embedding and Convolutional Neural Network (CNN).
42
  For the project, a Convolutional Neural Network (CNN) was chosen, as it presents better accuracy in empirical
43
+ comparison with 3 other different architectures: Neural Network (NN), Deep Neural Network (DNN), and Long-Term
44
  Memory (LSTM).
45
 
46
+ As the input data is composed of unstructured and nonuniform texts, it is essential to normalize the data to study
47
+ little insights and valuable relationships to work with their best features. In this way, learning is
48
  facilitated and allows the gradient descent to converge more quickly.
49
 
50
  The first layer of the model is a pre-trained Word2Vec embedding layer as a method of extracting features from the data that can
51
  replace one-hot coding with dimensional reduction. The pre-training of this model is explained further in this document.
52
 
53
  After the embedding layer, there is the CNN classification model. The architecture of the CNN network is composed of a 50% dropout layer followed by two 1D convolution layers
54
+ associated with a MaxPooling layer. After maximum grouping, a dense layer of size 128 is added and connected to
55
  a 50% dropout which finally connects to a flattened layer and the final sort dense layer. The dropout layers
56
+ helped avoid network overfitting by masking part of the data so that the network learned to create
57
+ redundancies in analyzing the inputs.
58
 
59
  ![CNN Classification Model](https://raw.githubusercontent.com/chap0lin/WEBIST2022/master/Assets/cnn_model.png)
60
 
61
  ## Model variations
62
 
63
  Table 1 below presents the results of several implementations with different architectures, highlighting the
64
+ accuracy, f1-score, recall, and precision results obtained in each network training.
65
 
66
  Table 1: Results of experiments
67
  | Model | Accuracy | F1-score | Recall | Precision |
 
79
  | Longformer + CNN | 94.09 | 90.69 | 83.41 | 100.00 |
80
  | Longformer + LSTM | 61.29 | 0 | 0 | 0 |
81
 
82
+ Table 2 below shows the times required for training each epoch, the data validation execution time and the weight of the deep learning
83
  model associated with each implementation.
84
 
85
+ Table 2: Results of Training time epoch, Validation time, and Weight
86
  | Model | Training time epoch(s) | Validation time (s) | Weight(MB) |
87
  |------------------------|:-----------------------:|:-------------------:|:----------:|
88
  | Keras Embedding + SNN | 100.00 | 0.2 | 0.7 | 1.8 |
 
99
  | Longformer + LSTM | 0 | 13.0 | 8.6 | 2.6 |
100
 
101
  In addition, it is possible to notice that the model of Longformer + SNN and Longformer + LSTM were not able
102
+ to learn. Perhaps the models need some adjustments; however, each training attempt takes between 5 and 8 hours,
103
+ which made an attempt to adjust unfeasible because of other models already showing promising results.
104
 
105
+ With Longformer, the problems caused by the size the model's size became more visible. First, it was necessary to
106
  actively deallocate unused chunks of memory right after use so that the next steps could be loaded. Then, it
107
+ was necessary to use a CPU environment for training the networks because the model's weight exceeded the
108
  16GB of video memory available on the P100 board, available in Colab during training. In this case, the high
109
+ RAM environment was used, which delivers 25GB of memory for use with the CPU, and this means a longer time is
110
+ required for training since the GPU performs matrix operations faster than the CPU. These models were trained
111
  5x with 100 training epochs each.
112
 
113
  ## Intended uses
 
116
 
117
  ### How to use
118
 
119
+ This model is available in Hugging Face spaces to be applied to excel files containing scrapped opportunity data.
120
  - [NLP MCTI Classification Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/NLP-W2V-CNN-Multi)
121
 
122
+ The training and evaluation notebooks can be found in the github repository:
123
  - [PPF-MCTI Repository](https://github.com/chap0lin/PPF-MCTI)
124
 
125
 
 
133
  Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
134
  predictions:
135
 
136
+ Performance limiting: Loading the longformer model in the memory means needing 11Gb available only for the model
137
+ without considering the weight of the deep learning network. For training, this means we need a 20+ Gb GPU to
138
  perform the training. Here this was resolved using the high RAM environment of google Colab Pro and training
139
+ using CPU, which justifies the longer training time per season.
140
 
141
+ Replicability limitation: Due to the simplicity of the Keras embedding model, we are using one-hot encoding,
142
  and it has a delicate problem for replication in production. This detail is pending further study to define
143
  whether it is possible to use one of these models.
144
 
 
146
 
147
  ## Training data
148
 
149
+ The [inputted training data](https://github.com/chap0lin/PPF-MCTI/tree/master/Datasets) was obtained from scrapping techniques over 30 different platforms, e.g., The Royal Society,
150
  Annenberg foundation, and contained 928 labeled entries (928 rows x 21 columns). Of the data gathered, was used only
151
  the main text content (column u). Text content averages 800 tokens in length, but with high variance, up to 5,000 tokens.
152
 
 
156
 
157
  After the pre-trained model of word2vec embeddings had already learned meanings relevant to the classification problem,
158
  it was coupled to the classification model to train it with the labeled data in a supervised way. Table 6 shows the results
159
+ obtained with related metrics. This implementation reached new levels of accuracy, with 86% for CNN architecture
160
  and 88% for the LSTM architecture.
161
 
162
  ### Preprocessing
 
180
  |--------------------------------------------------------|--------------|
181
  | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
182
  | Natural Language Processing | [nltk](https://pypi.org/project/nltk) |
183
+ | Other data manipulations and calculations included in Python 3.10: io, json, math, re (regular expressions), shutil, time, unicodedata; | [numpy](https://pypi.org/project/numpy) |
184
  | Data manipulation and analysis | [pandas](https://pypi.org/project/pandas) |
185
  | http library | [requests](https://pypi.org/project/requests) |
186
  | Training model | [scikit-learn](https://pypi.org/project/scikit-learn) |
 
211
  Then, the content simplification was evaluated, from the xp4 base, considering stemming (xp5), Lemmatization (xp6),
212
  stemming + stopwords removal (xp7), and Lemmatization + stopwords removal (xp8).
213
 
214
+ All eight bases were evaluated to classify the eligibility of the opportunity through the training of a shallow
215
  neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
216
  shown in Table 5.
217
 
 
229
  | xp8 | ap4 + Lemmatization + Stopwords Removal | 92,47% | 88,46% | 79,66% | 100,00% | 225,580 | 16081 | 2726 |
230
 
231
  Even so, between these two excellent options, one can judge which one to choose. XP7: It has less training time,
232
+ and less number of unique tokens. XP8: It has smaller maximum sizes. In this case, the criterion used for the choice
233
  was the computational cost required to train the vector representation models (word-embedding, sentence-embeddings,
234
  document-embedding). The training time is so close that it did not have such a large weight for the analysis.
235
 
236
+ As the last step, a spreadsheet was generated for the model (xp8) with the fields opo_pre and opo_pre_tkn, containing the
237
  preprocessed text in sentence format and tokens, respectively. This [database](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/oportunidades_final_pre_processado.xlsx) was made
238
  available on the project's GitHub with the inclusion of columns opo_pre (text) and opo_pre_tkn (tokenized).
239
 
240
  ### Pretraining
241
 
242
+ Since labeled data is scarce, word-embeddings were trained in an unsupervised manner using other datasets that
243
  contain most of the words it needs to learn. The idea implemented was based on introducing better and better-trained
244
  word embeddings in the model. For an additional dataset to be applied to improve word-embedding training, it must be
245
+ compatible with the dataset used to train the classifier. We searched for datasets from the Kaggle, a platform with
246
  over a thousand available NLP datasets, and the closest we found was the BBC News Articles dataset, which achieved
247
  only 56% compatibility.
248
 
 
267
 
268
  ## Evaluation results
269
 
270
+ The table below presents the results of accuracy, f1-score, recall, and precision obtained in the training of each network.
271
+ In addition, the necessary times for training each epoch, the data validation execution time, and the weight of the deep
272
  learning model associated with each implementation were added.
273
 
274
  Table 8: Results of experiments
 
293
 
294
  With the motivation to increase accuracy obtained with baseline implementation, was implemented a transfer learning
295
  strategy under the assumption that small data available for training was insufficient for adequate embedding training.
296
+ In this context, were considered two approaches:
297
 
298
  - Pre-training word embeddings using similar datasets for text classification;
299
  - Using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
 
314
  training of networks:
315
 
316
 
317
+ These 10Gb of the model exceeded the Github limit and did not go to the repository, so to run the system, we need
318
  to download the pre-trained network in the notebook and run the encoder-decoder with the data to create the model.
319
+ It is advisable to do this in a GPU environment and save the file on the drive. After that, change the environment to
320
+ CPU to perform the training. Trying to generate the model unsing the CPU will take more than 3 hours of processing.
321
 
322
 
323
  The best model that does not have any limitations is Word2Vec + CNN. However, we need to study the limitations to
324
  understand whether it is possible to introduce a new model with better accuracy and indicators. These adjustments
325
+ will be worked on during goals 13 and 14, where the main objective will be to encapsulate the solution in the most
326
  suitable way for use in production.
327
 
328
  ## Benchmarks