unb-lamfo-nlp-mcti
/

NLP-Classification-MCTI

English

Clsssification

science

Model card Files Files and versions Community

MarcosDib commited on Dec 13, 2022

Commit

f7c57c4

•

1 Parent(s): e12ab8c

Update README.md

Browse files

Files changed (1) hide show

README.md +79 -46

README.md CHANGED Viewed

@@ -19,7 +19,7 @@ thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_mode
 ![MCTIimg](https://antigo.mctic.gov.br/mctic/export/sites/institucional/institucional/entidadesVinculadas/conselhos/pag-old/RODAPE_MCTI.png)
-# MCTI Text Classification Task (uncased) DRAFT
 Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project.
@@ -38,24 +38,28 @@ Transformer-based approach, the Word2Vec-based approach improved the accuracy ra
 ## Model description
-Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
-nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
-consectetur adipiscing elit. Maecenas viverra tempus risus non ornare. Donec in vehicula est. Pellentesque vulputate
-bibendum cursus. Nunc volutpat vitae neque ut bibendum:
-- Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
- nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
- consectetur adipiscing elit.
-- Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
- nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
- consectetur adipiscing elit.
-Nullam congue hendrerit turpis et facilisis. Cras accumsan ante mi, eu hendrerit nulla finibus at. Donec imperdiet,
-nisi nec pulvinar suscipit, dolor nulla sagittis massa, et vehicula ante felis quis nibh. Lorem ipsum dolor sit amet,
-consectetur adipiscing elit. Maecenas viverra tempus risus non ornare. Donec in vehicula est. Pellentesque vulputate
-bibendum cursus. Nunc volutpat vitae neque ut bibendum.
-![architeru](https://github.com/marcosdib/S2Query/Classification_Architecture_model.png)
 ## Model variations
@@ -74,30 +78,9 @@ Table 1: Templates using Word2Vec and Longformer
 | Longformer | 10.9GB |
 | Word2Vec | 56.1MB |
-| Keras Embedding + SNN | 92.47 | 88.46 | 79.66 | 100 | 0.2 | 0.7 | 1.8 |
-| Keras Embedding + DNN | 89.78 | 84.41 | 77.81 | 92.57 | 1 | 1.4 | 7.6 |
-| Keras Embedding + CNN | 93.01 | 89.91 | 85.18 | 95.69 | 0.4 | 1.1 | 3.2 |
-| Keras Embedding + LSTM| 93.01 | 88.94 | 83.32 | 95.54 | 1.4 | 2 | 1.8 |
-| Word2Vec + SNN | 89.25 | 83.82 | 74.15 | 97.10 | 1.4 | 1.2 | 9.6 |
-| Word2Vec + DNN | 90.32 | 86.52 | 85.18 | 88.70 | 2 | 6.8 | 7.8 |
-| Word2Vec + CNN | 92.47 | 88.42 | 80.85 | 98.72 | 1.9 | 3.4 | 4.7 |
-| Word2Vec + LSTM | 89.78 | 84.36 | 75.36 | 95.81 | 2.6 | 14.3 | 1.2 |
-| Longformer + SNN | 61.29 | 0 | 0 | 0 | 128 | 1.5 | 36.8 |
-| Longformer + DNN | 91.93 | 87.62 | 80.37 | 97.62 | 81 | 8.4 | 12.7 |
-| Longformer + CNN | 94.09 | 90.69 | 83.41 | 100 | 57 | 4.5 | 9.6 |
-| Longformer + LSTM | 61.29 | 0 | 0 | 0 | 135 | 8.6 | 2.6 |
 ## Intended uses
-You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
-be fine-tuned on a downstream task. See the [model hub](https://www.google.com) to look for
-fine-tuned versions of a task that interests you.
-Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
-to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
-generation you should look at model like XXX.
 ### How to use
@@ -125,6 +108,15 @@ This model is uncased: it does not make a difference between english and English
 Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
 predictions:
 -
 -
 This bias will also affect all fine-tuned versions of this model.
@@ -144,14 +136,6 @@ it was coupled to the classification model to train it with the labeled data in
 obtained with related metrics. With this implementation, was reached new levels of accuracy with 86% for CNN architecture
 and 88% for the LSTM architecture.
-Table 6: Results from Pre-trained WE + ML models
-| ML Model | Accuracy | F1 Score | Precision | Recall |
-|:--------:|:---------:|:---------:|:---------:|:---------:|
-| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
-| DNN | 0.7115 | 0.7794 | 0.7255 | 0.8485 |
-| CNN | 0.8654 | 0.9083 | 0.8486 | 0.9773 |
-| LSTM | 0.8846 | 0.9139 | 0.9056 | 0.9318 |
 ### Preprocessing
 Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and
@@ -250,9 +234,58 @@ Table 5: Compatibility results (*base = labeled MCTI dataset entries)
 | BBC News Articles | 56.77% |
 | New unlabeled MCTI | 75.26% |
-## Evaluation results
 ## Benchmarks

 ![MCTIimg](https://antigo.mctic.gov.br/mctic/export/sites/institucional/institucional/entidadesVinculadas/conselhos/pag-old/RODAPE_MCTI.png)
+# MCTI Text Classification Task (uncased)
 Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project.
 ## Model description
+After the embedding, which is just essentially data preprocessing, it is necessary to develop the Project
+further to analyze the input text and classify whether it is a valid research funding opportunity for
+Brazilian or not.
+For the project, the best option would be chosen empirically upon comparing the results of 4 distinct architectures:
+Neural Network (NN), Deep Neural Network (DNN), Long Short-Term Memory (LSTM), and Convolutional Neural Network (CNN).
+Figure 4 shows the structure of the models.
+A neural network (NN) here is a simple feedforward neural network with only a single hidden layer, usually called
+”shallow”. Shallow NNs are often limited in the complexity of the problems they can be trained to solve well.
+Our CNN model uses a dropout layer feeding into a couple of Conv1D layers and then a MaxPooling layer. After that,
+we Figure 4: Classification models use a hidden layer composed of a dense layer of size 128, followed by another
+dropout layer, and finally, the Flatten and final dense classification layer.
+The architecture of the CNN network used is composed of a 50% dropout layer followed by two 1D convolution
+layers associated with a MaxPooling layer. After max pooling a dense layer of size 128 was added connected
+to a 50% dropout which finally connects to a flatten layer and the final classification dense layer. Dropout
+layers help to avoid overfitting the network by masking part of the data so that the network learns to create
+redundancies in the analysis of the inputs.
+![CNN Classification Model](https://raw.githubusercontent.com/chap0lin/WEBIST2022/master/Assets/cnn_model.png)
 ## Model variations
 | Longformer | 10.9GB |
 | Word2Vec | 56.1MB |
 ## Intended uses
 ### How to use
 Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
 predictions:
+Performance limiting: Loading the longformer model in memory means needing 11Gb available only for the model,
+without considering the weight of the deep learning network. For training this means we need a 20+ Gb GPU to
+perform the training. Here this was resolved using the high RAM environment of google Colab Pro and training
+using CPU which justifies the longer training time per season.
+Replicability limitation: Due to the simplicity of the keras embedding model, we are using one hot encoding,
+and it has a delicate problem for replication in production. This detail is pending further study to define
+whether it is possible to use one of these models.
 -
 -
 This bias will also affect all fine-tuned versions of this model.
 obtained with related metrics. With this implementation, was reached new levels of accuracy with 86% for CNN architecture
 and 88% for the LSTM architecture.
 ### Preprocessing
 Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and
 | BBC News Articles | 56.77% |
 | New unlabeled MCTI | 75.26% |
+Table 6: Results from Pre-trained WE + ML models
+| ML Model | Accuracy | F1 Score | Precision | Recall |
+|:--------:|:---------:|:---------:|:---------:|:---------:|
+| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
+| DNN | 0.7115 | 0.7794 | 0.7255 | 0.8485 |
+| CNN | 0.8654 | 0.9083 | 0.8486 | 0.9773 |
+| LSTM | 0.8846 | 0.9139 | 0.9056 | 0.9318 |
+## Evaluation results
+The table below presents the results of accuracy, f1-score, recall and precision obtained in the training of each network.
+In addition, the necessary times for training each epoch, the data validation execution time and the weight of the deep
+learning model associated with each implementation were added.
+Table 7: Results of experiments
+| Model | Accuracy | F1-score | Recall | Precision | Training time epoch(s) | Validation time (s) | Weight(MB) |
+|------------------------|----------|----------|--------|-----------|------------------------|---------------------|------------|
+| Keras Embedding + SNN | 92.47 | 88.46 | 79.66 | 100.00 | 0.2 | 0.7 | 1.8 |
+| Keras Embedding + DNN | 89.78 | 84.41 | 77.81 | 92.57 | 1.0 | 1.4 | 7.6 |
+| Keras Embedding + CNN | 93.01 | 89.91 | 85.18 | 95.69 | 0.4 | 1.1 | 3.2 |
+| Keras Embedding + LSTM | 93.01 | 88.94 | 83.32 | 95.54 | 1.4 | 2.0 | 1.8 |
+| Word2Vec + SNN | 89.25 | 83.82 | 74.15 | 97.10 | 1.4 | 1.2 | 9.6 |
+| Word2Vec + DNN | 90.32 | 86.52 | 85.18 | 88.70 | 2.0 | 6.8 | 7.8 |
+| Word2Vec + CNN | 92.47 | 88.42 | 80.85 | 98.72 | 1.9 | 3.4 | 4.7 |
+| Word2Vec + LSTM | 89.78 | 84.36 | 75.36 | 95.81 | 2.6 | 14.3 | 1.2 |
+| Longformer + SNN | 61.29 | 0 | 0 | 0 | 128.0 | 1.5 | 36.8 |
+| Longformer + DNN | 91.93 | 87.62 | 80.37 | 97.62 | 81.0 | 8.4 | 12.7 |
+| Longformer + CNN | 94.09 | 90.69 | 83.41 | 100.00 | 57.0 | 4.5 | 9.6 |
+| Longformer + LSTM | 61.29 | 0 | 0 | 0 | 13.0 | 8.6 | 2.6 |
+The results obtained surpassed those achieved in goal 6 and goal 9, with the best accuracy obtained of 94%
+in the longformer + CNN model. We can also observe that the models that achieved the best results were those
+that used the CNN network for deep learning.
+In addition, it was possible to notice that the model of longformer + SNN and longformer + LSTM were not able
+to learn. Perhaps the models need some adjustments, but each training attempt took between 5 and 8 hours, which
+made it impossible to try to adjust when other models were already showing promising results.
+Above the results obtained, it is also necessary to highlight two limitations found for the replication and
+training of networks:
+These 10Gb of the model exceed the Github limit and did not go to the repository, so to run the system we need
+to download the pre-trained network in the notebook and run the encoder-decoder with the data to create the model.
+It is advisable to do this in a GPU environment and save the file on the drive. After that change the environment to
+CPU to perform the training. Trying to generate the model in CPU will take more than 3 hours of processing.
+The best model that does not have any limitations is Word2Vec + CNN. However, we need to study the limitations to
+understand whether it is possible to introduce a new model with better accuracy and indicators. These adjustments
+will be worked on during goals 13 and 14 where the main objective will be to encapsulate the solution in the most
+suitable way for use in production.
 ## Benchmarks