unb-lamfo-nlp-mcti
/

NLP-Classification-MCTI

English

Clsssification

science

Model card Files Files and versions Community

chap0lin commited on Dec 22, 2022

Commit

69370ac

•

1 Parent(s): b192b15

Update README.md

Browse files

Files changed (1) hide show

README.md +26 -35

README.md CHANGED Viewed

@@ -23,10 +23,10 @@ thumbnail: https://github.com/Marcosdib/S2Query/Classification_Architecture_mode
 Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project.
-The model [NLP MCTI Classification Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/NLP-W2V-CNN-Multi) is part of the project [Research Financing Product Portfolio (FPP)](https://huggingface.co/unb-lamfo-nlp-mcti) focuses
-on the task of Text Classification and explores different machine learning strategies to classify a small amount
 of long, unstructured, and uneven data to find a proper method with good performance. Pre-training and word embedding
-solutions were used to learn word relationships from other datasets with considerable similarity and larger scale.
 Then, using the acquired resources, based on the dataset available in the MCTI, transfer learning plus deep learning
 models were applied to improve the understanding of each sentence.
@@ -47,10 +47,10 @@ As the input data is compose of unstructured and nonuniform texts it is essentia
 little insights and valuable relationships to work with the best features of them. In this way, learning is
 facilitated and allows the gradient descent to converge more quickly.
-The first layer of the model is an embedding layer as a method of extracting features from the data that can
-replace one-hot coding with dimensional reduction.
-The architecture of the CNN network is composed of a 50% dropout layer followed by two 1D convolution layers
 associated with a MaxPooling layer. After maximum grouping, a dense layer of size 128 is added connected to
 a 50% dropout which finally connects to a flattened layer and the final sort dense layer. The dropout layers
 helped to avoid network overfitting by masking part of the data so that the network learned to create
@@ -60,7 +60,7 @@ redundancies in the analysis of the inputs.
 ## Model variations
-Table x below presents the results of several implementations with different architectures, highlighting the
 accuracy, f1-score, recall and precision results obtained in the training of each network.
 Table 1: Results of experiments
@@ -116,27 +116,20 @@ required for training since the GPU performs matrix operations faster then a CPU
 ### How to use
-You can use this model directly with a pipeline for masked language modeling:
-```python
-```
-Here is how to use this model to get the features of a given text in PyTorch:
-```python
-```
-and in TensorFlow:
-```python
-```
 ### Limitations and bias
 This model is uncased: it does not make a difference between english and English.
 Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
 predictions:
@@ -149,8 +142,6 @@ Replicability limitation: Due to the simplicity of the keras embedding model, we
 and it has a delicate problem for replication in production. This detail is pending further study to define
 whether it is possible to use one of these models.
--
--
 This bias will also affect all fine-tuned versions of this model.
 ## Training data
@@ -184,7 +175,7 @@ to implement the [preprocessing code](https://github.com/mcti-sefip/mcti-sefip-p
 Several Python packages were used to develop the preprocessing code:
-Table 2: Python packages used
 | Objective | Package |
 |--------------------------------------------------------|--------------|
 | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
@@ -199,9 +190,9 @@ Table 2: Python packages used
 As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
-bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
-Table 3: Preprocessing methods evaluated
 | id | Experiments |
 |--------|------------------------------------------------------------------------|
 | Base | Original Texts |
@@ -212,7 +203,7 @@ Table 3: Preprocessing methods evaluated
 | xp5 | xp4 + Stemming |
 | xp6 | xp4 + Lemmatization |
 | xp7 | xp4 + Stemming + Stopwords Removal |
-| xp8 | ap4 + Lemmatization + Stopwords Removal |
 First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and
 evaluation of the first four bases (xp1, xp2, xp3, xp4).
@@ -222,9 +213,9 @@ stemming + stopwords removal (xp7), and Lemmatization + stopwords removal (xp8).
 All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
 neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
-shown in Table 4.
-Table 4: Results obtained in Preprocessing
 | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
 |--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
 | Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
@@ -258,7 +249,7 @@ only 56% compatibility.
 The alternative was to use web scraping algorithms to acquire more unlabeled data from the same sources, thus ensuring
 compatibility. The original dataset had 260 labeled entries.
-Table 5: Compatibility results (*base = labeled MCTI dataset entries)
 | Dataset | |
 |--------------------------------------|:----------------------:|
 | Labeled MCTI | 100% |
@@ -266,7 +257,7 @@ Table 5: Compatibility results (*base = labeled MCTI dataset entries)
 | BBC News Articles | 56.77% |
 | New unlabeled MCTI | 75.26% |
-Table 6: Results from Pre-trained WE + ML models
 | ML Model | Accuracy | F1 Score | Precision | Recall |
 |:--------:|:---------:|:---------:|:---------:|:---------:|
 | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
@@ -280,7 +271,7 @@ The table below presents the results of accuracy, f1-score, recall and precision
 In addition, the necessary times for training each epoch, the data validation execution time and the weight of the deep
 learning model associated with each implementation were added.
-Table 7: Results of experiments
 | Model | Accuracy | F1-score | Recall | Precision | Training time epoch(s) | Validation time (s) | Weight(MB) |
 |------------------------|----------|----------|--------|-----------|------------------------|---------------------|------------|
 | Keras Embedding + SNN | 92.47 | 88.46 | 79.66 | 100.00 | 0.2 | 0.7 | 1.8 |
@@ -309,9 +300,9 @@ In this context, was considered two approaches:
 Templates using Word2Vec and Longformer also need to be loaded and their weights are as follows:
-Table 1: Templates using Word2Vec and Longformer
-| Tamplates | weights |
-|+----------------------------+|:-------:|
 | Longformer | 10.9GB |
 | Word2Vec | 56.1MB |

 Disclaimer: The Brazilian Ministry of Science, Technology, and Innovation (MCTI) has partially supported this project.
+The model [NLP MCTI Classification Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/NLP-W2V-CNN-Multi) is part of the project [Research Financing Product Portfolio (FPP)](https://huggingface.co/unb-lamfo-nlp-mcti) and focuses
+on the Text Classification task, exploring different machine learning strategies to classify a small amount
 of long, unstructured, and uneven data to find a proper method with good performance. Pre-training and word embedding
+solutions were used to learn word relationships from other datasets with considerable similarity and a larger scale.
 Then, using the acquired resources, based on the dataset available in the MCTI, transfer learning plus deep learning
 models were applied to improve the understanding of each sentence.
 little insights and valuable relationships to work with the best features of them. In this way, learning is
 facilitated and allows the gradient descent to converge more quickly.
+The first layer of the model is a pre-trained Word2Vec embedding layer as a method of extracting features from the data that can
+replace one-hot coding with dimensional reduction. The pre-training of this model is explained further in this document.
+After the embedding layer, there is the CNN classification model. The architecture of the CNN network is composed of a 50% dropout layer followed by two 1D convolution layers
 associated with a MaxPooling layer. After maximum grouping, a dense layer of size 128 is added connected to
 a 50% dropout which finally connects to a flattened layer and the final sort dense layer. The dropout layers
 helped to avoid network overfitting by masking part of the data so that the network learned to create
 ## Model variations
+Table 1 below presents the results of several implementations with different architectures, highlighting the
 accuracy, f1-score, recall and precision results obtained in the training of each network.
 Table 1: Results of experiments
 ### How to use
+This model is available in huggingface spaces to be applied to excel files containing scrapped oportunity data.
+- [NLP MCTI Classification Multi](https://huggingface.co/spaces/unb-lamfo-nlp-mcti/NLP-W2V-CNN-Multi)
+You can also find the training and evaluation notebooks in the github repository:
+- [PPF-MCTI Repository](https://github.com/chap0lin/PPF-MCTI)
 ### Limitations and bias
 This model is uncased: it does not make a difference between english and English.
+This model depends on high-quality scrapped data. Since the model understands a finite number of words, the input needs to
+have little to no wrong encodings and abstract markdowns so that the preprocessing can remove them and correctly identify the words.
 Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
 predictions:
 and it has a delicate problem for replication in production. This detail is pending further study to define
 whether it is possible to use one of these models.
 This bias will also affect all fine-tuned versions of this model.
 ## Training data
 Several Python packages were used to develop the preprocessing code:
+Table 3: Python packages used
 | Objective | Package |
 |--------------------------------------------------------|--------------|
 | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
 As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
+bases, derived from the base of goal 4, with the application of the methods shown in table 4.
+Table 4: Preprocessing methods evaluated
 | id | Experiments |
 |--------|------------------------------------------------------------------------|
 | Base | Original Texts |
 | xp5 | xp4 + Stemming |
 | xp6 | xp4 + Lemmatization |
 | xp7 | xp4 + Stemming + Stopwords Removal |
+| xp8 | xp4 + Lemmatization + Stopwords Removal |
 First, the treatment of punctuation and capitalization was evaluated. This phase resulted in the construction and
 evaluation of the first four bases (xp1, xp2, xp3, xp4).
 All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
 neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
+shown in Table 5.
+Table 5: Results obtained in Preprocessing
 | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
 |--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
 | Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
 The alternative was to use web scraping algorithms to acquire more unlabeled data from the same sources, thus ensuring
 compatibility. The original dataset had 260 labeled entries.
+Table 6: Compatibility results (*base = labeled MCTI dataset entries)
 | Dataset | |
 |--------------------------------------|:----------------------:|
 | Labeled MCTI | 100% |
 | BBC News Articles | 56.77% |
 | New unlabeled MCTI | 75.26% |
+Table 7: Results from Pre-trained WE + ML models
 | ML Model | Accuracy | F1 Score | Precision | Recall |
 |:--------:|:---------:|:---------:|:---------:|:---------:|
 | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
 In addition, the necessary times for training each epoch, the data validation execution time and the weight of the deep
 learning model associated with each implementation were added.
+Table 8: Results of experiments
 | Model | Accuracy | F1-score | Recall | Precision | Training time epoch(s) | Validation time (s) | Weight(MB) |
 |------------------------|----------|----------|--------|-----------|------------------------|---------------------|------------|
 | Keras Embedding + SNN | 92.47 | 88.46 | 79.66 | 100.00 | 0.2 | 0.7 | 1.8 |
 Templates using Word2Vec and Longformer also need to be loaded and their weights are as follows:
+Table 9: Templates using Word2Vec and Longformer
+| Templates | weights |
+|------------------------------|---------|
 | Longformer | 10.9GB |
 | Word2Vec | 56.1MB |