unb-lamfo-nlp-mcti
/

NLP-Classification-MCTI

English

Clsssification

science

Model card Files Files and versions Community

MarcosDib commited on Dec 13, 2022

Commit

e12ab8c

•

1 Parent(s): c795b59

Update README.md

Browse files

Files changed (1) hide show

README.md +57 -130

README.md CHANGED Viewed

@@ -66,31 +66,28 @@ In this context, was considered two approaches:
  - Pre-training word embeddings using similar datasets for text classification;
  - Using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
-The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
-Os modelos que utilizam Word2Vec e Longformer também precisam ser carregados e seus pesos são os seguintes:
-Longformer: 10.88 GB
-Word2Vec: 56.1 MB
-Table 1:
-| Model | #params | Language |
-|------------------------------|:-------:|:--------:|
-| [`mcti-base-uncased`] | 110M | English |
-| [`mcti-large-uncased`] | 340M | English |
-| [`mcti-base-cased`] | 110M | English |
-| [`mcti-large-cased`] | 110M | Chinese |
-| [`-base-multilingual-cased`] | 110M | Multiple |
-Table 2: Compatibility results (*base = labeled MCTI dataset entries)
-| Dataset | |
-|--------------------------------------|:----------------------:|
-| Labeled MCTI | 100% |
-| Full MCTI | 100% |
-| BBC News Articles | 56.77% |
-| New unlabeled MCTI | 75.26% |
 ## Intended uses
@@ -107,40 +104,18 @@ generation you should look at model like XXX.
 You can use this model directly with a pipeline for masked language modeling:
 ```python
->>> from transformers import pipeline
->>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
->>> unmasker("Hello I'm a [MASK] model.")
-[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
- 'score': 0.1073106899857521,
- 'token': 4827,
- 'token_str': 'fashion'},
- {'sequence': "[CLS] hello i'm a fine model. [SEP]",
- 'score': 0.027095865458250046,
- 'token': 2986,
- 'token_str': 'fine'}]
 ```
 Here is how to use this model to get the features of a given text in PyTorch:
 ```python
-from transformers import BertTokenizer, BertModel
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-model = BertModel.from_pretrained("bert-base-uncased")
-text = "Replace me by any text you'd like."
-encoded_input = tokenizer(text, return_tensors='pt')
-output = model(**encoded_input)
 ```
 and in TensorFlow:
 ```python
-from transformers import BertTokenizer, TFBertModel
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-model = TFBertModel.from_pretrained("bert-base-uncased")
-text = "Replace me by any text you'd like."
-encoded_input = tokenizer(text, return_tensors='tf')
-output = model(encoded_input)
 ```
 ### Limitations and bias
@@ -150,32 +125,8 @@ This model is uncased: it does not make a difference between english and English
 Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
 predictions:
-```python
->>> from transformers import pipeline
->>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
->>> unmasker("The man worked as a [MASK].")
-[{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
- 'score': 0.09747550636529922,
- 'token': 10533,
- 'token_str': 'carpenter'},
- {'sequence': '[CLS] the man worked as a salesman. [SEP]',
- 'score': 0.037680890411138535,
- 'token': 18968,
- 'token_str': 'salesman'}]
->>> unmasker("The woman worked as a [MASK].")
-[{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
- 'score': 0.21981462836265564,
- 'token': 6821,
- 'token_str': 'nurse'},
- {'sequence': '[CLS] the woman worked as a cook. [SEP]',
- 'score': 0.03042375110089779,
- 'token': 5660,
- 'token_str': 'cook'}]
-```
 This bias will also affect all fine-tuned versions of this model.
 ## Training data
@@ -186,6 +137,21 @@ the main text content (column u). Text content averages 800 tokens in length, bu
 ## Training procedure
 ### Preprocessing
 Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and
@@ -202,7 +168,7 @@ to implement the [preprocessing code](https://github.com/mcti-sefip/mcti-sefip-p
 Several Python packages were used to develop the preprocessing code:
-Table 3: Python packages used
 | Objective | Package |
 |--------------------------------------------------------|--------------|
 | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
@@ -219,7 +185,7 @@ Table 3: Python packages used
 As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
 bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
-Table 4: Preprocessing methods evaluated
 | id | Experiments |
 |--------|------------------------------------------------------------------------|
 | Base | Original Texts |
@@ -240,9 +206,9 @@ stemming + stopwords removal (xp7), and Lemmatization + stopwords removal (xp8).
 All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
 neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
-shown in Table 5.
-Table 5: Results obtained in Preprocessing
 | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
 |--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
 | Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
@@ -266,66 +232,27 @@ available on the project's GitHub with the inclusion of columns opo_pre (text) a
 ### Pretraining
-Since labeled data is scarce, word-embeddings was trained in an unsupervised manner using other datasets that contain most of
-the words it needs to learn. The alternative was to use web scraping algorithms to acquire more unlabeled data from the same
-sources, which would give a higher chance of providing compatible texts. The original dataset had 357 entries, with 260 of
-them labeled.
-## Evaluation results
-### Model training with Word2Vec embeddings
-Now we have a pre-trained model of word2vec embeddings that has already learned relevant meaningsfor our classification problem.
-We can couple it to our classification models (Fig. 4), realizing transferlearning and then training the model with the labeled
-data in a supervised manner. The new coupled model can be seen in Figure 5 under word2vec model training. The Table 3 shows the
-obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
-architecture and 88% for the LSTM architecture.
-Table 6: Results from Pre-trained WE + ML models
-| ML Model | Accuracy | F1 Score | Precision | Recall |
-|:--------:|:---------:|:---------:|:---------:|:---------:|
-| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
-| DNN | 0.7115 | 0.7794 | 0.7255 | 0.8485 |
-| CNN | 0.8654 | 0.9083 | 0.8486 | 0.9773 |
-| LSTM | 0.8846 | 0.9139 | 0.9056 | 0.9318 |
-### Transformer-based implementation
-Another way we used pre-trained vector representations was by use of a Longformer (Beltagy et al., 2020). We chose it because
-of the limitation of the first generation of transformers and BERT-based architectures involving the size of the sentences:
-the maximum of 512 tokens. The reason behind that limitation is that the self-attention mechanism scale quadratically with the
-input sequence length O(n2) (Beltagy et al., 2020). The Longformer allowed the processing sequences of a thousand characters
-without facing the memory bottleneck of BERT-like architectures and achieved SOTA in several benchmarks.
-For our text length distribution in Figure 3, if we used a Bert-based architecture with a maximum length of 512, 99 sentences
-would have to be truncated and probably miss some critical information. By comparison, with the Longformer, with a maximum
-length of 4096, only eight sentences will have their information shortened.
-To apply the Longformer, we used the pre-trained base (available on the link) that was previously trained with a combination
-of vast datasets as input to the model, as shown in figure 5 under Longformer model training. After coupling to our classification
-models, we realized supervised training of the whole model. At this point, only transfer learning was applied since more
-computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
-This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
-Table 7: Results from Pre-trained Longformer + ML models
-| ML Model | Accuracy | F1 Score | Precision | Recall |
-|:--------:|:---------:|:---------:|:---------:|:---------:|
-| NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |
-| DNN | 0.8462 | 0.8776 |0.8474 | 0.9123 |
-| CNN | 0.8462 | 0.8776 |0.8474 | 0.9123 |
-| LSTM | 0.8269 | 0.8801 |0.8571 | 0.9091 |
-## Checkpoints
-- Examples
-- Implementation Notes
-- Usage Example
-- >>>
-- >>> ...
-## Config
-## Tokenizer
 ## Benchmarks

  - Pre-training word embeddings using similar datasets for text classification;
  - Using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
+Templates using Word2Vec and Longformer also need to be loaded and their weights are as follows:
+Table 1: Templates using Word2Vec and Longformer
+| Tamplates | weights |
+|------------------------------|:-------:|
+| Longformer | 10.9GB |
+| Word2Vec | 56.1MB |
+| Keras Embedding + SNN | 92.47 | 88.46 | 79.66 | 100 | 0.2 | 0.7 | 1.8 |
+| Keras Embedding + DNN | 89.78 | 84.41 | 77.81 | 92.57 | 1 | 1.4 | 7.6 |
+| Keras Embedding + CNN | 93.01 | 89.91 | 85.18 | 95.69 | 0.4 | 1.1 | 3.2 |
+| Keras Embedding + LSTM| 93.01 | 88.94 | 83.32 | 95.54 | 1.4 | 2 | 1.8 |
+| Word2Vec + SNN | 89.25 | 83.82 | 74.15 | 97.10 | 1.4 | 1.2 | 9.6 |
+| Word2Vec + DNN | 90.32 | 86.52 | 85.18 | 88.70 | 2 | 6.8 | 7.8 |
+| Word2Vec + CNN | 92.47 | 88.42 | 80.85 | 98.72 | 1.9 | 3.4 | 4.7 |
+| Word2Vec + LSTM | 89.78 | 84.36 | 75.36 | 95.81 | 2.6 | 14.3 | 1.2 |
+| Longformer + SNN | 61.29 | 0 | 0 | 0 | 128 | 1.5 | 36.8 |
+| Longformer + DNN | 91.93 | 87.62 | 80.37 | 97.62 | 81 | 8.4 | 12.7 |
+| Longformer + CNN | 94.09 | 90.69 | 83.41 | 100 | 57 | 4.5 | 9.6 |
+| Longformer + LSTM | 61.29 | 0 | 0 | 0 | 135 | 8.6 | 2.6 |
 ## Intended uses
 You can use this model directly with a pipeline for masked language modeling:
 ```python
 ```
 Here is how to use this model to get the features of a given text in PyTorch:
 ```python
 ```
 and in TensorFlow:
 ```python
 ```
 ### Limitations and bias
 Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
 predictions:
+-
+-
 This bias will also affect all fine-tuned versions of this model.
 ## Training data
 ## Training procedure
+### Model training with Word2Vec embeddings
+After the pre-trained model of word2vec embeddings had already learned meanings relevant to the classification problem,
+it was coupled to the classification model to train it with the labeled data in a supervised way. Table 6 shows the results
+obtained with related metrics. With this implementation, was reached new levels of accuracy with 86% for CNN architecture
+and 88% for the LSTM architecture.
+Table 6: Results from Pre-trained WE + ML models
+| ML Model | Accuracy | F1 Score | Precision | Recall |
+|:--------:|:---------:|:---------:|:---------:|:---------:|
+| NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
+| DNN | 0.7115 | 0.7794 | 0.7255 | 0.8485 |
+| CNN | 0.8654 | 0.9083 | 0.8486 | 0.9773 |
+| LSTM | 0.8846 | 0.9139 | 0.9056 | 0.9318 |
 ### Preprocessing
 Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and
 Several Python packages were used to develop the preprocessing code:
+Table 2: Python packages used
 | Objective | Package |
 |--------------------------------------------------------|--------------|
 | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
 As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
 bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
+Table 3: Preprocessing methods evaluated
 | id | Experiments |
 |--------|------------------------------------------------------------------------|
 | Base | Original Texts |
 All eight bases were evaluated to classify the eligibility of the opportunity, through the training of a shallow
 neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
+shown in Table 4.
+Table 4: Results obtained in Preprocessing
 | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
 |--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
 | Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
 ### Pretraining
+Since labeled data is scarce, word-embeddings was trained in an unsupervised manner using other datasets that
+contain most of the words it needs to learn. The idea implemented was based on introducing better and better-trained
+word embeddings in the model. For an additional dataset to be applied to improve word-embedding training, it must be
+compatible with the dataset used to train the classifier. Was searched for datasets from the Kaggle, a platform with
+over a thousand available NLP datasets, and the closest we found was the BBC News Articles dataset, which achieved
+only 56% compatibility.
+The alternative was to use web scraping algorithms to acquire more unlabeled data from the same sources, thus ensuring
+compatibility. The original dataset had 260 labeled entries.
+Table 5: Compatibility results (*base = labeled MCTI dataset entries)
+| Dataset | |
+|--------------------------------------|:----------------------:|
+| Labeled MCTI | 100% |
+| Full MCTI | 100% |
+| BBC News Articles | 56.77% |
+| New unlabeled MCTI | 75.26% |
+## Evaluation results
 ## Benchmarks