unb-lamfo-nlp-mcti
/

NLP-Classification-MCTI

English

Clsssification

science

Model card Files Files and versions Community

MarcosDib commited on Dec 13, 2022

Commit

42b2aa7

•

1 Parent(s): bdba148

Update README.md

Browse files

Files changed (1) hide show

README.md +21 -21

README.md CHANGED Viewed

@@ -59,23 +59,24 @@ bibendum cursus. Nunc volutpat vitae neque ut bibendum.
 ## Model variations
-With the motivation to increase accuracy obtained with baseline implementation, we implemented a transfer learning
 strategy under the assumption that small data available for training was insufficient for adequate embedding training.
-In this context, we considered two approaches:
- i) pre-training wordembeddings using similar datasets for text classification;
  ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
-XXXX has originally been released in base and large variations, for cased and uncased input text. The uncased models
-also strips out an accent markers. Chinese and multilingual uncased and cased versions followed shortly after.
-Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of
-two models.
 Other 24 smaller models are released afterward.
 The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
-#### Table 1:
 | Model | #params | Language |
 |------------------------------|:-------:|:--------:|
 | [`mcti-base-uncased`] | 110M | English |
@@ -84,8 +85,8 @@ The detailed release history can be found on the [here](https://huggingface.co/u
 | [`mcti-large-cased`] | 110M | Chinese |
 | [`-base-multilingual-cased`] | 110M | Multiple |
-#### Table 2:
-| Dataset | Compatibility to base* |
 |--------------------------------------|:----------------------:|
 | Labeled MCTI | 100% |
 | Full MCTI | 100% |
@@ -146,8 +147,7 @@ output = model(encoded_input)
 ### Limitations and bias
-This model is uncased: it does not make a difference between english
-and English.
 Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
 predictions:
@@ -182,9 +182,9 @@ This bias will also affect all fine-tuned versions of this model.
 ## Training data
-The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
-unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
-headers).
 ## Training procedure
@@ -204,7 +204,7 @@ to implement the [pre-processing code](https://github.com/mcti-sefip/mcti-sefip-
 Several Python packages were used to develop the preprocessing code:
-#### Table 3: Python packages used
 | Objective | Package |
 |--------------------------------------------------------|--------------|
 | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
@@ -221,7 +221,7 @@ Several Python packages were used to develop the preprocessing code:
 As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
 bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
-#### Table 4: Preprocessing methods evaluated
 | id | Experiments |
 |--------|------------------------------------------------------------------------|
 | Base | Original Texts |
@@ -244,7 +244,7 @@ All eight bases were evaluated to classify the eligibility of the opportunity,
 neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
 shown in Table 5.
-#### Table 5: Results obtained in Preprocessing
 | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
 |--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
 | Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
@@ -282,7 +282,7 @@ data in a supervised manner. The new coupled model can be seen in Figure 5 under
 obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
 architecture and 88% for the LSTM architecture.
-#### Table 6: Results from Pre-trained WE + ML models
 | ML Model | Accuracy | F1 Score | Precision | Recall |
 |:--------:|:---------:|:---------:|:---------:|:---------:|
 | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
@@ -308,7 +308,7 @@ models, we realized supervised training of the whole model. At this point, only
 computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
 This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
-#### Table 7: Results from Pre-trained Longformer + ML models
 | ML Model | Accuracy | F1 Score | Precision | Recall |
 |:--------:|:---------:|:---------:|:---------:|:---------:|
 | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |

 ## Model variations
+With the motivation to increase accuracy obtained with baseline implementation, was implemented a transfer learning
 strategy under the assumption that small data available for training was insufficient for adequate embedding training.
+In this context, was considered two approaches:
+ i) pre-training word embeddings using similar datasets for text classification;
  ii) using transformers and attention mechanisms (Longformer) to create contextualized embeddings.
 Other 24 smaller models are released afterward.
 The detailed release history can be found on the [here](https://huggingface.co/unb-lamfo-nlp-mcti) on github.
+Os modelos que utilizam Word2Vec e Longformer também precisam ser carregados e seus pesos são os seguintes:
+Longformer: 10.88 GB
+Word2Vec: 56.1 MB
+Table 1:
 | Model | #params | Language |
 |------------------------------|:-------:|:--------:|
 | [`mcti-base-uncased`] | 110M | English |
 | [`mcti-large-cased`] | 110M | Chinese |
 | [`-base-multilingual-cased`] | 110M | Multiple |
+Table 2: Compatibility results (*base = labeled MCTI dataset entries)
+| Dataset | |
 |--------------------------------------|:----------------------:|
 | Labeled MCTI | 100% |
 | Full MCTI | 100% |
 ### Limitations and bias
+This model is uncased: it does not make a difference between english and English.
 Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
 predictions:
 ## Training data
+The [inputted training](https://github.com/chap0lin/PPF-MCTI/tree/master/Datasets) data was obtained from scrapping techniques, over 30 different platforms e.g. The Royal Society,
+Annenberg foundation, and contained 928 labeled entries (928 rows x 21 columns). Of the data gathered, was used only
+the main text content (column u). Text content averages 800 tokens in length, but with high variance, up to 5,000 tokens.
 ## Training procedure
 Several Python packages were used to develop the preprocessing code:
+Table 3: Python packages used
 | Objective | Package |
 |--------------------------------------------------------|--------------|
 | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
 As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
 bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
+Table 4: Preprocessing methods evaluated
 | id | Experiments |
 |--------|------------------------------------------------------------------------|
 | Base | Original Texts |
 neural network (SNN – Shallow Neural Network). The metrics for the eight bases were evaluated. The results are
 shown in Table 5.
+Table 5: Results obtained in Preprocessing
 | id | Experiment | acurácia | f1-score | recall | precision | Média(s) | N_tokens | max_lenght |
 |--------|------------------------------------------------------------------------|----------|----------|--------|-----------|----------|----------|------------|
 | Base | Original Texts | 89,78% | 84,20% | 79,09% | 90,95% | 417,772 | 23788 | 5636 |
 obtained results with related metrics. With this implementation, we achieved new levels of accuracy with 86% for the CNN
 architecture and 88% for the LSTM architecture.
+Table 6: Results from Pre-trained WE + ML models
 | ML Model | Accuracy | F1 Score | Precision | Recall |
 |:--------:|:---------:|:---------:|:---------:|:---------:|
 | NN | 0.8269 | 0.8545 | 0.8392 | 0.8712 |
 computational power was needed to realize the fine-tuning of the weights. The results with related metrics can be viewed in table 4.
 This approach achieved adequate accuracy scores, above 82% in all implementation architectures.
+Table 7: Results from Pre-trained Longformer + ML models
 | ML Model | Accuracy | F1 Score | Precision | Recall |
 |:--------:|:---------:|:---------:|:---------:|:---------:|
 | NN | 0.8269 | 0.8754 |0.7950 | 0.9773 |