unb-lamfo-nlp-mcti
/

NLP-Classification-MCTI

English

Clsssification

science

Model card Files Files and versions Community

MarcosDib commited on Dec 12, 2022

Commit

a7aa062

•

1 Parent(s): 85009f1

Update README.md

Browse files

Files changed (1) hide show

README.md +46 -43

README.md CHANGED Viewed

@@ -90,12 +90,12 @@ The detailed release history can be found on the [here](https://huggingface.co/u
 | [`mcti-large-cased`] | 110M | Chinese |
 | [`-base-multilingual-cased`] | 110M | Multiple |
- | Dataset | Compatibility to base* |
- |----------------------------|------------------------|
- | Labeled MCTI | 100% |
- | Full MCTI | 100% |
- | BBC News Articles | 56.77% |
- | New unlabeled MCTI | 75.26% |
 ## Intended uses
@@ -229,48 +229,51 @@ headers).
 ### Preprocessing
-| Model | #params | Language |
-|------------------------------|--------------------|-------|
 | Objective | Package |
 |--------------------------------------------------------|--------------|
 | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
 | Natural Language Processing | [nltk](https://pypi.org/project/nltk) |
- | Objective | Package |
- |+------------------------------------------------------+|+------------+|
- | Resolve contractions and slang usage in text | contractions(https://pypi.org/project/contractions/) |
- | Natural Language Processing | nltk(https://pypi.org/project/nltk/) |
- | Others data manipulations and calculations included | numpy(https://pypi.org/project/numpy/) |
- | in Python 3.10: io, json, math, re (regular | |
- | expressions), shutil, time, unicodedata; | |
- | Data manipulation and analysis | pandas(https://pypi.org/project/pandas/) |
- | http library | requests(https://pypi.org/project/requests/) |
- | Training model | scikit-learn(https://pypi.org/project/scikit-learn/) |
- | Machine learning | tensorflow(https://pypi.org/project/tensorflow/) |
- | Machine learning | keras(https://keras.io/) |
- | Translation from multiple languages to English | translators(https://pypi.org/project/translators/) |
-The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
-then of the form:
-```
-[CLS] Sentence A [SEP] Sentence B [SEP]
-```
-With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
-the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
-consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
-"sentences" has a combined length of less than 512 tokens.
-The details of the masking procedure for each sentence are the following:
-- 15% of the tokens are masked.
-- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
-- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
-- In the 10% remaining cases, the masked tokens are left as is.
 ### Pretraining

 | [`mcti-large-cased`] | 110M | Chinese |
 | [`-base-multilingual-cased`] | 110M | Multiple |
+| Dataset | Compatibility to base* |
+|------------------------------|------------------------|
+| Labeled MCTI | 100% |
+| Full MCTI | 100% |
+| BBC News Articles | 56.77% |
+| New unlabeled MCTI | 75.26% |
 ## Intended uses
 ### Preprocessing
+Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and
+optimize the training of the models.
+The following assumptions were considered:
+• The Data Entry base is obtained from the result of goal 4.
+• Labeling (Goal 4) is considered true for accuracy measurement purposes;
+• Preprocessing experiments compare accuracy in a shallow neural network (SNN);
+• Pre-processing was investigated for the classification goal.
+From the Database obtained in Meta 4, stored in the project's [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps- desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL.xlsx), a Notebook was developed in [Google Colab](https://colab.research.google.com)
+to implement the [pre-processing code](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-
+processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), which also can be found on the project's GitHub.
+Several Python packages were used to develop the preprocessing code:
 | Objective | Package |
 |--------------------------------------------------------|--------------|
 | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
 | Natural Language Processing | [nltk](https://pypi.org/project/nltk) |
+| Others data manipulations and calculations included | [numpy](https://pypi.org/project/numpy) |
+| in Python 3.10: io, json, math, re (regular | |
+| expressions), shutil, time, unicodedata; | |
+| Data manipulation and analysis | [pandas](https://pypi.org/project/pandas) |
+| http library | [requests](https://pypi.org/project/requests) |
+| Training model | [scikit-learn](https://pypi.org/project/scikit-learn) |
+| Machine learning | [tensorflow](https://pypi.org/project/tensorflow) |
+| Machine learning | [keras(https://keras.io) |
+| Translation from multiple languages to English | [translators](https://pypi.org/project/translators) |
+As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-
+processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), in the pre-processing, code was created to build and evaluate 8 (eight) different
+bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
+| Base | Textos originais |
+|--------|--------------------------------------------------------------|
+| xp1 | Expandir Contrações |
+| xp2 | Expandir Contrações + Transformar texto em minúsculo |
+| xp3 | Expandir Contrações + Remover Pontuação |
+| xp4 | Expandir Contrações + Remover Pontuação + Transformar Texto |
+| xp5 | xp4 + Stemização |
+| xp6 | xp4 + Lematização |
+| xp7 | xp4 + Stemização + Remoção de StopWords |
+| xp8 | ap4 + Lematização + Remoção de StopWords |
+ Table 2 – Pre-processing methods evaluated
 ### Pretraining