MarcosDib commited on
Commit
a7aa062
1 Parent(s): 85009f1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +46 -43
README.md CHANGED
@@ -90,12 +90,12 @@ The detailed release history can be found on the [here](https://huggingface.co/u
90
  | [`mcti-large-cased`] | 110M | Chinese |
91
  | [`-base-multilingual-cased`] | 110M | Multiple |
92
 
93
- | Dataset | Compatibility to base* |
94
- |----------------------------|------------------------|
95
- | Labeled MCTI | 100% |
96
- | Full MCTI | 100% |
97
- | BBC News Articles | 56.77% |
98
- | New unlabeled MCTI | 75.26% |
99
 
100
 
101
  ## Intended uses
@@ -229,48 +229,51 @@ headers).
229
 
230
  ### Preprocessing
231
 
232
- | Model | #params | Language |
233
- |------------------------------|--------------------|-------|
 
 
 
 
 
 
 
 
 
 
 
 
234
 
235
  | Objective | Package |
236
  |--------------------------------------------------------|--------------|
237
  | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
238
  | Natural Language Processing | [nltk](https://pypi.org/project/nltk) |
239
-
240
-
241
- | Objective | Package |
242
- |+------------------------------------------------------+|+------------+|
243
- | Resolve contractions and slang usage in text | contractions(https://pypi.org/project/contractions/) |
244
- | Natural Language Processing | nltk(https://pypi.org/project/nltk/) |
245
- | Others data manipulations and calculations included | numpy(https://pypi.org/project/numpy/) |
246
- | in Python 3.10: io, json, math, re (regular | |
247
- | expressions), shutil, time, unicodedata; | |
248
- | Data manipulation and analysis | pandas(https://pypi.org/project/pandas/) |
249
- | http library | requests(https://pypi.org/project/requests/) |
250
- | Training model | scikit-learn(https://pypi.org/project/scikit-learn/) |
251
- | Machine learning | tensorflow(https://pypi.org/project/tensorflow/) |
252
- | Machine learning | keras(https://keras.io/) |
253
- | Translation from multiple languages to English | translators(https://pypi.org/project/translators/) |
254
-
255
-
256
-
257
- The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
258
- then of the form:
259
-
260
- ```
261
- [CLS] Sentence A [SEP] Sentence B [SEP]
262
- ```
263
-
264
- With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
265
- the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
266
- consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
267
- "sentences" has a combined length of less than 512 tokens.
268
-
269
- The details of the masking procedure for each sentence are the following:
270
- - 15% of the tokens are masked.
271
- - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
272
- - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
273
- - In the 10% remaining cases, the masked tokens are left as is.
274
 
275
  ### Pretraining
276
 
 
90
  | [`mcti-large-cased`] | 110M | Chinese |
91
  | [`-base-multilingual-cased`] | 110M | Multiple |
92
 
93
+ | Dataset | Compatibility to base* |
94
+ |------------------------------|------------------------|
95
+ | Labeled MCTI | 100% |
96
+ | Full MCTI | 100% |
97
+ | BBC News Articles | 56.77% |
98
+ | New unlabeled MCTI | 75.26% |
99
 
100
 
101
  ## Intended uses
 
229
 
230
  ### Preprocessing
231
 
232
+ Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and
233
+ optimize the training of the models.
234
+
235
+ The following assumptions were considered:
236
+ • The Data Entry base is obtained from the result of goal 4.
237
+ • Labeling (Goal 4) is considered true for accuracy measurement purposes;
238
+ • Preprocessing experiments compare accuracy in a shallow neural network (SNN);
239
+ • Pre-processing was investigated for the classification goal.
240
+
241
+ From the Database obtained in Meta 4, stored in the project's [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps- desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL.xlsx), a Notebook was developed in [Google Colab](https://colab.research.google.com)
242
+ to implement the [pre-processing code](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-
243
+ processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), which also can be found on the project's GitHub.
244
+
245
+ Several Python packages were used to develop the preprocessing code:
246
 
247
  | Objective | Package |
248
  |--------------------------------------------------------|--------------|
249
  | Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
250
  | Natural Language Processing | [nltk](https://pypi.org/project/nltk) |
251
+ | Others data manipulations and calculations included | [numpy](https://pypi.org/project/numpy) |
252
+ | in Python 3.10: io, json, math, re (regular | |
253
+ | expressions), shutil, time, unicodedata; | |
254
+ | Data manipulation and analysis | [pandas](https://pypi.org/project/pandas) |
255
+ | http library | [requests](https://pypi.org/project/requests) |
256
+ | Training model | [scikit-learn](https://pypi.org/project/scikit-learn) |
257
+ | Machine learning | [tensorflow](https://pypi.org/project/tensorflow) |
258
+ | Machine learning | [keras(https://keras.io) |
259
+ | Translation from multiple languages to English | [translators](https://pypi.org/project/translators) |
260
+
261
+
262
+ As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-
263
+ processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), in the pre-processing, code was created to build and evaluate 8 (eight) different
264
+ bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
265
+
266
+ | Base | Textos originais |
267
+ |--------|--------------------------------------------------------------|
268
+ | xp1 | Expandir Contrações |
269
+ | xp2 | Expandir Contrações + Transformar texto em minúsculo |
270
+ | xp3 | Expandir Contrações + Remover Pontuação |
271
+ | xp4 | Expandir Contrações + Remover Pontuação + Transformar Texto |
272
+ | xp5 | xp4 + Stemização |
273
+ | xp6 | xp4 + Lematização |
274
+ | xp7 | xp4 + Stemização + Remoção de StopWords |
275
+ | xp8 | ap4 + Lematização + Remoção de StopWords |
276
+ Table 2 Pre-processing methods evaluated
 
 
 
 
 
 
 
 
 
277
 
278
  ### Pretraining
279