Update README.md
Browse files
README.md
CHANGED
@@ -90,12 +90,12 @@ The detailed release history can be found on the [here](https://huggingface.co/u
|
|
90 |
| [`mcti-large-cased`] | 110M | Chinese |
|
91 |
| [`-base-multilingual-cased`] | 110M | Multiple |
|
92 |
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
|
100 |
|
101 |
## Intended uses
|
@@ -229,48 +229,51 @@ headers).
|
|
229 |
|
230 |
### Preprocessing
|
231 |
|
232 |
-
|
233 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
234 |
|
235 |
| Objective | Package |
|
236 |
|--------------------------------------------------------|--------------|
|
237 |
| Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
|
238 |
| Natural Language Processing | [nltk](https://pypi.org/project/nltk) |
|
239 |
-
|
240 |
-
|
241 |
-
|
242 |
-
|
243 |
-
|
244 |
-
|
245 |
-
|
246 |
-
|
247 |
-
|
248 |
-
|
249 |
-
|
250 |
-
|
251 |
-
|
252 |
-
|
253 |
-
|
254 |
-
|
255 |
-
|
256 |
-
|
257 |
-
|
258 |
-
|
259 |
-
|
260 |
-
|
261 |
-
|
262 |
-
|
263 |
-
|
264 |
-
|
265 |
-
the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
|
266 |
-
consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
|
267 |
-
"sentences" has a combined length of less than 512 tokens.
|
268 |
-
|
269 |
-
The details of the masking procedure for each sentence are the following:
|
270 |
-
- 15% of the tokens are masked.
|
271 |
-
- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
|
272 |
-
- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
|
273 |
-
- In the 10% remaining cases, the masked tokens are left as is.
|
274 |
|
275 |
### Pretraining
|
276 |
|
|
|
90 |
| [`mcti-large-cased`] | 110M | Chinese |
|
91 |
| [`-base-multilingual-cased`] | 110M | Multiple |
|
92 |
|
93 |
+
| Dataset | Compatibility to base* |
|
94 |
+
|------------------------------|------------------------|
|
95 |
+
| Labeled MCTI | 100% |
|
96 |
+
| Full MCTI | 100% |
|
97 |
+
| BBC News Articles | 56.77% |
|
98 |
+
| New unlabeled MCTI | 75.26% |
|
99 |
|
100 |
|
101 |
## Intended uses
|
|
|
229 |
|
230 |
### Preprocessing
|
231 |
|
232 |
+
Pre-processing was used to standardize the texts for the English language, reduce the number of insignificant tokens and
|
233 |
+
optimize the training of the models.
|
234 |
+
|
235 |
+
The following assumptions were considered:
|
236 |
+
• The Data Entry base is obtained from the result of goal 4.
|
237 |
+
• Labeling (Goal 4) is considered true for accuracy measurement purposes;
|
238 |
+
• Preprocessing experiments compare accuracy in a shallow neural network (SNN);
|
239 |
+
• Pre-processing was investigated for the classification goal.
|
240 |
+
|
241 |
+
From the Database obtained in Meta 4, stored in the project's [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps- desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL.xlsx), a Notebook was developed in [Google Colab](https://colab.research.google.com)
|
242 |
+
to implement the [pre-processing code](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-
|
243 |
+
processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), which also can be found on the project's GitHub.
|
244 |
+
|
245 |
+
Several Python packages were used to develop the preprocessing code:
|
246 |
|
247 |
| Objective | Package |
|
248 |
|--------------------------------------------------------|--------------|
|
249 |
| Resolve contractions and slang usage in text | [contractions](https://pypi.org/project/contractions) |
|
250 |
| Natural Language Processing | [nltk](https://pypi.org/project/nltk) |
|
251 |
+
| Others data manipulations and calculations included | [numpy](https://pypi.org/project/numpy) |
|
252 |
+
| in Python 3.10: io, json, math, re (regular | |
|
253 |
+
| expressions), shutil, time, unicodedata; | |
|
254 |
+
| Data manipulation and analysis | [pandas](https://pypi.org/project/pandas) |
|
255 |
+
| http library | [requests](https://pypi.org/project/requests) |
|
256 |
+
| Training model | [scikit-learn](https://pypi.org/project/scikit-learn) |
|
257 |
+
| Machine learning | [tensorflow](https://pypi.org/project/tensorflow) |
|
258 |
+
| Machine learning | [keras(https://keras.io) |
|
259 |
+
| Translation from multiple languages to English | [translators](https://pypi.org/project/translators) |
|
260 |
+
|
261 |
+
|
262 |
+
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-
|
263 |
+
processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), in the pre-processing, code was created to build and evaluate 8 (eight) different
|
264 |
+
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
|
265 |
+
|
266 |
+
| Base | Textos originais |
|
267 |
+
|--------|--------------------------------------------------------------|
|
268 |
+
| xp1 | Expandir Contrações |
|
269 |
+
| xp2 | Expandir Contrações + Transformar texto em minúsculo |
|
270 |
+
| xp3 | Expandir Contrações + Remover Pontuação |
|
271 |
+
| xp4 | Expandir Contrações + Remover Pontuação + Transformar Texto |
|
272 |
+
| xp5 | xp4 + Stemização |
|
273 |
+
| xp6 | xp4 + Lematização |
|
274 |
+
| xp7 | xp4 + Stemização + Remoção de StopWords |
|
275 |
+
| xp8 | ap4 + Lematização + Remoção de StopWords |
|
276 |
+
Table 2 – Pre-processing methods evaluated
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
277 |
|
278 |
### Pretraining
|
279 |
|