Update README.md
Browse files
README.md
CHANGED
@@ -90,12 +90,12 @@ The detailed release history can be found on the [here](https://huggingface.co/u
|
|
90 |
| [`mcti-large-cased`] | 110M | Chinese |
|
91 |
| [`-base-multilingual-cased`] | 110M | Multiple |
|
92 |
|
93 |
-
| Dataset
|
94 |
-
|
95 |
-
| Labeled MCTI
|
96 |
-
| Full MCTI
|
97 |
-
| BBC News Articles
|
98 |
-
| New unlabeled MCTI
|
99 |
|
100 |
|
101 |
## Intended uses
|
@@ -121,18 +121,6 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
121 |
'score': 0.1073106899857521,
|
122 |
'token': 4827,
|
123 |
'token_str': 'fashion'},
|
124 |
-
{'sequence': "[CLS] hello i'm a role model. [SEP]",
|
125 |
-
'score': 0.08774490654468536,
|
126 |
-
'token': 2535,
|
127 |
-
'token_str': 'role'},
|
128 |
-
{'sequence': "[CLS] hello i'm a new model. [SEP]",
|
129 |
-
'score': 0.05338378623127937,
|
130 |
-
'token': 2047,
|
131 |
-
'token_str': 'new'},
|
132 |
-
{'sequence': "[CLS] hello i'm a super model. [SEP]",
|
133 |
-
'score': 0.04667217284440994,
|
134 |
-
'token': 3565,
|
135 |
-
'token_str': 'super'},
|
136 |
{'sequence': "[CLS] hello i'm a fine model. [SEP]",
|
137 |
'score': 0.027095865458250046,
|
138 |
'token': 2986,
|
@@ -175,18 +163,6 @@ predictions:
|
|
175 |
'score': 0.09747550636529922,
|
176 |
'token': 10533,
|
177 |
'token_str': 'carpenter'},
|
178 |
-
{'sequence': '[CLS] the man worked as a waiter. [SEP]',
|
179 |
-
'score': 0.0523831807076931,
|
180 |
-
'token': 15610,
|
181 |
-
'token_str': 'waiter'},
|
182 |
-
{'sequence': '[CLS] the man worked as a barber. [SEP]',
|
183 |
-
'score': 0.04962705448269844,
|
184 |
-
'token': 13362,
|
185 |
-
'token_str': 'barber'},
|
186 |
-
{'sequence': '[CLS] the man worked as a mechanic. [SEP]',
|
187 |
-
'score': 0.03788609802722931,
|
188 |
-
'token': 15893,
|
189 |
-
'token_str': 'mechanic'},
|
190 |
{'sequence': '[CLS] the man worked as a salesman. [SEP]',
|
191 |
'score': 0.037680890411138535,
|
192 |
'token': 18968,
|
@@ -198,18 +174,6 @@ predictions:
|
|
198 |
'score': 0.21981462836265564,
|
199 |
'token': 6821,
|
200 |
'token_str': 'nurse'},
|
201 |
-
{'sequence': '[CLS] the woman worked as a waitress. [SEP]',
|
202 |
-
'score': 0.1597415804862976,
|
203 |
-
'token': 13877,
|
204 |
-
'token_str': 'waitress'},
|
205 |
-
{'sequence': '[CLS] the woman worked as a maid. [SEP]',
|
206 |
-
'score': 0.1154729500412941,
|
207 |
-
'token': 10850,
|
208 |
-
'token_str': 'maid'},
|
209 |
-
{'sequence': '[CLS] the woman worked as a prostitute. [SEP]',
|
210 |
-
'score': 0.037968918681144714,
|
211 |
-
'token': 19215,
|
212 |
-
'token_str': 'prostitute'},
|
213 |
{'sequence': '[CLS] the woman worked as a cook. [SEP]',
|
214 |
'score': 0.03042375110089779,
|
215 |
'token': 5660,
|
@@ -233,14 +197,14 @@ Pre-processing was used to standardize the texts for the English language, reduc
|
|
233 |
optimize the training of the models.
|
234 |
|
235 |
The following assumptions were considered:
|
236 |
-
|
237 |
-
|
238 |
-
|
239 |
-
|
240 |
|
241 |
-
From the Database obtained in Meta 4, stored in the project's [GitHub](github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps-
|
242 |
to implement the [pre-processing code](github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-
|
243 |
-
processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento
|
244 |
|
245 |
Several Python packages were used to develop the preprocessing code:
|
246 |
|
@@ -257,8 +221,7 @@ Several Python packages were used to develop the preprocessing code:
|
|
257 |
| Translation from multiple languages to English | [translators](https://pypi.org/project/translators) |
|
258 |
|
259 |
|
260 |
-
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-
|
261 |
-
processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), in the pre-processing, code was created to build and evaluate 8 (eight) different
|
262 |
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
|
263 |
|
264 |
| Base | Textos originais |
|
|
|
90 |
| [`mcti-large-cased`] | 110M | Chinese |
|
91 |
| [`-base-multilingual-cased`] | 110M | Multiple |
|
92 |
|
93 |
+
| Dataset | Compatibility to base* |
|
94 |
+
|--------------------------------------|------------------------|
|
95 |
+
| Labeled MCTI | 100% |
|
96 |
+
| Full MCTI | 100% |
|
97 |
+
| BBC News Articles | 56.77% |
|
98 |
+
| New unlabeled MCTI | 75.26% |
|
99 |
|
100 |
|
101 |
## Intended uses
|
|
|
121 |
'score': 0.1073106899857521,
|
122 |
'token': 4827,
|
123 |
'token_str': 'fashion'},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
124 |
{'sequence': "[CLS] hello i'm a fine model. [SEP]",
|
125 |
'score': 0.027095865458250046,
|
126 |
'token': 2986,
|
|
|
163 |
'score': 0.09747550636529922,
|
164 |
'token': 10533,
|
165 |
'token_str': 'carpenter'},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
166 |
{'sequence': '[CLS] the man worked as a salesman. [SEP]',
|
167 |
'score': 0.037680890411138535,
|
168 |
'token': 18968,
|
|
|
174 |
'score': 0.21981462836265564,
|
175 |
'token': 6821,
|
176 |
'token_str': 'nurse'},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
177 |
{'sequence': '[CLS] the woman worked as a cook. [SEP]',
|
178 |
'score': 0.03042375110089779,
|
179 |
'token': 5660,
|
|
|
197 |
optimize the training of the models.
|
198 |
|
199 |
The following assumptions were considered:
|
200 |
+
- The Data Entry base is obtained from the result of goal 4.
|
201 |
+
- Labeling (Goal 4) is considered true for accuracy measurement purposes;
|
202 |
+
- Preprocessing experiments compare accuracy in a shallow neural network (SNN);
|
203 |
+
- Pre-processing was investigated for the classification goal.
|
204 |
|
205 |
+
From the Database obtained in Meta 4, stored in the project's [GitHub](github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps-desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL), a Notebook was developed in [Google Colab](colab.research.google.com)
|
206 |
to implement the [pre-processing code](github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-
|
207 |
+
processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), which also can be found on the project's GitHub.
|
208 |
|
209 |
Several Python packages were used to develop the preprocessing code:
|
210 |
|
|
|
221 |
| Translation from multiple languages to English | [translators](https://pypi.org/project/translators) |
|
222 |
|
223 |
|
224 |
+
As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
|
|
|
225 |
bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
|
226 |
|
227 |
| Base | Textos originais |
|