MarcosDib commited on
Commit
0d2996b
1 Parent(s): 628a769

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -50
README.md CHANGED
@@ -90,12 +90,12 @@ The detailed release history can be found on the [here](https://huggingface.co/u
90
  | [`mcti-large-cased`] | 110M | Chinese |
91
  | [`-base-multilingual-cased`] | 110M | Multiple |
92
 
93
- | Dataset | Compatibility to base* |
94
- |------------------------------|------------------------|
95
- | Labeled MCTI | 100% |
96
- | Full MCTI | 100% |
97
- | BBC News Articles | 56.77% |
98
- | New unlabeled MCTI | 75.26% |
99
 
100
 
101
  ## Intended uses
@@ -121,18 +121,6 @@ You can use this model directly with a pipeline for masked language modeling:
121
  'score': 0.1073106899857521,
122
  'token': 4827,
123
  'token_str': 'fashion'},
124
- {'sequence': "[CLS] hello i'm a role model. [SEP]",
125
- 'score': 0.08774490654468536,
126
- 'token': 2535,
127
- 'token_str': 'role'},
128
- {'sequence': "[CLS] hello i'm a new model. [SEP]",
129
- 'score': 0.05338378623127937,
130
- 'token': 2047,
131
- 'token_str': 'new'},
132
- {'sequence': "[CLS] hello i'm a super model. [SEP]",
133
- 'score': 0.04667217284440994,
134
- 'token': 3565,
135
- 'token_str': 'super'},
136
  {'sequence': "[CLS] hello i'm a fine model. [SEP]",
137
  'score': 0.027095865458250046,
138
  'token': 2986,
@@ -175,18 +163,6 @@ predictions:
175
  'score': 0.09747550636529922,
176
  'token': 10533,
177
  'token_str': 'carpenter'},
178
- {'sequence': '[CLS] the man worked as a waiter. [SEP]',
179
- 'score': 0.0523831807076931,
180
- 'token': 15610,
181
- 'token_str': 'waiter'},
182
- {'sequence': '[CLS] the man worked as a barber. [SEP]',
183
- 'score': 0.04962705448269844,
184
- 'token': 13362,
185
- 'token_str': 'barber'},
186
- {'sequence': '[CLS] the man worked as a mechanic. [SEP]',
187
- 'score': 0.03788609802722931,
188
- 'token': 15893,
189
- 'token_str': 'mechanic'},
190
  {'sequence': '[CLS] the man worked as a salesman. [SEP]',
191
  'score': 0.037680890411138535,
192
  'token': 18968,
@@ -198,18 +174,6 @@ predictions:
198
  'score': 0.21981462836265564,
199
  'token': 6821,
200
  'token_str': 'nurse'},
201
- {'sequence': '[CLS] the woman worked as a waitress. [SEP]',
202
- 'score': 0.1597415804862976,
203
- 'token': 13877,
204
- 'token_str': 'waitress'},
205
- {'sequence': '[CLS] the woman worked as a maid. [SEP]',
206
- 'score': 0.1154729500412941,
207
- 'token': 10850,
208
- 'token_str': 'maid'},
209
- {'sequence': '[CLS] the woman worked as a prostitute. [SEP]',
210
- 'score': 0.037968918681144714,
211
- 'token': 19215,
212
- 'token_str': 'prostitute'},
213
  {'sequence': '[CLS] the woman worked as a cook. [SEP]',
214
  'score': 0.03042375110089779,
215
  'token': 5660,
@@ -233,14 +197,14 @@ Pre-processing was used to standardize the texts for the English language, reduc
233
  optimize the training of the models.
234
 
235
  The following assumptions were considered:
236
- The Data Entry base is obtained from the result of goal 4.
237
- Labeling (Goal 4) is considered true for accuracy measurement purposes;
238
- Preprocessing experiments compare accuracy in a shallow neural network (SNN);
239
- Pre-processing was investigated for the classification goal.
240
 
241
- From the Database obtained in Meta 4, stored in the project's [GitHub](github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps- desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL.xlsx), a Notebook was developed in [Google Colab](colab.research.google.com)
242
  to implement the [pre-processing code](github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-
243
- processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), which also can be found on the project's GitHub.
244
 
245
  Several Python packages were used to develop the preprocessing code:
246
 
@@ -257,8 +221,7 @@ Several Python packages were used to develop the preprocessing code:
257
  | Translation from multiple languages to English | [translators](https://pypi.org/project/translators) |
258
 
259
 
260
- As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-
261
- processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento.ipynb), in the pre-processing, code was created to build and evaluate 8 (eight) different
262
  bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
263
 
264
  | Base | Textos originais |
 
90
  | [`mcti-large-cased`] | 110M | Chinese |
91
  | [`-base-multilingual-cased`] | 110M | Multiple |
92
 
93
+ | Dataset | Compatibility to base* |
94
+ |--------------------------------------|------------------------|
95
+ | Labeled MCTI | 100% |
96
+ | Full MCTI | 100% |
97
+ | BBC News Articles | 56.77% |
98
+ | New unlabeled MCTI | 75.26% |
99
 
100
 
101
  ## Intended uses
 
121
  'score': 0.1073106899857521,
122
  'token': 4827,
123
  'token_str': 'fashion'},
 
 
 
 
 
 
 
 
 
 
 
 
124
  {'sequence': "[CLS] hello i'm a fine model. [SEP]",
125
  'score': 0.027095865458250046,
126
  'token': 2986,
 
163
  'score': 0.09747550636529922,
164
  'token': 10533,
165
  'token_str': 'carpenter'},
 
 
 
 
 
 
 
 
 
 
 
 
166
  {'sequence': '[CLS] the man worked as a salesman. [SEP]',
167
  'score': 0.037680890411138535,
168
  'token': 18968,
 
174
  'score': 0.21981462836265564,
175
  'token': 6821,
176
  'token_str': 'nurse'},
 
 
 
 
 
 
 
 
 
 
 
 
177
  {'sequence': '[CLS] the woman worked as a cook. [SEP]',
178
  'score': 0.03042375110089779,
179
  'token': 5660,
 
197
  optimize the training of the models.
198
 
199
  The following assumptions were considered:
200
+ - The Data Entry base is obtained from the result of goal 4.
201
+ - Labeling (Goal 4) is considered true for accuracy measurement purposes;
202
+ - Preprocessing experiments compare accuracy in a shallow neural network (SNN);
203
+ - Pre-processing was investigated for the classification goal.
204
 
205
+ From the Database obtained in Meta 4, stored in the project's [GitHub](github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/scraps-desenvolvimento/Rotulagem/db_PPF_validacao_para%20UNB_%20FINAL), a Notebook was developed in [Google Colab](colab.research.google.com)
206
  to implement the [pre-processing code](github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-
207
+ processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), which also can be found on the project's GitHub.
208
 
209
  Several Python packages were used to develop the preprocessing code:
210
 
 
221
  | Translation from multiple languages to English | [translators](https://pypi.org/project/translators) |
222
 
223
 
224
+ As detailed in the notebook on [GitHub](https://github.com/mcti-sefip/mcti-sefip-ppfcd2020/blob/pre-processamento/Pre_Processamento/MCTI_PPF_Pr%C3%A9_processamento), in the pre-processing, code was created to build and evaluate 8 (eight) different
 
225
  bases, derived from the base of goal 4, with the application of the methods shown in Figure 2.
226
 
227
  | Base | Textos originais |