Commit
·
67c3ab6
1
Parent(s):
2592313
Upload README.md
Browse files
README.md
ADDED
@@ -0,0 +1,186 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
widget:
|
3 |
+
- text: "El dólar se dispara tras la reunión de la Fed"
|
4 |
+
---
|
5 |
+
|
6 |
+
|
7 |
+
# Spanish News Classification Headlines
|
8 |
+
|
9 |
+
SNCH: this model was develop by [M47Labs](https://www.m47labs.com/es/) the goal is text classification, the base model use was [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased), it was fine-tuned on 1000 example dataset.
|
10 |
+
|
11 |
+
|
12 |
+
## Dataset Sample
|
13 |
+
|
14 |
+
Dataset size : 1000
|
15 |
+
|
16 |
+
Columns: idTask,task content 1,idTag,tag.
|
17 |
+
|
18 |
+
|idTask|task content 1|idTag|tag|
|
19 |
+
|------|------|------|------|
|
20 |
+
|3637d9ac-119c-4a8f-899c-339cf5b42ae0|Alcalá de Guadaíra celebra la IV Semana de la Diversidad Sexual con acciones de sensibilización|81b36360-6cbf-4ffa-b558-9ef95c136714|sociedad|
|
21 |
+
|d56bab52-0029-45dd-ad90-5c17d4ed4c88|El Archipiélago Chinijo Graciplus se impone en el Trofeo Centro Comercial Rubicón|ed198b6d-a5b9-4557-91ff-c0be51707dec|deportes|
|
22 |
+
|dec70bc5-4932-4fa2-aeac-31a52377be02|Un total de 39 personas padecen ELA actualmente en la provincia|81b36360-6cbf-4ffa-b558-9ef95c136714|sociedad|
|
23 |
+
|fb396ba9-fbf1-4495-84d9-5314eb731405|Eurocopa 2021 : Italia vence a Gales y pasa a octavos con su candidatura reforzada|ed198b6d-a5b9-4557-91ff-c0be51707dec|deportes|
|
24 |
+
|bc5a36ca-4e0a-422e-9167-766b41008c01|Resolución de 10 de junio de 2021, del Ayuntamiento de Tarazona de La Mancha (Albacete), referente a la convocatoria para proveer una plaza.|81b36360-6cbf-4ffa-b558-9ef95c136714|sociedad|
|
25 |
+
|a87f8703-ce34-47a5-9c1b-e992c7fe60f6|El primer ministro sueco pierde una moción de censura|209ae89e-55b4-41fd-aac0-5400feab479e|politica|
|
26 |
+
|d80bdaad-0ad5-43a0-850e-c473fd612526|El dólar se dispara tras la reunión de la Fed|11925830-148e-4890-a2bc-da9dc059dc17|economia|
|
27 |
+
|
28 |
+
|
29 |
+
## Labels:
|
30 |
+
|
31 |
+
* ciencia_tecnologia
|
32 |
+
|
33 |
+
* clickbait
|
34 |
+
|
35 |
+
* cultura
|
36 |
+
|
37 |
+
* deportes
|
38 |
+
|
39 |
+
* economia
|
40 |
+
|
41 |
+
* educacion
|
42 |
+
|
43 |
+
* medio_ambiente
|
44 |
+
|
45 |
+
* opinion
|
46 |
+
|
47 |
+
* politica
|
48 |
+
|
49 |
+
* sociedad
|
50 |
+
|
51 |
+
|
52 |
+
|
53 |
+
## Example of Use
|
54 |
+
|
55 |
+
### Pipeline
|
56 |
+
|
57 |
+
```{python}
|
58 |
+
|
59 |
+
import torch
|
60 |
+
from transformers import AutoTokenizer, BertForSequenceClassification,TextClassificationPipeline
|
61 |
+
|
62 |
+
|
63 |
+
review_text = 'los vehiculos que esten esperando pasajaeros deberan estar apagados para reducir emisiones'
|
64 |
+
path = "M47Labs/spanish_news_classification_headlines"
|
65 |
+
tokenizer = AutoTokenizer.from_pretrained(path)
|
66 |
+
model = BertForSequenceClassification.from_pretrained(path)
|
67 |
+
|
68 |
+
|
69 |
+
nlp = TextClassificationPipeline(task = "text-classification",
|
70 |
+
model = model,
|
71 |
+
tokenizer = tokenizer)
|
72 |
+
|
73 |
+
print(nlp(review_text))
|
74 |
+
|
75 |
+
```
|
76 |
+
|
77 |
+
```[{'label': 'medio_ambiente', 'score': 0.5648820996284485}]```
|
78 |
+
|
79 |
+
### Pytorch
|
80 |
+
|
81 |
+
```{python}
|
82 |
+
|
83 |
+
import torch
|
84 |
+
from transformers import AutoTokenizer, BertForSequenceClassification,TextClassificationPipeline
|
85 |
+
from numpy import np
|
86 |
+
|
87 |
+
model_name = 'M47Labs/spanish_news_classification_headlines'
|
88 |
+
MAX_LEN = 32
|
89 |
+
|
90 |
+
|
91 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
92 |
+
|
93 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
94 |
+
|
95 |
+
texto = "las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno"
|
96 |
+
|
97 |
+
|
98 |
+
encoded_review = tokenizer.encode_plus(
|
99 |
+
texto,
|
100 |
+
max_length=MAX_LEN,
|
101 |
+
add_special_tokens=True,
|
102 |
+
#return_token_type_ids=False,
|
103 |
+
pad_to_max_length=True,
|
104 |
+
return_attention_mask=True,
|
105 |
+
return_tensors='pt',
|
106 |
+
)
|
107 |
+
|
108 |
+
input_ids = encoded_review['input_ids']
|
109 |
+
attention_mask = encoded_review['attention_mask']
|
110 |
+
output = model(input_ids, attention_mask)
|
111 |
+
|
112 |
+
_, prediction = torch.max(output['logits'], dim=1)
|
113 |
+
print(f'Review text: {texto}')
|
114 |
+
|
115 |
+
print(f'Sentiment : {model.config.id2label[prediction.detach().cpu().numpy()[0]]}')
|
116 |
+
|
117 |
+
```
|
118 |
+
|
119 |
+
```Review text: las emisiones estan bajando, debido a las medidas ambientales tomadas por el gobierno```
|
120 |
+
|
121 |
+
|
122 |
+
```Sentiment : medio_ambiente```
|
123 |
+
|
124 |
+
|
125 |
+
A more in depth example on how to use the model can be found in this colab notebook: https://colab.research.google.com/drive/1XsKea6oMyEckye2FePW_XN7Rf8v41Cw_?usp=sharing
|
126 |
+
|
127 |
+
|
128 |
+
## Finetune Hyperparameters
|
129 |
+
|
130 |
+
|
131 |
+
* MAX_LEN = 32
|
132 |
+
* TRAIN_BATCH_SIZE = 8
|
133 |
+
* VALID_BATCH_SIZE = 4
|
134 |
+
* EPOCHS = 5
|
135 |
+
* LEARNING_RATE = 1e-05
|
136 |
+
|
137 |
+
## Train Results
|
138 |
+
|
139 |
+
|n_example|epoch|loss|acc|
|
140 |
+
|------|------|------|------|
|
141 |
+
|100|0|2.286327266693115|12.5|
|
142 |
+
|100|1|2.018876111507416|40.0|
|
143 |
+
|100|2|1.8016730904579163|43.75|
|
144 |
+
|100|3|1.6121837735176086|46.25|
|
145 |
+
|100|4|1.41565443277359|68.75|
|
146 |
+
|
147 |
+
|n_example|epoch|loss|acc|
|
148 |
+
|------|------|------|------|
|
149 |
+
|500|0|2.0770938420295715|24.5|
|
150 |
+
|500|1|1.6953029704093934|50.25|
|
151 |
+
|500|2|1.258900796175003|64.25|
|
152 |
+
|500|3|0.8342628020048142|78.25|
|
153 |
+
|500|4|0.5135736921429634|90.25|
|
154 |
+
|
155 |
+
|n_example|epoch|loss|acc|
|
156 |
+
|------|------|------|------|
|
157 |
+
|1000|0|1.916002897115854|36.1997226074896|
|
158 |
+
|1000|1|1.2941598492664295|62.2746185852982|
|
159 |
+
|1000|2|0.8201534710415117|76.97642163661581|
|
160 |
+
|1000|3|0.524806430051615|86.9625520110957|
|
161 |
+
|1000|4|0.30662027455784463|92.64909847434119|
|
162 |
+
|
163 |
+
## Validation Results
|
164 |
+
|
165 |
+
|n_examples|100|
|
166 |
+
|------|------|
|
167 |
+
|Accuracy Score|0.35|
|
168 |
+
|Precision (Macro)|0.35|
|
169 |
+
|Recall (Macro)|0.16|
|
170 |
+
|
171 |
+
|n_examples|500|
|
172 |
+
|------|------|
|
173 |
+
|Accuracy Score|0.62|
|
174 |
+
|Precision (Macro)|0.60|
|
175 |
+
|Recall (Macro)|0.47|
|
176 |
+
|
177 |
+
|n_examples|1000|
|
178 |
+
|------|------|
|
179 |
+
|Accuracy Score|0.68|
|
180 |
+
|Precision(Macro)|0.68|
|
181 |
+
|Recall (Macro)|0.64|
|
182 |
+
|
183 |
+
|
184 |
+
|
185 |
+

|
186 |
+
|