Adicionando Imagens, notebboks explicativos e os dados

#1
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ data/imdb_reviews.csv filter=lfs diff=lfs merge=lfs -text
data/imdb_reviews.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f1314f123ac922d7d0f2bd5bd17f1734e167d90b2256c34963228bc63f6a4cb
3
+ size 66262310
imagens/BERT_TDIDF.png ADDED
imagens/Simbolico_WordCloud_Wordnet.png ADDED
notebooks_explicativos/Estatistico.ipynb ADDED
@@ -0,0 +1,765 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {
6
+ "id": "lawNHLqffR_m"
7
+ },
8
+ "source": [
9
+ "# SCC0633/SCC5908 - Processamento de Linguagem Natural\n",
10
+ "> **Docente:** Thiago Alexandre Salgueiro Pardo \\\n",
11
+ "> **Estagiário PAE:** Germano Antonio Zani Jorge\n",
12
+ "\n",
13
+ "\n",
14
+ "# Integrantes do Grupo: GPTrouxas\n",
15
+ "> André Guarnier De Mitri - 11395579 \\\n",
16
+ "> Daniel Carvalho - 10685702 \\\n",
17
+ "> Fernando - 11795342 \\\n",
18
+ "> Lucas Henrique Sant'Anna - 10748521 \\\n",
19
+ "> Magaly L Fujimoto - 4890582"
20
+ ]
21
+ },
22
+ {
23
+ "cell_type": "markdown",
24
+ "metadata": {
25
+ "id": "pV6WGoBln8id"
26
+ },
27
+ "source": [
28
+ "# New Section"
29
+ ]
30
+ },
31
+ {
32
+ "cell_type": "markdown",
33
+ "metadata": {},
34
+ "source": [
35
+ "# Abordagem Estatístico\n",
36
+ "A arquitetura da solução estatística/neural envolve duas abordagens que\n",
37
+ "serão descritas neste documento. A primeira abordagem envolve utilizar\n",
38
+ "TF-IDF e Naive Bayes. E a segunda abordagem irá utilizar Word2Vec e um\n",
39
+ "modelo transformers pré-treinado da família BERT, realizando finetuning do\n",
40
+ "modelo.\n",
41
+ "\n",
42
+ "Na primeira abordagem, utilizaremos o TF-IDF, que leva em consideração a\n",
43
+ "frequência de ocorrência dos termos em um corpus e gera uma sequência de\n",
44
+ "vetores que serão fornecidos ao Naive Bayes para classificação da review como\n",
45
+ "positiva ou negativa.\n",
46
+ "\n",
47
+ "\n",
48
+ "Na segunda abordagem, utilizaremos o Word2Vec para vetorizar as reviews.\n",
49
+ "Após dividir em treino e teste, faremos o fine tuning de um modelo do tipo BERT\n",
50
+ "para o nosso problema e dataset específico. Com o BERT adaptado, faremos a\n",
51
+ "classificação de nossos textos, medindo o seu desempenho com F1 score e\n",
52
+ "acurácia.\n",
53
+ "\n",
54
+ "![alt text](../imagens/BERT_TDIDF.png)"
55
+ ]
56
+ },
57
+ {
58
+ "cell_type": "markdown",
59
+ "metadata": {
60
+ "id": "vfP54aryxZBg"
61
+ },
62
+ "source": [
63
+ "\n",
64
+ "## # Etapas da Abordagem Estatística\n",
65
+ "\n",
66
+ "1. **Bibliotecas**: Importamos as bibliotecas necessárias, considerando pandas para manipulação de dados, train_test_split para dividir o conjunto de dados em conjuntos de treinamento e teste, TfidfVectorizer para vetorização de texto usando TF-IDF, MultinomialNB para implementar o classificador Naive Bayes Multinomial e algumas métricas de avaliação.\n",
67
+ "\n",
68
+ "2. **Conjunto de dados**: Carregar o conjunto de dados e armazená-lo em um dataframe usando pandas.\n",
69
+ "\n",
70
+ "3. **Dividir o conjunto de dados**: Usamos `train_test_split` para dividir o DataFrame em conjuntos de treinamento e teste.\n",
71
+ "\n",
72
+ "4. **TF-IDF**: Usamos `TfidfVectorizer` para converter as revisões de texto em vetores numéricos usando a técnica TF-IDF. Em seguida, ajustamos e transformamos tanto o conjunto de treinamento quanto o conjunto de teste.\n",
73
+ "\n",
74
+ "5. **Naive Bayes**: Treinamos um classificador Naive Bayes Multinomial e usamos o modelo treinado para prever os sentimentos no conjunto de teste usando `predict`.\n",
75
+ "\n",
76
+ "6. **Avaliação e Resultados**: Salvamos os resultados em um novo dataframe `results_df` contendo as revisões do conjunto de teste, os sentimentos originais e os sentimentos previstos pelo modelo. Além disso, avaliamos o modelo verificando algumas métricas e a matriz de confusão.\n",
77
+ "\n"
78
+ ]
79
+ },
80
+ {
81
+ "cell_type": "markdown",
82
+ "metadata": {
83
+ "id": "TbLraa4UhWDJ"
84
+ },
85
+ "source": [
86
+ "\n",
87
+ "## # Baixando, Carregando os dados e Pré Processamento\n",
88
+ "\n",
89
+ "1. Transformar todos os textos em lowercase \\\\\n",
90
+ "2. Remoção de caracteres especiais \\\\\n",
91
+ "3. Remoção de stop words \\\\\n",
92
+ "4. Lematização (Lemmatization) \\\\\n",
93
+ "5. Tokenização \\\\"
94
+ ]
95
+ },
96
+ {
97
+ "cell_type": "code",
98
+ "execution_count": 1,
99
+ "metadata": {
100
+ "id": "bIWmIe0qfTbE"
101
+ },
102
+ "outputs": [],
103
+ "source": [
104
+ "import pandas as pd"
105
+ ]
106
+ },
107
+ {
108
+ "cell_type": "code",
109
+ "execution_count": 2,
110
+ "metadata": {
111
+ "colab": {
112
+ "base_uri": "https://localhost:8080/",
113
+ "height": 206
114
+ },
115
+ "id": "Wf0n2yPdAn4C",
116
+ "outputId": "37eb3c4d-40c1-41a0-9b1a-d93ed6e272f3"
117
+ },
118
+ "outputs": [
119
+ {
120
+ "data": {
121
+ "text/html": [
122
+ "<div>\n",
123
+ "<style scoped>\n",
124
+ " .dataframe tbody tr th:only-of-type {\n",
125
+ " vertical-align: middle;\n",
126
+ " }\n",
127
+ "\n",
128
+ " .dataframe tbody tr th {\n",
129
+ " vertical-align: top;\n",
130
+ " }\n",
131
+ "\n",
132
+ " .dataframe thead th {\n",
133
+ " text-align: right;\n",
134
+ " }\n",
135
+ "</style>\n",
136
+ "<table border=\"1\" class=\"dataframe\">\n",
137
+ " <thead>\n",
138
+ " <tr style=\"text-align: right;\">\n",
139
+ " <th></th>\n",
140
+ " <th>review</th>\n",
141
+ " <th>sentiment</th>\n",
142
+ " </tr>\n",
143
+ " </thead>\n",
144
+ " <tbody>\n",
145
+ " <tr>\n",
146
+ " <th>0</th>\n",
147
+ " <td>One of the other reviewers has mentioned that ...</td>\n",
148
+ " <td>positive</td>\n",
149
+ " </tr>\n",
150
+ " <tr>\n",
151
+ " <th>1</th>\n",
152
+ " <td>A wonderful little production. &lt;br /&gt;&lt;br /&gt;The...</td>\n",
153
+ " <td>positive</td>\n",
154
+ " </tr>\n",
155
+ " <tr>\n",
156
+ " <th>2</th>\n",
157
+ " <td>I thought this was a wonderful way to spend ti...</td>\n",
158
+ " <td>positive</td>\n",
159
+ " </tr>\n",
160
+ " <tr>\n",
161
+ " <th>3</th>\n",
162
+ " <td>Basically there's a family where a little boy ...</td>\n",
163
+ " <td>negative</td>\n",
164
+ " </tr>\n",
165
+ " <tr>\n",
166
+ " <th>4</th>\n",
167
+ " <td>Petter Mattei's \"Love in the Time of Money\" is...</td>\n",
168
+ " <td>positive</td>\n",
169
+ " </tr>\n",
170
+ " </tbody>\n",
171
+ "</table>\n",
172
+ "</div>"
173
+ ],
174
+ "text/plain": [
175
+ " review sentiment\n",
176
+ "0 One of the other reviewers has mentioned that ... positive\n",
177
+ "1 A wonderful little production. <br /><br />The... positive\n",
178
+ "2 I thought this was a wonderful way to spend ti... positive\n",
179
+ "3 Basically there's a family where a little boy ... negative\n",
180
+ "4 Petter Mattei's \"Love in the Time of Money\" is... positive"
181
+ ]
182
+ },
183
+ "execution_count": 2,
184
+ "metadata": {},
185
+ "output_type": "execute_result"
186
+ }
187
+ ],
188
+ "source": [
189
+ "db = pd.read_csv('../data/imdb_reviews.csv')\n",
190
+ "db.head(5)"
191
+ ]
192
+ },
193
+ {
194
+ "cell_type": "code",
195
+ "execution_count": 3,
196
+ "metadata": {
197
+ "colab": {
198
+ "base_uri": "https://localhost:8080/"
199
+ },
200
+ "id": "6PlfPScGMF1_",
201
+ "outputId": "2a0bd4a1-e22a-429d-82a4-5984eeab7b9d"
202
+ },
203
+ "outputs": [
204
+ {
205
+ "data": {
206
+ "text/plain": [
207
+ "sentiment\n",
208
+ "positive 25000\n",
209
+ "negative 25000\n",
210
+ "Name: count, dtype: int64"
211
+ ]
212
+ },
213
+ "execution_count": 3,
214
+ "metadata": {},
215
+ "output_type": "execute_result"
216
+ }
217
+ ],
218
+ "source": [
219
+ "db['sentiment'].value_counts()"
220
+ ]
221
+ },
222
+ {
223
+ "cell_type": "code",
224
+ "execution_count": 4,
225
+ "metadata": {
226
+ "colab": {
227
+ "base_uri": "https://localhost:8080/"
228
+ },
229
+ "id": "Kev0EaSmMa4N",
230
+ "outputId": "eab73a61-ba36-4d72-e4f2-82236f9f2880"
231
+ },
232
+ "outputs": [
233
+ {
234
+ "name": "stdout",
235
+ "output_type": "stream",
236
+ "text": [
237
+ "Quantidade de valores faltantes para cada variável do dataset:\n",
238
+ "review 0\n",
239
+ "sentiment 0\n",
240
+ "dtype: int64\n"
241
+ ]
242
+ }
243
+ ],
244
+ "source": [
245
+ "valores_ausentes = db.isnull().sum(axis=0)\n",
246
+ "print('Quantidade de valores faltantes para cada variável do dataset:')\n",
247
+ "print(valores_ausentes)"
248
+ ]
249
+ },
250
+ {
251
+ "cell_type": "code",
252
+ "execution_count": 5,
253
+ "metadata": {
254
+ "colab": {
255
+ "base_uri": "https://localhost:8080/",
256
+ "height": 276
257
+ },
258
+ "id": "1AI3rN0KMuUq",
259
+ "outputId": "7ea5c91b-362e-49eb-82a7-6e8535f0e591"
260
+ },
261
+ "outputs": [
262
+ {
263
+ "name": "stderr",
264
+ "output_type": "stream",
265
+ "text": [
266
+ "[nltk_data] Downloading package stopwords to\n",
267
+ "[nltk_data] C:\\Users\\andre\\AppData\\Roaming\\nltk_data...\n",
268
+ "[nltk_data] Package stopwords is already up-to-date!\n",
269
+ "[nltk_data] Downloading package wordnet to\n",
270
+ "[nltk_data] C:\\Users\\andre\\AppData\\Roaming\\nltk_data...\n",
271
+ "[nltk_data] Package wordnet is already up-to-date!\n"
272
+ ]
273
+ },
274
+ {
275
+ "data": {
276
+ "text/html": [
277
+ "<div>\n",
278
+ "<style scoped>\n",
279
+ " .dataframe tbody tr th:only-of-type {\n",
280
+ " vertical-align: middle;\n",
281
+ " }\n",
282
+ "\n",
283
+ " .dataframe tbody tr th {\n",
284
+ " vertical-align: top;\n",
285
+ " }\n",
286
+ "\n",
287
+ " .dataframe thead th {\n",
288
+ " text-align: right;\n",
289
+ " }\n",
290
+ "</style>\n",
291
+ "<table border=\"1\" class=\"dataframe\">\n",
292
+ " <thead>\n",
293
+ " <tr style=\"text-align: right;\">\n",
294
+ " <th></th>\n",
295
+ " <th>review</th>\n",
296
+ " <th>sentiment</th>\n",
297
+ " </tr>\n",
298
+ " </thead>\n",
299
+ " <tbody>\n",
300
+ " <tr>\n",
301
+ " <th>0</th>\n",
302
+ " <td>one reviewer mentioned watching 1 oz episode h...</td>\n",
303
+ " <td>positive</td>\n",
304
+ " </tr>\n",
305
+ " <tr>\n",
306
+ " <th>1</th>\n",
307
+ " <td>wonderful little production filming technique ...</td>\n",
308
+ " <td>positive</td>\n",
309
+ " </tr>\n",
310
+ " <tr>\n",
311
+ " <th>2</th>\n",
312
+ " <td>thought wonderful way spend time hot summer we...</td>\n",
313
+ " <td>positive</td>\n",
314
+ " </tr>\n",
315
+ " <tr>\n",
316
+ " <th>3</th>\n",
317
+ " <td>basically family little boy jake think zombie ...</td>\n",
318
+ " <td>negative</td>\n",
319
+ " </tr>\n",
320
+ " <tr>\n",
321
+ " <th>4</th>\n",
322
+ " <td>petter mattei love time money visually stunnin...</td>\n",
323
+ " <td>positive</td>\n",
324
+ " </tr>\n",
325
+ " </tbody>\n",
326
+ "</table>\n",
327
+ "</div>"
328
+ ],
329
+ "text/plain": [
330
+ " review sentiment\n",
331
+ "0 one reviewer mentioned watching 1 oz episode h... positive\n",
332
+ "1 wonderful little production filming technique ... positive\n",
333
+ "2 thought wonderful way spend time hot summer we... positive\n",
334
+ "3 basically family little boy jake think zombie ... negative\n",
335
+ "4 petter mattei love time money visually stunnin... positive"
336
+ ]
337
+ },
338
+ "execution_count": 5,
339
+ "metadata": {},
340
+ "output_type": "execute_result"
341
+ }
342
+ ],
343
+ "source": [
344
+ "import re\n",
345
+ "import nltk\n",
346
+ "from nltk.corpus import stopwords\n",
347
+ "from nltk.stem import PorterStemmer\n",
348
+ "from nltk.stem import WordNetLemmatizer\n",
349
+ "\n",
350
+ "def lowercase_text(text):\n",
351
+ " return text.lower()\n",
352
+ "\n",
353
+ "def remove_html(text):\n",
354
+ " return re.sub(r'<[^<]+?>', '', text)\n",
355
+ "\n",
356
+ "def remove_url(text):\n",
357
+ " return re.sub(r'http[s]?://\\S+|www\\.\\S+', '', text)\n",
358
+ "\n",
359
+ "def remove_punctuations(text):\n",
360
+ " tokens_list = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n",
361
+ " for char in text:\n",
362
+ " if char in tokens_list:\n",
363
+ " text = text.replace(char, ' ')\n",
364
+ "\n",
365
+ " return text\n",
366
+ "\n",
367
+ "def remove_emojis(text):\n",
368
+ " emojis = re.compile(\"[\"\n",
369
+ " u\"\\U0001F600-\\U0001F64F\"\n",
370
+ " u\"\\U0001F300-\\U0001F5FF\"\n",
371
+ " u\"\\U0001F680-\\U0001F6FF\"\n",
372
+ " u\"\\U0001F1E0-\\U0001F1FF\"\n",
373
+ " u\"\\U00002500-\\U00002BEF\"\n",
374
+ " u\"\\U00002702-\\U000027B0\"\n",
375
+ " u\"\\U00002702-\\U000027B0\"\n",
376
+ " u\"\\U000024C2-\\U0001F251\"\n",
377
+ " u\"\\U0001f926-\\U0001f937\"\n",
378
+ " u\"\\U00010000-\\U0010ffff\"\n",
379
+ " u\"\\u2640-\\u2642\"\n",
380
+ " u\"\\u2600-\\u2B55\"\n",
381
+ " u\"\\u200d\"\n",
382
+ " u\"\\u23cf\"\n",
383
+ " u\"\\u23e9\"\n",
384
+ " u\"\\u231a\"\n",
385
+ " u\"\\ufe0f\"\n",
386
+ " u\"\\u3030\"\n",
387
+ " \"]+\", re.UNICODE)\n",
388
+ "\n",
389
+ " text = re.sub(emojis, '', text)\n",
390
+ " return text\n",
391
+ "\n",
392
+ "def remove_stop_words(text):\n",
393
+ " stop_words = stopwords.words('english')\n",
394
+ " new_text = ''\n",
395
+ " for word in text.split():\n",
396
+ " if word not in stop_words:\n",
397
+ " new_text += ''.join(f'{word} ')\n",
398
+ "\n",
399
+ " return new_text.strip()\n",
400
+ "\n",
401
+ "def lem_words(text):\n",
402
+ " lemma = WordNetLemmatizer()\n",
403
+ " new_text = ''\n",
404
+ " for word in text.split():\n",
405
+ " new_text += ''.join(f'{lemma.lemmatize(word)} ')\n",
406
+ "\n",
407
+ " return new_text\n",
408
+ "\n",
409
+ "def preprocess_text(text):\n",
410
+ " text = lowercase_text(text)\n",
411
+ " text = remove_html(text)\n",
412
+ " text = remove_url(text)\n",
413
+ " text = remove_punctuations(text)\n",
414
+ " text = remove_emojis(text)\n",
415
+ " text = remove_stop_words(text)\n",
416
+ " text = lem_words(text)\n",
417
+ "\n",
418
+ " return text\n",
419
+ "\n",
420
+ "nltk.download('stopwords')\n",
421
+ "nltk.download('wordnet')\n",
422
+ "db['review'] = db['review'].apply(preprocess_text)\n",
423
+ "db.head()"
424
+ ]
425
+ },
426
+ {
427
+ "cell_type": "markdown",
428
+ "metadata": {
429
+ "id": "QgufZpgHnPa4"
430
+ },
431
+ "source": [
432
+ "# **Conjunto de Treino e teste**"
433
+ ]
434
+ },
435
+ {
436
+ "cell_type": "code",
437
+ "execution_count": 6,
438
+ "metadata": {
439
+ "id": "s0lJ6Q0tnPka"
440
+ },
441
+ "outputs": [],
442
+ "source": [
443
+ "from sklearn.model_selection import train_test_split\n",
444
+ "\n",
445
+ "X= db['review']\n",
446
+ "y= db['sentiment']\n",
447
+ "\n",
448
+ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state= 12)"
449
+ ]
450
+ },
451
+ {
452
+ "cell_type": "code",
453
+ "execution_count": 7,
454
+ "metadata": {
455
+ "colab": {
456
+ "base_uri": "https://localhost:8080/"
457
+ },
458
+ "id": "nz4erCEJuD4-",
459
+ "outputId": "88d57536-66e7-4d9b-e016-bf40183d4c45"
460
+ },
461
+ "outputs": [
462
+ {
463
+ "data": {
464
+ "text/plain": [
465
+ "35235 disagree people saying lousy horror film good ...\n",
466
+ "36936 husband wife doctor team carole nile nelson mo...\n",
467
+ "46486 like cast pretty much however story sort unfol...\n",
468
+ "27160 movie awful bad bear expend anything word avoi...\n",
469
+ "19490 purchased blood castle dvd ebay buck knowing s...\n",
470
+ " ... \n",
471
+ "36482 strange thing see film scene work rather weakl...\n",
472
+ "40177 saw cheap dvd release title entity force since...\n",
473
+ "19709 one peculiar oft used romance movie plot one s...\n",
474
+ "38555 nothing positive say meandering nonsense huffi...\n",
475
+ "14155 low moment life bewildered depressed sitting r...\n",
476
+ "Name: review, Length: 40000, dtype: object"
477
+ ]
478
+ },
479
+ "execution_count": 7,
480
+ "metadata": {},
481
+ "output_type": "execute_result"
482
+ }
483
+ ],
484
+ "source": [
485
+ "X_train"
486
+ ]
487
+ },
488
+ {
489
+ "cell_type": "markdown",
490
+ "metadata": {
491
+ "id": "6LX-6e-QlioJ"
492
+ },
493
+ "source": [
494
+ "# **TD-IDF e Naive Bayes**"
495
+ ]
496
+ },
497
+ {
498
+ "cell_type": "code",
499
+ "execution_count": 8,
500
+ "metadata": {
501
+ "id": "gscB9-obNusA"
502
+ },
503
+ "outputs": [],
504
+ "source": [
505
+ "from sklearn.metrics import confusion_matrix,classification_report\n",
506
+ "from sklearn.feature_extraction.text import TfidfVectorizer\n",
507
+ "from sklearn.preprocessing import StandardScaler as encoder\n",
508
+ "from sklearn.metrics import (\n",
509
+ " accuracy_score,\n",
510
+ " confusion_matrix,\n",
511
+ " ConfusionMatrixDisplay,\n",
512
+ " f1_score,\n",
513
+ ")\n",
514
+ "\n",
515
+ "\n",
516
+ "tfidf = TfidfVectorizer()\n",
517
+ "tfidf_train = tfidf.fit_transform(X_train)\n",
518
+ "tfidf_test = tfidf.transform(X_test)\n",
519
+ "\n",
520
+ "from sklearn.naive_bayes import MultinomialNB\n",
521
+ "\n",
522
+ "naive_bayes = MultinomialNB()\n",
523
+ "\n",
524
+ "naive_bayes.fit(tfidf_train, y_train)\n",
525
+ "y_pred = naive_bayes.predict(tfidf_test)\n",
526
+ "\n",
527
+ "\n"
528
+ ]
529
+ },
530
+ {
531
+ "cell_type": "code",
532
+ "execution_count": 9,
533
+ "metadata": {
534
+ "colab": {
535
+ "base_uri": "https://localhost:8080/",
536
+ "height": 206
537
+ },
538
+ "id": "RfJ7AHMZvAb8",
539
+ "outputId": "685701e1-b1e8-47fb-9dc5-1bc04dd3894b"
540
+ },
541
+ "outputs": [
542
+ {
543
+ "data": {
544
+ "text/html": [
545
+ "<div>\n",
546
+ "<style scoped>\n",
547
+ " .dataframe tbody tr th:only-of-type {\n",
548
+ " vertical-align: middle;\n",
549
+ " }\n",
550
+ "\n",
551
+ " .dataframe tbody tr th {\n",
552
+ " vertical-align: top;\n",
553
+ " }\n",
554
+ "\n",
555
+ " .dataframe thead th {\n",
556
+ " text-align: right;\n",
557
+ " }\n",
558
+ "</style>\n",
559
+ "<table border=\"1\" class=\"dataframe\">\n",
560
+ " <thead>\n",
561
+ " <tr style=\"text-align: right;\">\n",
562
+ " <th></th>\n",
563
+ " <th>review</th>\n",
564
+ " <th>original sentiment</th>\n",
565
+ " <th>predicted sentiment</th>\n",
566
+ " </tr>\n",
567
+ " </thead>\n",
568
+ " <tbody>\n",
569
+ " <tr>\n",
570
+ " <th>34622</th>\n",
571
+ " <td>hard tell noonan marshall trying ape abbott co...</td>\n",
572
+ " <td>negative</td>\n",
573
+ " <td>negative</td>\n",
574
+ " </tr>\n",
575
+ " <tr>\n",
576
+ " <th>1163</th>\n",
577
+ " <td>well start one reviewer said know real treat s...</td>\n",
578
+ " <td>positive</td>\n",
579
+ " <td>positive</td>\n",
580
+ " </tr>\n",
581
+ " <tr>\n",
582
+ " <th>7637</th>\n",
583
+ " <td>wife kid opinion absolute abc classic seen eve...</td>\n",
584
+ " <td>positive</td>\n",
585
+ " <td>positive</td>\n",
586
+ " </tr>\n",
587
+ " <tr>\n",
588
+ " <th>7045</th>\n",
589
+ " <td>surprise basic copycat comedy classic nutty pr...</td>\n",
590
+ " <td>positive</td>\n",
591
+ " <td>positive</td>\n",
592
+ " </tr>\n",
593
+ " <tr>\n",
594
+ " <th>43847</th>\n",
595
+ " <td>josef von sternberg directs magnificent silent...</td>\n",
596
+ " <td>positive</td>\n",
597
+ " <td>positive</td>\n",
598
+ " </tr>\n",
599
+ " </tbody>\n",
600
+ "</table>\n",
601
+ "</div>"
602
+ ],
603
+ "text/plain": [
604
+ " review original sentiment \\\n",
605
+ "34622 hard tell noonan marshall trying ape abbott co... negative \n",
606
+ "1163 well start one reviewer said know real treat s... positive \n",
607
+ "7637 wife kid opinion absolute abc classic seen eve... positive \n",
608
+ "7045 surprise basic copycat comedy classic nutty pr... positive \n",
609
+ "43847 josef von sternberg directs magnificent silent... positive \n",
610
+ "\n",
611
+ " predicted sentiment \n",
612
+ "34622 negative \n",
613
+ "1163 positive \n",
614
+ "7637 positive \n",
615
+ "7045 positive \n",
616
+ "43847 positive "
617
+ ]
618
+ },
619
+ "execution_count": 9,
620
+ "metadata": {},
621
+ "output_type": "execute_result"
622
+ }
623
+ ],
624
+ "source": [
625
+ "# Criando DataFrame com resultados\n",
626
+ "results_df = pd.DataFrame({'review': X_test, 'original sentiment': y_test, 'predicted sentiment': y_pred})\n",
627
+ "results_df.head()"
628
+ ]
629
+ },
630
+ {
631
+ "cell_type": "markdown",
632
+ "metadata": {
633
+ "id": "8Xq2ABXYtsjk"
634
+ },
635
+ "source": [
636
+ "## Avaliação"
637
+ ]
638
+ },
639
+ {
640
+ "cell_type": "code",
641
+ "execution_count": 10,
642
+ "metadata": {
643
+ "id": "3lXqDNhSrhsZ"
644
+ },
645
+ "outputs": [],
646
+ "source": [
647
+ "from sklearn.metrics import confusion_matrix, classification_report\n",
648
+ "import seaborn as sns\n",
649
+ "import matplotlib.pyplot as plt\n",
650
+ "\n",
651
+ "def plot_confusion_matrix(y_true, y_pred, labels, model_name):\n",
652
+ " cm = confusion_matrix(y_true, y_pred, labels=labels)\n",
653
+ " plt.figure(figsize=(8, 6))\n",
654
+ " sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=labels, yticklabels=labels)\n",
655
+ " plt.xlabel('Predicted Labels')\n",
656
+ " plt.ylabel('True Labels')\n",
657
+ " plt.title(f'Confusion Matrix {model_name}')\n",
658
+ " plt.show()\n",
659
+ "\n",
660
+ "# Função para calcular e imprimir as métricas de avaliação\n",
661
+ "def print_evaluation_metrics(y_true, y_pred, model_name):\n",
662
+ " print(f\"Classification Report {model_name}:\")\n",
663
+ " print(classification_report(y_true, y_pred))\n"
664
+ ]
665
+ },
666
+ {
667
+ "cell_type": "code",
668
+ "execution_count": 11,
669
+ "metadata": {
670
+ "colab": {
671
+ "base_uri": "https://localhost:8080/",
672
+ "height": 564
673
+ },
674
+ "id": "ybfb_GKDuqmb",
675
+ "outputId": "3e4c3a98-8962-4ce8-9856-2252f769a1b8"
676
+ },
677
+ "outputs": [
678
+ {
679
+ "data": {
680
+ "image/png": "",
681
+ "text/plain": [
682
+ "<Figure size 800x600 with 2 Axes>"
683
+ ]
684
+ },
685
+ "metadata": {},
686
+ "output_type": "display_data"
687
+ }
688
+ ],
689
+ "source": [
690
+ "plot_confusion_matrix(y_test, y_pred, ['positive', 'negative'], 'NB')"
691
+ ]
692
+ },
693
+ {
694
+ "cell_type": "code",
695
+ "execution_count": 12,
696
+ "metadata": {
697
+ "colab": {
698
+ "base_uri": "https://localhost:8080/"
699
+ },
700
+ "id": "2580FJCGs_oQ",
701
+ "outputId": "118f79e2-6b57-4cc0-a631-c2ef8a7e317e"
702
+ },
703
+ "outputs": [
704
+ {
705
+ "name": "stdout",
706
+ "output_type": "stream",
707
+ "text": [
708
+ "Classification Report NB:\n",
709
+ " precision recall f1-score support\n",
710
+ "\n",
711
+ " negative 0.86 0.87 0.86 5017\n",
712
+ " positive 0.87 0.86 0.86 4983\n",
713
+ "\n",
714
+ " accuracy 0.86 10000\n",
715
+ " macro avg 0.86 0.86 0.86 10000\n",
716
+ "weighted avg 0.86 0.86 0.86 10000\n",
717
+ "\n"
718
+ ]
719
+ }
720
+ ],
721
+ "source": [
722
+ "# Imprimir as métricas de avaliação\n",
723
+ "print_evaluation_metrics(y_test, y_pred, 'NB')"
724
+ ]
725
+ },
726
+ {
727
+ "cell_type": "markdown",
728
+ "metadata": {
729
+ "id": "x0JBy6nXvdjC"
730
+ },
731
+ "source": [
732
+ "# Conclusão\n",
733
+ "\n",
734
+ "É possível verificar no relatório de classificação que precisão e recall estão variando entre 86 a 87%. A métrica **F1-Score** combina precisão e recall, possui valor de aproximadamente 86%, o que indica um bom equilíbrio entre precisão e recall. A **Acurácia (accuracy)** geral do modelo é de 86%, o que significa que ele classificou corretamente aproximadamente 86% de todos os exemplos no conjunto de teste.\n",
735
+ "\n",
736
+ "O modelo Naive Bayes com vetorização TF-IDF conseguiu alcançar uma precisão, recall e F1-Score bastante equilibrados para ambas as classes, com uma acurácia geral de 86%. Podemos afirmar que o modelo é capaz de fazer previsões precisas em relação ao sentimento das revisões. Assim, podemos afirmar que o modelo estatístico possui um desempenho consideravelmente superior em relação à abordagem simbólica.\n"
737
+ ]
738
+ }
739
+ ],
740
+ "metadata": {
741
+ "accelerator": "GPU",
742
+ "colab": {
743
+ "gpuType": "T4",
744
+ "provenance": []
745
+ },
746
+ "kernelspec": {
747
+ "display_name": "Python 3",
748
+ "name": "python3"
749
+ },
750
+ "language_info": {
751
+ "codemirror_mode": {
752
+ "name": "ipython",
753
+ "version": 3
754
+ },
755
+ "file_extension": ".py",
756
+ "mimetype": "text/x-python",
757
+ "name": "python",
758
+ "nbconvert_exporter": "python",
759
+ "pygments_lexer": "ipython3",
760
+ "version": "3.11.7"
761
+ }
762
+ },
763
+ "nbformat": 4,
764
+ "nbformat_minor": 0
765
+ }
notebooks_explicativos/Neural_Bert.ipynb ADDED
@@ -0,0 +1,1291 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# SCC0633/SCC5908 - Processamento de Linguagem Natural\n",
8
+ "> **Docente:** Thiago Alexandre Salgueiro Pardo \\\\\n",
9
+ "> **Estagiário PAE:** Germano Antonio Zani Jorge\n",
10
+ "\n",
11
+ "\n",
12
+ "# Integrantes do Grupo: GPTrouxas\n",
13
+ "> André Guarnier De Mitri - 11395579 \\\\\n",
14
+ "> Daniel Carvalho - 10685702 \\\\\n",
15
+ "> Fernando - 11795342 \\\\\n",
16
+ "> Lucas Henrique Sant'Anna - 10748521 \\\\\n",
17
+ "> Magaly L Fujimoto - 4890582 \\\\\n"
18
+ ]
19
+ },
20
+ {
21
+ "cell_type": "markdown",
22
+ "metadata": {},
23
+ "source": [
24
+ "# Abordagem Neural usando BERT\n",
25
+ "![alt text](../imagens/BERT_TDIDF.png)"
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "markdown",
30
+ "metadata": {},
31
+ "source": [
32
+ "###"
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "markdown",
37
+ "metadata": {
38
+ "id": "6yecpJR0feeQ"
39
+ },
40
+ "source": [
41
+ "## Importando bibliotecas"
42
+ ]
43
+ },
44
+ {
45
+ "cell_type": "code",
46
+ "execution_count": 1,
47
+ "metadata": {
48
+ "id": "FAIvyZwodEtm"
49
+ },
50
+ "outputs": [],
51
+ "source": [
52
+ "import torch\n",
53
+ "import numpy as np\n",
54
+ "import matplotlib.pyplot as plt\n",
55
+ "import math\n",
56
+ "from tqdm.notebook import tqdm\n",
57
+ "import pandas as pd"
58
+ ]
59
+ },
60
+ {
61
+ "cell_type": "code",
62
+ "execution_count": 3,
63
+ "metadata": {},
64
+ "outputs": [],
65
+ "source": [
66
+ "#!pip install transformers seaborn nltk"
67
+ ]
68
+ },
69
+ {
70
+ "cell_type": "markdown",
71
+ "metadata": {},
72
+ "source": [
73
+ "## Carregando dados"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "code",
78
+ "execution_count": 3,
79
+ "metadata": {
80
+ "colab": {
81
+ "base_uri": "https://localhost:8080/",
82
+ "height": 206
83
+ },
84
+ "id": "LYgXl3RIfgfo",
85
+ "outputId": "eb496faf-7826-44f7-fa88-3b21fb6e7cbf"
86
+ },
87
+ "outputs": [
88
+ {
89
+ "data": {
90
+ "text/html": [
91
+ "<div>\n",
92
+ "<style scoped>\n",
93
+ " .dataframe tbody tr th:only-of-type {\n",
94
+ " vertical-align: middle;\n",
95
+ " }\n",
96
+ "\n",
97
+ " .dataframe tbody tr th {\n",
98
+ " vertical-align: top;\n",
99
+ " }\n",
100
+ "\n",
101
+ " .dataframe thead th {\n",
102
+ " text-align: right;\n",
103
+ " }\n",
104
+ "</style>\n",
105
+ "<table border=\"1\" class=\"dataframe\">\n",
106
+ " <thead>\n",
107
+ " <tr style=\"text-align: right;\">\n",
108
+ " <th></th>\n",
109
+ " <th>review</th>\n",
110
+ " <th>sentiment</th>\n",
111
+ " </tr>\n",
112
+ " </thead>\n",
113
+ " <tbody>\n",
114
+ " <tr>\n",
115
+ " <th>0</th>\n",
116
+ " <td>One of the other reviewers has mentioned that ...</td>\n",
117
+ " <td>positive</td>\n",
118
+ " </tr>\n",
119
+ " <tr>\n",
120
+ " <th>1</th>\n",
121
+ " <td>A wonderful little production. &lt;br /&gt;&lt;br /&gt;The...</td>\n",
122
+ " <td>positive</td>\n",
123
+ " </tr>\n",
124
+ " <tr>\n",
125
+ " <th>2</th>\n",
126
+ " <td>I thought this was a wonderful way to spend ti...</td>\n",
127
+ " <td>positive</td>\n",
128
+ " </tr>\n",
129
+ " <tr>\n",
130
+ " <th>3</th>\n",
131
+ " <td>Basically there's a family where a little boy ...</td>\n",
132
+ " <td>negative</td>\n",
133
+ " </tr>\n",
134
+ " <tr>\n",
135
+ " <th>4</th>\n",
136
+ " <td>Petter Mattei's \"Love in the Time of Money\" is...</td>\n",
137
+ " <td>positive</td>\n",
138
+ " </tr>\n",
139
+ " </tbody>\n",
140
+ "</table>\n",
141
+ "</div>"
142
+ ],
143
+ "text/plain": [
144
+ " review sentiment\n",
145
+ "0 One of the other reviewers has mentioned that ... positive\n",
146
+ "1 A wonderful little production. <br /><br />The... positive\n",
147
+ "2 I thought this was a wonderful way to spend ti... positive\n",
148
+ "3 Basically there's a family where a little boy ... negative\n",
149
+ "4 Petter Mattei's \"Love in the Time of Money\" is... positive"
150
+ ]
151
+ },
152
+ "execution_count": 3,
153
+ "metadata": {},
154
+ "output_type": "execute_result"
155
+ }
156
+ ],
157
+ "source": [
158
+ "df_reviews = pd.read_csv('imdb_reviews.csv')\n",
159
+ "df_reviews.head()"
160
+ ]
161
+ },
162
+ {
163
+ "cell_type": "markdown",
164
+ "metadata": {},
165
+ "source": [
166
+ "## Mapeando as classes\n",
167
+ "- Sentimento positivo recebe label 1\n",
168
+ "- Sentimento negativo recebe label 0"
169
+ ]
170
+ },
171
+ {
172
+ "cell_type": "code",
173
+ "execution_count": 4,
174
+ "metadata": {
175
+ "colab": {
176
+ "base_uri": "https://localhost:8080/",
177
+ "height": 206
178
+ },
179
+ "id": "D-5n8XzJbWOO",
180
+ "outputId": "cef630cc-b0cc-4598-c53f-d32636bfcd86"
181
+ },
182
+ "outputs": [
183
+ {
184
+ "data": {
185
+ "text/html": [
186
+ "<div>\n",
187
+ "<style scoped>\n",
188
+ " .dataframe tbody tr th:only-of-type {\n",
189
+ " vertical-align: middle;\n",
190
+ " }\n",
191
+ "\n",
192
+ " .dataframe tbody tr th {\n",
193
+ " vertical-align: top;\n",
194
+ " }\n",
195
+ "\n",
196
+ " .dataframe thead th {\n",
197
+ " text-align: right;\n",
198
+ " }\n",
199
+ "</style>\n",
200
+ "<table border=\"1\" class=\"dataframe\">\n",
201
+ " <thead>\n",
202
+ " <tr style=\"text-align: right;\">\n",
203
+ " <th></th>\n",
204
+ " <th>review</th>\n",
205
+ " <th>sentiment</th>\n",
206
+ " </tr>\n",
207
+ " </thead>\n",
208
+ " <tbody>\n",
209
+ " <tr>\n",
210
+ " <th>0</th>\n",
211
+ " <td>One of the other reviewers has mentioned that ...</td>\n",
212
+ " <td>1</td>\n",
213
+ " </tr>\n",
214
+ " <tr>\n",
215
+ " <th>1</th>\n",
216
+ " <td>A wonderful little production. &lt;br /&gt;&lt;br /&gt;The...</td>\n",
217
+ " <td>1</td>\n",
218
+ " </tr>\n",
219
+ " <tr>\n",
220
+ " <th>2</th>\n",
221
+ " <td>I thought this was a wonderful way to spend ti...</td>\n",
222
+ " <td>1</td>\n",
223
+ " </tr>\n",
224
+ " <tr>\n",
225
+ " <th>3</th>\n",
226
+ " <td>Basically there's a family where a little boy ...</td>\n",
227
+ " <td>0</td>\n",
228
+ " </tr>\n",
229
+ " <tr>\n",
230
+ " <th>4</th>\n",
231
+ " <td>Petter Mattei's \"Love in the Time of Money\" is...</td>\n",
232
+ " <td>1</td>\n",
233
+ " </tr>\n",
234
+ " </tbody>\n",
235
+ "</table>\n",
236
+ "</div>"
237
+ ],
238
+ "text/plain": [
239
+ " review sentiment\n",
240
+ "0 One of the other reviewers has mentioned that ... 1\n",
241
+ "1 A wonderful little production. <br /><br />The... 1\n",
242
+ "2 I thought this was a wonderful way to spend ti... 1\n",
243
+ "3 Basically there's a family where a little boy ... 0\n",
244
+ "4 Petter Mattei's \"Love in the Time of Money\" is... 1"
245
+ ]
246
+ },
247
+ "execution_count": 4,
248
+ "metadata": {},
249
+ "output_type": "execute_result"
250
+ }
251
+ ],
252
+ "source": [
253
+ "def map_sentiments(sentiment):\n",
254
+ " if sentiment == 'positive':\n",
255
+ " return 1\n",
256
+ " return 0\n",
257
+ "\n",
258
+ "df_reviews['sentiment'] = df_reviews['sentiment'].apply(map_sentiments)\n",
259
+ "df_reviews.head()"
260
+ ]
261
+ },
262
+ {
263
+ "cell_type": "markdown",
264
+ "metadata": {},
265
+ "source": [
266
+ "# Funções para limpeza do texto\n",
267
+ "**lowercase_text(text)** Converte o texto para letras minúsculas para uniformizar o texto.\n",
268
+ "\n",
269
+ "\n",
270
+ "**remove_html(text)** Remove quaisquer tags HTML do texto para limpar dados provenientes de fontes HTML.\n",
271
+ "\n",
272
+ "\n",
273
+ " **remove_url(text)** Remove URLs do texto para eliminar links que podem não ser relevantes para a análise de texto.\n",
274
+ "\n",
275
+ "\n",
276
+ "**remove_punctuations(text)** Remove pontuações do texto para simplificar a estrutura do texto, mantendo apenas palavras.\n",
277
+ "\n",
278
+ "**remove_emojis(text)** Remove emojis do texto para evitar caracteres não verbais que podem interferir na análise textual.\n",
279
+ "\n",
280
+ "**remove_stop_words(text)** Remove stop words (palavras comuns como \"e\", \"de\", \"o\") que geralmente não adicionam valor significativo à análise de texto.\n",
281
+ "\n",
282
+ "**stem_words(text)** Aplica stemming nas palavras do texto, reduzindo-as à sua raiz (por exemplo, \"running\" vira \"run\") para normalizar as variações das palavras.\n",
283
+ "\n",
284
+ "**preprocess_text(text)** Aplica todas as funções acima em sequência para pré-processar o texto de forma completa, tornando-o mais adequado para análise de texto ou modelagem.\n",
285
+ "\n",
286
+ "\n",
287
+ "\n"
288
+ ]
289
+ },
290
+ {
291
+ "cell_type": "code",
292
+ "execution_count": 5,
293
+ "metadata": {
294
+ "colab": {
295
+ "base_uri": "https://localhost:8080/",
296
+ "height": 241
297
+ },
298
+ "id": "PnFHO62rnWn-",
299
+ "outputId": "17fb6619-fab9-4395-de5d-4c5199e7e45e"
300
+ },
301
+ "outputs": [
302
+ {
303
+ "name": "stderr",
304
+ "output_type": "stream",
305
+ "text": [
306
+ "[nltk_data] Downloading package stopwords to\n",
307
+ "[nltk_data] C:\\Users\\andre\\AppData\\Roaming\\nltk_data...\n",
308
+ "[nltk_data] Package stopwords is already up-to-date!\n"
309
+ ]
310
+ },
311
+ {
312
+ "data": {
313
+ "text/html": [
314
+ "<div>\n",
315
+ "<style scoped>\n",
316
+ " .dataframe tbody tr th:only-of-type {\n",
317
+ " vertical-align: middle;\n",
318
+ " }\n",
319
+ "\n",
320
+ " .dataframe tbody tr th {\n",
321
+ " vertical-align: top;\n",
322
+ " }\n",
323
+ "\n",
324
+ " .dataframe thead th {\n",
325
+ " text-align: right;\n",
326
+ " }\n",
327
+ "</style>\n",
328
+ "<table border=\"1\" class=\"dataframe\">\n",
329
+ " <thead>\n",
330
+ " <tr style=\"text-align: right;\">\n",
331
+ " <th></th>\n",
332
+ " <th>review</th>\n",
333
+ " <th>sentiment</th>\n",
334
+ " </tr>\n",
335
+ " </thead>\n",
336
+ " <tbody>\n",
337
+ " <tr>\n",
338
+ " <th>0</th>\n",
339
+ " <td>one review mention watch 1 oz episod hook righ...</td>\n",
340
+ " <td>1</td>\n",
341
+ " </tr>\n",
342
+ " <tr>\n",
343
+ " <th>1</th>\n",
344
+ " <td>wonder littl product film techniqu unassum old...</td>\n",
345
+ " <td>1</td>\n",
346
+ " </tr>\n",
347
+ " <tr>\n",
348
+ " <th>2</th>\n",
349
+ " <td>thought wonder way spend time hot summer weeke...</td>\n",
350
+ " <td>1</td>\n",
351
+ " </tr>\n",
352
+ " <tr>\n",
353
+ " <th>3</th>\n",
354
+ " <td>basic famili littl boy jake think zombi closet...</td>\n",
355
+ " <td>0</td>\n",
356
+ " </tr>\n",
357
+ " <tr>\n",
358
+ " <th>4</th>\n",
359
+ " <td>petter mattei love time money visual stun film...</td>\n",
360
+ " <td>1</td>\n",
361
+ " </tr>\n",
362
+ " </tbody>\n",
363
+ "</table>\n",
364
+ "</div>"
365
+ ],
366
+ "text/plain": [
367
+ " review sentiment\n",
368
+ "0 one review mention watch 1 oz episod hook righ... 1\n",
369
+ "1 wonder littl product film techniqu unassum old... 1\n",
370
+ "2 thought wonder way spend time hot summer weeke... 1\n",
371
+ "3 basic famili littl boy jake think zombi closet... 0\n",
372
+ "4 petter mattei love time money visual stun film... 1"
373
+ ]
374
+ },
375
+ "execution_count": 5,
376
+ "metadata": {},
377
+ "output_type": "execute_result"
378
+ }
379
+ ],
380
+ "source": [
381
+ "import re\n",
382
+ "import nltk\n",
383
+ "from nltk.corpus import stopwords\n",
384
+ "from nltk.stem import PorterStemmer\n",
385
+ "\n",
386
+ "\n",
387
+ "def lowercase_text(text):\n",
388
+ " return text.lower()\n",
389
+ "\n",
390
+ "def remove_html(text):\n",
391
+ " return re.sub(r'<[^<]+?>', '', text)\n",
392
+ "\n",
393
+ "def remove_url(text):\n",
394
+ " return re.sub(r'http[s]?://\\S+|www\\.\\S+', '', text)\n",
395
+ "\n",
396
+ "def remove_punctuations(text):\n",
397
+ " tokens_list = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n",
398
+ " for char in text:\n",
399
+ " if char in tokens_list:\n",
400
+ " text = text.replace(char, ' ')\n",
401
+ "\n",
402
+ " return text\n",
403
+ "\n",
404
+ "def remove_emojis(text):\n",
405
+ " emojis = re.compile(\"[\"\n",
406
+ " u\"\\U0001F600-\\U0001F64F\"\n",
407
+ " u\"\\U0001F300-\\U0001F5FF\"\n",
408
+ " u\"\\U0001F680-\\U0001F6FF\"\n",
409
+ " u\"\\U0001F1E0-\\U0001F1FF\"\n",
410
+ " u\"\\U00002500-\\U00002BEF\"\n",
411
+ " u\"\\U00002702-\\U000027B0\"\n",
412
+ " u\"\\U00002702-\\U000027B0\"\n",
413
+ " u\"\\U000024C2-\\U0001F251\"\n",
414
+ " u\"\\U0001f926-\\U0001f937\"\n",
415
+ " u\"\\U00010000-\\U0010ffff\"\n",
416
+ " u\"\\u2640-\\u2642\"\n",
417
+ " u\"\\u2600-\\u2B55\"\n",
418
+ " u\"\\u200d\"\n",
419
+ " u\"\\u23cf\"\n",
420
+ " u\"\\u23e9\"\n",
421
+ " u\"\\u231a\"\n",
422
+ " u\"\\ufe0f\"\n",
423
+ " u\"\\u3030\"\n",
424
+ " \"]+\", re.UNICODE)\n",
425
+ "\n",
426
+ " text = re.sub(emojis, '', text)\n",
427
+ " return text\n",
428
+ "\n",
429
+ "def remove_stop_words(text):\n",
430
+ " stop_words = stopwords.words('english')\n",
431
+ " new_text = ''\n",
432
+ " for word in text.split():\n",
433
+ " if word not in stop_words:\n",
434
+ " new_text += ''.join(f'{word} ')\n",
435
+ "\n",
436
+ " return new_text.strip()\n",
437
+ "\n",
438
+ "def stem_words(text):\n",
439
+ " stemmer = PorterStemmer()\n",
440
+ " new_text = ''\n",
441
+ " for word in text.split():\n",
442
+ " new_text += ''.join(f'{stemmer.stem(word)} ')\n",
443
+ "\n",
444
+ " return new_text\n",
445
+ "\n",
446
+ "def preprocess_text(text):\n",
447
+ " text = lowercase_text(text)\n",
448
+ " text = remove_html(text)\n",
449
+ " text = remove_url(text)\n",
450
+ " text = remove_punctuations(text)\n",
451
+ " text = remove_emojis(text)\n",
452
+ " text = remove_stop_words(text)\n",
453
+ " text = stem_words(text)\n",
454
+ "\n",
455
+ " return text\n",
456
+ "\n",
457
+ "nltk.download('stopwords')\n",
458
+ "df_reviews['review'] = df_reviews['review'].apply(preprocess_text)\n",
459
+ "df_reviews.head()"
460
+ ]
461
+ },
462
+ {
463
+ "cell_type": "markdown",
464
+ "metadata": {},
465
+ "source": [
466
+ "### Visualizando balancemento da classes"
467
+ ]
468
+ },
469
+ {
470
+ "cell_type": "code",
471
+ "execution_count": 6,
472
+ "metadata": {
473
+ "colab": {
474
+ "base_uri": "https://localhost:8080/",
475
+ "height": 452
476
+ },
477
+ "id": "Gdi_L0HWfntv",
478
+ "outputId": "bce77594-f662-4b3f-c8eb-27d8a188b4f2"
479
+ },
480
+ "outputs": [
481
+ {
482
+ "data": {
483
+ "image/png": "",
484
+ "text/plain": [
485
+ "<Figure size 640x480 with 1 Axes>"
486
+ ]
487
+ },
488
+ "metadata": {},
489
+ "output_type": "display_data"
490
+ }
491
+ ],
492
+ "source": [
493
+ "plt.title('Target value distribution')\n",
494
+ "plt.hist(df_reviews['sentiment'])\n",
495
+ "plt.show()"
496
+ ]
497
+ },
498
+ {
499
+ "cell_type": "markdown",
500
+ "metadata": {},
501
+ "source": [
502
+ "# Modelo BERT"
503
+ ]
504
+ },
505
+ {
506
+ "cell_type": "markdown",
507
+ "metadata": {
508
+ "id": "EDkjlPDakskM"
509
+ },
510
+ "source": [
511
+ "## Instalando Bibliotecas"
512
+ ]
513
+ },
514
+ {
515
+ "cell_type": "code",
516
+ "execution_count": 4,
517
+ "metadata": {
518
+ "colab": {
519
+ "base_uri": "https://localhost:8080/"
520
+ },
521
+ "id": "lk7m_1xvmWvz",
522
+ "outputId": "ce842053-b261-4768-d9d7-fe9c65c9f6aa"
523
+ },
524
+ "outputs": [],
525
+ "source": [
526
+ "#pip install transformers\n",
527
+ "#pip install accelerate -U\n",
528
+ "#pip install transformers[torch]\n",
529
+ "#pip install datasets evaluate"
530
+ ]
531
+ },
532
+ {
533
+ "cell_type": "markdown",
534
+ "metadata": {},
535
+ "source": [
536
+ "## Carregando o modelo treinado e tokenizador"
537
+ ]
538
+ },
539
+ {
540
+ "cell_type": "code",
541
+ "execution_count": 10,
542
+ "metadata": {
543
+ "colab": {
544
+ "base_uri": "https://localhost:8080/"
545
+ },
546
+ "id": "GlyrkK52zMcc",
547
+ "outputId": "a938653b-92c3-4b4e-802c-eacc3f1b6ecf"
548
+ },
549
+ "outputs": [
550
+ {
551
+ "name": "stderr",
552
+ "output_type": "stream",
553
+ "text": [
554
+ "c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
555
+ " from .autonotebook import tqdm as notebook_tqdm\n"
556
+ ]
557
+ }
558
+ ],
559
+ "source": [
560
+ "from transformers import AutoTokenizer\n",
561
+ "from transformers import BertForSequenceClassification\n",
562
+ "\n",
563
+ "pre_trained_base = \"bert-base-uncased\"\n",
564
+ "tokenizer = AutoTokenizer.from_pretrained(pre_trained_base)\n",
565
+ "model = BertForSequenceClassification.from_pretrained(pre_trained_base, num_labels = 2, output_attentions=False, output_hidden_states=False)"
566
+ ]
567
+ },
568
+ {
569
+ "cell_type": "markdown",
570
+ "metadata": {},
571
+ "source": [
572
+ "### Tokenização das Sentenças e Cálculo do Tamanho dos Tokens"
573
+ ]
574
+ },
575
+ {
576
+ "cell_type": "code",
577
+ "execution_count": 13,
578
+ "metadata": {
579
+ "id": "LKEjDZCHpk4e"
580
+ },
581
+ "outputs": [],
582
+ "source": [
583
+ "token_lens = []\n",
584
+ "\n",
585
+ "for sentence in df_reviews['review']:\n",
586
+ " tokens = tokenizer.encode(sentence, max_length=200, truncation=True)\n",
587
+ " token_lens.append(len(tokens))"
588
+ ]
589
+ },
590
+ {
591
+ "cell_type": "markdown",
592
+ "metadata": {},
593
+ "source": [
594
+ "### Divisão dos Dados em Conjunto de Treinamento e Validação:"
595
+ ]
596
+ },
597
+ {
598
+ "cell_type": "code",
599
+ "execution_count": 15,
600
+ "metadata": {
601
+ "id": "H7PfXaVVp2uQ"
602
+ },
603
+ "outputs": [],
604
+ "source": [
605
+ "SEED=42\n",
606
+ "MAX_LEN = 200\n",
607
+ "from sklearn.model_selection import train_test_split\n",
608
+ "df_train, df_val = train_test_split(df_reviews, test_size=0.2, random_state=SEED)"
609
+ ]
610
+ },
611
+ {
612
+ "cell_type": "markdown",
613
+ "metadata": {},
614
+ "source": [
615
+ "### Processando os dados\n",
616
+ "A função process_data recebe uma linha de um dataframe contendo uma revisão de texto e sua respectiva classificação de sentimento. Ela começa extraindo e limpando o texto da revisão, removendo quaisquer espaços extras. Em seguida, utiliza o tokenizer BERT para tokenizar o texto, aplicando padding e truncamento para garantir que todas as sequências tenham um comprimento fixo definido pela variável MAX_LEN. A função então adiciona a etiqueta de sentimento original e o texto limpo às codificações geradas, retornando um dicionário que contém os tokens do texto, a etiqueta de sentimento e o texto original."
617
+ ]
618
+ },
619
+ {
620
+ "cell_type": "code",
621
+ "execution_count": 16,
622
+ "metadata": {
623
+ "id": "v7EZ6wd-qDfd"
624
+ },
625
+ "outputs": [],
626
+ "source": [
627
+ "def process_data(row):\n",
628
+ "\n",
629
+ " text = row['review']\n",
630
+ " text = str(text)\n",
631
+ " text = ' '.join(text.split())\n",
632
+ "\n",
633
+ " encodings = tokenizer(text, padding=\"max_length\", truncation=True, max_length=MAX_LEN)\n",
634
+ "\n",
635
+ " encodings['label'] = row['sentiment']\n",
636
+ " encodings['text'] = text\n",
637
+ "\n",
638
+ " return encodings"
639
+ ]
640
+ },
641
+ {
642
+ "cell_type": "code",
643
+ "execution_count": 17,
644
+ "metadata": {
645
+ "id": "d9VgrXNSqIYL"
646
+ },
647
+ "outputs": [],
648
+ "source": [
649
+ "# Treino\n",
650
+ "processed_data_tr = []\n",
651
+ "for i in range(df_train.shape[0]):\n",
652
+ " processed_data_tr.append(process_data(df_train.iloc[i]))"
653
+ ]
654
+ },
655
+ {
656
+ "cell_type": "code",
657
+ "execution_count": 18,
658
+ "metadata": {
659
+ "id": "p0NLQxoKqJ_k"
660
+ },
661
+ "outputs": [],
662
+ "source": [
663
+ "# Validação\n",
664
+ "processed_data_val = []\n",
665
+ "for i in range(df_val.shape[0]):\n",
666
+ " processed_data_val.append(process_data(df_val.iloc[i]))"
667
+ ]
668
+ },
669
+ {
670
+ "cell_type": "code",
671
+ "execution_count": 19,
672
+ "metadata": {
673
+ "id": "ac76Rb6fqP_G"
674
+ },
675
+ "outputs": [],
676
+ "source": [
677
+ "# Dataframes de Treino e Validação\n",
678
+ "df_train = pd.DataFrame(processed_data_tr)\n",
679
+ "df_val = pd.DataFrame(processed_data_val)"
680
+ ]
681
+ },
682
+ {
683
+ "cell_type": "code",
684
+ "execution_count": 20,
685
+ "metadata": {
686
+ "colab": {
687
+ "base_uri": "https://localhost:8080/",
688
+ "height": 206
689
+ },
690
+ "id": "RdbHaVy_fd64",
691
+ "outputId": "a9aed834-81b7-4223-da42-6289799c2e1e"
692
+ },
693
+ "outputs": [
694
+ {
695
+ "data": {
696
+ "text/html": [
697
+ "<div>\n",
698
+ "<style scoped>\n",
699
+ " .dataframe tbody tr th:only-of-type {\n",
700
+ " vertical-align: middle;\n",
701
+ " }\n",
702
+ "\n",
703
+ " .dataframe tbody tr th {\n",
704
+ " vertical-align: top;\n",
705
+ " }\n",
706
+ "\n",
707
+ " .dataframe thead th {\n",
708
+ " text-align: right;\n",
709
+ " }\n",
710
+ "</style>\n",
711
+ "<table border=\"1\" class=\"dataframe\">\n",
712
+ " <thead>\n",
713
+ " <tr style=\"text-align: right;\">\n",
714
+ " <th></th>\n",
715
+ " <th>attention_mask</th>\n",
716
+ " <th>input_ids</th>\n",
717
+ " <th>label</th>\n",
718
+ " <th>text</th>\n",
719
+ " <th>token_type_ids</th>\n",
720
+ " </tr>\n",
721
+ " </thead>\n",
722
+ " <tbody>\n",
723
+ " <tr>\n",
724
+ " <th>0</th>\n",
725
+ " <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n",
726
+ " <td>[101, 2921, 3198, 23624, 2954, 6978, 2674, 841...</td>\n",
727
+ " <td>0</td>\n",
728
+ " <td>kept ask mani fight scream match swear gener m...</td>\n",
729
+ " <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
730
+ " </tr>\n",
731
+ " <tr>\n",
732
+ " <th>1</th>\n",
733
+ " <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n",
734
+ " <td>[101, 3422, 4372, 3775, 2099, 9587, 5737, 2071...</td>\n",
735
+ " <td>0</td>\n",
736
+ " <td>watch entir movi could watch entir movi stop d...</td>\n",
737
+ " <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
738
+ " </tr>\n",
739
+ " <tr>\n",
740
+ " <th>2</th>\n",
741
+ " <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n",
742
+ " <td>[101, 3543, 2293, 2358, 10050, 2128, 25300, 11...</td>\n",
743
+ " <td>1</td>\n",
744
+ " <td>touch love stori reminisc ‘in mood love draw h...</td>\n",
745
+ " <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
746
+ " </tr>\n",
747
+ " <tr>\n",
748
+ " <th>3</th>\n",
749
+ " <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n",
750
+ " <td>[101, 3732, 2154, 11865, 15472, 2072, 8040, 73...</td>\n",
751
+ " <td>0</td>\n",
752
+ " <td>latter day fulci schlocker total abysm concoct...</td>\n",
753
+ " <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
754
+ " </tr>\n",
755
+ " <tr>\n",
756
+ " <th>4</th>\n",
757
+ " <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n",
758
+ " <td>[101, 2034, 3813, 3669, 19337, 2666, 2615, 504...</td>\n",
759
+ " <td>0</td>\n",
760
+ " <td>first firmli believ norwegian movi continu get...</td>\n",
761
+ " <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
762
+ " </tr>\n",
763
+ " </tbody>\n",
764
+ "</table>\n",
765
+ "</div>"
766
+ ],
767
+ "text/plain": [
768
+ " attention_mask \\\n",
769
+ "0 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n",
770
+ "1 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n",
771
+ "2 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n",
772
+ "3 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n",
773
+ "4 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n",
774
+ "\n",
775
+ " input_ids label \\\n",
776
+ "0 [101, 2921, 3198, 23624, 2954, 6978, 2674, 841... 0 \n",
777
+ "1 [101, 3422, 4372, 3775, 2099, 9587, 5737, 2071... 0 \n",
778
+ "2 [101, 3543, 2293, 2358, 10050, 2128, 25300, 11... 1 \n",
779
+ "3 [101, 3732, 2154, 11865, 15472, 2072, 8040, 73... 0 \n",
780
+ "4 [101, 2034, 3813, 3669, 19337, 2666, 2615, 504... 0 \n",
781
+ "\n",
782
+ " text \\\n",
783
+ "0 kept ask mani fight scream match swear gener m... \n",
784
+ "1 watch entir movi could watch entir movi stop d... \n",
785
+ "2 touch love stori reminisc ‘in mood love draw h... \n",
786
+ "3 latter day fulci schlocker total abysm concoct... \n",
787
+ "4 first firmli believ norwegian movi continu get... \n",
788
+ "\n",
789
+ " token_type_ids \n",
790
+ "0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
791
+ "1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
792
+ "2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
793
+ "3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
794
+ "4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... "
795
+ ]
796
+ },
797
+ "execution_count": 20,
798
+ "metadata": {},
799
+ "output_type": "execute_result"
800
+ }
801
+ ],
802
+ "source": [
803
+ "df_train.head()"
804
+ ]
805
+ },
806
+ {
807
+ "cell_type": "markdown",
808
+ "metadata": {
809
+ "id": "0lTWT8JwkRic"
810
+ },
811
+ "source": [
812
+ "## Fine Tunning do Modelo\n",
813
+ "Ajuste fino do BERT para tarefas específica de classificação de sentimento para o dataset do IMDB"
814
+ ]
815
+ },
816
+ {
817
+ "cell_type": "code",
818
+ "execution_count": null,
819
+ "metadata": {},
820
+ "outputs": [],
821
+ "source": [
822
+ "import torch\n",
823
+ "import pyarrow as pa\n",
824
+ "from datasets import Dataset\n",
825
+ "import evaluate\n",
826
+ "import numpy as np"
827
+ ]
828
+ },
829
+ {
830
+ "cell_type": "code",
831
+ "execution_count": 21,
832
+ "metadata": {
833
+ "colab": {
834
+ "base_uri": "https://localhost:8080/"
835
+ },
836
+ "id": "kW53p7VQqUDD",
837
+ "outputId": "8231f3ba-37d5-4546-c4d0-6b4ff317ecf3"
838
+ },
839
+ "outputs": [
840
+ {
841
+ "data": {
842
+ "text/plain": [
843
+ "device(type='cuda', index=0)"
844
+ ]
845
+ },
846
+ "execution_count": 21,
847
+ "metadata": {},
848
+ "output_type": "execute_result"
849
+ }
850
+ ],
851
+ "source": [
852
+ "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n",
853
+ "device"
854
+ ]
855
+ },
856
+ {
857
+ "cell_type": "code",
858
+ "execution_count": 24,
859
+ "metadata": {
860
+ "id": "68OdbTv5rLrm"
861
+ },
862
+ "outputs": [],
863
+ "source": [
864
+ "train_hg = Dataset(pa.Table.from_pandas(df_train))\n",
865
+ "valid_hg = Dataset(pa.Table.from_pandas(df_val))"
866
+ ]
867
+ },
868
+ {
869
+ "cell_type": "markdown",
870
+ "metadata": {},
871
+ "source": [
872
+ "## Metricas de avaliação F1 Score e Acc"
873
+ ]
874
+ },
875
+ {
876
+ "cell_type": "markdown",
877
+ "metadata": {},
878
+ "source": [
879
+ "`compute_metrics` calcula tanto a acurácia quanto o F1-score para avaliar um modelo de classificação. Primeiramente, são carregadas as métricas de acurácia e F1-score usando evaluate.load. Em seguida, a função compute_metrics recebe um par de arrays eval_pred, contendo as previsões do modelo e os rótulos verdadeiros. Utilizando as previsões, a função calcula a acurácia e o F1-score ponderado, onde a acurácia é obtida através da comparação das previsões com os rótulos utilizando a métrica de acurácia previamente carregada, e o F1-score é calculado utilizando a métrica de F1 previamente carregada, com ponderação \"weighted\". Os resultados de ambas as métricas são então combinados em um dicionário e retornados como um único objeto contendo as métricas de avaliação calculadas."
880
+ ]
881
+ },
882
+ {
883
+ "cell_type": "code",
884
+ "execution_count": 25,
885
+ "metadata": {
886
+ "id": "lUNhDPs0ry4m"
887
+ },
888
+ "outputs": [],
889
+ "source": [
890
+ "\n",
891
+ "# Load both accuracy and f1 metrics\n",
892
+ "accuracy_metric = evaluate.load(\"accuracy\")\n",
893
+ "f1_metric = evaluate.load(\"f1\")\n",
894
+ "\n",
895
+ "# Metric helper method\n",
896
+ "def compute_metrics(eval_pred):\n",
897
+ " predictions, labels = eval_pred\n",
898
+ " predictions = np.argmax(predictions, axis=1)\n",
899
+ "\n",
900
+ " # Compute accuracy\n",
901
+ " accuracy = accuracy_metric.compute(predictions=predictions, references=labels)\n",
902
+ "\n",
903
+ " # Compute F1 score\n",
904
+ " f1 = f1_metric.compute(predictions=predictions, references=labels, average=\"weighted\")\n",
905
+ "\n",
906
+ " # Combine the metrics into a single dictionary\n",
907
+ " combined_metrics = {\n",
908
+ " 'accuracy': accuracy['accuracy'],\n",
909
+ " 'f1': f1['f1']\n",
910
+ " }\n",
911
+ "\n",
912
+ " return combined_metrics"
913
+ ]
914
+ },
915
+ {
916
+ "cell_type": "code",
917
+ "execution_count": 26,
918
+ "metadata": {
919
+ "colab": {
920
+ "base_uri": "https://localhost:8080/"
921
+ },
922
+ "id": "9jJYTWsHjnEc",
923
+ "outputId": "fe45691a-4476-4978-89b8-15f36465c37c"
924
+ },
925
+ "outputs": [
926
+ {
927
+ "name": "stdout",
928
+ "output_type": "stream",
929
+ "text": [
930
+ "Name: accelerateNote: you may need to restart the kernel to use updated packages.\n",
931
+ "\n",
932
+ "Version: 0.31.0\n",
933
+ "Summary: Accelerate\n",
934
+ "Home-page: https://github.com/huggingface/accelerate\n",
935
+ "Author: The HuggingFace team\n",
936
+ "Author-email: [email protected]\n",
937
+ "License: Apache\n",
938
+ "Location: c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\n",
939
+ "Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch\n",
940
+ "Required-by: \n",
941
+ "---\n",
942
+ "Name: transformers\n",
943
+ "Version: 4.41.2\n",
944
+ "Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow\n",
945
+ "Home-page: https://github.com/huggingface/transformers\n",
946
+ "Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)\n",
947
+ "Author-email: [email protected]\n",
948
+ "License: Apache 2.0 License\n",
949
+ "Location: c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\n",
950
+ "Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm\n",
951
+ "Required-by: \n"
952
+ ]
953
+ }
954
+ ],
955
+ "source": [
956
+ "pip show accelerate transformers"
957
+ ]
958
+ },
959
+ {
960
+ "cell_type": "markdown",
961
+ "metadata": {},
962
+ "source": [
963
+ "## Treinamento do modelo"
964
+ ]
965
+ },
966
+ {
967
+ "cell_type": "code",
968
+ "execution_count": 27,
969
+ "metadata": {
970
+ "colab": {
971
+ "base_uri": "https://localhost:8080/"
972
+ },
973
+ "id": "QlaLCwf7rLtp",
974
+ "outputId": "7e10e82a-8bc7-478b-851e-c7b628b46c41"
975
+ },
976
+ "outputs": [
977
+ {
978
+ "name": "stderr",
979
+ "output_type": "stream",
980
+ "text": [
981
+ "c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\\transformers\\training_args.py:1474: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead\n",
982
+ " warnings.warn(\n"
983
+ ]
984
+ }
985
+ ],
986
+ "source": [
987
+ "from transformers import TrainingArguments, Trainer\n",
988
+ "\n",
989
+ "EPOCHS = 1\n",
990
+ "\n",
991
+ "training_args = TrainingArguments(output_dir=\"./result\",\n",
992
+ " evaluation_strategy=\"epoch\",\n",
993
+ " num_train_epochs= EPOCHS,\n",
994
+ " per_device_train_batch_size=16,\n",
995
+ " per_device_eval_batch_size=8\n",
996
+ " )\n",
997
+ "\n",
998
+ "trainer = Trainer(\n",
999
+ " model=model,\n",
1000
+ " args=training_args,\n",
1001
+ " train_dataset=train_hg,\n",
1002
+ " eval_dataset=valid_hg,\n",
1003
+ " tokenizer=tokenizer,\n",
1004
+ " compute_metrics=compute_metrics\n",
1005
+ ")"
1006
+ ]
1007
+ },
1008
+ {
1009
+ "cell_type": "code",
1010
+ "execution_count": 28,
1011
+ "metadata": {},
1012
+ "outputs": [
1013
+ {
1014
+ "name": "stdout",
1015
+ "output_type": "stream",
1016
+ "text": [
1017
+ "CUDA available: True\n",
1018
+ "CUDA version: 12.1\n"
1019
+ ]
1020
+ }
1021
+ ],
1022
+ "source": [
1023
+ "print(\"CUDA available: \", torch.cuda.is_available())\n",
1024
+ "print(\"CUDA version: \", torch.version.cuda)"
1025
+ ]
1026
+ },
1027
+ {
1028
+ "cell_type": "code",
1029
+ "execution_count": 29,
1030
+ "metadata": {
1031
+ "colab": {
1032
+ "base_uri": "https://localhost:8080/",
1033
+ "height": 141
1034
+ },
1035
+ "id": "3s6lVFz_rLwO",
1036
+ "outputId": "ee64e8e9-9c8c-42a8-c355-f51410cc33df"
1037
+ },
1038
+ "outputs": [
1039
+ {
1040
+ "name": "stderr",
1041
+ "output_type": "stream",
1042
+ "text": [
1043
+ " 0%| | 0/2500 [00:00<?, ?it/s]c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\\transformers\\models\\bert\\modeling_bert.py:435: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\\aten\\src\\ATen\\native\\transformers\\cuda\\sdp_utils.cpp:263.)\n",
1044
+ " attn_output = torch.nn.functional.scaled_dot_product_attention(\n",
1045
+ " 20%|██ | 500/2500 [05:35<22:22, 1.49it/s]"
1046
+ ]
1047
+ },
1048
+ {
1049
+ "name": "stdout",
1050
+ "output_type": "stream",
1051
+ "text": [
1052
+ "{'loss': 0.4994, 'grad_norm': 12.613661766052246, 'learning_rate': 4e-05, 'epoch': 0.2}\n"
1053
+ ]
1054
+ },
1055
+ {
1056
+ "name": "stderr",
1057
+ "output_type": "stream",
1058
+ "text": [
1059
+ " 40%|████ | 1000/2500 [11:13<16:46, 1.49it/s]"
1060
+ ]
1061
+ },
1062
+ {
1063
+ "name": "stdout",
1064
+ "output_type": "stream",
1065
+ "text": [
1066
+ "{'loss': 0.3898, 'grad_norm': 4.661791801452637, 'learning_rate': 3e-05, 'epoch': 0.4}\n"
1067
+ ]
1068
+ },
1069
+ {
1070
+ "name": "stderr",
1071
+ "output_type": "stream",
1072
+ "text": [
1073
+ " 60%|██████ | 1500/2500 [16:47<11:02, 1.51it/s]"
1074
+ ]
1075
+ },
1076
+ {
1077
+ "name": "stdout",
1078
+ "output_type": "stream",
1079
+ "text": [
1080
+ "{'loss': 0.3516, 'grad_norm': 1.5203113555908203, 'learning_rate': 2e-05, 'epoch': 0.6}\n"
1081
+ ]
1082
+ },
1083
+ {
1084
+ "name": "stderr",
1085
+ "output_type": "stream",
1086
+ "text": [
1087
+ " 80%|████████ | 2000/2500 [22:25<05:33, 1.50it/s]"
1088
+ ]
1089
+ },
1090
+ {
1091
+ "name": "stdout",
1092
+ "output_type": "stream",
1093
+ "text": [
1094
+ "{'loss': 0.3121, 'grad_norm': 8.331348419189453, 'learning_rate': 1e-05, 'epoch': 0.8}\n"
1095
+ ]
1096
+ },
1097
+ {
1098
+ "name": "stderr",
1099
+ "output_type": "stream",
1100
+ "text": [
1101
+ "100%|██████████| 2500/2500 [28:04<00:00, 1.50it/s]"
1102
+ ]
1103
+ },
1104
+ {
1105
+ "name": "stdout",
1106
+ "output_type": "stream",
1107
+ "text": [
1108
+ "{'loss': 0.2882, 'grad_norm': 6.287994861602783, 'learning_rate': 0.0, 'epoch': 1.0}\n"
1109
+ ]
1110
+ },
1111
+ {
1112
+ "name": "stderr",
1113
+ "output_type": "stream",
1114
+ "text": [
1115
+ " \n",
1116
+ "100%|██████████| 2500/2500 [30:45<00:00, 1.35it/s]"
1117
+ ]
1118
+ },
1119
+ {
1120
+ "name": "stdout",
1121
+ "output_type": "stream",
1122
+ "text": [
1123
+ "{'eval_loss': 0.283893883228302, 'eval_accuracy': 0.883, 'eval_f1': 0.8829425082505502, 'eval_runtime': 159.717, 'eval_samples_per_second': 62.611, 'eval_steps_per_second': 7.826, 'epoch': 1.0}\n",
1124
+ "{'train_runtime': 1845.2907, 'train_samples_per_second': 21.677, 'train_steps_per_second': 1.355, 'train_loss': 0.3682089477539062, 'epoch': 1.0}\n"
1125
+ ]
1126
+ },
1127
+ {
1128
+ "name": "stderr",
1129
+ "output_type": "stream",
1130
+ "text": [
1131
+ "\n"
1132
+ ]
1133
+ },
1134
+ {
1135
+ "data": {
1136
+ "text/plain": [
1137
+ "TrainOutput(global_step=2500, training_loss=0.3682089477539062, metrics={'train_runtime': 1845.2907, 'train_samples_per_second': 21.677, 'train_steps_per_second': 1.355, 'total_flos': 4111110240000000.0, 'train_loss': 0.3682089477539062, 'epoch': 1.0})"
1138
+ ]
1139
+ },
1140
+ "execution_count": 29,
1141
+ "metadata": {},
1142
+ "output_type": "execute_result"
1143
+ }
1144
+ ],
1145
+ "source": [
1146
+ "trainer.train()"
1147
+ ]
1148
+ },
1149
+ {
1150
+ "cell_type": "markdown",
1151
+ "metadata": {},
1152
+ "source": [
1153
+ "## Salvando o modelo"
1154
+ ]
1155
+ },
1156
+ {
1157
+ "cell_type": "code",
1158
+ "execution_count": 38,
1159
+ "metadata": {
1160
+ "id": "8eO6WDiOBAhg"
1161
+ },
1162
+ "outputs": [],
1163
+ "source": [
1164
+ "torch.save(model.state_dict(), 'model.pth')"
1165
+ ]
1166
+ },
1167
+ {
1168
+ "cell_type": "markdown",
1169
+ "metadata": {
1170
+ "id": "FtVZztSa40b3"
1171
+ },
1172
+ "source": [
1173
+ "## Teste de predições individuais"
1174
+ ]
1175
+ },
1176
+ {
1177
+ "cell_type": "code",
1178
+ "execution_count": 34,
1179
+ "metadata": {
1180
+ "id": "lOHVSyfJJ8zK"
1181
+ },
1182
+ "outputs": [],
1183
+ "source": [
1184
+ "from transformers import AutoTokenizer\n",
1185
+ "\n",
1186
+ "new_tokenizer = AutoTokenizer.from_pretrained(pre_trained_base)"
1187
+ ]
1188
+ },
1189
+ {
1190
+ "cell_type": "code",
1191
+ "execution_count": 35,
1192
+ "metadata": {
1193
+ "id": "t-T7hDZ2J1Qk"
1194
+ },
1195
+ "outputs": [],
1196
+ "source": [
1197
+ "def get_prediction(text):\n",
1198
+ " encoding = new_tokenizer(text, return_tensors=\"pt\", padding=\"max_length\", truncation=True, max_length=MAX_LEN)\n",
1199
+ " encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}\n",
1200
+ "\n",
1201
+ " outputs = model(**encoding)\n",
1202
+ "\n",
1203
+ " logits = outputs.logits\n",
1204
+ "\n",
1205
+ " sigmoid = torch.nn.Sigmoid()\n",
1206
+ " probs = sigmoid(logits.squeeze().cpu())\n",
1207
+ " probs = probs.detach().numpy()\n",
1208
+ " label = np.argmax(probs, axis=-1)\n",
1209
+ "\n",
1210
+ " return label"
1211
+ ]
1212
+ },
1213
+ {
1214
+ "cell_type": "code",
1215
+ "execution_count": 36,
1216
+ "metadata": {
1217
+ "colab": {
1218
+ "base_uri": "https://localhost:8080/"
1219
+ },
1220
+ "id": "y4dxQ4oYJ5C1",
1221
+ "outputId": "d0d77c2d-aff6-412b-e22a-0b721f5b097e"
1222
+ },
1223
+ "outputs": [
1224
+ {
1225
+ "data": {
1226
+ "text/plain": [
1227
+ "0"
1228
+ ]
1229
+ },
1230
+ "execution_count": 36,
1231
+ "metadata": {},
1232
+ "output_type": "execute_result"
1233
+ }
1234
+ ],
1235
+ "source": [
1236
+ "get_prediction(\"This movie is horrible!\")"
1237
+ ]
1238
+ },
1239
+ {
1240
+ "cell_type": "code",
1241
+ "execution_count": 37,
1242
+ "metadata": {
1243
+ "colab": {
1244
+ "base_uri": "https://localhost:8080/"
1245
+ },
1246
+ "id": "JXAyOu_6AqoO",
1247
+ "outputId": "ffcd019e-4c0c-45eb-f538-d2860c53a0e0"
1248
+ },
1249
+ "outputs": [
1250
+ {
1251
+ "data": {
1252
+ "text/plain": [
1253
+ "1"
1254
+ ]
1255
+ },
1256
+ "execution_count": 37,
1257
+ "metadata": {},
1258
+ "output_type": "execute_result"
1259
+ }
1260
+ ],
1261
+ "source": [
1262
+ "get_prediction(\"This movie is awesome!\")"
1263
+ ]
1264
+ }
1265
+ ],
1266
+ "metadata": {
1267
+ "accelerator": "GPU",
1268
+ "colab": {
1269
+ "provenance": []
1270
+ },
1271
+ "gpuClass": "standard",
1272
+ "kernelspec": {
1273
+ "display_name": "Python 3",
1274
+ "name": "python3"
1275
+ },
1276
+ "language_info": {
1277
+ "codemirror_mode": {
1278
+ "name": "ipython",
1279
+ "version": 3
1280
+ },
1281
+ "file_extension": ".py",
1282
+ "mimetype": "text/x-python",
1283
+ "name": "python",
1284
+ "nbconvert_exporter": "python",
1285
+ "pygments_lexer": "ipython3",
1286
+ "version": "3.10.11"
1287
+ }
1288
+ },
1289
+ "nbformat": 4,
1290
+ "nbformat_minor": 0
1291
+ }
notebooks_explicativos/Simbolico.ipynb ADDED
The diff for this file is too large to render. See raw diff