AndreMitri commited on
Commit
c1f801a
β€’
1 Parent(s): 1ba6bc3

Notebooks Explicativos, imagens e dados

Browse files

Adiciona
- Notebooks Explicativos
- Imagens
- Dados

.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ data/imdb_reviews.csv filter=lfs diff=lfs merge=lfs -text
data/imdb_reviews.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f1314f123ac922d7d0f2bd5bd17f1734e167d90b2256c34963228bc63f6a4cb
3
+ size 66262310
imagens/BERT_TDIDF.png ADDED
imagens/Simbolico_WordCloud_Wordnet.png ADDED
notebooks_explicativos/Estatistico.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
notebooks_explicativos/Neural_Bert.ipynb ADDED
@@ -0,0 +1,1291 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# SCC0633/SCC5908 - Processamento de Linguagem Natural\n",
8
+ "> **Docente:** Thiago Alexandre Salgueiro Pardo \\\\\n",
9
+ "> **EstagiΓ‘rio PAE:** Germano Antonio Zani Jorge\n",
10
+ "\n",
11
+ "\n",
12
+ "# Integrantes do Grupo: GPTrouxas\n",
13
+ "> AndrΓ© Guarnier De Mitri - 11395579 \\\\\n",
14
+ "> Daniel Carvalho - 10685702 \\\\\n",
15
+ "> Fernando - 11795342 \\\\\n",
16
+ "> Lucas Henrique Sant'Anna - 10748521 \\\\\n",
17
+ "> Magaly L Fujimoto - 4890582 \\\\\n"
18
+ ]
19
+ },
20
+ {
21
+ "cell_type": "markdown",
22
+ "metadata": {},
23
+ "source": [
24
+ "# Abordagem Neural usando BERT\n",
25
+ "![alt text](../imagens/BERT_TDIDF.png)"
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "markdown",
30
+ "metadata": {},
31
+ "source": [
32
+ "###"
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "markdown",
37
+ "metadata": {
38
+ "id": "6yecpJR0feeQ"
39
+ },
40
+ "source": [
41
+ "## Importando bibliotecas"
42
+ ]
43
+ },
44
+ {
45
+ "cell_type": "code",
46
+ "execution_count": 1,
47
+ "metadata": {
48
+ "id": "FAIvyZwodEtm"
49
+ },
50
+ "outputs": [],
51
+ "source": [
52
+ "import torch\n",
53
+ "import numpy as np\n",
54
+ "import matplotlib.pyplot as plt\n",
55
+ "import math\n",
56
+ "from tqdm.notebook import tqdm\n",
57
+ "import pandas as pd"
58
+ ]
59
+ },
60
+ {
61
+ "cell_type": "code",
62
+ "execution_count": 3,
63
+ "metadata": {},
64
+ "outputs": [],
65
+ "source": [
66
+ "#!pip install transformers seaborn nltk"
67
+ ]
68
+ },
69
+ {
70
+ "cell_type": "markdown",
71
+ "metadata": {},
72
+ "source": [
73
+ "## Carregando dados"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "code",
78
+ "execution_count": 3,
79
+ "metadata": {
80
+ "colab": {
81
+ "base_uri": "https://localhost:8080/",
82
+ "height": 206
83
+ },
84
+ "id": "LYgXl3RIfgfo",
85
+ "outputId": "eb496faf-7826-44f7-fa88-3b21fb6e7cbf"
86
+ },
87
+ "outputs": [
88
+ {
89
+ "data": {
90
+ "text/html": [
91
+ "<div>\n",
92
+ "<style scoped>\n",
93
+ " .dataframe tbody tr th:only-of-type {\n",
94
+ " vertical-align: middle;\n",
95
+ " }\n",
96
+ "\n",
97
+ " .dataframe tbody tr th {\n",
98
+ " vertical-align: top;\n",
99
+ " }\n",
100
+ "\n",
101
+ " .dataframe thead th {\n",
102
+ " text-align: right;\n",
103
+ " }\n",
104
+ "</style>\n",
105
+ "<table border=\"1\" class=\"dataframe\">\n",
106
+ " <thead>\n",
107
+ " <tr style=\"text-align: right;\">\n",
108
+ " <th></th>\n",
109
+ " <th>review</th>\n",
110
+ " <th>sentiment</th>\n",
111
+ " </tr>\n",
112
+ " </thead>\n",
113
+ " <tbody>\n",
114
+ " <tr>\n",
115
+ " <th>0</th>\n",
116
+ " <td>One of the other reviewers has mentioned that ...</td>\n",
117
+ " <td>positive</td>\n",
118
+ " </tr>\n",
119
+ " <tr>\n",
120
+ " <th>1</th>\n",
121
+ " <td>A wonderful little production. &lt;br /&gt;&lt;br /&gt;The...</td>\n",
122
+ " <td>positive</td>\n",
123
+ " </tr>\n",
124
+ " <tr>\n",
125
+ " <th>2</th>\n",
126
+ " <td>I thought this was a wonderful way to spend ti...</td>\n",
127
+ " <td>positive</td>\n",
128
+ " </tr>\n",
129
+ " <tr>\n",
130
+ " <th>3</th>\n",
131
+ " <td>Basically there's a family where a little boy ...</td>\n",
132
+ " <td>negative</td>\n",
133
+ " </tr>\n",
134
+ " <tr>\n",
135
+ " <th>4</th>\n",
136
+ " <td>Petter Mattei's \"Love in the Time of Money\" is...</td>\n",
137
+ " <td>positive</td>\n",
138
+ " </tr>\n",
139
+ " </tbody>\n",
140
+ "</table>\n",
141
+ "</div>"
142
+ ],
143
+ "text/plain": [
144
+ " review sentiment\n",
145
+ "0 One of the other reviewers has mentioned that ... positive\n",
146
+ "1 A wonderful little production. <br /><br />The... positive\n",
147
+ "2 I thought this was a wonderful way to spend ti... positive\n",
148
+ "3 Basically there's a family where a little boy ... negative\n",
149
+ "4 Petter Mattei's \"Love in the Time of Money\" is... positive"
150
+ ]
151
+ },
152
+ "execution_count": 3,
153
+ "metadata": {},
154
+ "output_type": "execute_result"
155
+ }
156
+ ],
157
+ "source": [
158
+ "df_reviews = pd.read_csv('imdb_reviews.csv')\n",
159
+ "df_reviews.head()"
160
+ ]
161
+ },
162
+ {
163
+ "cell_type": "markdown",
164
+ "metadata": {},
165
+ "source": [
166
+ "## Mapeando as classes\n",
167
+ "- Sentimento positivo recebe label 1\n",
168
+ "- Sentimento negativo recebe label 0"
169
+ ]
170
+ },
171
+ {
172
+ "cell_type": "code",
173
+ "execution_count": 4,
174
+ "metadata": {
175
+ "colab": {
176
+ "base_uri": "https://localhost:8080/",
177
+ "height": 206
178
+ },
179
+ "id": "D-5n8XzJbWOO",
180
+ "outputId": "cef630cc-b0cc-4598-c53f-d32636bfcd86"
181
+ },
182
+ "outputs": [
183
+ {
184
+ "data": {
185
+ "text/html": [
186
+ "<div>\n",
187
+ "<style scoped>\n",
188
+ " .dataframe tbody tr th:only-of-type {\n",
189
+ " vertical-align: middle;\n",
190
+ " }\n",
191
+ "\n",
192
+ " .dataframe tbody tr th {\n",
193
+ " vertical-align: top;\n",
194
+ " }\n",
195
+ "\n",
196
+ " .dataframe thead th {\n",
197
+ " text-align: right;\n",
198
+ " }\n",
199
+ "</style>\n",
200
+ "<table border=\"1\" class=\"dataframe\">\n",
201
+ " <thead>\n",
202
+ " <tr style=\"text-align: right;\">\n",
203
+ " <th></th>\n",
204
+ " <th>review</th>\n",
205
+ " <th>sentiment</th>\n",
206
+ " </tr>\n",
207
+ " </thead>\n",
208
+ " <tbody>\n",
209
+ " <tr>\n",
210
+ " <th>0</th>\n",
211
+ " <td>One of the other reviewers has mentioned that ...</td>\n",
212
+ " <td>1</td>\n",
213
+ " </tr>\n",
214
+ " <tr>\n",
215
+ " <th>1</th>\n",
216
+ " <td>A wonderful little production. &lt;br /&gt;&lt;br /&gt;The...</td>\n",
217
+ " <td>1</td>\n",
218
+ " </tr>\n",
219
+ " <tr>\n",
220
+ " <th>2</th>\n",
221
+ " <td>I thought this was a wonderful way to spend ti...</td>\n",
222
+ " <td>1</td>\n",
223
+ " </tr>\n",
224
+ " <tr>\n",
225
+ " <th>3</th>\n",
226
+ " <td>Basically there's a family where a little boy ...</td>\n",
227
+ " <td>0</td>\n",
228
+ " </tr>\n",
229
+ " <tr>\n",
230
+ " <th>4</th>\n",
231
+ " <td>Petter Mattei's \"Love in the Time of Money\" is...</td>\n",
232
+ " <td>1</td>\n",
233
+ " </tr>\n",
234
+ " </tbody>\n",
235
+ "</table>\n",
236
+ "</div>"
237
+ ],
238
+ "text/plain": [
239
+ " review sentiment\n",
240
+ "0 One of the other reviewers has mentioned that ... 1\n",
241
+ "1 A wonderful little production. <br /><br />The... 1\n",
242
+ "2 I thought this was a wonderful way to spend ti... 1\n",
243
+ "3 Basically there's a family where a little boy ... 0\n",
244
+ "4 Petter Mattei's \"Love in the Time of Money\" is... 1"
245
+ ]
246
+ },
247
+ "execution_count": 4,
248
+ "metadata": {},
249
+ "output_type": "execute_result"
250
+ }
251
+ ],
252
+ "source": [
253
+ "def map_sentiments(sentiment):\n",
254
+ " if sentiment == 'positive':\n",
255
+ " return 1\n",
256
+ " return 0\n",
257
+ "\n",
258
+ "df_reviews['sentiment'] = df_reviews['sentiment'].apply(map_sentiments)\n",
259
+ "df_reviews.head()"
260
+ ]
261
+ },
262
+ {
263
+ "cell_type": "markdown",
264
+ "metadata": {},
265
+ "source": [
266
+ "# Funçáes para limpeza do texto\n",
267
+ "**lowercase_text(text)** Converte o texto para letras minΓΊsculas para uniformizar o texto.\n",
268
+ "\n",
269
+ "\n",
270
+ "**remove_html(text)** Remove quaisquer tags HTML do texto para limpar dados provenientes de fontes HTML.\n",
271
+ "\n",
272
+ "\n",
273
+ " **remove_url(text)** Remove URLs do texto para eliminar links que podem nΓ£o ser relevantes para a anΓ‘lise de texto.\n",
274
+ "\n",
275
+ "\n",
276
+ "**remove_punctuations(text)** Remove pontuaçáes do texto para simplificar a estrutura do texto, mantendo apenas palavras.\n",
277
+ "\n",
278
+ "**remove_emojis(text)** Remove emojis do texto para evitar caracteres nΓ£o verbais que podem interferir na anΓ‘lise textual.\n",
279
+ "\n",
280
+ "**remove_stop_words(text)** Remove stop words (palavras comuns como \"e\", \"de\", \"o\") que geralmente nΓ£o adicionam valor significativo Γ  anΓ‘lise de texto.\n",
281
+ "\n",
282
+ "**stem_words(text)** Aplica stemming nas palavras do texto, reduzindo-as à sua raiz (por exemplo, \"running\" vira \"run\") para normalizar as variaçáes das palavras.\n",
283
+ "\n",
284
+ "**preprocess_text(text)** Aplica todas as funçáes acima em sequΓͺncia para prΓ©-processar o texto de forma completa, tornando-o mais adequado para anΓ‘lise de texto ou modelagem.\n",
285
+ "\n",
286
+ "\n",
287
+ "\n"
288
+ ]
289
+ },
290
+ {
291
+ "cell_type": "code",
292
+ "execution_count": 5,
293
+ "metadata": {
294
+ "colab": {
295
+ "base_uri": "https://localhost:8080/",
296
+ "height": 241
297
+ },
298
+ "id": "PnFHO62rnWn-",
299
+ "outputId": "17fb6619-fab9-4395-de5d-4c5199e7e45e"
300
+ },
301
+ "outputs": [
302
+ {
303
+ "name": "stderr",
304
+ "output_type": "stream",
305
+ "text": [
306
+ "[nltk_data] Downloading package stopwords to\n",
307
+ "[nltk_data] C:\\Users\\andre\\AppData\\Roaming\\nltk_data...\n",
308
+ "[nltk_data] Package stopwords is already up-to-date!\n"
309
+ ]
310
+ },
311
+ {
312
+ "data": {
313
+ "text/html": [
314
+ "<div>\n",
315
+ "<style scoped>\n",
316
+ " .dataframe tbody tr th:only-of-type {\n",
317
+ " vertical-align: middle;\n",
318
+ " }\n",
319
+ "\n",
320
+ " .dataframe tbody tr th {\n",
321
+ " vertical-align: top;\n",
322
+ " }\n",
323
+ "\n",
324
+ " .dataframe thead th {\n",
325
+ " text-align: right;\n",
326
+ " }\n",
327
+ "</style>\n",
328
+ "<table border=\"1\" class=\"dataframe\">\n",
329
+ " <thead>\n",
330
+ " <tr style=\"text-align: right;\">\n",
331
+ " <th></th>\n",
332
+ " <th>review</th>\n",
333
+ " <th>sentiment</th>\n",
334
+ " </tr>\n",
335
+ " </thead>\n",
336
+ " <tbody>\n",
337
+ " <tr>\n",
338
+ " <th>0</th>\n",
339
+ " <td>one review mention watch 1 oz episod hook righ...</td>\n",
340
+ " <td>1</td>\n",
341
+ " </tr>\n",
342
+ " <tr>\n",
343
+ " <th>1</th>\n",
344
+ " <td>wonder littl product film techniqu unassum old...</td>\n",
345
+ " <td>1</td>\n",
346
+ " </tr>\n",
347
+ " <tr>\n",
348
+ " <th>2</th>\n",
349
+ " <td>thought wonder way spend time hot summer weeke...</td>\n",
350
+ " <td>1</td>\n",
351
+ " </tr>\n",
352
+ " <tr>\n",
353
+ " <th>3</th>\n",
354
+ " <td>basic famili littl boy jake think zombi closet...</td>\n",
355
+ " <td>0</td>\n",
356
+ " </tr>\n",
357
+ " <tr>\n",
358
+ " <th>4</th>\n",
359
+ " <td>petter mattei love time money visual stun film...</td>\n",
360
+ " <td>1</td>\n",
361
+ " </tr>\n",
362
+ " </tbody>\n",
363
+ "</table>\n",
364
+ "</div>"
365
+ ],
366
+ "text/plain": [
367
+ " review sentiment\n",
368
+ "0 one review mention watch 1 oz episod hook righ... 1\n",
369
+ "1 wonder littl product film techniqu unassum old... 1\n",
370
+ "2 thought wonder way spend time hot summer weeke... 1\n",
371
+ "3 basic famili littl boy jake think zombi closet... 0\n",
372
+ "4 petter mattei love time money visual stun film... 1"
373
+ ]
374
+ },
375
+ "execution_count": 5,
376
+ "metadata": {},
377
+ "output_type": "execute_result"
378
+ }
379
+ ],
380
+ "source": [
381
+ "import re\n",
382
+ "import nltk\n",
383
+ "from nltk.corpus import stopwords\n",
384
+ "from nltk.stem import PorterStemmer\n",
385
+ "\n",
386
+ "\n",
387
+ "def lowercase_text(text):\n",
388
+ " return text.lower()\n",
389
+ "\n",
390
+ "def remove_html(text):\n",
391
+ " return re.sub(r'<[^<]+?>', '', text)\n",
392
+ "\n",
393
+ "def remove_url(text):\n",
394
+ " return re.sub(r'http[s]?://\\S+|www\\.\\S+', '', text)\n",
395
+ "\n",
396
+ "def remove_punctuations(text):\n",
397
+ " tokens_list = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n",
398
+ " for char in text:\n",
399
+ " if char in tokens_list:\n",
400
+ " text = text.replace(char, ' ')\n",
401
+ "\n",
402
+ " return text\n",
403
+ "\n",
404
+ "def remove_emojis(text):\n",
405
+ " emojis = re.compile(\"[\"\n",
406
+ " u\"\\U0001F600-\\U0001F64F\"\n",
407
+ " u\"\\U0001F300-\\U0001F5FF\"\n",
408
+ " u\"\\U0001F680-\\U0001F6FF\"\n",
409
+ " u\"\\U0001F1E0-\\U0001F1FF\"\n",
410
+ " u\"\\U00002500-\\U00002BEF\"\n",
411
+ " u\"\\U00002702-\\U000027B0\"\n",
412
+ " u\"\\U00002702-\\U000027B0\"\n",
413
+ " u\"\\U000024C2-\\U0001F251\"\n",
414
+ " u\"\\U0001f926-\\U0001f937\"\n",
415
+ " u\"\\U00010000-\\U0010ffff\"\n",
416
+ " u\"\\u2640-\\u2642\"\n",
417
+ " u\"\\u2600-\\u2B55\"\n",
418
+ " u\"\\u200d\"\n",
419
+ " u\"\\u23cf\"\n",
420
+ " u\"\\u23e9\"\n",
421
+ " u\"\\u231a\"\n",
422
+ " u\"\\ufe0f\"\n",
423
+ " u\"\\u3030\"\n",
424
+ " \"]+\", re.UNICODE)\n",
425
+ "\n",
426
+ " text = re.sub(emojis, '', text)\n",
427
+ " return text\n",
428
+ "\n",
429
+ "def remove_stop_words(text):\n",
430
+ " stop_words = stopwords.words('english')\n",
431
+ " new_text = ''\n",
432
+ " for word in text.split():\n",
433
+ " if word not in stop_words:\n",
434
+ " new_text += ''.join(f'{word} ')\n",
435
+ "\n",
436
+ " return new_text.strip()\n",
437
+ "\n",
438
+ "def stem_words(text):\n",
439
+ " stemmer = PorterStemmer()\n",
440
+ " new_text = ''\n",
441
+ " for word in text.split():\n",
442
+ " new_text += ''.join(f'{stemmer.stem(word)} ')\n",
443
+ "\n",
444
+ " return new_text\n",
445
+ "\n",
446
+ "def preprocess_text(text):\n",
447
+ " text = lowercase_text(text)\n",
448
+ " text = remove_html(text)\n",
449
+ " text = remove_url(text)\n",
450
+ " text = remove_punctuations(text)\n",
451
+ " text = remove_emojis(text)\n",
452
+ " text = remove_stop_words(text)\n",
453
+ " text = stem_words(text)\n",
454
+ "\n",
455
+ " return text\n",
456
+ "\n",
457
+ "nltk.download('stopwords')\n",
458
+ "df_reviews['review'] = df_reviews['review'].apply(preprocess_text)\n",
459
+ "df_reviews.head()"
460
+ ]
461
+ },
462
+ {
463
+ "cell_type": "markdown",
464
+ "metadata": {},
465
+ "source": [
466
+ "### Visualizando balancemento da classes"
467
+ ]
468
+ },
469
+ {
470
+ "cell_type": "code",
471
+ "execution_count": 6,
472
+ "metadata": {
473
+ "colab": {
474
+ "base_uri": "https://localhost:8080/",
475
+ "height": 452
476
+ },
477
+ "id": "Gdi_L0HWfntv",
478
+ "outputId": "bce77594-f662-4b3f-c8eb-27d8a188b4f2"
479
+ },
480
+ "outputs": [
481
+ {
482
+ "data": {
483
+ "image/png": "",
484
+ "text/plain": [
485
+ "<Figure size 640x480 with 1 Axes>"
486
+ ]
487
+ },
488
+ "metadata": {},
489
+ "output_type": "display_data"
490
+ }
491
+ ],
492
+ "source": [
493
+ "plt.title('Target value distribution')\n",
494
+ "plt.hist(df_reviews['sentiment'])\n",
495
+ "plt.show()"
496
+ ]
497
+ },
498
+ {
499
+ "cell_type": "markdown",
500
+ "metadata": {},
501
+ "source": [
502
+ "# Modelo BERT"
503
+ ]
504
+ },
505
+ {
506
+ "cell_type": "markdown",
507
+ "metadata": {
508
+ "id": "EDkjlPDakskM"
509
+ },
510
+ "source": [
511
+ "## Instalando Bibliotecas"
512
+ ]
513
+ },
514
+ {
515
+ "cell_type": "code",
516
+ "execution_count": 4,
517
+ "metadata": {
518
+ "colab": {
519
+ "base_uri": "https://localhost:8080/"
520
+ },
521
+ "id": "lk7m_1xvmWvz",
522
+ "outputId": "ce842053-b261-4768-d9d7-fe9c65c9f6aa"
523
+ },
524
+ "outputs": [],
525
+ "source": [
526
+ "#pip install transformers\n",
527
+ "#pip install accelerate -U\n",
528
+ "#pip install transformers[torch]\n",
529
+ "#pip install datasets evaluate"
530
+ ]
531
+ },
532
+ {
533
+ "cell_type": "markdown",
534
+ "metadata": {},
535
+ "source": [
536
+ "## Carregando o modelo treinado e tokenizador"
537
+ ]
538
+ },
539
+ {
540
+ "cell_type": "code",
541
+ "execution_count": 10,
542
+ "metadata": {
543
+ "colab": {
544
+ "base_uri": "https://localhost:8080/"
545
+ },
546
+ "id": "GlyrkK52zMcc",
547
+ "outputId": "a938653b-92c3-4b4e-802c-eacc3f1b6ecf"
548
+ },
549
+ "outputs": [
550
+ {
551
+ "name": "stderr",
552
+ "output_type": "stream",
553
+ "text": [
554
+ "c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\\tqdm\\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
555
+ " from .autonotebook import tqdm as notebook_tqdm\n"
556
+ ]
557
+ }
558
+ ],
559
+ "source": [
560
+ "from transformers import AutoTokenizer\n",
561
+ "from transformers import BertForSequenceClassification\n",
562
+ "\n",
563
+ "pre_trained_base = \"bert-base-uncased\"\n",
564
+ "tokenizer = AutoTokenizer.from_pretrained(pre_trained_base)\n",
565
+ "model = BertForSequenceClassification.from_pretrained(pre_trained_base, num_labels = 2, output_attentions=False, output_hidden_states=False)"
566
+ ]
567
+ },
568
+ {
569
+ "cell_type": "markdown",
570
+ "metadata": {},
571
+ "source": [
572
+ "### Tokenização das Sentenças e CÑlculo do Tamanho dos Tokens"
573
+ ]
574
+ },
575
+ {
576
+ "cell_type": "code",
577
+ "execution_count": 13,
578
+ "metadata": {
579
+ "id": "LKEjDZCHpk4e"
580
+ },
581
+ "outputs": [],
582
+ "source": [
583
+ "token_lens = []\n",
584
+ "\n",
585
+ "for sentence in df_reviews['review']:\n",
586
+ " tokens = tokenizer.encode(sentence, max_length=200, truncation=True)\n",
587
+ " token_lens.append(len(tokens))"
588
+ ]
589
+ },
590
+ {
591
+ "cell_type": "markdown",
592
+ "metadata": {},
593
+ "source": [
594
+ "### Divisão dos Dados em Conjunto de Treinamento e Validação:"
595
+ ]
596
+ },
597
+ {
598
+ "cell_type": "code",
599
+ "execution_count": 15,
600
+ "metadata": {
601
+ "id": "H7PfXaVVp2uQ"
602
+ },
603
+ "outputs": [],
604
+ "source": [
605
+ "SEED=42\n",
606
+ "MAX_LEN = 200\n",
607
+ "from sklearn.model_selection import train_test_split\n",
608
+ "df_train, df_val = train_test_split(df_reviews, test_size=0.2, random_state=SEED)"
609
+ ]
610
+ },
611
+ {
612
+ "cell_type": "markdown",
613
+ "metadata": {},
614
+ "source": [
615
+ "### Processando os dados\n",
616
+ "A função process_data recebe uma linha de um dataframe contendo uma revisΓ£o de texto e sua respectiva classificação de sentimento. Ela comeΓ§a extraindo e limpando o texto da revisΓ£o, removendo quaisquer espaΓ§os extras. Em seguida, utiliza o tokenizer BERT para tokenizar o texto, aplicando padding e truncamento para garantir que todas as sequΓͺncias tenham um comprimento fixo definido pela variΓ‘vel MAX_LEN. A função entΓ£o adiciona a etiqueta de sentimento original e o texto limpo Γ s codificaçáes geradas, retornando um dicionΓ‘rio que contΓ©m os tokens do texto, a etiqueta de sentimento e o texto original."
617
+ ]
618
+ },
619
+ {
620
+ "cell_type": "code",
621
+ "execution_count": 16,
622
+ "metadata": {
623
+ "id": "v7EZ6wd-qDfd"
624
+ },
625
+ "outputs": [],
626
+ "source": [
627
+ "def process_data(row):\n",
628
+ "\n",
629
+ " text = row['review']\n",
630
+ " text = str(text)\n",
631
+ " text = ' '.join(text.split())\n",
632
+ "\n",
633
+ " encodings = tokenizer(text, padding=\"max_length\", truncation=True, max_length=MAX_LEN)\n",
634
+ "\n",
635
+ " encodings['label'] = row['sentiment']\n",
636
+ " encodings['text'] = text\n",
637
+ "\n",
638
+ " return encodings"
639
+ ]
640
+ },
641
+ {
642
+ "cell_type": "code",
643
+ "execution_count": 17,
644
+ "metadata": {
645
+ "id": "d9VgrXNSqIYL"
646
+ },
647
+ "outputs": [],
648
+ "source": [
649
+ "# Treino\n",
650
+ "processed_data_tr = []\n",
651
+ "for i in range(df_train.shape[0]):\n",
652
+ " processed_data_tr.append(process_data(df_train.iloc[i]))"
653
+ ]
654
+ },
655
+ {
656
+ "cell_type": "code",
657
+ "execution_count": 18,
658
+ "metadata": {
659
+ "id": "p0NLQxoKqJ_k"
660
+ },
661
+ "outputs": [],
662
+ "source": [
663
+ "# Validação\n",
664
+ "processed_data_val = []\n",
665
+ "for i in range(df_val.shape[0]):\n",
666
+ " processed_data_val.append(process_data(df_val.iloc[i]))"
667
+ ]
668
+ },
669
+ {
670
+ "cell_type": "code",
671
+ "execution_count": 19,
672
+ "metadata": {
673
+ "id": "ac76Rb6fqP_G"
674
+ },
675
+ "outputs": [],
676
+ "source": [
677
+ "# Dataframes de Treino e Validação\n",
678
+ "df_train = pd.DataFrame(processed_data_tr)\n",
679
+ "df_val = pd.DataFrame(processed_data_val)"
680
+ ]
681
+ },
682
+ {
683
+ "cell_type": "code",
684
+ "execution_count": 20,
685
+ "metadata": {
686
+ "colab": {
687
+ "base_uri": "https://localhost:8080/",
688
+ "height": 206
689
+ },
690
+ "id": "RdbHaVy_fd64",
691
+ "outputId": "a9aed834-81b7-4223-da42-6289799c2e1e"
692
+ },
693
+ "outputs": [
694
+ {
695
+ "data": {
696
+ "text/html": [
697
+ "<div>\n",
698
+ "<style scoped>\n",
699
+ " .dataframe tbody tr th:only-of-type {\n",
700
+ " vertical-align: middle;\n",
701
+ " }\n",
702
+ "\n",
703
+ " .dataframe tbody tr th {\n",
704
+ " vertical-align: top;\n",
705
+ " }\n",
706
+ "\n",
707
+ " .dataframe thead th {\n",
708
+ " text-align: right;\n",
709
+ " }\n",
710
+ "</style>\n",
711
+ "<table border=\"1\" class=\"dataframe\">\n",
712
+ " <thead>\n",
713
+ " <tr style=\"text-align: right;\">\n",
714
+ " <th></th>\n",
715
+ " <th>attention_mask</th>\n",
716
+ " <th>input_ids</th>\n",
717
+ " <th>label</th>\n",
718
+ " <th>text</th>\n",
719
+ " <th>token_type_ids</th>\n",
720
+ " </tr>\n",
721
+ " </thead>\n",
722
+ " <tbody>\n",
723
+ " <tr>\n",
724
+ " <th>0</th>\n",
725
+ " <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n",
726
+ " <td>[101, 2921, 3198, 23624, 2954, 6978, 2674, 841...</td>\n",
727
+ " <td>0</td>\n",
728
+ " <td>kept ask mani fight scream match swear gener m...</td>\n",
729
+ " <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
730
+ " </tr>\n",
731
+ " <tr>\n",
732
+ " <th>1</th>\n",
733
+ " <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n",
734
+ " <td>[101, 3422, 4372, 3775, 2099, 9587, 5737, 2071...</td>\n",
735
+ " <td>0</td>\n",
736
+ " <td>watch entir movi could watch entir movi stop d...</td>\n",
737
+ " <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
738
+ " </tr>\n",
739
+ " <tr>\n",
740
+ " <th>2</th>\n",
741
+ " <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n",
742
+ " <td>[101, 3543, 2293, 2358, 10050, 2128, 25300, 11...</td>\n",
743
+ " <td>1</td>\n",
744
+ " <td>touch love stori reminisc Β‘in mood love draw h...</td>\n",
745
+ " <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
746
+ " </tr>\n",
747
+ " <tr>\n",
748
+ " <th>3</th>\n",
749
+ " <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n",
750
+ " <td>[101, 3732, 2154, 11865, 15472, 2072, 8040, 73...</td>\n",
751
+ " <td>0</td>\n",
752
+ " <td>latter day fulci schlocker total abysm concoct...</td>\n",
753
+ " <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
754
+ " </tr>\n",
755
+ " <tr>\n",
756
+ " <th>4</th>\n",
757
+ " <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n",
758
+ " <td>[101, 2034, 3813, 3669, 19337, 2666, 2615, 504...</td>\n",
759
+ " <td>0</td>\n",
760
+ " <td>first firmli believ norwegian movi continu get...</td>\n",
761
+ " <td>[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...</td>\n",
762
+ " </tr>\n",
763
+ " </tbody>\n",
764
+ "</table>\n",
765
+ "</div>"
766
+ ],
767
+ "text/plain": [
768
+ " attention_mask \\\n",
769
+ "0 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n",
770
+ "1 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n",
771
+ "2 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n",
772
+ "3 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n",
773
+ "4 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... \n",
774
+ "\n",
775
+ " input_ids label \\\n",
776
+ "0 [101, 2921, 3198, 23624, 2954, 6978, 2674, 841... 0 \n",
777
+ "1 [101, 3422, 4372, 3775, 2099, 9587, 5737, 2071... 0 \n",
778
+ "2 [101, 3543, 2293, 2358, 10050, 2128, 25300, 11... 1 \n",
779
+ "3 [101, 3732, 2154, 11865, 15472, 2072, 8040, 73... 0 \n",
780
+ "4 [101, 2034, 3813, 3669, 19337, 2666, 2615, 504... 0 \n",
781
+ "\n",
782
+ " text \\\n",
783
+ "0 kept ask mani fight scream match swear gener m... \n",
784
+ "1 watch entir movi could watch entir movi stop d... \n",
785
+ "2 touch love stori reminisc Β‘in mood love draw h... \n",
786
+ "3 latter day fulci schlocker total abysm concoct... \n",
787
+ "4 first firmli believ norwegian movi continu get... \n",
788
+ "\n",
789
+ " token_type_ids \n",
790
+ "0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
791
+ "1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
792
+ "2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
793
+ "3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... \n",
794
+ "4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... "
795
+ ]
796
+ },
797
+ "execution_count": 20,
798
+ "metadata": {},
799
+ "output_type": "execute_result"
800
+ }
801
+ ],
802
+ "source": [
803
+ "df_train.head()"
804
+ ]
805
+ },
806
+ {
807
+ "cell_type": "markdown",
808
+ "metadata": {
809
+ "id": "0lTWT8JwkRic"
810
+ },
811
+ "source": [
812
+ "## Fine Tunning do Modelo\n",
813
+ "Ajuste fino do BERT para tarefas específica de classificação de sentimento para o dataset do IMDB"
814
+ ]
815
+ },
816
+ {
817
+ "cell_type": "code",
818
+ "execution_count": null,
819
+ "metadata": {},
820
+ "outputs": [],
821
+ "source": [
822
+ "import torch\n",
823
+ "import pyarrow as pa\n",
824
+ "from datasets import Dataset\n",
825
+ "import evaluate\n",
826
+ "import numpy as np"
827
+ ]
828
+ },
829
+ {
830
+ "cell_type": "code",
831
+ "execution_count": 21,
832
+ "metadata": {
833
+ "colab": {
834
+ "base_uri": "https://localhost:8080/"
835
+ },
836
+ "id": "kW53p7VQqUDD",
837
+ "outputId": "8231f3ba-37d5-4546-c4d0-6b4ff317ecf3"
838
+ },
839
+ "outputs": [
840
+ {
841
+ "data": {
842
+ "text/plain": [
843
+ "device(type='cuda', index=0)"
844
+ ]
845
+ },
846
+ "execution_count": 21,
847
+ "metadata": {},
848
+ "output_type": "execute_result"
849
+ }
850
+ ],
851
+ "source": [
852
+ "device = torch.device(\"cuda:0\" if torch.cuda.is_available() else \"cpu\")\n",
853
+ "device"
854
+ ]
855
+ },
856
+ {
857
+ "cell_type": "code",
858
+ "execution_count": 24,
859
+ "metadata": {
860
+ "id": "68OdbTv5rLrm"
861
+ },
862
+ "outputs": [],
863
+ "source": [
864
+ "train_hg = Dataset(pa.Table.from_pandas(df_train))\n",
865
+ "valid_hg = Dataset(pa.Table.from_pandas(df_val))"
866
+ ]
867
+ },
868
+ {
869
+ "cell_type": "markdown",
870
+ "metadata": {},
871
+ "source": [
872
+ "## Metricas de avaliação F1 Score e Acc"
873
+ ]
874
+ },
875
+ {
876
+ "cell_type": "markdown",
877
+ "metadata": {},
878
+ "source": [
879
+ "`compute_metrics` calcula tanto a acurÑcia quanto o F1-score para avaliar um modelo de classificação. Primeiramente, são carregadas as métricas de acurÑcia e F1-score usando evaluate.load. Em seguida, a função compute_metrics recebe um par de arrays eval_pred, contendo as previsáes do modelo e os rótulos verdadeiros. Utilizando as previsáes, a função calcula a acurÑcia e o F1-score ponderado, onde a acurÑcia é obtida através da comparação das previsáes com os rótulos utilizando a métrica de acurÑcia previamente carregada, e o F1-score é calculado utilizando a métrica de F1 previamente carregada, com ponderação \"weighted\". Os resultados de ambas as métricas são então combinados em um dicionÑrio e retornados como um único objeto contendo as métricas de avaliação calculadas."
880
+ ]
881
+ },
882
+ {
883
+ "cell_type": "code",
884
+ "execution_count": 25,
885
+ "metadata": {
886
+ "id": "lUNhDPs0ry4m"
887
+ },
888
+ "outputs": [],
889
+ "source": [
890
+ "\n",
891
+ "# Load both accuracy and f1 metrics\n",
892
+ "accuracy_metric = evaluate.load(\"accuracy\")\n",
893
+ "f1_metric = evaluate.load(\"f1\")\n",
894
+ "\n",
895
+ "# Metric helper method\n",
896
+ "def compute_metrics(eval_pred):\n",
897
+ " predictions, labels = eval_pred\n",
898
+ " predictions = np.argmax(predictions, axis=1)\n",
899
+ "\n",
900
+ " # Compute accuracy\n",
901
+ " accuracy = accuracy_metric.compute(predictions=predictions, references=labels)\n",
902
+ "\n",
903
+ " # Compute F1 score\n",
904
+ " f1 = f1_metric.compute(predictions=predictions, references=labels, average=\"weighted\")\n",
905
+ "\n",
906
+ " # Combine the metrics into a single dictionary\n",
907
+ " combined_metrics = {\n",
908
+ " 'accuracy': accuracy['accuracy'],\n",
909
+ " 'f1': f1['f1']\n",
910
+ " }\n",
911
+ "\n",
912
+ " return combined_metrics"
913
+ ]
914
+ },
915
+ {
916
+ "cell_type": "code",
917
+ "execution_count": 26,
918
+ "metadata": {
919
+ "colab": {
920
+ "base_uri": "https://localhost:8080/"
921
+ },
922
+ "id": "9jJYTWsHjnEc",
923
+ "outputId": "fe45691a-4476-4978-89b8-15f36465c37c"
924
+ },
925
+ "outputs": [
926
+ {
927
+ "name": "stdout",
928
+ "output_type": "stream",
929
+ "text": [
930
+ "Name: accelerateNote: you may need to restart the kernel to use updated packages.\n",
931
+ "\n",
932
+ "Version: 0.31.0\n",
933
+ "Summary: Accelerate\n",
934
+ "Home-page: https://github.com/huggingface/accelerate\n",
935
+ "Author: The HuggingFace team\n",
936
+ "Author-email: [email protected]\n",
937
+ "License: Apache\n",
938
+ "Location: c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\n",
939
+ "Requires: huggingface-hub, numpy, packaging, psutil, pyyaml, safetensors, torch\n",
940
+ "Required-by: \n",
941
+ "---\n",
942
+ "Name: transformers\n",
943
+ "Version: 4.41.2\n",
944
+ "Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow\n",
945
+ "Home-page: https://github.com/huggingface/transformers\n",
946
+ "Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)\n",
947
+ "Author-email: [email protected]\n",
948
+ "License: Apache 2.0 License\n",
949
+ "Location: c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\n",
950
+ "Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, safetensors, tokenizers, tqdm\n",
951
+ "Required-by: \n"
952
+ ]
953
+ }
954
+ ],
955
+ "source": [
956
+ "pip show accelerate transformers"
957
+ ]
958
+ },
959
+ {
960
+ "cell_type": "markdown",
961
+ "metadata": {},
962
+ "source": [
963
+ "## Treinamento do modelo"
964
+ ]
965
+ },
966
+ {
967
+ "cell_type": "code",
968
+ "execution_count": 27,
969
+ "metadata": {
970
+ "colab": {
971
+ "base_uri": "https://localhost:8080/"
972
+ },
973
+ "id": "QlaLCwf7rLtp",
974
+ "outputId": "7e10e82a-8bc7-478b-851e-c7b628b46c41"
975
+ },
976
+ "outputs": [
977
+ {
978
+ "name": "stderr",
979
+ "output_type": "stream",
980
+ "text": [
981
+ "c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\\transformers\\training_args.py:1474: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of πŸ€— Transformers. Use `eval_strategy` instead\n",
982
+ " warnings.warn(\n"
983
+ ]
984
+ }
985
+ ],
986
+ "source": [
987
+ "from transformers import TrainingArguments, Trainer\n",
988
+ "\n",
989
+ "EPOCHS = 1\n",
990
+ "\n",
991
+ "training_args = TrainingArguments(output_dir=\"./result\",\n",
992
+ " evaluation_strategy=\"epoch\",\n",
993
+ " num_train_epochs= EPOCHS,\n",
994
+ " per_device_train_batch_size=16,\n",
995
+ " per_device_eval_batch_size=8\n",
996
+ " )\n",
997
+ "\n",
998
+ "trainer = Trainer(\n",
999
+ " model=model,\n",
1000
+ " args=training_args,\n",
1001
+ " train_dataset=train_hg,\n",
1002
+ " eval_dataset=valid_hg,\n",
1003
+ " tokenizer=tokenizer,\n",
1004
+ " compute_metrics=compute_metrics\n",
1005
+ ")"
1006
+ ]
1007
+ },
1008
+ {
1009
+ "cell_type": "code",
1010
+ "execution_count": 28,
1011
+ "metadata": {},
1012
+ "outputs": [
1013
+ {
1014
+ "name": "stdout",
1015
+ "output_type": "stream",
1016
+ "text": [
1017
+ "CUDA available: True\n",
1018
+ "CUDA version: 12.1\n"
1019
+ ]
1020
+ }
1021
+ ],
1022
+ "source": [
1023
+ "print(\"CUDA available: \", torch.cuda.is_available())\n",
1024
+ "print(\"CUDA version: \", torch.version.cuda)"
1025
+ ]
1026
+ },
1027
+ {
1028
+ "cell_type": "code",
1029
+ "execution_count": 29,
1030
+ "metadata": {
1031
+ "colab": {
1032
+ "base_uri": "https://localhost:8080/",
1033
+ "height": 141
1034
+ },
1035
+ "id": "3s6lVFz_rLwO",
1036
+ "outputId": "ee64e8e9-9c8c-42a8-c355-f51410cc33df"
1037
+ },
1038
+ "outputs": [
1039
+ {
1040
+ "name": "stderr",
1041
+ "output_type": "stream",
1042
+ "text": [
1043
+ " 0%| | 0/2500 [00:00<?, ?it/s]c:\\Users\\andre\\1JUPYTER\\dt_labs\\.venv\\Lib\\site-packages\\transformers\\models\\bert\\modeling_bert.py:435: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\\aten\\src\\ATen\\native\\transformers\\cuda\\sdp_utils.cpp:263.)\n",
1044
+ " attn_output = torch.nn.functional.scaled_dot_product_attention(\n",
1045
+ " 20%|β–ˆβ–ˆ | 500/2500 [05:35<22:22, 1.49it/s]"
1046
+ ]
1047
+ },
1048
+ {
1049
+ "name": "stdout",
1050
+ "output_type": "stream",
1051
+ "text": [
1052
+ "{'loss': 0.4994, 'grad_norm': 12.613661766052246, 'learning_rate': 4e-05, 'epoch': 0.2}\n"
1053
+ ]
1054
+ },
1055
+ {
1056
+ "name": "stderr",
1057
+ "output_type": "stream",
1058
+ "text": [
1059
+ " 40%|β–ˆβ–ˆβ–ˆβ–ˆ | 1000/2500 [11:13<16:46, 1.49it/s]"
1060
+ ]
1061
+ },
1062
+ {
1063
+ "name": "stdout",
1064
+ "output_type": "stream",
1065
+ "text": [
1066
+ "{'loss': 0.3898, 'grad_norm': 4.661791801452637, 'learning_rate': 3e-05, 'epoch': 0.4}\n"
1067
+ ]
1068
+ },
1069
+ {
1070
+ "name": "stderr",
1071
+ "output_type": "stream",
1072
+ "text": [
1073
+ " 60%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 1500/2500 [16:47<11:02, 1.51it/s]"
1074
+ ]
1075
+ },
1076
+ {
1077
+ "name": "stdout",
1078
+ "output_type": "stream",
1079
+ "text": [
1080
+ "{'loss': 0.3516, 'grad_norm': 1.5203113555908203, 'learning_rate': 2e-05, 'epoch': 0.6}\n"
1081
+ ]
1082
+ },
1083
+ {
1084
+ "name": "stderr",
1085
+ "output_type": "stream",
1086
+ "text": [
1087
+ " 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 2000/2500 [22:25<05:33, 1.50it/s]"
1088
+ ]
1089
+ },
1090
+ {
1091
+ "name": "stdout",
1092
+ "output_type": "stream",
1093
+ "text": [
1094
+ "{'loss': 0.3121, 'grad_norm': 8.331348419189453, 'learning_rate': 1e-05, 'epoch': 0.8}\n"
1095
+ ]
1096
+ },
1097
+ {
1098
+ "name": "stderr",
1099
+ "output_type": "stream",
1100
+ "text": [
1101
+ "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2500/2500 [28:04<00:00, 1.50it/s]"
1102
+ ]
1103
+ },
1104
+ {
1105
+ "name": "stdout",
1106
+ "output_type": "stream",
1107
+ "text": [
1108
+ "{'loss': 0.2882, 'grad_norm': 6.287994861602783, 'learning_rate': 0.0, 'epoch': 1.0}\n"
1109
+ ]
1110
+ },
1111
+ {
1112
+ "name": "stderr",
1113
+ "output_type": "stream",
1114
+ "text": [
1115
+ " \n",
1116
+ "100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2500/2500 [30:45<00:00, 1.35it/s]"
1117
+ ]
1118
+ },
1119
+ {
1120
+ "name": "stdout",
1121
+ "output_type": "stream",
1122
+ "text": [
1123
+ "{'eval_loss': 0.283893883228302, 'eval_accuracy': 0.883, 'eval_f1': 0.8829425082505502, 'eval_runtime': 159.717, 'eval_samples_per_second': 62.611, 'eval_steps_per_second': 7.826, 'epoch': 1.0}\n",
1124
+ "{'train_runtime': 1845.2907, 'train_samples_per_second': 21.677, 'train_steps_per_second': 1.355, 'train_loss': 0.3682089477539062, 'epoch': 1.0}\n"
1125
+ ]
1126
+ },
1127
+ {
1128
+ "name": "stderr",
1129
+ "output_type": "stream",
1130
+ "text": [
1131
+ "\n"
1132
+ ]
1133
+ },
1134
+ {
1135
+ "data": {
1136
+ "text/plain": [
1137
+ "TrainOutput(global_step=2500, training_loss=0.3682089477539062, metrics={'train_runtime': 1845.2907, 'train_samples_per_second': 21.677, 'train_steps_per_second': 1.355, 'total_flos': 4111110240000000.0, 'train_loss': 0.3682089477539062, 'epoch': 1.0})"
1138
+ ]
1139
+ },
1140
+ "execution_count": 29,
1141
+ "metadata": {},
1142
+ "output_type": "execute_result"
1143
+ }
1144
+ ],
1145
+ "source": [
1146
+ "trainer.train()"
1147
+ ]
1148
+ },
1149
+ {
1150
+ "cell_type": "markdown",
1151
+ "metadata": {},
1152
+ "source": [
1153
+ "## Salvando o modelo"
1154
+ ]
1155
+ },
1156
+ {
1157
+ "cell_type": "code",
1158
+ "execution_count": 38,
1159
+ "metadata": {
1160
+ "id": "8eO6WDiOBAhg"
1161
+ },
1162
+ "outputs": [],
1163
+ "source": [
1164
+ "torch.save(model.state_dict(), 'model.pth')"
1165
+ ]
1166
+ },
1167
+ {
1168
+ "cell_type": "markdown",
1169
+ "metadata": {
1170
+ "id": "FtVZztSa40b3"
1171
+ },
1172
+ "source": [
1173
+ "## Teste de prediçáes individuais"
1174
+ ]
1175
+ },
1176
+ {
1177
+ "cell_type": "code",
1178
+ "execution_count": 34,
1179
+ "metadata": {
1180
+ "id": "lOHVSyfJJ8zK"
1181
+ },
1182
+ "outputs": [],
1183
+ "source": [
1184
+ "from transformers import AutoTokenizer\n",
1185
+ "\n",
1186
+ "new_tokenizer = AutoTokenizer.from_pretrained(pre_trained_base)"
1187
+ ]
1188
+ },
1189
+ {
1190
+ "cell_type": "code",
1191
+ "execution_count": 35,
1192
+ "metadata": {
1193
+ "id": "t-T7hDZ2J1Qk"
1194
+ },
1195
+ "outputs": [],
1196
+ "source": [
1197
+ "def get_prediction(text):\n",
1198
+ " encoding = new_tokenizer(text, return_tensors=\"pt\", padding=\"max_length\", truncation=True, max_length=MAX_LEN)\n",
1199
+ " encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}\n",
1200
+ "\n",
1201
+ " outputs = model(**encoding)\n",
1202
+ "\n",
1203
+ " logits = outputs.logits\n",
1204
+ "\n",
1205
+ " sigmoid = torch.nn.Sigmoid()\n",
1206
+ " probs = sigmoid(logits.squeeze().cpu())\n",
1207
+ " probs = probs.detach().numpy()\n",
1208
+ " label = np.argmax(probs, axis=-1)\n",
1209
+ "\n",
1210
+ " return label"
1211
+ ]
1212
+ },
1213
+ {
1214
+ "cell_type": "code",
1215
+ "execution_count": 36,
1216
+ "metadata": {
1217
+ "colab": {
1218
+ "base_uri": "https://localhost:8080/"
1219
+ },
1220
+ "id": "y4dxQ4oYJ5C1",
1221
+ "outputId": "d0d77c2d-aff6-412b-e22a-0b721f5b097e"
1222
+ },
1223
+ "outputs": [
1224
+ {
1225
+ "data": {
1226
+ "text/plain": [
1227
+ "0"
1228
+ ]
1229
+ },
1230
+ "execution_count": 36,
1231
+ "metadata": {},
1232
+ "output_type": "execute_result"
1233
+ }
1234
+ ],
1235
+ "source": [
1236
+ "get_prediction(\"This movie is horrible!\")"
1237
+ ]
1238
+ },
1239
+ {
1240
+ "cell_type": "code",
1241
+ "execution_count": 37,
1242
+ "metadata": {
1243
+ "colab": {
1244
+ "base_uri": "https://localhost:8080/"
1245
+ },
1246
+ "id": "JXAyOu_6AqoO",
1247
+ "outputId": "ffcd019e-4c0c-45eb-f538-d2860c53a0e0"
1248
+ },
1249
+ "outputs": [
1250
+ {
1251
+ "data": {
1252
+ "text/plain": [
1253
+ "1"
1254
+ ]
1255
+ },
1256
+ "execution_count": 37,
1257
+ "metadata": {},
1258
+ "output_type": "execute_result"
1259
+ }
1260
+ ],
1261
+ "source": [
1262
+ "get_prediction(\"This movie is awesome!\")"
1263
+ ]
1264
+ }
1265
+ ],
1266
+ "metadata": {
1267
+ "accelerator": "GPU",
1268
+ "colab": {
1269
+ "provenance": []
1270
+ },
1271
+ "gpuClass": "standard",
1272
+ "kernelspec": {
1273
+ "display_name": "Python 3",
1274
+ "name": "python3"
1275
+ },
1276
+ "language_info": {
1277
+ "codemirror_mode": {
1278
+ "name": "ipython",
1279
+ "version": 3
1280
+ },
1281
+ "file_extension": ".py",
1282
+ "mimetype": "text/x-python",
1283
+ "name": "python",
1284
+ "nbconvert_exporter": "python",
1285
+ "pygments_lexer": "ipython3",
1286
+ "version": "3.10.11"
1287
+ }
1288
+ },
1289
+ "nbformat": 4,
1290
+ "nbformat_minor": 0
1291
+ }
notebooks_explicativos/Simbolico.ipynb ADDED
The diff for this file is too large to render. See raw diff