{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" }, "gpuClass": "standard" }, "cells": [ { "cell_type": "markdown", "source": [ "# Implementation of text classification with BERT\n", "\n", "\n", "Still Working on it.\n", "\n", "This notebook is based in this TensorFlow tutorial: [Classify text with BERT](https://www.tensorflow.org/tutorials/text/classify_text_wibert)\n", "\n", "BERT [(article link)](https://arxiv.org/abs/1810.04805) and other Transformer encoder architectures have been wildly successful on a variety of tasks in NLP (natural language processing). They compute vector-space representations of natural language that are suitable for use in deep learning models.\n", "\n", "![](http://www.d2l.ai/_images/nlp-map-pretrain.svg):\n", "\n", "Source: http://www.d2l.ai/chapter_natural-language-processing-pretraining/index.html\n", "\n", "BERT models are usually pre-trained on a large corpus of text, then fine-tuned for specific tasks.\n", "\n", "In this notebook, I am going to use a pretreined BERT to compute vector-space representations of a hate speech dataset to feed two different downsteam Archtectures (CNN and MLP).\n", "\n", "Sentiment Analysis\n", "\n", "This notebook trains a sentiment analysis model to classify the [Hate Speech and Offensive Language Dataset]( https://www.kaggle.com/mrmorj/hate-speech-and-offensive-language-dataset) tweets in three classes:\n", " \n", "* 0 - hate speech \n", "* 1 - offensive language \n", "* 2 - neither as positive or negative" ], "metadata": { "id": "Jh_WkIs1iJDs" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ltc_HOzjX87s" }, "outputs": [], "source": [ "!pip install -q tensorflow-text > /dev/null\n", "!pip install -q tf-models-official > /dev/null" ] }, { "cell_type": "code", "source": [ "from official.nlp.optimization import AdamWeightDecay, WarmUp\n", "import tensorflow as tf\n", "import tensorflow_hub as hub\n", "import tensorflow_text as text\n", "import numpy as np\n", "np.set_printoptions(suppress=True)\n", "\n", "\n", "# Carga el modelo\n", "with tf.keras.utils.custom_object_scope({'AdamWeightDecay': AdamWeightDecay, 'WarmUp': WarmUp}):\n", " classifier_model = tf.keras.models.load_model('classifier_model.h5', \n", " custom_objects={'KerasLayer': hub.KerasLayer})" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "EQlIdYjKYAnn", "outputId": "efb1f87f-8b45-4201-ac14-2abdc74b8cfd" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.data_structures has been moved to tensorflow.python.trackable.data_structures. The old module will be deleted in version 2.11.\n", "WARNING:tensorflow:From /usr/local/lib/python3.9/dist-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.\n", "Instructions for updating:\n", "Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089\n", "WARNING:tensorflow:Error in loading the saved optimizer state. As a result, your model is starting with a freshly initialized optimizer.\n" ] } ] }, { "cell_type": "code", "source": [ "classifier_model.predict([\"LEETSSS GOOO Get those ... outta here!!!!!!\"])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6Ma3P-7iYEbA", "outputId": "2e6fbc37-6ab8-4035-c706-f8d6d3d8c7ba" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "1/1 [==============================] - 1s 1s/step\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "array([[0.99998355, 0.00001638, 0.00000017]], dtype=float32)" ] }, "metadata": {}, "execution_count": 4 } ] } ] }