{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "nQGxT74Sdis4" }, "source": [ "Modelos basados en secuencias con Word2Vec\n", "==========================================" ] }, { "cell_type": "markdown", "metadata": { "id": "GkwnMQ-DXuNk" }, "source": [ "Introducción\n", "------------" ] }, { "cell_type": "markdown", "metadata": { "id": "vjuHJnQOdis6" }, "source": [ "Los modelos basados en secuencias tienen la fortaleza que toman una secuencia de token (en un determinado orden) y generan una salida dependiendo del tipo de problema que se trate.\n", " - Seq2Class: Toman una secuencia de tokens y generan una clase\n", " - Seq2Seq: Toman una secuencia de token y generan otra secuencia de tokens.\n", "\n", "Vimos que cuando aplicamos técnicas de Topic Modeling, intentamos reducir la cantidad de dimensiones de nuestras representaciones de palabras para luego utilizar un clasificador para resolver la tarea en cuestión. Sin embargo, la suponsición básica de ese tipo de modelos es que un texto no es mas que una distribución de palabras (bag of words). Sin embargo, nosotros sabemos que un texto es una secuencia de palabras donde importa el orden. Para capturar este tipo de propiedades podemos utilizar modelos basados en secuencias." ] }, { "cell_type": "markdown", "metadata": { "id": "Dcyc_TQ6dis7" }, "source": [ "### Para ejecutar este notebook" ] }, { "cell_type": "markdown", "metadata": { "id": "SVZuIbB-XuNn" }, "source": [ "Para ejecutar este notebook, instale las siguientes librerias:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "syRzC-FPXuNn" }, "outputs": [], "source": [ "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Datasets/mascorpus/tweets_marketing.csv \\\n", " --quiet --no-clobber --directory-prefix ./Datasets/mascorpus/\n", "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/m72109/nlp/normalization.py \\\n", " --quiet --no-clobber --directory-prefix ./m72109/nlp/\n", "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/m72109/nlp/transformation.py \\\n", " --quiet --no-clobber --directory-prefix ./m72109/nlp/\n", "\n", "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/docs/nlp/neural/sequences-word2vec.txt \\\n", " --quiet --no-clobber\n", "!pip install -r sequences-word2vec.txt --quiet" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "Ntcs1AlpfckX" }, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": { "id": "0rRpBRahditI" }, "source": [ "Instalamos las librerias necesarias" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "UgVf9-V7ditI" }, "outputs": [], "source": [ "!python -m spacy download es_core_news_sm 1> /dev/null" ] }, { "cell_type": "markdown", "metadata": { "id": "CC7vGpjqditL" }, "source": [ "Cargamos el set de datos" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "pmJpenUkditM" }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')" ] }, { "cell_type": "markdown", "metadata": { "id": "kwRwMqoSditT" }, "source": [ "## Preprocesamiento de texto" ] }, { "cell_type": "markdown", "metadata": { "id": "_qrYYZ0FditT" }, "source": [ "Al igual que con Topic Modeling, nuestro primer paso es preprocesar el texto. Para focalizarnos en Word2Vec en este modulo, les preparé un modulo TweetTextNormalizer que hará todo el preprocesamiento por nosotros. Pueden explorar los parametros que recibe el constructor de esta clase para ver que opciones podemos configurar como Stemmer, Lemmatization, etc.\n", "\n", "En lo particular, estamos creando un TweetTextNormalizer que:\n", " - Aplicará un tokenizer especifico para Twitter\n", " - Eliminará stop words\n", " - Aplicará lemmatization\n", " - Eliminará URLs\n", " - Eliminará acentos\n", " - Eliminará las mayusculas\n", "\n", "Adicionalmente, el parametro text_to_sequence=True indica que la salida de este proceso no serán oraciones sino que tokens." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "Jnuzl8qbditU" }, "outputs": [], "source": [ "from m72109.nlp.normalization import TweetTextNormalizer" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "kFYzMdiCditX" }, "outputs": [], "source": [ "normalizer = TweetTextNormalizer(preserve_case=False, return_tokens=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "6sD5jK6ndita" }, "source": [ "Transformemos el texto:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ldKc0udIditb", "outputId": "6486a4bb-2b72-4575-dfc7-a1c1fe9d62ad" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "100%|██████████| 3763/3763 [03:13<00:00, 19.44it/s]\n" ] } ], "source": [ "tweets_text = normalizer.transform(tweets['TEXTO'])" ] }, { "cell_type": "markdown", "metadata": { "id": "QdoLyHg9ditd" }, "source": [ "## Vectorización de las palabras" ] }, { "cell_type": "markdown", "metadata": { "id": "a9iuTrwXditd" }, "source": [ "En las actividades anteriores utilizamos siempre un TF-IDF vectorizer para generar los vectores. En esta oportunidad utilizaremos Word2Vec utilizando un modelo pre-entrenado para el idioma español." ] }, { "cell_type": "markdown", "metadata": { "id": "Gl18LLmkeqjx" }, "source": [ "Descargamos nuestros vectores de word2vec en español" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "LJVkiH2lesN1" }, "outputs": [], "source": [ "!mkdir -p ./Models/Word2Vec\n", "!wget https://santiagxf.blob.core.windows.net/public/Word2Vec/model-es.bin \\\n", " --quiet --no-clobber" ] }, { "cell_type": "markdown", "source": [ "Adicionalmente, vemos que este vectorizer tiene el parametro sequence_to_idx en Verdadero. Esto significa que no queremos que como salida obtengamos los vectores de Word2Vec, sino que queremos \"el indice\" que se corresponde a la palabra en una matriz de indice-palabra/vectores." ], "metadata": { "id": "xspUdeJCTP5n" } }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "jE-SlBWddite" }, "outputs": [], "source": [ "from m72109.nlp.transformation import Word2VecVectorizer" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "vkWxVLPYditg" }, "outputs": [], "source": [ "w2v = Word2VecVectorizer(model='/content/model-es.bin', sequence_to_idx=True)" ] }, { "cell_type": "code", "source": [ "tweets_text = w2v.transform(tweets_text)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "b3LWu-54d1Oy", "outputId": "1946c83d-012e-49ce-e363-a2061e7f92ee" }, "execution_count": 10, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "100%|██████████| 3763/3763 [00:00<00:00, 49876.49it/s]\n" ] } ] }, { "cell_type": "markdown", "metadata": { "id": "uBNY37iIditm" }, "source": [ "## Construirmos un modelo basado en secuencias" ] }, { "cell_type": "markdown", "metadata": { "id": "vhfhArdbdit5" }, "source": [ "### Ajustando la longitud de las secuencias" ] }, { "cell_type": "markdown", "metadata": { "id": "_WpivJzsdit6" }, "source": [ "Los modelos basados en secuencias pueden adaptarse a cualquier longitud de secuencia, sin embargo, los parametros de nuestras redes neuronales deberan ser fijos. Para esto definiermos una longitud máxima de la secuencia que vamos analizar. Para esto podemos utilizar un valor especifico o utilizar el valor máximo de tokens que hay en nuestro corpus." ] }, { "cell_type": "markdown", "metadata": { "id": "5Q2zBOXUdit6" }, "source": [ "La siguiente clase PadSequenceTransformer es un modulo que les preparé para simplificar este procesamiento. El mismo se encarga de ajustar cualquier secuencia para que tenga exactamente max_seq_len. Cuando la lingitud es mejor, se completan con ceros." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "NeVilfLMdit7" }, "outputs": [], "source": [ "from m72109.nlp.transformation import PadSequenceTransformer" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "u-2T5o8Rdit9" }, "outputs": [], "source": [ "max_seq_len = 50" ] }, { "cell_type": "code", "source": [ "seq2seq = PadSequenceTransformer(max_len=max_seq_len)" ], "metadata": { "id": "HggcN8KcV9qZ" }, "execution_count": 13, "outputs": [] }, { "cell_type": "code", "source": [ "tweets_text = seq2seq.transform(tweets_text)" ], "metadata": { "id": "Ja1tQUEhhDJ4" }, "execution_count": 14, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "TPbwWFXUXuNv" }, "source": [ "### Construyendo el modelo" ] }, { "cell_type": "markdown", "metadata": { "id": "lnhHx7Pddito" }, "source": [ "Para construir nuestro modelo, utilizaremos TensorFlow. En particular utilizaremos la API de Keras que nos permite componer modelos de redes neuronales como una secuencia de pasos o capas que se conectan en una dirección.\n", "\n", "Utilizemos los siguientes tipos de capas:\n", "\n", " * **Embedding:** Esta capa transforma vectores que representan indices dentro de una matriz en representaciones vectoriales densas. Básicamente en este caso nos resolverá la busqueda de las representaciones vectoriales para nuestras palabras.\n", " * **SpatialDropout1D:** Este tipo de capas ayudan a promover la independencia entre filtros (feature maps). Funciona en forma analoga a Dropout pero en lugar de desconectar elementos individuales, desconecta el filtro completo.\n", " * **LSTM:** Long Short-Term Memory layer - Hochreiter 1997\n", " * **Dense:** Una típica capa de una red neuronal completamente conectada (fully connected)\n", "\n", "Algunos detalles para notar:\n", "\n", " * `loss='sparse_categorical_crossentropy'`, este problema de clasificación (crossentropy) de más de una clase (categorical). Sin embargo, nuestro output produce probabilidades de cada una de las clases posibles (7) en forma one-hot encoding.\n", " * `metrics=['accuracy']`: Si bien nuestra metrica es accuracy, Keras hará un promedio ponderado del accuracy de cada clase. Este es el comportamiento por defecto." ] }, { "cell_type": "code", "source": [ "embedding_weights = w2v.get_weights()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7wphj0eNl43g", "outputId": "3a6b6242-c22c-482b-b8ec-1e86a94d0382" }, "execution_count": 15, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "100%|██████████| 2656058/2656058 [00:06<00:00, 411299.01it/s]\n" ] } ] }, { "cell_type": "markdown", "source": [ "> El método `get_weights()` construye la matríz de indice-palabra/vector que luego será utilizado para encontrar los vectores correspondientes de cada palabra. Esta matriz tiene dimensiones m x n, donde m es la cantidad de palabras del vocabulario y n la dimensión de los vectores de word2vec. En este caso trabajamos con vectores de dimensionalidad 100." ], "metadata": { "id": "3FSKPr-al-8W" } }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "LQ2C980jditp" }, "outputs": [], "source": [ "import tensorflow as tf\n", "import tensorflow.keras as keras\n", "from tensorflow.keras.models import Sequential, Model\n", "from tensorflow.keras.layers import Embedding, LSTM, Dense, Input, SpatialDropout1D" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "id": "nNrawZnAditr" }, "outputs": [], "source": [ "model = Sequential([\n", " Embedding(w2v.vocab_size, w2v.emdedding_size,\n", " weights=[embedding_weights],\n", " trainable=False,\n", " mask_zero=True),\n", " SpatialDropout1D(0.2),\n", " LSTM(w2v.emdedding_size),\n", " Dense(7, activation='softmax')\n", "])\n", "\n", "model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])" ] }, { "cell_type": "markdown", "metadata": { "id": "XWc2N0O-ditw" }, "source": [ "Podemos inspeccionar el modelo:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 282 }, "id": "vmk-mnMzditw", "outputId": "e7aab425-c6ac-4146-8f54-62b2f53fa9a8" }, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "\u001b[1mModel: \"sequential\"\u001b[0m\n" ], "text/html": [ "
Model: \"sequential\"\n",
              "
\n" ] }, "metadata": {} }, { "output_type": "display_data", "data": { "text/plain": [ "┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n", "┃\u001b[1m \u001b[0m\u001b[1mLayer (type) \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1mOutput Shape \u001b[0m\u001b[1m \u001b[0m┃\u001b[1m \u001b[0m\u001b[1m Param #\u001b[0m\u001b[1m \u001b[0m┃\n", "┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n", "│ embedding (\u001b[38;5;33mEmbedding\u001b[0m) │ ? │ \u001b[38;5;34m265,605,800\u001b[0m │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ spatial_dropout1d │ ? │ \u001b[38;5;34m0\u001b[0m │\n", "│ (\u001b[38;5;33mSpatialDropout1D\u001b[0m) │ │ │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ lstm (\u001b[38;5;33mLSTM\u001b[0m) │ ? │ \u001b[38;5;34m0\u001b[0m (unbuilt) │\n", "├─────────────────────────────────┼────────────────────────┼───────────────┤\n", "│ dense (\u001b[38;5;33mDense\u001b[0m) │ ? │ \u001b[38;5;34m0\u001b[0m (unbuilt) │\n", "└─────────────────────────────────┴────────────────────────┴───────────────┘\n" ], "text/html": [ "
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓\n",
              "┃ Layer (type)                     Output Shape                  Param # ┃\n",
              "┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩\n",
              "│ embedding (Embedding)           │ ?                      │   265,605,800 │\n",
              "├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
              "│ spatial_dropout1d               │ ?                      │             0 │\n",
              "│ (SpatialDropout1D)              │                        │               │\n",
              "├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
              "│ lstm (LSTM)                     │ ?                      │   0 (unbuilt) │\n",
              "├─────────────────────────────────┼────────────────────────┼───────────────┤\n",
              "│ dense (Dense)                   │ ?                      │   0 (unbuilt) │\n",
              "└─────────────────────────────────┴────────────────────────┴───────────────┘\n",
              "
\n" ] }, "metadata": {} }, { "output_type": "display_data", "data": { "text/plain": [ "\u001b[1m Total params: \u001b[0m\u001b[38;5;34m265,605,800\u001b[0m (1013.21 MB)\n" ], "text/html": [ "
 Total params: 265,605,800 (1013.21 MB)\n",
              "
\n" ] }, "metadata": {} }, { "output_type": "display_data", "data": { "text/plain": [ "\u001b[1m Trainable params: \u001b[0m\u001b[38;5;34m0\u001b[0m (0.00 B)\n" ], "text/html": [ "
 Trainable params: 0 (0.00 B)\n",
              "
\n" ] }, "metadata": {} }, { "output_type": "display_data", "data": { "text/plain": [ "\u001b[1m Non-trainable params: \u001b[0m\u001b[38;5;34m265,605,800\u001b[0m (1013.21 MB)\n" ], "text/html": [ "
 Non-trainable params: 265,605,800 (1013.21 MB)\n",
              "
\n" ] }, "metadata": {} } ], "source": [ "model.summary()" ] }, { "cell_type": "markdown", "source": [ "Antes de continuar, separemos nuestro conjunto de datos en entrenamiento y testing y codifiquemos la variable a predecir:" ], "metadata": { "id": "G_WubiNOiiyH" } }, { "cell_type": "code", "source": [ "from sklearn.preprocessing import LabelEncoder\n", "\n", "encoder = LabelEncoder()\n", "tweets_sector = encoder.fit_transform(tweets['SECTOR'])" ], "metadata": { "id": "mcZVsSjhiOBu" }, "execution_count": 19, "outputs": [] }, { "cell_type": "code", "execution_count": 20, "metadata": { "id": "RKrEadrGditO" }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(tweets_text, tweets_sector,\n", " test_size=0.33,\n", " stratify=tweets_sector)" ] }, { "cell_type": "markdown", "metadata": { "id": "webM_rFCdiuD" }, "source": [ "Entrenamos nuestro modelo:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "sJYPslBwdiuE", "outputId": "a29365ab-fd06-42f1-8411-f02551a22dce" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Epoch 1/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m7s\u001b[0m 48ms/step - accuracy: 0.4105 - loss: 1.6565\n", "Epoch 2/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 44ms/step - accuracy: 0.7568 - loss: 0.8116\n", "Epoch 3/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m6s\u001b[0m 57ms/step - accuracy: 0.8233 - loss: 0.5701\n", "Epoch 4/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 44ms/step - accuracy: 0.8422 - loss: 0.5003\n", "Epoch 5/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m3s\u001b[0m 44ms/step - accuracy: 0.8665 - loss: 0.4105\n", "Epoch 6/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 68ms/step - accuracy: 0.8870 - loss: 0.3595\n", "Epoch 7/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m8s\u001b[0m 44ms/step - accuracy: 0.8790 - loss: 0.3679\n", "Epoch 8/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 56ms/step - accuracy: 0.8821 - loss: 0.3417\n", "Epoch 9/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 49ms/step - accuracy: 0.8865 - loss: 0.3083\n", "Epoch 10/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 45ms/step - accuracy: 0.9041 - loss: 0.2818\n", "Epoch 11/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 56ms/step - accuracy: 0.9031 - loss: 0.2601\n", "Epoch 12/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m3s\u001b[0m 44ms/step - accuracy: 0.9106 - loss: 0.2574\n", "Epoch 13/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 48ms/step - accuracy: 0.9046 - loss: 0.2618\n", "Epoch 14/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 44ms/step - accuracy: 0.9203 - loss: 0.2319\n", "Epoch 15/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 44ms/step - accuracy: 0.9156 - loss: 0.2395\n", "Epoch 16/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m6s\u001b[0m 54ms/step - accuracy: 0.9276 - loss: 0.2093\n", "Epoch 17/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m3s\u001b[0m 44ms/step - accuracy: 0.9381 - loss: 0.1913\n", "Epoch 18/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m7s\u001b[0m 66ms/step - accuracy: 0.9354 - loss: 0.2026\n", "Epoch 19/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m9s\u001b[0m 44ms/step - accuracy: 0.9395 - loss: 0.1827\n", "Epoch 20/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m6s\u001b[0m 56ms/step - accuracy: 0.9335 - loss: 0.1873\n", "Epoch 21/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 44ms/step - accuracy: 0.9352 - loss: 0.1824\n", "Epoch 22/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m6s\u001b[0m 50ms/step - accuracy: 0.9427 - loss: 0.1732\n", "Epoch 23/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 50ms/step - accuracy: 0.9359 - loss: 0.1733\n", "Epoch 24/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 45ms/step - accuracy: 0.9433 - loss: 0.1654\n", "Epoch 25/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m6s\u001b[0m 53ms/step - accuracy: 0.9488 - loss: 0.1536\n", "Epoch 26/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 44ms/step - accuracy: 0.9513 - loss: 0.1338\n", "Epoch 27/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m3s\u001b[0m 44ms/step - accuracy: 0.9606 - loss: 0.1313\n", "Epoch 28/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 52ms/step - accuracy: 0.9545 - loss: 0.1292\n", "Epoch 29/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 44ms/step - accuracy: 0.9696 - loss: 0.1005\n", "Epoch 30/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 44ms/step - accuracy: 0.9675 - loss: 0.1018\n", "Epoch 31/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 51ms/step - accuracy: 0.9705 - loss: 0.1082\n", "Epoch 32/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 50ms/step - accuracy: 0.9624 - loss: 0.1085\n", "Epoch 33/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 45ms/step - accuracy: 0.9674 - loss: 0.0999\n", "Epoch 34/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 50ms/step - accuracy: 0.9658 - loss: 0.0990\n", "Epoch 35/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 50ms/step - accuracy: 0.9623 - loss: 0.1044\n", "Epoch 36/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 44ms/step - accuracy: 0.9787 - loss: 0.0766\n", "Epoch 37/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m6s\u001b[0m 58ms/step - accuracy: 0.9723 - loss: 0.0851\n", "Epoch 38/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 44ms/step - accuracy: 0.9811 - loss: 0.0759\n", "Epoch 39/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 44ms/step - accuracy: 0.9820 - loss: 0.0585\n", "Epoch 40/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 46ms/step - accuracy: 0.9763 - loss: 0.0763\n", "Epoch 41/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 44ms/step - accuracy: 0.9719 - loss: 0.0863\n", "Epoch 42/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 58ms/step - accuracy: 0.9826 - loss: 0.0669\n", "Epoch 43/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 44ms/step - accuracy: 0.9841 - loss: 0.0537\n", "Epoch 44/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 44ms/step - accuracy: 0.9768 - loss: 0.0753\n", "Epoch 45/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m6s\u001b[0m 51ms/step - accuracy: 0.9818 - loss: 0.0550\n", "Epoch 46/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m3s\u001b[0m 44ms/step - accuracy: 0.9821 - loss: 0.0616\n", "Epoch 47/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m6s\u001b[0m 53ms/step - accuracy: 0.9831 - loss: 0.0579\n", "Epoch 48/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m4s\u001b[0m 47ms/step - accuracy: 0.9885 - loss: 0.0466\n", "Epoch 49/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m5s\u001b[0m 44ms/step - accuracy: 0.9784 - loss: 0.0563\n", "Epoch 50/50\n", "\u001b[1m79/79\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m6s\u001b[0m 71ms/step - accuracy: 0.9814 - loss: 0.0593\n" ] } ], "source": [ "history = model.fit(X_train, y_train, epochs=50)" ] }, { "cell_type": "markdown", "metadata": { "id": "D3DtoSNTdiuJ" }, "source": [ "## Evalución de los resultados" ] }, { "cell_type": "markdown", "metadata": { "id": "aFmZCv5KdiuK" }, "source": [ "Probamos su performance utilizando el test set" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "kqM4G2BVdiuL", "outputId": "0a3d130a-72d4-403a-c7e9-535c6360824d" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\u001b[1m39/39\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m1s\u001b[0m 20ms/step\n" ] } ], "source": [ "predictions = model.predict(X_test)" ] }, { "cell_type": "code", "source": [ "import numpy as np\n", "\n", "predictions = np.argmax(predictions, axis=1)" ], "metadata": { "id": "afB-SEKdlKFV" }, "execution_count": 23, "outputs": [] }, { "cell_type": "markdown", "source": [ "Veamos el reporte:" ], "metadata": { "id": "FXuC61Dplufm" } }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "XSIlnu1zdiuN", "outputId": "a1bcf738-a032-43c7-c5f4-bfd15009fbce" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ " precision recall f1-score support\n", "\n", "ALIMENTACION 0.99 0.92 0.95 110\n", " AUTOMOCION 0.90 0.89 0.89 148\n", " BANCA 0.86 0.87 0.86 198\n", " BEBIDAS 0.92 0.88 0.90 223\n", " DEPORTES 0.94 0.89 0.92 216\n", " RETAIL 0.80 0.87 0.83 268\n", " TELCO 0.89 0.91 0.90 79\n", "\n", " accuracy 0.89 1242\n", " macro avg 0.90 0.89 0.89 1242\n", "weighted avg 0.89 0.89 0.89 1242\n", "\n" ] } ], "source": [ "from sklearn.metrics import classification_report\n", "\n", "print(classification_report(y_test, predictions, target_names=encoder.classes_))" ] } ], "metadata": { "colab": { "name": "Word2Vec - Secuencias.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.11" } }, "nbformat": 4, "nbformat_minor": 0 }