{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "iKENeWEEwKIb" }, "source": [ "\n", "BERT en un problema de modelado de tópicos\n", "==========================================" ] }, { "cell_type": "markdown", "metadata": { "id": "LV5DTexIybKw" }, "source": [ "Introducción\n", "------------" ] }, { "cell_type": "markdown", "metadata": { "id": "sT_t9OxYwKIc" }, "source": [ "Los modelos basados en transformers nos pueden ayudar a resolver varios tipos de problemas. Desde problemas de clasificación y regresión hasta tareas más complejas como resumen de textos o generación de leguaje condicionado. Veremos como aplicar las técnicas de modelado de tópicos (Topic Modeling) utilizando un modelo basado en BERT para nuestro típico problema de clasificación de tweets." ] }, { "cell_type": "markdown", "metadata": { "id": "Dcyc_TQ6dis7" }, "source": [ "### Para ejecutar este notebook" ] }, { "cell_type": "markdown", "metadata": { "id": "YWGfRkiHybK1" }, "source": [ "Para ejecutar este notebook, instale las siguientes librerias:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "YXE4ZBrlybK1", "outputId": "b991dd62-125c-4a7c-bc63-217fa7fbb4d5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[K |████████████████████████████████| 10.4 MB 4.6 MB/s \n", "\u001b[K |████████████████████████████████| 235 kB 45.2 MB/s \n", "\u001b[K |████████████████████████████████| 184 kB 46.9 MB/s \n", "\u001b[K |████████████████████████████████| 1.0 MB 42.3 MB/s \n", "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "en-core-web-sm 3.4.0 requires spacy<3.5.0,>=3.4.0, but you have spacy 2.3.5 which is incompatible.\n", "confection 0.0.2 requires srsly<3.0.0,>=2.4.0, but you have srsly 1.0.5 which is incompatible.\u001b[0m\n", "\u001b[K |████████████████████████████████| 3.1 MB 5.3 MB/s \n", "\u001b[K |████████████████████████████████| 831.4 MB 2.5 kB/s \n", "\u001b[K |████████████████████████████████| 306 kB 57.5 MB/s \n", "\u001b[K |████████████████████████████████| 90 kB 8.5 MB/s \n", "\u001b[K |████████████████████████████████| 163 kB 57.8 MB/s \n", "\u001b[K |████████████████████████████████| 3.3 MB 36.7 MB/s \n", "\u001b[K |████████████████████████████████| 880 kB 53.6 MB/s \n", "\u001b[K |████████████████████████████████| 5.2 MB 47.1 MB/s \n", "\u001b[?25h Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n", " Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n", " Preparing wheel metadata ... \u001b[?25l\u001b[?25hdone\n", "\u001b[K |████████████████████████████████| 88 kB 5.8 MB/s \n", "\u001b[K |████████████████████████████████| 85 kB 4.3 MB/s \n", "\u001b[K |████████████████████████████████| 636 kB 56.0 MB/s \n", "\u001b[K |████████████████████████████████| 1.3 MB 68.6 MB/s \n", "\u001b[K |████████████████████████████████| 1.1 MB 59.0 MB/s \n", "\u001b[K |████████████████████████████████| 19.1 MB 1.2 MB/s \n", "\u001b[K |████████████████████████████████| 19.1 MB 94.5 MB/s \n", "\u001b[K |████████████████████████████████| 21.0 MB 1.2 MB/s \n", "\u001b[K |████████████████████████████████| 23.2 MB 1.4 MB/s \n", "\u001b[K |████████████████████████████████| 23.3 MB 1.5 MB/s \n", "\u001b[K |████████████████████████████████| 23.3 MB 15.3 MB/s \n", "\u001b[K |████████████████████████████████| 22.1 MB 61.0 MB/s \n", "\u001b[K |████████████████████████████████| 22.1 MB 12.5 MB/s \n", "\u001b[?25h Building wheel for hdbscan (PEP 517) ... \u001b[?25l\u001b[?25hdone\n", " Building wheel for sentence-transformers (setup.py) ... \u001b[?25l\u001b[?25hdone\n", " Building wheel for umap-learn (setup.py) ... \u001b[?25l\u001b[?25hdone\n", " Building wheel for pynndescent (setup.py) ... \u001b[?25l\u001b[?25hdone\n", " Building wheel for sacremoses (setup.py) ... \u001b[?25l\u001b[?25hdone\n", "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", "torchtext 0.13.1 requires torch==1.12.1, but you have torch 1.9.0 which is incompatible.\n", "torchaudio 0.12.1+cu113 requires torch==1.12.1, but you have torch 1.9.0 which is incompatible.\u001b[0m\n" ] } ], "source": [ "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Datasets/mascorpus/tweets_marketing.csv \\\n", " --quiet --no-clobber --directory-prefix ./Datasets/mascorpus/\n", "\n", "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/m72109/nlp/normalization.py \\\n", " --quiet --no-clobber --directory-prefix ./m72109/nlp/\n", "\n", "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/docs/nlp/neural/bertopic.txt \\\n", " --quiet --no-clobber\n", "\n", "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/docs/nlp/preprocessing/Normalization.txt \\\n", " --quiet --no-clobber\n", "\n", "!pip install -r Normalization.txt --quiet\n", "!pip install -r bertopic.txt --quiet" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ntcs1AlpfckX" }, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": { "id": "_gBXNzwYwKIu" }, "source": [ "Cargamos el set de datos" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "I8vqJD9JwKIv" }, "outputs": [], "source": [ "import pandas as pd\n", "\n", "tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "osRGJ85Lx_3S" }, "outputs": [], "source": [ "from m72109.nlp.normalization import TweetTextNormalizer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NxJjAT-RyLKV" }, "outputs": [], "source": [ "normalizer = TweetTextNormalizer(lemmatize=False, stem=False, reduce_len=True, strip_handles=True, strip_stopwords=False, strip_urls=True, strip_accents=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tlwyfrwaMDAy" }, "outputs": [], "source": [ "docs = normalizer.transform(tweets['TEXTO'])" ] }, { "cell_type": "markdown", "metadata": { "id": "ExV73PU4Ak1I" }, "source": [ "### Verificando el hardware disponible" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "gather": { "logged": 1604083146665 }, "id": "ULqO3R6_Am2I", "outputId": "bd12fc79-7572-4592-83f1-fc6b6a1b490c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Este notebook se está ejecutando en cpu\n" ] } ], "source": [ "import torch\n", "device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')\n", "\n", "print(\"Este notebook se está ejecutando en\", device)" ] }, { "cell_type": "markdown", "metadata": { "id": "Hqf5VvQKVko_" }, "source": [ "## Clustering\n", "\n", "Los modelos como BERT son capaces de generar representaciones o embeddings contextualizados para las secuencias de texto que se introducen. En uno de los ejemplos anteriores, entrenamos un modelo de clasificación de tweets. El mismo consistia de una arquitectura basada en transformers + un clasificador.\n", "\n", "En este ejemplo, no utilizaremos la capa de clasificación sino que solo nos quedaremos con los embeddings. Tenga en cuenta que, a pesar de que no utilizamos el clasificador (MLP), los embeddings de BERT fueron ajustados (fine-tune) al problema de clasificación de tweets puntualmente. Esto hace que nuestra capacidad de clustering sea más acorde al conjunto de datos.\n", "\n", "Para mostrar como funciona, carguemos el modelo de clasificación de tweets. El mismo está publicado en HuggingFace bajo la cuenta de esta materia:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "w9JNa2s5wKI5", "outputId": "bc554fcc-5960-4832-bc27-4a15989e370d" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Some weights of the model checkpoint at fce-m72109/mascorpus-bert-classifier were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']\n", "- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n", "- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n" ] } ], "source": [ "from transformers import pipeline\n", "\n", "embedding_model = pipeline(\"feature-extraction\", model=\"fce-m72109/mascorpus-bert-classifier\")" ] }, { "cell_type": "markdown", "metadata": { "id": "pA8jWrPvWVjP" }, "source": [ "> Note como la tarea que indicamos en el pipeline es `feature-extration`." ] }, { "cell_type": "markdown", "metadata": { "id": "A90unty6WzCc" }, "source": [ "### Clusters según palabras" ] }, { "cell_type": "markdown", "metadata": { "id": "Nx4FkysvWa_o" }, "source": [ "Utilizando la libraria `BERTopic`, podemos computar los embeddings para su visualización:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "s1f5MNLUcqFJ" }, "outputs": [], "source": [ "from bertopic import BERTopic\n", "\n", "topic_model = BERTopic(language='spanish', embedding_model=embedding_model)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "earA7Mc1cqv_" }, "outputs": [], "source": [ "topics, probs = topic_model.fit_transform(docs)" ] }, { "cell_type": "markdown", "metadata": { "id": "uRFsU9R5WjU6" }, "source": [ "Podemos ver que la libraría a detectado 28 diferentes tópicos:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "l7aecwz_jxP8", "outputId": "d7398585-3033-466f-989d-f4e8a4b2f0ce" }, "outputs": [ { "data": { "text/plain": [ "array([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,\n", " 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "\n", "np.unique(np.asarray(topics))" ] }, { "cell_type": "markdown", "metadata": { "id": "lDfLvAPbWp3c" }, "source": [ "Veamos como estos tópicos se agrupan:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 667 }, "id": "HNsU9kokj9hM", "outputId": "95adf71b-3136-4660-dfef-3e087adf3472" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "\n", "