{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "3RcCek8lsFze" }, "source": [ "Técnicas de reducción de dimensionalidad\n", "===================================" ] }, { "cell_type": "markdown", "metadata": { "id": "u8Irja1O_J9X" }, "source": [ "## Introducción\n", "\n", "Topic modeling es una técnica de aprendizaje automático no supervisado donde intentados descubrir tópicos que son abstractos al texto pero que pueden describir una colección de documentos. Es importante marcar que estos \"tópicos\" no son necesariamente equivalentes a la interpretación coloquial de tópicos, sino que responden a un patrón que emerge de las palabras que están en los documentos.\n", "\n", "La suposición básica para Topic Modeling es que cada documento está representado por una mescla de tópicos, y cada tópico consiste en una colección de palabras." ] }, { "cell_type": "markdown", "metadata": { "id": "nJKJtlkCsFzn" }, "source": [ "### Para ejecutar este notebook" ] }, { "cell_type": "markdown", "metadata": { "id": "zjntkNZ5sFzn" }, "source": [ "Para ejecutar este notebook, instalaremos las siguientes librerias:\n", "\n", "```\n", "nltk\n", "pandas\n", "numpy\n", "matplotlib\n", "tqdm\n", "spacy==2.3.5\n", "unidecode\n", "scikit-learn\n", "importlib-metadata==4.13.0\n", "pyLDAvis\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9zClL8PJsFzo" }, "outputs": [], "source": [ "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Datasets/mascorpus/tweets_marketing.csv \\\n", " --quiet --no-clobber --directory-prefix ./Datasets/mascorpus/\n", "\n", "!wget https://raw.githubusercontent.com/santiagxf/M72109/master/docs/nlp/classic/topic-modeling.txt \\\n", " --quiet --no-clobber\n", "!pip install -r topic-modeling.txt --quiet" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "KgoST-ebsFzq" }, "outputs": [], "source": [ "!python -m spacy download es_core_news_sm 1> /dev/null" ] }, { "cell_type": "markdown", "source": [ "Deshabilitamos algunos mensajes de advertencias:" ], "metadata": { "id": "uVCSKx1cu6Z3" } }, { "cell_type": "code", "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\")" ], "metadata": { "id": "ltgAo3Y2u9md" }, "execution_count": 17, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "WPpqVNrwSdhL" }, "source": [ "Primero importaremos algunas librerias necesarias" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "zPfF_O0U_J9a" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from tqdm import tqdm" ] }, { "cell_type": "markdown", "metadata": { "id": "lE_O7bEjLebd" }, "source": [ "## Sobre el set de datos con el que vamos a trabajar" ] }, { "cell_type": "markdown", "metadata": { "id": "H8lcRTa_Li4e" }, "source": [ "Utilizaremos como ejemplo un set de datos en español que contiene tweets que diferentes usuarios han publicado en relación a diferentes marcas de productos u empresas en el rubro de alimentación, construcción, automoviles, etc. Estos tweets, a su vez, están asociados a una de las diferentes fases en el proceso de ventas (también conocido como Marketing Funel) y por eso están tagueados con las fases de:\n", "\n", " - Awareness – el cliente es conciente de la existencia de un producto o servicio\n", " - Interest – activamente expresa el interes de un producto o servicio\n", " - Evaluation – aspira una marca o producto en particular\n", " - Purchase – toma el siguiente paso necesario para comprar el producto o servicio\n", " - Postpurchase - realización del proceso de compra. El cliente compara la diferencia entre lo que deseaba y lo que obtuvo\n", "\n", "Referencia: [Spanish Corpus of Tweets for Marketing](http://ceur-ws.org/Vol-2111/paper1.pdf\n", "\n", "> Nota: La version de este conjunto de datos que utilizaremos aqui es una versión preprocesada del original." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "Gc44Q7do_J9h" }, "outputs": [], "source": [ "tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')" ] }, { "cell_type": "markdown", "metadata": { "id": "INJwReUXSs4K" }, "source": [ "Inspeccionamos el set de datos" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "Gd6EocPdG5A0", "outputId": "df21f703-4542-40a6-97c3-9cb205d3334d" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " TEXTO SECTOR MARCA \\\n", "0 #tablondeanuncios Funda nordica ikea #madrid h... RETAIL IKEA \n", "1 #tr Me ofrezco para montar muebles de Ikea - H... RETAIL IKEA \n", "2 #VozPópuli Vozpópuli @voz_populi - #LoMásLeido... RETAIL ALCAMPO \n", "3 #ZonaTecno Destacado: Todo lo que hay que sabe... RETAIL CARREFOUR \n", "4 $Carrefour retira pez #Panga. OCU y grupos x #... RETAIL CARREFOUR \n", "\n", " CANAL AWARENESS EVALUATION PURCHASE POSTPURCHASE NC2 \n", "0 Microblog 0 0 0.0 0 1.0 \n", "1 Microblog 0 0 0.0 0 1.0 \n", "2 Microblog 0 0 0.0 0 1.0 \n", "3 Microblog 0 0 0.0 0 1.0 \n", "4 Microblog 0 0 0.0 0 1.0 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TEXTOSECTORMARCACANALAWARENESSEVALUATIONPURCHASEPOSTPURCHASENC2
0#tablondeanuncios Funda nordica ikea #madrid h...RETAILIKEAMicroblog000.001.0
1#tr Me ofrezco para montar muebles de Ikea - H...RETAILIKEAMicroblog000.001.0
2#VozPópuli Vozpópuli @voz_populi - #LoMásLeido...RETAILALCAMPOMicroblog000.001.0
3#ZonaTecno Destacado: Todo lo que hay que sabe...RETAILCARREFOURMicroblog000.001.0
4$Carrefour retira pez #Panga. OCU y grupos x #...RETAILCARREFOURMicroblog000.001.0
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "variable_name": "tweets", "summary": "{\n \"name\": \"tweets\",\n \"rows\": 3763,\n \"fields\": [\n {\n \"column\": \"TEXTO\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 3678,\n \"samples\": [\n \"El BBVA deber\\u00eda hacer nuevos comerciales con Claudio Bravo.\\nAprovechando que ahora es el rey de la banca.\\n@alebattocchio\",\n \"yo quiero el nuevo citroen c3!! https://t.co/5gKTjThrJk\",\n \"Acabo de correr 5,78 km a un ritmo de 6'36'' con Nike+ https://t.co/Ui6paCTjtC #nikeplus\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"SECTOR\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"RETAIL\",\n \"TELCO\",\n \"BEBIDAS\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"MARCA\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 38,\n \"samples\": [\n \"ESTRELLA GALICIA\",\n \"NIKE\",\n \"LEROY MERLIN\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"CANAL\",\n \"properties\": {\n \"dtype\": \"category\",\n \"num_unique_values\": 1,\n \"samples\": [\n \"Microblog\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"AWARENESS\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"EVALUATION\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"PURCHASE\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.15200028602567445,\n \"min\": 0.0,\n \"max\": 1.0,\n \"num_unique_values\": 2,\n \"samples\": [\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"POSTPURCHASE\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 0,\n \"max\": 1,\n \"num_unique_values\": 2,\n \"samples\": [\n 1\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"NC2\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.5047254454850233,\n \"min\": 0.0,\n \"max\": 11.0,\n \"num_unique_values\": 3,\n \"samples\": [\n 1.0\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 6 } ], "source": [ "tweets.head(5)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 268 }, "id": "uXwJS2Og_J9l", "outputId": "c8c6ca1f-d5e2-4807-9061-51479e9089f8" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " TEXTO SECTOR\n", "0 #tablondeanuncios Funda nordica ikea #madrid h... RETAIL\n", "725 \"Ilcinsisti lis MB dispiniblis\" te odeeeeeo Mo... TELCO\n", "964 #CarlosSlim y Bimbo lanzarán un vehículo eléct... ALIMENTACION\n", "1298 ‼🏎Toyota #Day, 4ruedas ,1/4 milla, 1 #pasión, ... AUTOMOCION\n", "1748 \"- Tú qué.\\n- Yo na.\"\\nConversaciones banco sa... BANCA\n", "2348 - Cariño, te juro que sólo tenían Cruzcampo en... BEBIDAS\n", "3023 #adidas #hockey Amenabar 2080 CABA https://t.c... DEPORTES" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TEXTOSECTOR
0#tablondeanuncios Funda nordica ikea #madrid h...RETAIL
725\"Ilcinsisti lis MB dispiniblis\" te odeeeeeo Mo...TELCO
964#CarlosSlim y Bimbo lanzarán un vehículo eléct...ALIMENTACION
1298‼🏎Toyota #Day, 4ruedas ,1/4 milla, 1 #pasión, ...AUTOMOCION
1748\"- Tú qué.\\n- Yo na.\"\\nConversaciones banco sa...BANCA
2348- Cariño, te juro que sólo tenían Cruzcampo en...BEBIDAS
3023#adidas #hockey Amenabar 2080 CABA https://t.c...DEPORTES
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "dataframe", "summary": "{\n \"name\": \"tweets\",\n \"rows\": 7,\n \"fields\": [\n {\n \"column\": \"TEXTO\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"#tablondeanuncios Funda nordica ikea #madrid https://t.co/9TvaZSa1De https://t.co/J3bC6t6C9u\",\n \"\\\"Ilcinsisti lis MB dispiniblis\\\" te odeeeeeo Movistar dir\\u00eda la golda peluda jajaja\",\n \"- Cari\\u00f1o, te juro que s\\u00f3lo ten\\u00edan Cruzcampo en la tienda... https://t.co/m064KIDlpM\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"SECTOR\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 7,\n \"samples\": [\n \"RETAIL\",\n \"TELCO\",\n \"BEBIDAS\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" } }, "metadata": {}, "execution_count": 7 } ], "source": [ "tweets.groupby('SECTOR').head(1)[['TEXTO', 'SECTOR']]" ] }, { "cell_type": "markdown", "metadata": { "id": "r9FcIehJ_J9q" }, "source": [ "## Preprosesamiento" ] }, { "cell_type": "markdown", "metadata": { "id": "gfUnlH25HEeM" }, "source": [ "Como en toda tarea de NLP, y más generalmente, en Machine Learning, ocuparemos una porción de nuestro tiempo en preprocesar los datos para generar representaciones útiles y deshacernos de problemas especificos que podría exhibir nuestro set de datos." ] }, { "cell_type": "markdown", "metadata": { "id": "UZISG0Cs_J-g" }, "source": [ "### Creando una rutina de preprosesamiento de texto\n", "\n", "Realizaremos las tareas cotidianas de preprocesamiento. Adicionalmente nuestra rutina va a:\n", "\n", " - Eliminar caracteres especiales: Acentos y caracteres especiales podrían complejizar el la representación de palabras, por lo que los eliminaremos.\n", " - Eliminaremos URLs y handles que son típicos en tweeter. Esto es especifico en este set de datos ya que una URL no representa información en este contexto." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "0eJxv1LA_J-g" }, "outputs": [], "source": [ "import unidecode\n", "import spacy\n", "import es_core_news_sm as spa\n", "import re\n", "import nltk\n", "from nltk import stem\n", "from nltk.corpus import stopwords\n", "from nltk.tokenize.casual import TweetTokenizer\n", "\n", "nltk.download('stopwords', quiet=True)\n", "\n", "parser = spa.load() # Cargamos el parser en español\n", "tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True) # Creamos un tokenizer\n", "stemmer = stem.SnowballStemmer(language='spanish') # Creamos un steammer\n", "lemmatizer = lambda word : \" \".join([token.lemma_ for token in parser(word)]) # Creamos un lemmatizer\n", "stopwords = set(stopwords.words('spanish')) # Instanciamos las stopwords en español\n", "urls_regex = re.compile(r'http\\S+') # Usamos una expresion regular para encontrar las URLs\n", "\n", "def process_text(text):\n", " tokens = tokenizer.tokenize(text)\n", " tokens = [token for token in tokens if not re.match(urls_regex, token)]\n", " tokens = [token for token in tokens if len(token) > 4]\n", " tokens = [token for token in tokens if token not in stopwords]\n", " tokens = [unidecode.unidecode(token) for token in tokens] # Quitamos acentos\n", " tokens = [lemmatizer(token) for token in tokens]\n", " return tokens" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "U4uXheye_J-j", "outputId": "e28c8dfc-abd3-45c4-f9e3-b274cd453872" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "100%|██████████| 3763/3763 [02:18<00:00, 27.21it/s]\n" ] } ], "source": [ "doc_list = []\n", "\n", "for doc in tqdm(tweets['TEXTO']):\n", " tokens = process_text(doc)\n", " doc_list.append(' '.join(tokens))" ] }, { "cell_type": "markdown", "metadata": { "id": "LtykSqbA_J-l" }, "source": [ "Revisemos algunos resultados:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "QQgWOejW_J-l", "outputId": "7cb2e9a3-5cbe-4a23-c7b1-b79bb7fa51b2" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'#VozPópuli Vozpópuli @voz_populi - #LoMásLeidoHoy Mercadona, DIA o Alcampo guardan silencio ante la ola europea... https://t.co/aJTuA4J9UV'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 18 } ], "source": [ "tweets['TEXTO'][2]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "huZ2Cu2Q_J-n", "outputId": "6d2db30f-f339-4f93-c4e8-6b6271f12d54" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'# vozpopuli Vozpopuli # lomasleidohoy Mercadona Alcampo guardar silencio europeo'" ], "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" } }, "metadata": {}, "execution_count": 11 } ], "source": [ "doc_list[2]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6WSbMjTnRYXT", "outputId": "afc28ea6-8bc0-4f06-fa0d-34e5e6f20e48" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "3763" ] }, "metadata": {}, "execution_count": 15 } ], "source": [ "len(doc_list)" ] }, { "cell_type": "markdown", "metadata": { "id": "-PXVzn24_J-p" }, "source": [ "## Vectorización" ] }, { "cell_type": "markdown", "metadata": { "id": "TLwWu__sX4Rq" }, "source": [ "Una vez que nuesto texto fue preprocesado para mantener solo aquellas palabras que nos son relevantes, pasamos al proceso de generar vectores a partir de las palabras que componen nuestro vocabulario. Nuestros modelos no pueden operar sobre palabras, y por lo tanto necesitamos una representación númerica de las mismas." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "id": "JUldeeN6_J-7" }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "vectorizer = TfidfVectorizer(use_idf=True, sublinear_tf=True, norm='l2')\n", "vectors = np.asarray(vectorizer.fit_transform(doc_list).todense())" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "id": "pTQYz9ABsFz7", "outputId": "ae0d5ad9-8649-488a-ba96-5f99cd0e118b", "colab": { "base_uri": "https://localhost:8080/" } }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(3763, 6924)" ] }, "metadata": {}, "execution_count": 21 } ], "source": [ "vectors.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "j5YtJxjQXgxH" }, "source": [ "## Reducción de dimensionalidad: Featurization\n", "Una vez que tenemos nuestros palabras representadas como vectores, nos aparece el problema de que ¡aún son demasiado grandes! En el ejemplo anterior, estamos trabajando con vectores en un espacio de 6K+. Necesitamos reducir esta dimensionalidad. Para esto, utilizaremos métodos de reducción de dimensionalidad con el objetivo de generar features que nos sean más utiles. Estas features las generaremos de forma \"no supervisada\"" ] }, { "cell_type": "markdown", "metadata": { "id": "cI4HLD1tkTBT" }, "source": [ "### Métodos básados en descomposición de matrices" ] }, { "cell_type": "markdown", "metadata": { "id": "-8tC0kyK_J-6" }, "source": [ "Los modelos basados en factorización de matrices intentan reducir la dimensionalidad de la matriz al aproximarla usando dos matrices que representan embeddings de palabras y embeddings de documentos (más una matriz singular que los vincula los unos con los otros). Este método es bastante popular no solo en NLP sino que también en sistemas de recomendación, método que fué ganador del Netflix Prize (Funk SVD).\n", "\n", "\n", "\n", "\n", "U y V(trapuesta) son ortogonales. Esto es de esperar porque si determinadas propiedades determinan un determinado factor latente, entonces esas propiedades serán poco relevantes en los restantes factores (pues sino, no haría sentido que conformen un factor distinto en un primer lugar).\n", "\n", "SVC es un metodo de decomposición exacto, lo que singnifica que las matrices U y V son lo suficientemente grandes para mapear exactamente la matriz A." ] }, { "cell_type": "markdown", "metadata": { "id": "Kh1u5cZzkKbO" }, "source": [ "### LSI - Latent Semantic Indexing" ] }, { "cell_type": "markdown", "metadata": { "id": "U1pbwdPv_J-7" }, "source": [ "LSI es un caso particular de factorización de matrices. Cuando SVD es utilizado para procesar tópicos en texto y en donde los valores de la matriz A corresponden a frecuencias de palabras, este método se lo denomina Latent Semantic Analysis (sin embargo, en NLP no se lo suele nombrar como LSI).\n", "\n", "Dado que SVC es un método de decomposición exacto, tiende a producir matrices de poca densidad (sparse). Para evitar este problema, se utiliza una versión modificada de SVC conocida como Truncated SVD que solamente computa los k componentes mas grandes en la descomposición. Esto ayuda a que LSI combata efectivamente el problema de matrices sparse que tienden a generarse cuando se tienen cuerpos de texto con sinónimos y palabras que significan varias cosas dependiendo del contexto. Truncated SVD evíta ser un método de decomposición exacto al aproximar la matriz A utilizando los k tópicos más relevantes.\n", "\n", "\n", "\n", "Facebook Research: Fast Randomized SVD [https://research.fb.com/fast-randomized-svd/])\n", "\n", "En esta configuración entonces:\n", " - Un documento es nada mas que la distribución de palabras que ocurren en el (Bag of words)\n", " - A es una matriz de m x n donde m es la cantidad de documentos ú observaciones, y n es la cantidad de palabras en el vocabulario.\n", " - Los valores de A corresponden a la frecuencia de la cada palabra del vocabulario en cada observación ú documento.\n", " - A es una matriz sujeta a ruido con distribución Gausiana.\n", "\n", "\n", "Referencia: Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions [https://arxiv.org/abs/0909.4061]" ] }, { "cell_type": "markdown", "metadata": { "id": "c1iGWgQZca0C" }, "source": [ "El principal parametro en LSI es el numero de factores que queremos generar (el parametro K). No existe una regla para especificar este parametro ya que depende del escenario. Valores muy pequeños pueden forzar a los documentos a ser colisionar en los tópicos que son asignados, mientras que valores muy grandes pueden hacer que palabras poco frecuentes y raras terminen determinando su propio \"topico\".\n", "\n", "> Valores típicos de este parámetro están en 50 < k < 300\n", "\n", "En la librería `scikit-learn`, este valor lo especificaremos en `n_components`. El parametro `algorithm` hace referencia al método que utilizaremos para generar la descomposición:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "id": "QZ2uzF82_J--" }, "outputs": [], "source": [ "from sklearn.decomposition import TruncatedSVD\n", "\n", "svd = TruncatedSVD(n_components=7, algorithm='randomized')\n", "\n", "USigma = svd.fit_transform(vectors)\n", "Sigma = svd.singular_values_\n", "VT = svd.components_" ] }, { "cell_type": "markdown", "metadata": { "id": "3bFhXZc6eHQL" }, "source": [ "Si bien en el codigo anterior estamos viendo las 3 matrices, solo nos interesa la matriz VT. ¿Porque? Recuerden que nuestro \"input\" es un conjunto de palabras que luego vectorizamos utilizando TF-IDF. Cada documento está representado por este conjunto de palabras. Nuestro objetivo es disponer una forma donde podamos convertir este set de palabras a \"tópicos\" que sean más informativos que las palabras propiamente dichas. **En consecuencia, lo único que nos interesa aqui es la matriz VT**" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Xvvt3kpu_J_A", "outputId": "a12c44cd-ae4d-4bfb-cc89-f9a51848b94c" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(7, 6924)" ] }, "metadata": {}, "execution_count": 23 } ], "source": [ "VT.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "uwu_KCg__J_C" }, "source": [ "Internamente, TrucatedSVC es un wrapper de la clase randomized_svd donde la matríz Q que vimos anteriormente se genera a través de un método de sampling aleatorio. Las siguientes lineas son equivalentes a lo que vimos anteriormente:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_kigl-Xl_J_D" }, "outputs": [], "source": [ "from sklearn.utils.extmath import randomized_svd\n", "\n", "U, Sigma, VT = randomized_svd(vectors,\n", " n_components=7,\n", " n_iter=5)" ] }, { "cell_type": "markdown", "metadata": { "id": "HV9fOGLB_J_F" }, "source": [ "Podemos validar que U es una matriz ortogonal" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "uLYEo9Gm_J_F", "outputId": "6277fcdb-e3cc-4da7-d3e3-97b22f83ef39" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "True" ] }, "metadata": {}, "execution_count": 16 } ], "source": [ "np.allclose(U.T @ U, np.eye(U.shape[1]))" ] }, { "cell_type": "markdown", "metadata": { "id": "xSv1wUgK_J_I" }, "source": [ "Lo siguiente es solo a titulo informativo, pero si vemos los valores de la matriz Sigma, veremos la importancia relativa de los documentos con respecto a los tópicos que encontramos. Si los gráficamos vemos que sus valores comienzan a decrecer relativamente rápido, sosteniendo la supoción de que Truncated SVD genera los K más relevantes tópicos." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 450 }, "id": "HVI80HRw_J_I", "outputId": "46a9dc99-7235-4423-beb1-2dd0dcac8671" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[]" ] }, "metadata": {}, "execution_count": 17 }, { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "iVBORw0KGgoAAAANSUhEUgAAAiMAAAGgCAYAAAB45mdaAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA+PElEQVR4nO3deXhU5cH+8XsmyyRkg4SsELKwI7tsASkutFTRqvWHyouCC7YqVsXaKt3EWkXfFl9bFxStgAta1IIWFWURXAh7QRBZQnZIwpZkspB15vdHIBohkElCnpnM93Nd5zJz5pyZO3Np5vY5zznH4nQ6nQIAADDEajoAAADwbpQRAABgFGUEAAAYRRkBAABGUUYAAIBRlBEAAGAUZQQAABhFGQEAAEZRRgAAgFGUEQAAYJRLZaS2tlZ//OMflZSUpMDAQHXv3l2PPfaYznVF+bVr12ro0KGy2Wzq0aOHFi5c2JLMAACgHfF1ZeOnnnpK8+bN06JFi3TBBRdoy5YtuvXWWxUWFqZ77733jPtkZGRo4sSJuvPOO/Xmm29q9erVmj59umJjYzVhwoQmva/D4dChQ4cUEhIii8XiSmQAAGCI0+lUSUmJ4uLiZLU2Pv5hceVGeVdeeaWio6P1z3/+s37dddddp8DAQL3xxhtn3Oehhx7Shx9+qF27dtWvu/HGG1VUVKQVK1Y06X1zc3MVHx/f1JgAAMCN5OTkqGvXro0+79LIyOjRozV//nzt27dPvXr10o4dO/Tll1/q6aefbnSf1NRUjR8/vsG6CRMm6P777290n8rKSlVWVtY/PtWXcnJyFBoa6kpkAABgiN1uV3x8vEJCQs66nUtl5OGHH5bdblefPn3k4+Oj2tpaPf7445oyZUqj++Tn5ys6OrrBuujoaNntdp04cUKBgYGn7TNnzhw9+uijp60PDQ2ljAAA4GHONcXCpQmsS5Ys0ZtvvqnFixdr27ZtWrRokf72t79p0aJFLQr5Q7NmzVJxcXH9kpOT06qvDwAA3IdLIyO/+c1v9PDDD+vGG2+UJA0YMEBZWVmaM2eOpk2bdsZ9YmJiVFBQ0GBdQUGBQkNDzzgqIkk2m002m82VaAAAwEO5NDJSXl5+2mxYHx8fORyORvdJSUnR6tWrG6xbuXKlUlJSXHlrAADQTrlURq666io9/vjj+vDDD5WZmamlS5fq6aef1rXXXlu/zaxZszR16tT6x3feeafS09P129/+Vnv27NELL7ygJUuWaObMma33WwAAAI/l0mGaZ599Vn/84x9199136/Dhw4qLi9Mvf/lL/elPf6rfJi8vT9nZ2fWPk5KS9OGHH2rmzJn6+9//rq5du+qVV15p8jVGAABA++bSdUZMsdvtCgsLU3FxMWfTAADgIZr6/c29aQAAgFGUEQAAYBRlBAAAGEUZAQAARlFGAACAUZQRAABglFeXkS/3H9XUVzeporrWdBQAALyW15aRiupazVyyXZ/vO6K5n+41HQcAAK/ltWUkwM9HT/58gCTplS8ztDH9mOFEAAB4J68tI5J0Wd9o3TAsXk6n9Ot3dqi0ssZ0JAAAvI5XlxFJ+sOVfdWlY6ByC0/o8Q93m44DAIDX8foyEhLgp7nXD5LFIr21KUdr9hSYjgQAgFfx+jIiSaOSI3T7mCRJ0kPv7VRhWZXhRAAAeA/KyEkPTuitHlHBOlJSqT8s2yUPuJkxAADtAmXkpAA/H/3f9YPla7Xow515+mDHIdORAADwCpSR7xnQNUz3XNpDkvSn979RfnGF4UQAALR/lJEfmHFJDw3sGqbiE9V66L2vOVwDAMB5Rhn5AT8fq56+fpD8fa1at++IFm/KNh0JAIB2jTJyBj2iQvTQT/tIkh7/8FtlHSsznAgAgPaLMtKIW0cnalRyuMqravXrJTtU6+BwDQAA5wNlpBFWq0V/mzRIwTZfbckq1MtfpJuOBABAu0QZOYuunTroT1f1kyQ9/ek+7cm3G04EAED7Qxk5h0kXdtX4vlGqqnVo5r92qKrGYToSAADtCmXkHCwWi+b8fKDCg/z1bZ5df1+9z3QkAADaFcpIE0SG2PT4Nf0lSfPWHtC27ELDiQAAaD8oI010+YBYXTukixxO6ddLdqi8qsZ0JAAA2gXKiAtm/+wCxYQGKONomZ76eI/pOAAAtAuUEReEBfrpr5MGSpIWpWbpi/1HDCcCAMDzUUZcNLZnpKamJEiSfvPO1yo+UW04EQAAno0y0gwPX95HiREdlG+v0KMffGM6DgAAHo0y0gwd/H019/rBslqkf//3oFbsyjMdCQAAj0UZaaYLEzrpznHdJUm/W7pLR0oqDScCAMAzUUZa4P7xvdQ3NlTHy6o069875XRyMz0AAFxFGWkBf1+rnr5+kPx9rFr1bYHe3ZprOhIAAB6HMtJCfWNDNfPHvSRJj/5nt3ILyw0nAgDAs1BGWsEvfpSsCxM6qbSyRg++s0MOB4drAABoKspIK/CxWjR30iAF+vloQ/pxLVyfaToSAAAew6UykpiYKIvFctoyY8aMM26/cOHC07YNCAholeDuJrFzkH4/sa8k6akVe5R2uNRwIgAAPINLZWTz5s3Ky8urX1auXClJmjRpUqP7hIaGNtgnKyurZYnd2JSR3fSjXpGqrHHogSXbVV3rMB0JAAC351IZiYyMVExMTP2yfPlyde/eXePGjWt0H4vF0mCf6OjoFod2VxaLRf973UCFBvjq69xivfDZAdORAABwe82eM1JVVaU33nhDt912mywWS6PblZaWKiEhQfHx8br66qv1zTft+/LpMWEBeuya/pKkZ9fs187cYsOJAABwb80uI8uWLVNRUZFuueWWRrfp3bu3Xn31Vb3//vt644035HA4NHr0aOXmnv16HJWVlbLb7Q0WT/KzQXGaOCBWNQ6nZi7ZrorqWtORAABwWxZnMy8bOmHCBPn7++s///lPk/eprq5W3759NXnyZD322GONbjd79mw9+uijp60vLi5WaGhoc+K2ueNlVZrwzOc6UlKp6Rcl6Q9X9jMdCQCANmW32xUWFnbO7+9mjYxkZWVp1apVmj59ukv7+fn5aciQIUpLSzvrdrNmzVJxcXH9kpOT05yYRoUH+eup6wZIkv75VYY2pB8znAgAAPfUrDKyYMECRUVFaeLEiS7tV1tbq507dyo2Nvas29lsNoWGhjZYPNGlfaJ14/B4OZ3Sg+/sUGlljelIAAC4HZfLiMPh0IIFCzRt2jT5+vo2eG7q1KmaNWtW/eM///nP+vTTT5Wenq5t27bppptuUlZWlssjKp7sD1f2U9dOgcotPKG/LN9tOg4AAG7H5TKyatUqZWdn67bbbjvtuezsbOXl5dU/Liws1B133KG+ffvqiiuukN1u1/r169Wvn/fMnwi2+epvkwbJYpHe3pyjNXsKTEcCAMCtNHsCa1tq6gQYd/b4h7v18hcZ6hxs06czf6TwIH/TkQAAOK/O6wRWuO7XP+mtnlHBOlpaqT8s2ykP6IAAALQJykgbCfDz0dPXD5av1aKPdubrgx2HTEcCAMAtUEba0ICuYfrVpT0lSX9ctkv5xRWGEwEAYB5lpI3dfUl3DeoaJntFjX773tccrgEAeD3KSBvz87Fq7vWDZfO16vN9R/TmxmzTkQAAMIoyYkCPqGA99NM+kqTHP/xWmUfLDCcCAMAcyoght4xOVEpyhE5U1+rX7+xQrYPDNQAA70QZMcRqteivkwYq2OarrVmFmv95uulIAAAYQRkxqGunDnrkqrqr0T69cq++zbMbTgQAQNujjBj2/y7sqvF9o1Vd69TMf21XZU2t6UgAALQpyohhFotFc34+QOFB/tqTX6K/r9pvOhIAAG2KMuIGIkNseuLa/pKkF9cd0NasQsOJAABoO5QRN/HT/rH6+ZAucjilXy/ZrvKqGtORAABoE5QRN/LIzy5QbFiAMo+V68mP95iOAwBAm6CMuJGwQD/99f8NkiS9lpqlz/cdMZwIAIDzjzLiZi7q2VnTUhIkSb9992sVl1cbTgQAwPlFGXFDD1/eV0mdg5Rvr9Ds/3xjOg4AAOcVZcQNBfr7aO71g2S1SEv/e1Af78wzHQkAgPOGMuKmhnbrpLsu7i5J+t3SnTpcUmE4EQAA5wdlxI3dd1kv9Y0NVWF5tX73751yOrmZHgCg/aGMuDF/X6v+74ZB8vexatW3h/XO1lzTkQAAaHWUETfXJyZUD/yklyTpz//ZrZzj5YYTAQDQuigjHuCOsckaltBJpZU1evCdHXI4OFwDAGg/KCMewMdq0dzrB6mDv482ZhzXgvWZpiMBANBqKCMeIiEiSL+f2FeS9NSKPUo7XGI4EQAArYMy4kH+Z0Q3jesVqaoahx5YskPVtQ7TkQAAaDHKiAexWCx66rqBCgv009e5xXr+szTTkQAAaDHKiIeJCQvQY9f0lyQ9tyZNX+cWmQ0EAEALUUY80M8GxWniwFjVOJx6YMkOVVTXmo4EAECzUUY81F+u7q/IEJvSDpfqr5/sNR0HAIBmo4x4qE5B/vrf6wZKkl79KkOpB44ZTgQAQPNQRjzYJX2iNHlEvJxO6cF3dqikotp0JAAAXEYZ8XC/n9hP8eGBOlh0Qn9Z/q3pOAAAuIwy4uGCbb762/8bJItF+teWHK3aXWA6EgAALqGMtAMjkyN0x9hkSdLD/96p42VVhhMBANB0lJF24oEf91Kv6GAdLa3U75fulNPJzfQAAJ6BMtJOBPj56OnrB8vXatHHu/L1/vZDpiMBANAklJF2pH+XMN17WU9J0p/e36W84hOGEwEAcG4ulZHExERZLJbTlhkzZjS6zzvvvKM+ffooICBAAwYM0EcffdTi0Gjc3Rd316D4jrJX1Oi3737N4RoAgNtzqYxs3rxZeXl59cvKlSslSZMmTTrj9uvXr9fkyZN1++2367///a+uueYaXXPNNdq1a1fLk+OMfH2smjtpkGy+Vn2x/6je2JhtOhIAAGdlcbbgf53vv/9+LV++XPv375fFYjnt+RtuuEFlZWVavnx5/bpRo0Zp8ODBevHFF5v8Pna7XWFhYSouLlZoaGhz43qVBV9l6NH/7Fagn48+vm+sEjsHmY4EAPAyTf3+bvackaqqKr3xxhu67bbbzlhEJCk1NVXjx49vsG7ChAlKTU0962tXVlbKbrc3WOCaaSmJGt09Qieqa/XAku2qdXC4BgDgnppdRpYtW6aioiLdcsstjW6Tn5+v6OjoBuuio6OVn59/1teeM2eOwsLC6pf4+PjmxvRaVqtFf500SCE2X23LLtJLnx8wHQkAgDNqdhn55z//qcsvv1xxcXGtmUeSNGvWLBUXF9cvOTk5rf4e3qBLx0A98rMLJEn/t3Kfdh9ihAkA4H6aVUaysrK0atUqTZ8+/azbxcTEqKCg4eXJCwoKFBMTc9b9bDabQkNDGyxonuuGdtGP+0WrutapB5ZsV2VNrelIAAA00KwysmDBAkVFRWnixIln3S4lJUWrV69usG7lypVKSUlpztuiGSwWi+b8fIAigvy1J79Ez6zabzoSAAANuFxGHA6HFixYoGnTpsnX17fBc1OnTtWsWbPqH993331asWKF5s6dqz179mj27NnasmWL7rnnnpYnR5N1Drbp8WsHSJJeWndAW7OOG04EAMB3XC4jq1atUnZ2tm677bbTnsvOzlZeXl7949GjR2vx4sWaP3++Bg0apHfffVfLli1T//79W5YaLvtp/xj9fGgXOZzSA0t2qLyqxnQkAAAktfA6I22F64y0juIT1br8mc91qLhCN49K0GPXUAoBAOfPeb/OCDxPWKCf/jppkCTp9Q1ZWrfviOFEAABQRrzOmB6ddcvoREnSb9/doeLyarOBAABejzLihR76aR8ldw5Sgb1Sj3zAfYIAAGZRRrxQoL+P5l4/SFaLtGz7IX20M+/cOwEAcJ5QRrzUkG6ddPfFPSRJv1+6U4dLKgwnAgB4K8qIF7v3sp7qFxuqwvJqzXpvpzzgxCoAQDtEGfFi/r5W/d8Ng+XvY9XqPYf1zpZc05EAAF6IMuLleseE6Nc/6SVJevQ/3yjneLnhRAAAb0MZgaaPTdbwxE4qq6rVg+/skMPB4RoAQNuhjEA+VovmThqsDv4+2phxXK9+lWE6EgDAi1BGIEnqFtFBf5jYT5L0v5/s1f6CEsOJAADegjKCepNHxOvi3pGqqnHogSU7VF3rMB0JAOAFKCOoZ7FY9NR1AxUW6KedB4v13Jo005EAAF6AMoIGokMD9JeTd/N97rM0fZ1bZDYQAKDdo4zgNFcNitOVA2NV63Bq5r+2q6K61nQkAEA7RhnBGT12dX9Fhdh04EiZ/nfFXtNxAADtGGUEZ9QpyF9PXTdQkvTqVxlaf+Co4UQAgPaKMoJGXdInSpNHdJMk/eadr1VSUW04EQCgPaKM4Kx+P7Gv4sMDdbDohB5bvtt0HABAO0QZwVkF23w1d9JgWSzSki25Wrv3sOlIAIB2hjKCcxqRFK5bRydJkp5euU9OJ/euAQC0HsoImmTGJd0V6Oejr3OLtW7fEdNxAADtCGUETRIRbNOUkXWTWZ9dk8boCACg1VBG0GS/+FGy/H2t2ppVqNT0Y6bjAADaCcoImiwqNEA3Do+XJD27mvvWAABaB2UELvnluO7y87EoNf2YtmQeNx0HANAOUEbgki4dA3Xd0K6S6uaOAADQUpQRuOzui3vIx2rRun1HtCOnyHQcAICHo4zAZd0iOujqwXGSpOc+Y3QEANAylBE0y90X95DFIq3cXaBv8+ym4wAAPBhlBM3SIypYEwfESpKeY+4IAKAFKCNotnsu7SFJ+mhXntIOlxhOAwDwVJQRNFufmFD9pF+0nE7p+c8OmI4DAPBQlBG0yK8u7SlJen/7QWUeLTOcBgDgiSgjaJEBXcN0ce9IOZzSvLWMjgAAXEcZQYudGh15b1uucgvLDacBAHgaygha7MKEThrTI0I1DqdeXMfoCADANS6XkYMHD+qmm25SRESEAgMDNWDAAG3ZsqXR7deuXSuLxXLakp+f36LgcC/3XFI3OrJkc64K7BWG0wAAPIlLZaSwsFBjxoyRn5+fPv74Y+3evVtz585Vp06dzrnv3r17lZeXV79ERUU1OzTcz6jkcA1P7KSqWodeWpduOg4AwIP4urLxU089pfj4eC1YsKB+XVJSUpP2jYqKUseOHV0KB89hsVj0q0t7auqrm7R4U5buvqS7OgfbTMcCAHgAl0ZGPvjgAw0bNkyTJk1SVFSUhgwZopdffrlJ+w4ePFixsbH68Y9/rK+++qpZYeHexvbsrEFdw1RR7dArX2SYjgMA8BAulZH09HTNmzdPPXv21CeffKK77rpL9957rxYtWtToPrGxsXrxxRf13nvv6b333lN8fLwuvvhibdu2rdF9KisrZbfbGyxwf6dGRyTp9dRMFZZVGU4EAPAEFqfT6Wzqxv7+/ho2bJjWr19fv+7ee+/V5s2blZqa2uQ3HTdunLp166bXX3/9jM/Pnj1bjz766Gnri4uLFRoa2uT3QdtzOp264h9f6ts8u+69rKce+HEv05EAAIbY7XaFhYWd8/vbpZGR2NhY9evXr8G6vn37Kjs726VwI0aMUFpa4zdXmzVrloqLi+uXnJwcl14f5tSNjtTds2bBVxmyV1QbTgQAcHculZExY8Zo7969Ddbt27dPCQkJLr3p9u3bFRsb2+jzNptNoaGhDRZ4jp9eEKMeUcEqqajRa+szTccBALg5l8rIzJkztWHDBj3xxBNKS0vT4sWLNX/+fM2YMaN+m1mzZmnq1Kn1j5955hm9//77SktL065du3T//fdrzZo1DfZB+2K1WnTPJXWjI//8MkNllTWGEwEA3JlLZWT48OFaunSp3nrrLfXv31+PPfaYnnnmGU2ZMqV+m7y8vAaHbaqqqvTrX/9aAwYM0Lhx47Rjxw6tWrVKl112Wev9FnA7Vw6MVWJEBxWWV+vNjVmm4wAA3JhLE1hNaeoEGLiXJVty9Nt3v1bnYJu+fOgSBfj5mI4EAGhD52UCK+CKa4d0UZeOgTpaWqm3N7k2yRkA4D0oIzhv/Hysuuvi7pKkF9elq7Km1nAiAIA7oozgvJo0rKuiQ23Kt1fova0HTccBALghygjOK5uvj375o7rRkRfWpqm61mE4EQDA3VBGcN5NHtFNnYP9lVt4Qsv+y+gIAKAhygjOu0B/H00fmyxJemHtAdU63P4ELgBAG6KMoE3cNCpBHTv4KeNomZZ/fch0HACAG6GMoE0E23x1+5gkSdLzn6XJwegIAOAkygjazNTRiQqx+WpfQak+3Z1vOg4AwE1QRtBmwgL9dMuYREnSs2vS5AEX/wUAtAHKCNrUbWOS1MHfR98csmvNnsOm4wAA3ABlBG2qU5C/bh6VIInREQBAHcoI2tz0sckK8LNqe06Rvkw7ajoOAMAwygjaXGSITZNHdJMkPbs6zXAaAIBplBEY8csfdZe/j1WbMo9rY/ox03EAAAZRRmBETFiAJg3rKqlu7ggAwHtRRmDMneO6y9dq0ZdpR7Utu9B0HACAIZQRGBMf3kHXDukiSXqO0REA8FqUERh19yU9ZLVIa/Yc1q6DxabjAAAMoIzAqKTOQbpqUJwkRkcAwFtRRmDcPZf0kMUirfgmX3vzS0zHAQC0McoIjOsZHaLL+8dIkp77jNERAPA2lBG4hRmX9JAkLf/6kA4cKTWcBgDQligjcAsXxIVpfN8oOZ3SC58dMB0HANCGKCNwG/dc2lOStGz7QWUfKzecBgDQVigjcBuD4ztqbM/OqnU4NW8doyMA4C0oI3Ar915WNzry7tYcHSo6YTgNAKAtUEbgVoYnhmtkUriqa52a/3m66TgAgDZAGYHbOTU68tambB0uqTCcBgBwvlFG4HZGd4/Q0G4dVVnj0MuMjgBAu0cZgduxWCz61ckza97YkK3jZVWGEwEAzifKCNzSxb0jNaBLmE5U1+qfXzI6AgDtGWUEbsliseieS+uuyrpofZaKy6sNJwIAnC+UEbitH/eNVu/oEJVW1mjh+kzTcQAA5wllBG7Lav1udOTVrzJUUsHoCAC0R5QRuLUrBsQqOTJIxSeq9fqGLNNxAADnAWUEbs3HatGMi+tGR175IkPlVTWGEwEAWhtlBG7v6sFxig8P1PGyKi3emG06DgCglblcRg4ePKibbrpJERERCgwM1IABA7Rly5az7rN27VoNHTpUNptNPXr00MKFC5ubF17I18equ0+Ojsz/PF0V1bWGEwEAWpNLZaSwsFBjxoyRn5+fPv74Y+3evVtz585Vp06dGt0nIyNDEydO1CWXXKLt27fr/vvv1/Tp0/XJJ5+0ODy8x3VDuyouLECHSyr1zpYc03EAAK3I4nQ6nU3d+OGHH9ZXX32lL774oslv8NBDD+nDDz/Url276tfdeOONKioq0ooVK5r0Gna7XWFhYSouLlZoaGiT3xvty2upmfrT+98oLixAa39zifx9OcoIAO6sqd/fLv01/+CDDzRs2DBNmjRJUVFRGjJkiF5++eWz7pOamqrx48c3WDdhwgSlpqY2uk9lZaXsdnuDBbh+WLwiQ2w6VFyhpf/NNR0HANBKXCoj6enpmjdvnnr27KlPPvlEd911l+69914tWrSo0X3y8/MVHR3dYF10dLTsdrtOnDhxxn3mzJmjsLCw+iU+Pt6VmGinAvx89MsfJUuSnv/sgGpqHYYTAQBag0tlxOFwaOjQoXriiSc0ZMgQ/eIXv9Add9yhF198sVVDzZo1S8XFxfVLTg5zBFDnf0Z2U3iQv7KPl+uDHYdMxwEAtAKXykhsbKz69evXYF3fvn2Vnd346ZYxMTEqKChosK6goEChoaEKDAw84z42m02hoaENFkCSOvj76vaLkiRJz3+WplpHk6c8AQDclEtlZMyYMdq7d2+Ddfv27VNCQkKj+6SkpGj16tUN1q1cuVIpKSmuvDVQb2pKgsIC/XTgSJk+3pVnOg4AoIVcKiMzZ87Uhg0b9MQTTygtLU2LFy/W/PnzNWPGjPptZs2apalTp9Y/vvPOO5Wenq7f/va32rNnj1544QUtWbJEM2fObL3fAl4lJMBPt45JlCQ9tyZNDkZHAMCjuVRGhg8frqVLl+qtt95S//799dhjj+mZZ57RlClT6rfJy8trcNgmKSlJH374oVauXKlBgwZp7ty5euWVVzRhwoTW+y3gdW4dnaRgm6/25Jdo1bcF594BAOC2XLrOiClcZwRn8r8r9uiFtQc0oEuYPrhnjCwWi+lIAIDvOS/XGQHcye0XJSnQz0c7DxZr7b4jpuMAAJqJMgKPFRFs05SR3SRJz67eLw8Y5AMAnAFlBB7tFz9Klr+vVduyi5R64JjpOACAZqCMwKNFhQboxuF1V+h9dk2a4TQAgOagjMDj3Tmuu/x8LEpNP6YtmcdNxwEAuIgyAo8X1zFQ/+/CrpKkfzA6AgAehzKCduGucT3kY7Xo831HtCOnyHQcAIALKCNoF7pFdNDVg+MkMXcEADwNZQTtxoxLeshikVZ9W6Ddh+ym4wAAmogygnaje2SwJg6IlVR3R18AgGegjKBduefSHpKkj3blKe1wieE0AICmoIygXekTE6qf9IuW01l3R18AgPujjKDd+dWlPSVJH+w4pMyjZYbTAADOhTKCdmdA1zBd0jtSDqf0wlpGRwDA3VFG0C7dc3J05N/bDirneLnhNACAs6GMoF26MKGTxvSIUI3DqZc+P2A6DgDgLCgjaLdOzR1ZsjlX+cUVhtMAABpDGUG7NTIpXMMTO6mq1qH5n6ebjgMAaARlBO2WxWKpHx1ZvClLR0srDScCAJwJZQTt2tienTUovqMqqh16+QtGRwDAHVFG0K5ZLBb96pK6q7K+kZqlwrIqw4kAAD9EGUG7d1nfKPWNDVVZVa0WfJVhOg4A4AcoI2j36uaO1I2OLFifKXtFteFEAIDvo4zAK/z0ghj1jApWSUWNXlufaToOAOB7KCPwClarpf6Ovv/8MkNllTWGEwEATqGMwGtMHBCrxIgOKiyv1hsbskzHAQCcRBmB1/D1seruk2fWvPxFuiqqaw0nAgBIlBF4mWuHdFGXjoE6WlqltzZlm44DABBlBF7Gz8equy7uLkl6aV26KmsYHQEA0ygj8DqThnVVTGiA8u0Vendrruk4AOD1KCPwOjZfH/1yXLIkad7aA6qudRhOBADejTICr3Tj8G7qHOyv3MITWvbfg6bjAIBXo4zAKwX6++iOsXWjIy+sPaBah9NwIgDwXpQReK0poxLUsYOfMo6WafnXh0zHAQCvRRmB1wq2+er2MUmSpOfWpMnB6AgAGEEZgVebNiZRIQG+2n+4VJ98k286DgB4JcoIvFpogJ9uGZ0oSXp2TZqcTkZHAKCtuVRGZs+eLYvF0mDp06dPo9svXLjwtO0DAgJaHBpoTbeNSVKQv49259m1Zs9h03EAwOv4urrDBRdcoFWrVn33Ar5nf4nQ0FDt3bu3/rHFYnH1LYHzqlOQv25KSdBL69L1jzVpurRPFP+eAkAbcrmM+Pr6KiYmpsnbWywWl7YHTJh+UbIWrc/UjpwifZl2VGN7RpqOBABew+U5I/v371dcXJySk5M1ZcoUZWef/WZjpaWlSkhIUHx8vK6++mp98803zQ4LnC+RITZNHtFNkvTs6jTDaQDAu7hURkaOHKmFCxdqxYoVmjdvnjIyMjR27FiVlJSccfvevXvr1Vdf1fvvv6833nhDDodDo0ePVm7u2e8HUllZKbvd3mABzrdf/qi7/H2s2pR5XBvSj5mOAwBew+JswekDRUVFSkhI0NNPP63bb7/9nNtXV1erb9++mjx5sh577LFGt5s9e7YeffTR09YXFxcrNDS0uXGBc/r90p16c2O2LurRWW9MH2k6DgB4NLvdrrCwsHN+f7fo1N6OHTuqV69eSktr2rC2n5+fhgwZcs7tZ82apeLi4volJyenJTGBJrtzXHf5Wi36Mu2otmUXmo4DAF6hRWWktLRUBw4cUGxsbJO2r62t1c6dO8+5vc1mU2hoaIMFaAvx4R107ZAukqRnV+83nAYAvINLZeTBBx/UunXrlJmZqfXr1+vaa6+Vj4+PJk+eLEmaOnWqZs2aVb/9n//8Z3366adKT0/Xtm3bdNNNNykrK0vTp09v3d8CaEUzLukhq0X6bO8R7TpYbDoOALR7LpWR3NxcTZ48Wb1799b111+viIgIbdiwQZGRdadBZmdnKy8vr377wsJC3XHHHerbt6+uuOIK2e12rV+/Xv369Wvd3wJoRYmdg/SzQXGSpGfXMDoCAOdbiyawtpWmToABWsv+ghL95JnP5XRKK+4fqz4x/HsHAK5qkwmsQHvVMzpEl/evu1jf858dMJwGANo3ygjQiHsu6SlJWv71IR04Umo4DQC0X5QRoBH94kI1vm+UnE7p+c+4KisAnC+UEeAsfnVp3ejI+9sPKftYueE0ANA+UUaAsxgU31E/6hWpWodT89YxOgIA5wNlBDiHX13aQ5L07tZcHSo6YTgNALQ/lBHgHIYnhmtUcriqa516aR1n1gBAa6OMAE1wau7IW5tzdNheYTgNALQvlBGgCUZ3j9DQbh1VVePQy1+km44DAO0KZQRoAovFol9dVjc68saGbB0rrTScCADaD8oI0EQX94rUgC5hOlFdq39+mWE6DgC0G5QRoIksFovuOXlmzWupWSourzacCADaB8oI4IIf941Wn5gQlVbWaMF6RkcAoDVQRgAXWK0WzbikbnTk1S8zVFLB6AgAtBRlBHDRFQNilRwZJHtFjV7fkGU6DgB4PMoI4CIfq0X3nBwdeeWLDJVX1RhOBACejTICNMPPBsWpW3gHHS+r0uKN2abjAIBHo4wAzeDrY9XdF3eXJM3/PF0V1bWGEwGA56KMAM3086FdFRcWoMMllVqyJcd0HADwWJQRoJn8fa268+ToyItrD6iqxmE4EQB4JsoI0ALXD4tXZIhNh4or9O9tuabjAIBHoowALRDg56Nf/ihZkvTC2gOqqWV0BABcRRkBWuh/RnZTeJC/so+X64Mdh0zHAQCPQxkBWqiDv6+mj02SJD33WZpqHU7DiQDAs1BGgFZw86gEhQX6Kf1ImT7amWc6DgB4FMoI0ApCAvx065hESdJza9LkYHQEAJqMMgK0kltHJynY5qu9BSVa+W2B6TgA4DEoI0ArCevgp6kpCZKkZ9fsl9PJ6AgANAVlBGhFt1+UpEA/H+06aNdrqVmq5lRfADgnygjQiiKCbbr55OjIIx98o4ueWqNnVu3TYXuF4WQA4L4sTg8YS7bb7QoLC1NxcbFCQ0NNxwHOqrKmVs9/dkCLN2braGmlJMnXatGE/jGaOipBI5LCZbFYDKcEgPOvqd/flBHgPKmqcejjXXl6PTVLW7IK69f3iQnRzSkJumZwFwXZfA0mBIDzizICuJFvDhXrjQ1ZWvbfQzpRXStJCrH56roLu+rmlAR1jww2nBAAWh9lBHBDxSeq9e7WXL2emqnMY+X16y/q0Vk3pyTosj5R8vVhKheA9oEyArgxh8OpL9KO6vXUTK3ec1in/iuMCwvQlFEJumF4vDoH28yGBIAWoowAHiLneLne3Jitf23OVmF5tSTJ38eqKwbE6OaURA3t1pEJrwA8EmUE8DAV1bX68Os8vbYhSztyiurXXxAXqmkpibpqUJwC/X3MBQQAF1FGAA/2dW6RXkvN0gc7Dqmqpu7CaWGBfrp+WFfdNCpBCRFBhhMCwLk19fvbpZlys2fPlsViabD06dPnrPu888476tOnjwICAjRgwAB99NFHrrwl4JUGdu2ov00apA2zLtPDl/dR106BKj5RrZe/yNDFf1urWxZs0po9BdyQD0C74PJFDi644AKtWrXquxfwbfwl1q9fr8mTJ2vOnDm68sortXjxYl1zzTXatm2b+vfv37zEgBcJD/LXneO6646xyVq797BeS83Sun1HtHZv3dItvINuGtVN1w+LV8cO/qbjAkCzuHSYZvbs2Vq2bJm2b9/epO1vuOEGlZWVafny5fXrRo0apcGDB+vFF19sckgO0wDfyTxapjc2ZGnJlhzZK2okSTZfq342KE5TUxI1oGuY4YQAUOe8HKaRpP379ysuLk7JycmaMmWKsrOzG902NTVV48ePb7BuwoQJSk1NPet7VFZWym63N1gA1EnsHKQ/XNlPG383Xk/+fID6xYaqssahd7bm6qrnvtQ1z3+lf2/LVcXJi6sBgLtzqYyMHDlSCxcu1IoVKzRv3jxlZGRo7NixKikpOeP2+fn5io6ObrAuOjpa+fn5Z32fOXPmKCwsrH6Jj493JSbgFQL9fXTjiG768N6L9N5dKbp6cJz8fCzanlOkB5bs0Ogn1+ipFXuUW1h+7hcDAINadDZNUVGREhIS9PTTT+v2228/7Xl/f38tWrRIkydPrl/3wgsv6NFHH1VBQUGjr1tZWanKysr6x3a7XfHx8RymAc7hSEml/rU5W29uzFZecd2dgq0W6dI+0ZqakqCLenSW1co1SwC0jaYepmnRXbo6duyoXr16KS0t7YzPx8TEnFY6CgoKFBMTc9bXtdlsstm4+iTgqsgQm+65tKfuHNddq749rNc3ZOqrtGNa9W2BVn1boOTOQbppVIKuu7CrwgL9TMcFAEnNmDPyfaWlpTpw4IBiY2PP+HxKSopWr17dYN3KlSuVkpLSkrcFcA6+Plb9tH+M3pw+SqseGKdbRicq2Oar9KNl+vPy3Rr1xGrN+vdOfZvHfCwA5rl0mObBBx/UVVddpYSEBB06dEiPPPKItm/frt27dysyMlJTp05Vly5dNGfOHEl1p/aOGzdOTz75pCZOnKi3335bTzzxhMun9nI2DdBypZU1Wvrfg3o9NVP7Ckrr1w9P7KSbUxL10wti5O/LTfoAtJ7zcpgmNzdXkydP1rFjxxQZGamLLrpIGzZsUGRkpCQpOztbVut3f8xGjx6txYsX6w9/+IN+97vfqWfPnlq2bBnXGAEMCLb56uZRCbppZDdtzDiu11Oz9Mk3+dqcWajNmYWKDLFp8vB4/c/IBMWEBZiOC8CLcDl4wIsV2Cu0eGO23tqUrcMldZPGfawW/aRftG5OSVBKcgQ36QPQbNybBkCTVdc69Mk3+XotNUubMo7Xr+8ZFaypKQm6dmhXBdtaNN8dgBeijABolj35dr2WmqVl/z2o8qq6C6cF23z186FdNDUlQT2iQgwnBOApKCMAWsReUa33tubq9Q1ZSj9SVr8+JTlCU1MS9ON+0fL1YcIrgMZRRgC0CqfTqa/Sjum11Eyt+rZAp24UHBsWoP8Z0U03juimyBCuCwTgdJQRAK3uYNEJLd6Ypbc35ehYWZUkyc/Hosv7x2pqSoIuTOjEhFcA9SgjAM6byppafbQzT6+lZum/2UX16/vGhmpqSoKuHhynDv5MeAW8HWUEQJvYdbBYr6Vm6v3th1RZ45AkhQT4atKF8bo5JUFJnYMMJwRgCmUEQJsqKq/SO1vqJrxmH//uTsFje3bW1JREXdonSj7cpA/wKpQRAEY4HE6t239Er6dm6bO9h3XqL0yXjoG6aVSCbhger/Agf7MhAbQJyggA47KPlevNjVn615YcFZVXS5L8fa26cmCspqYkanB8R7MBAZxXlBEAbqOiulYf7Dik11OztPNgcf36gV3DdPOoBF0+IJYrvALtEGUEgNtxOp3anlOk11OztPzrPFXV1k149bFaNKBLmEYmh2tUcoSGJXRSSICf4bQAWooyAsCtHSut1L+25GjJ5hxlHitv8JyP1aL+caEalRyhkcnhGpYYrlDKCeBxKCMAPMbBohPamH5MG9KPaWPGcWX9oJxYLVL/LmEamXRy5CQxXGGBlBPA3VFGAHisvOIT2ph+XBtOFpQfjpxYLVK/uFCNSorQqOQIDU+inADuiDICoN3IL67QxoyTIyfpx5V+tKzB8xaL1C82VCOTIjQqOVwjksLVsQOnDwOmUUYAtFsF9or6Qzob0o81uKuwVFdO+sSEatTJCbEjEsPViWubAG2OMgLAaxy2V9QXk40Zx5V2uPS0bfrEhGhU8qmRkwguvAa0AcoIAK91pKRSGzOO1c872X+GctI7OuS7kZOkcEUE2wwkBdo3yggAnHS0tFKbMr6bELuv4PRy0is6uO5U4qS604k7U06AFqOMAEAjjn2vnGzMOK49+SWnbdMzKrj+ImwjkyIUGUI5AVxFGQGAJjpeVqVNGce04eRhnTOVk+6RQScvwhahUUnhigoNMJAU8CyUEQBopsKyKm3KPF5/KvG3+Xb98C9lcmRQ/anEo5IjFE05AU5DGQGAVlJUXqVNGcfrz9jZnXd6OUnqHKRRyeEnC0qEYsIoJwBlBADOk+Lyam0+OXKyIeOYdh+yy/GDv6SJER3qikn3uoIS1zHQTFjAIMoIALSR4hPV2pL53YTYXQeLTysn3cI7fDdy0j1CXSgn8AKUEQAwxF5Rra2ZhfWnEu88QzmJDw+sP6QzMilc8eEdzIQFziPKCAC4iZKKam3JKqyfELvzYLFqf9BOunQMPHm2TrhSkiPUtVOgLBaLocRA66CMAICbKq2s0ZbM7ybEfp175nIyMin85CXsIxQfTjmB56GMAICHKKus0das7w7rfJ1brJoflJPYsACNSo7QgC5hSuocpMTOQeraKVB+PlZDqYFzo4wAgIcqr/qunGxMP64duUWqrj39T7Wv1aKunQKV2DlIiRFBSo6s+2dS5yDFdQyUj5WRFJhFGQGAduJEVa22ZRdq48n76mQeK1PmsTJVVDsa3cffx6r48MC6UZSIupGUUyMqsaEBslJU0AYoIwDQjjkcThWUVCjjaJkyj5Yr81iZ0o/UlZTsY+Wqqm28qNh8rScLSoe6kvK9shIVYmNuCloNZQQAvFStw6lDRSfqRlCOlinjZFnJPFqm7OPlp81H+b4O/j5KiAhSUucO9Yd8To2oRAT5U1TgEsoIAOA0NbUOHSw6ofSjdeUk82iZMo6VK/NomXILy0+7Hsr3hdh86+andA5SUkSH+p+TOwepYwf/tvsl4DEoIwAAl1TVOJRTWH5yNKXukM+pw0CHik+cdj+e7+vYwa9+JOXUIaBTIyqhAX5t90vArTT1+9u3DTMBANyYv69V3SOD1T0y+LTnKqprlX28/GQ5aVhU8u0VKiqv1vbyIm3PKTpt34gg/9PO+Ek8eRgoyMbXEFpYRp588knNmjVL9913n5555pkzbrNw4ULdeuutDdbZbDZVVFS05K0BAG0owM9HvaJD1Cs65LTnyqtq6ifRNiwr5TpaWqljZVU6VlalrVmFp+0bFWL7wSTaDvXFJcDPpy1+NbiBZpeRzZs366WXXtLAgQPPuW1oaKj27t1b/5gJUADQfnTw91W/uFD1izt9GL6kolpZx74bUck4WqaMk5NpC8urdbikUodLKrUp4/hp+8aFBXxvjsp3ZSU+vINsvhSV9qRZZaS0tFRTpkzRyy+/rL/85S/n3N5isSgmJqY5bwUA8GAhAX7q3yVM/buEnfZccXl1fTE5NUfl1M/2ihodKq7QoeIKrT9wrMF+VosU1/G7a6h8/4wfrkrrmZpVRmbMmKGJEydq/PjxTSojpaWlSkhIkMPh0NChQ/XEE0/oggsuaM5bAwDaibAOfhrcoaMGx3dssN7pdOp4WVX9oZ7M742mZB4tU1lVrXILTyi38IS+2H+0wb4+VoviT16VNukHS1xYIBd7c1Mul5G3335b27Zt0+bNm5u0fe/evfXqq69q4MCBKi4u1t/+9jeNHj1a33zzjbp27XrGfSorK1VZWVn/2G63uxoTAOChLBaLIoJtigi26cKE8AbPOZ1OHSmtrJujcrTsu1OUv3dV2sxj5co8Vq61e4802Nff16rEiO/O8knuHKSkzsFK7NxBkcFc7M0kl8pITk6O7rvvPq1cuVIBAQFN2iclJUUpKSn1j0ePHq2+ffvqpZde0mOPPXbGfebMmaNHH33UlWgAAC9gsVgUFRKgqJAAjUhqWFQauyptxtFSZR8vV1WNQ/sKSrWvoPS01w22+daXlKSTReXUz2GBnJp8vrl0nZFly5bp2muvlY/PdxOHamtrZbFYZLVaVVlZ2eC5xkyaNEm+vr566623zvj8mUZG4uPjuc4IAKBZamodOlRUofSjpfXzUtJPjqjkFp79GiqnTk3+4WGfxIggBfozkfZszst1Ri677DLt3Lmzwbpbb71Vffr00UMPPdSkIlJbW6udO3fqiiuuaHQbm80mm83mSjQAABrl62NVt4gO6hbRQerd8LmK6lrlnLyGyg+XwyVnPzU5NizgB4d96n6O79RB/r5MpG0ql8pISEiI+vfv32BdUFCQIiIi6tdPnTpVXbp00Zw5cyRJf/7znzVq1Cj16NFDRUVF+utf/6qsrCxNnz69lX4FAACaL8DPRz2jQ9TzDNdQKa2s+e6U5JPzU9JP/lx8olp5xRXKO8MZPz+cSPv9wz5MpD1dq1/6Ljs7W1brd22wsLBQd9xxh/Lz89WpUyddeOGFWr9+vfr169fabw0AQKsKtvk2empyYVlV/QTaH46onKiubXQirc3XqoSTE2mTOgcrqXOHk/8MUudg77wZIfemAQCgFTmdThXYK79XTkqVcbS8fiJtdW3jX7unJtKe6dCPJ06k5UZ5AAC4me9PpP3hYZ+DReeeSHumM37ceSItZQQAAA9yaiLt9w/9nPr5cEnlWfc9NZH2h0t8eAejV6SljAAA0E78cCLtqSX9SKnsFTWN7ndqIu2ZLvTWFhNpKSMAALRzTqdTheXVDeanZB79bnTlRHVto/vafK1KjAhS4skJtDcMj1dS56BWzXderjMCAADch8ViUXiQv8KD/HVhQqcGz52aSPv9+SmnDv3kHC9XZY1DewtKtLegRFKBftwvutXLSFNRRgAAaIcsFotiwgIUExag0d07N3iuptahg0UnGhzy6R5ppohIlBEAALyOr49VCRFBSogI0sW9z739+ca1agEAgFGUEQAAYBRlBAAAGEUZAQAARlFGAACAUZQRAABgFGUEAAAYRRkBAABGUUYAAIBRlBEAAGAUZQQAABhFGQEAAEZRRgAAgFEecddep9MpSbLb7YaTAACApjr1vX3qe7wxHlFGSkpKJEnx8fGGkwAAAFeVlJQoLCys0ectznPVFTfgcDh06NAhhYSEyGKxtNrr2u12xcfHKycnR6Ghoa32uu0Rn5Vr+Lyajs+q6fismo7PqunO52fldDpVUlKiuLg4Wa2NzwzxiJERq9Wqrl27nrfXDw0N5V/WJuKzcg2fV9PxWTUdn1XT8Vk13fn6rM42InIKE1gBAIBRlBEAAGCUV5cRm82mRx55RDabzXQUt8dn5Ro+r6bjs2o6Pqum47NqOnf4rDxiAisAAGi/vHpkBAAAmEcZAQAARlFGAACAUZQRAABglFeXkeeff16JiYkKCAjQyJEjtWnTJtOR3NLnn3+uq666SnFxcbJYLFq2bJnpSG5pzpw5Gj58uEJCQhQVFaVrrrlGe/fuNR3LLc2bN08DBw6sv8hSSkqKPv74Y9OxPMKTTz4pi8Wi+++/33QUtzR79mxZLJYGS58+fUzHclsHDx7UTTfdpIiICAUGBmrAgAHasmVLm+fw2jLyr3/9Sw888IAeeeQRbdu2TYMGDdKECRN0+PBh09HcTllZmQYNGqTnn3/edBS3tm7dOs2YMUMbNmzQypUrVV1drZ/85CcqKyszHc3tdO3aVU8++aS2bt2qLVu26NJLL9XVV1+tb775xnQ0t7Z582a99NJLGjhwoOkobu2CCy5QXl5e/fLll1+ajuSWCgsLNWbMGPn5+enjjz/W7t27NXfuXHXq1Kntwzi91IgRI5wzZsyof1xbW+uMi4tzzpkzx2Aq9yfJuXTpUtMxPMLhw4edkpzr1q0zHcUjdOrUyfnKK6+YjuG2SkpKnD179nSuXLnSOW7cOOd9991nOpJbeuSRR5yDBg0yHcMjPPTQQ86LLrrIdAyn0+l0euXISFVVlbZu3arx48fXr7NarRo/frxSU1MNJkN7UlxcLEkKDw83nMS91dbW6u2331ZZWZlSUlJMx3FbM2bM0MSJExv83cKZ7d+/X3FxcUpOTtaUKVOUnZ1tOpJb+uCDDzRs2DBNmjRJUVFRGjJkiF5++WUjWbyyjBw9elS1tbWKjo5usD46Olr5+fmGUqE9cTgcuv/++zVmzBj179/fdBy3tHPnTgUHB8tms+nOO+/U0qVL1a9fP9Ox3NLbb7+tbdu2ac6cOaajuL2RI0dq4cKFWrFihebNm6eMjAyNHTtWJSUlpqO5nfT0dM2bN089e/bUJ598orvuukv33nuvFi1a1OZZPOKuvYCnmTFjhnbt2sWx6rPo3bu3tm/fruLiYr377ruaNm2a1q1bRyH5gZycHN13331auXKlAgICTMdxe5dffnn9zwMHDtTIkSOVkJCgJUuW6PbbbzeYzP04HA4NGzZMTzzxhCRpyJAh2rVrl1588UVNmzatTbN45chI586d5ePjo4KCggbrCwoKFBMTYygV2ot77rlHy5cv12effaauXbuajuO2/P391aNHD1144YWaM2eOBg0apL///e+mY7mdrVu36vDhwxo6dKh8fX3l6+urdevW6R//+Id8fX1VW1trOqJb69ixo3r16qW0tDTTUdxObGzsaeW/b9++Rg5reWUZ8ff314UXXqjVq1fXr3M4HFq9ejXHrNFsTqdT99xzj5YuXao1a9YoKSnJdCSP4nA4VFlZaTqG27nsssu0c+dObd++vX4ZNmyYpkyZou3bt8vHx8d0RLdWWlqqAwcOKDY21nQUtzNmzJjTLj+wb98+JSQktHkWrz1M88ADD2jatGkaNmyYRowYoWeeeUZlZWW69dZbTUdzO6WlpQ3+ryIjI0Pbt29XeHi4unXrZjCZe5kxY4YWL16s999/XyEhIfXzj8LCwhQYGGg4nXuZNWuWLr/8cnXr1k0lJSVavHix1q5dq08++cR0NLcTEhJy2ryjoKAgRUREMB/pDB588EFdddVVSkhI0KFDh/TII4/Ix8dHkydPNh3N7cycOVOjR4/WE088oeuvv16bNm3S/PnzNX/+/LYPY/p0HpOeffZZZ7du3Zz+/v7OESNGODds2GA6klv67LPPnJJOW6ZNm2Y6mls502ckyblgwQLT0dzObbfd5kxISHD6+/s7IyMjnZdddpnz008/NR3LY3Bqb+NuuOEGZ2xsrNPf39/ZpUsX5w033OBMS0szHctt/ec//3H279/fabPZnH369HHOnz/fSA6L0+l0tn0FAgAAqOOVc0YAAID7oIwAAACjKCMAAMAoyggAADCKMgIAAIyijAAAAKMoIwAAwCjKCAAAMIoyAgAAjKKMAAAAoygjAADAKMoIAAAw6v8DG/Xh48ZJ4o0AAAAASUVORK5CYII=\n" }, "metadata": {} } ], "source": [ "plt.plot(Sigma)" ] }, { "cell_type": "markdown", "metadata": { "id": "IR8CpAYe_J_K" }, "source": [ "#### Interpretando los tópicos\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ra5f0S7FfKjG" }, "source": [ "La siguiente función solo toma la matriz de VT y obtiene las 8 palabras más importantes en este topico. Si quieren pueden variar este parametro para ver más palabras e inspeccionar los tópicos. Esto es importante porque LSI es un método no supervisado, por lo tanto no sabemos a priori cuando un tópico es bueno o malo. El sentido debemos darselo nosotros:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fBNIN0xE_J_K" }, "outputs": [], "source": [ "vocab = {value:key for (key, value) in vectorizer.vocabulary_.items()}\n", "\n", "def show_topics(a):\n", " top_words = lambda t: [vocab[i] for i in np.argsort(t)[-8:-1]]\n", " topic_words = ([top_words(t) for t in a])\n", " return [' '.join(t) for t in topic_words]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "IDhDFVY9_J_M", "outputId": "798f4133-6362-4947-bd1d-965506e9b09e" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['siempre anuncio superstar zapatilla aldub81stweeksary nuevo camiseta',\n", " 'necesitar gustar gana querer medalla cerveza corona',\n", " '50 10 terminar carrera ritmo correr acabo',\n", " 'cruzcampir gracias bimbo movistar mejor carrefour cruzcampo',\n", " 'mahou invitar estrella gustar arruinaunacitacon4palabra cruzcampir cerveza',\n", " 'cruzcampir comer gustar fundador lorenzo servitje osito',\n", " 'querer mejor alianza suzuki milko toyota movistar']" ] }, "metadata": {}, "execution_count": 19 } ], "source": [ "show_topics(VT)" ] }, { "cell_type": "markdown", "metadata": { "id": "WKCo8UOd_J_Q" }, "source": [ "Limitaciones en LSI:\n", " - LSI sufre de un problema llamado \"Indeterminación del signo\", que básicamente significa que el signo en la matríz VT y USigma dependen del algorimo que se utilizó para generarlos y de las condiciones iniciales (initial random state). En este contexto, que significa que un tópico esté relacionado con una palabra en un valor negativo?" ] }, { "cell_type": "markdown", "metadata": { "id": "no6V6rtO_J_Q" }, "source": [ "### NMF: Non-negative Matrix Factorization\n", "\n", "Motivación: En lugar de construir nuestros factores imponiendo la restricción de que sean ortogonales, la idea es de construirlos de tal forma que sean no-negativos." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NQwyjnN8_J_R" }, "outputs": [], "source": [ "from sklearn.decomposition import NMF\n", "\n", "nmf = NMF(n_components=7, random_state = 1234)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cdTfBg2n_J_U" }, "outputs": [], "source": [ "W1 = nmf.fit_transform(vectors)\n", "H1 = nmf.components_" ] }, { "cell_type": "markdown", "metadata": { "id": "lIYwacxIfxqQ" }, "source": [ "En este caso, la matriz que nos interesa es H1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "S39Zrr06_J_a", "outputId": "d9b15015-0242-42f2-df98-702d5731152d" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "(7, 6899)" ] }, "metadata": {}, "execution_count": 23 } ], "source": [ "H1.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "qkEBkOc-fqtT" }, "source": [ "#### Interpretando los tópicos" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "X8qfO_Dx_J_d", "outputId": "97bbb9d9-f661-4897-c35a-d761dd72a9de" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['siempre anuncio superstar zapatilla nuevo aldub81stweeksary camiseta',\n", " 'tomar necesitar querer gana cerveza medalla corona',\n", " '50 10 terminar carrera ritmo correr acabo',\n", " 'carne alcampo comprar vender mejor manán gracias',\n", " 'galicia mejor estrella arruinaunacitacon4palabra gustar cruzcampir cerveza',\n", " 'grupo gustar comer fundador lorenzo servitje osito',\n", " 'gracias él poder mejor toyota milko movistar']" ] }, "metadata": {}, "execution_count": 24 } ], "source": [ "show_topics(H1)" ] }, { "cell_type": "markdown", "metadata": { "id": "cbMNkPv7_J_i" }, "source": [ "### LDA: Latent Dirichlet Allocation\n", "\n", "LDA es un método Bayesiano basado en la distribución de Dirichlet, la cual es una distribución sobre probabilidades en K categorias. LDA supone que los documentos que tenemos pertenecen a K categorias distintas cuya distribución es desconocida, sin embargo, asume que todos los fragmentos que componen el texto fueron generados a través de un mismo proceso generativo." ] }, { "cell_type": "markdown", "metadata": { "id": "0bct-Ck5f6fs" }, "source": [ "La distribución Dirichlet es una generalización de la distribución Beta en un espacio multidimensional. Así como la distribución beta es la distribución previa de la binomial, la distribución de Dirichlet es la distribución previa de la multinomial.\n", "\n", "$$ P(w\\mid d) = P(d)\\sum_c P(k\\mid d)P(w\\mid k) $$\n", "\n", "*¿Notan alguna similitud con SVD?*\n", "\n", "[David Blei, Andrew Ng, Michael Jordan: Latent Dirichlet Allocation (https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cN1vPUB3_J_j" }, "outputs": [], "source": [ "from sklearn.decomposition import LatentDirichletAllocation\n", "\n", "lda = LatentDirichletAllocation(n_components=7)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 74 }, "id": "d1dC1tN6_J_k", "outputId": "9f63ce3f-2762-42fa-f463-2402d597ebe0" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "LatentDirichletAllocation(n_components=7)" ], "text/html": [ "
LatentDirichletAllocation(n_components=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ] }, "metadata": {}, "execution_count": 26 } ], "source": [ "lda.fit(vectors)" ] }, { "cell_type": "markdown", "source": [ "#### Interpretando los tópicos\n", "\n", "Podemos ver las 10 palabras más relevantes de los 7 tópicos que encontró LDA de la siguiente forma:" ], "metadata": { "id": "7aICU5GKtNEp" } }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "cF-MpBfDgk2O", "outputId": "76218443-732f-4008-d83d-20ee678b727b" }, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Topic 0 toyota heineken movistar suzuki carrefour cruzcampir mejor alianza necesitar tener\n", "Topic 1 bimbo mueble adidas bankia adida cliente montar carrefour mercadona amigo\n", "Topic 2 heineken milko él mercadona santander gana quiero favor carrefour banco\n", "Topic 3 adida mercadona camiseta cruzcampo cerveza hombre color estrella galicia nuevo\n", "Topic 4 heineken gracias gustar movistar siempre querer peugeot corona quedar mercadona\n", "Topic 5 acabo nikeplus correr adida ritmo heineken mejor milka refugiado mercadona\n", "Topic 6 cruzcampo bimbo chocolate adida poder banco bankia querer patrocinado heineken\n" ] } ], "source": [ "for idx, topic in enumerate(lda.components_):\n", " print (\"Topic \", idx, \" \".join(vocab[i] for i in topic.argsort()[:-10 - 1:-1]))" ] }, { "cell_type": "markdown", "source": [ "Otra alternativa es utilizando una librería especifica para estas visualizaciones. `pyLDAvis` es una librería de Python para la visualización de modelos de modelado de tópicos. Se trata de una portabilidad del fabuloso paquete de R de Carson Sievert y Kenny Shirley.\n", "\n", "`pyLDAvis` está diseñada para ayudar a los usuarios a interpretar los temas en un modelo de tópicos que se ha ajustado a un corpus de datos de texto. El paquete extrae información de un modelo LDA.\n", "\n", "La visualización está diseñada para usarse dentro de un notebook de Jupyter, pero también se puede guardar en un archivo HTML independiente para compartirlo fácilmente." ], "metadata": { "id": "Cclo_eRivRwQ" } }, { "cell_type": "code", "source": [ "import pyLDAvis\n", "import pyLDAvis.lda_model\n", "\n", "pyLDAvis.enable_notebook()" ], "metadata": { "id": "eaPP3P_6tNkL", "outputId": "e2012257-d14d-4b91-934a-2571f8734bcd", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n", " and should_run_async(code)\n" ] } ] }, { "cell_type": "markdown", "source": [ "> Nota: la siguiente celda corrije un error en la libraría `pyLDAvis` en las ultimas versiones de `scikit-learn`. Puede omitir los detalles." ], "metadata": { "id": "1MQDa_4hPNqr" } }, { "cell_type": "code", "source": [ "pyLDAvis.lda_model._get_doc_lengths = lambda dtm: dtm.sum(axis=1).ravel()\n", "pyLDAvis.lda_model._get_term_freqs = lambda dtm: dtm.sum(axis=0).ravel()" ], "metadata": { "id": "g2NeFT7zNVOD", "outputId": "a245d9c3-9280-4271-83d5-8034c388fa7f", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n", " and should_run_async(code)\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": {}, "execution_count": 40 } ] }, { "cell_type": "code", "source": [ "pyLDAvis.lda_model.prepare(lda, vectors, vectorizer, mds='tsne')" ], "metadata": { "id": "pgvdGqeluJSu", "outputId": "13417724-1835-4825-db74-e323caba8e4d", "colab": { "base_uri": "https://localhost:8080/", "height": 916 } }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n", " and should_run_async(code)\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "PreparedData(topic_coordinates= x y topics cluster Freq\n", "topic \n", "0 73.242546 10.512508 1 1 15.729860\n", "4 -135.212036 -19.932354 2 1 15.433575\n", "2 8.827081 93.134766 3 1 14.480544\n", "1 34.502068 -87.146095 4 1 14.194946\n", "3 -30.950050 -4.515550 5 1 14.092400\n", "5 -95.510620 77.838768 6 1 13.769977\n", "6 -69.408546 -102.083939 7 1 12.298696, topic_info= Term Freq Total Category logprob loglift\n", "336 adida 59.000000 59.000000 Default 30.0000 30.0000\n", "4534 nikeplus 9.000000 9.000000 Default 29.0000 29.0000\n", "1895 correr 9.000000 9.000000 Default 28.0000 28.0000\n", "267 acabo 11.000000 11.000000 Default 27.0000 27.0000\n", "6212 suzuki 9.000000 9.000000 Default 26.0000 26.0000\n", "... ... ... ... ... ... ...\n", "3283 heineken 2.331148 58.332809 Topic7 -6.0977 -1.1241\n", "4194 mejor 1.883641 20.082832 Topic7 -6.3108 -0.2710\n", "1315 carrefour 1.898977 29.115251 Topic7 -6.3027 -0.6343\n", "4743 pagar 1.563327 7.003053 Topic7 -6.4972 0.5961\n", "6898 él 1.597090 19.333292 Topic7 -6.4758 -0.3980\n", "\n", "[455 rows x 6 columns], token_table= Topic Freq Term\n", "term \n", "42 7 0.660624 10800\n", "50 2 0.934872 11780\n", "76 5 0.704733 160mil\n", "107 6 0.639032 23\n", "226 5 0.835984 95\n", "... ... ... ...\n", "6898 3 0.362070 él\n", "6898 4 0.051724 él\n", "6898 5 0.051724 él\n", "6898 6 0.155173 él\n", "6898 7 0.103448 él\n", "\n", "[730 rows x 3 columns], R=30, lambda_step=0.01, plot_opts={'xlab': 'PC1', 'ylab': 'PC2'}, topic_order=[1, 5, 3, 2, 4, 6, 7])" ], "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ] }, "metadata": {}, "execution_count": 41 } ] }, { "cell_type": "markdown", "source": [ "Para más informacións sobre la librería `pyLDAvis`, se recomienda la lectura [del papaer original](http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf) el cual fué presentado en la conferencia [*ACL Workshop on Interactive Language Learning, Visualization, and Interfaces*](http://nlp.stanford.edu/events/illvi2014/) en Baltimore el June 27, 2014." ], "metadata": { "id": "XZ8bz63cwLrn" } } ], "metadata": { "colab": { "name": "Topic Modeling.ipynb", "provenance": [] }, "interpreter": { "hash": "bea38c2984299ac640e8421861d34b2e05ee614f6236d2975c05eeb77366835f" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.11" } }, "nbformat": 4, "nbformat_minor": 0 }