{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "<figure>\n", "<img src=\"../Imagenes/logo-final-ap.png\" width=\"80\" height=\"80\" align=\"left\"/> \n", "</figure>\n", "\n", "# <span style=\"color:#4361EE\"><left>Aprendizaje Profundo</left></span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# <span style=\"color:red\"><center>Diplomado en Inteligencia Artificial y Aprendizaje Profundo</center></span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "# <span style=\"color:green\"><center> Técnicas modernas de clasificación de documentos</center></span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "<center>Técnicas word2vec</center>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "<figure>\n", "<center>\n", "<img src=\"../Imagenes/Taxonomy.png\" width=\"400\" height=\"400\" align=\"center\"/>\n", "</center>\n", "</figure>\n", "\n", "<center>Fuente: <a href=\"https://commons.wikimedia.org/wiki/File:Taxonomic_Rank_Graph.svg\">Annina Breen</a>, <a href=\"https://creativecommons.org/licenses/by-sa/4.0\">CC BY-SA 4.0</a>, via Wikimedia Commons</center>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Profesores</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "1. Alvaro Montenegro, PhD, ammontenegrod@unal.edu.co\n", "1. Camilo José Torres Jiménez, Msc, cjtorresj@unal.edu.co\n", "1. Daniel Montenegro, Msc, dextronomo@gmail.com " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Asesora Medios y Marketing digital</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "4. Maria del Pilar Montenegro, pmontenegro88@gmail.com\n", "5. Jessica López Mejía, jelopezme@unal.edu.co\n", "6. Venus Puertas, vpuertasg@unal.edu.co" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Jefe Jurídica</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "7. Paula Andrea Guzmán, guzmancruz.paula@gmail.com" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Coordinador Jurídico</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "8. David Fuentes, fuentesd065@gmail.com" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Desarrolladores Principales</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "9. Dairo Moreno, damoralesj@unal.edu.co\n", "10. Joan Castro, jocastroc@unal.edu.co\n", "11. Bryan Riveros, briveros@unal.edu.co\n", "12. Rosmer Vargas, rovargasc@unal.edu.co" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Expertos en Bases de Datos</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "13. Giovvani Barrera, udgiovanni@gmail.com\n", "14. Camilo Chitivo, cchitivo@unal.edu.co" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Referencias</span> " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "1. Adaptado de [deep-learning-methods-for-text-data]((https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa))\n", "2. Mikolov et al. 2013a, Google, [Distributed Representations of Words and Phrases\n", "and their Compositionality](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf), \n", "3. Xin Rong, 2016, [word2vec Parameter Learning Explained](https://arxiv.org/pdf/1411.2738.pdf), \n", "4. Mikolov et al. 2013b, Google, [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781.pdf).\n", "5. Baroni et al., 2014,[Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors](https://www.aclweb.org/anthology/P14-1023.pdf)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Contenido</span> " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "* [Introducción](#Introducción)\n", "* [No cuente, prediga](#No-cuente,-prediga)\n", "* [Sumergimiento o incrustamiento de palabras-word embeddings](#Sumergimiento-o-incrustamiento-de-palabras-word-embeddings)\n", "* [Modelos Word2Vec](#Modelos-Word2Vec)\n", "* [Modelo CBOW](#Modelo-CBOW)\n", "* [Modelo Skip-gram](#Modelo-Skip-gram)\n", "* [Sobre que podemos esperar](#Sobre-que-podemos-esperar)\n", "* [Las tareas linguísticas](#Las-tareas-linguísticas)\n", "* [Recursos linguísticos](#Recursos-linguísticos)\n", "* [Corpus de juguete](#Corpus-de-juguete)\n", "* [Ejemplo The King James version of the Bible](#Ejemplo-The-King-James-Version-of-the-Bible)\n", "* [Pre-procesamiento del texto](#Pre-procesamiento-del-texto)\n", "* [Modelos con Gensim](#Modelos-con-Gensim)\n", "* [Visualización de incrustaciones con TSNE](#Visualización-de-incrustaciones-con-TSNE)\n", "* [Incrustación de documentos](#Incrustación-de-documentos)\n", "* [Modelos pre-entrenados. El modelo Glove](#Modelos-pre-entrenados.-El-modelo-Glove)\n", "* [Introducción a spaCy](#Introducción-a-spaCy)\n", "* [Modelos en Español](Modelos-en-Español)\n", "* [El modelo FastText](#El-modelo-FastText)\n", "* [Modelos-de-incrustaciones-del -Español-Preentrenados](#Modelos-de-incrustaciones-del-Español-Preentrenados)\n", "* [Conclusión](#Conclusión)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Introducción</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Trabajar con datos de texto no estructurados es difícil, especialmente cuando se intenta construir un sistema inteligente que interprete y comprenda el lenguaje natural que fluye libremente al igual que los humanos. \n", "\n", "\n", "Debe poder procesar y transformar datos textuales no estructurados y ruidosos en algunos formatos estructurados y vectorizados que puedan ser entendidos por cualquier algoritmo de aprendizaje automático. \n", "\n", "\n", "Los principios del procesamiento del lenguaje natural, el aprendizaje automático o el aprendizaje profundo, todos los cuales caen bajo el amplio paraguas de la inteligencia artificial, son herramientas eficaces del oficio. \n", "\n", "\n", "Un punto importante para recordar aquí es que cualquier algoritmo de aprendizaje automático se basa en principios de estadística, matemáticas y optimización. \n", "\n", "Por lo tanto, no son lo suficientemente inteligentes como para comenzar a procesar texto en su forma original y sin pre-procesar. \n", "\n", "En esta lección revisamos los métodos más modernos para el descubrimiento de tópicos y clasificación de documentos. Usaremos algunos modelos globales pre-entrenados que se encuentran disponibles libremente.\n", "\n", "Más específicamente, cubriremos el modelos `Word2Vec`, `Glove` y `FasText` y usaremos las herramientas `nltk`, `gensim` y `spacy`.\n", "\n", "En esta lección no usaremos tensorflow.\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">No cuente, prediga</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Para superar las deficiencias de perder la semántica y la escasez de características basadas en el modelo de bolsa de palabras, necesitamos hacer uso de los modelos de espacio vectorial - **Vector Space Models**(VSM) de tal manera que podamos incrustar vectores de palabras en este espacio vectorial continuo basado en semánticas y similitud contextual. \n", "\n", "\n", "De hecho, la *hipótesis distributiva* en el campo de la *semántica distributiva* nos dice que **las palabras que ocurren y se usan en el mismo contexto, son semánticamente similares entre sí y tienen significados similares**. \n", "\n", "\n", "En términos simples, **una palabra se caracteriza por la compañía que mantiene**. ¡Uno de los artículos famosos que habla en detalle sobre estos vectores de palabras semánticas y varios tipos es [Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors’ by Baroni et al](https://www.aclweb.org/anthology/P14-1023.pdf), de by Baroni et al.\n", "\n", "\n", "No profundizaremos mucho, pero, en resumen, hay dos tipos principales de métodos para los vectores de palabras contextuales. \n", "\n", "- **Métodos basados en conteo** como el *Análisis semántico latente* (LSA) que se pueden usar para calcular algunas medidas estadísticas de la frecuencia con la que las palabras ocurren con sus palabras vecinas en un corpus y luego construir vectores de palabras densas para cada palabra a partir de estas medidas. \n", "- **Los métodos predictivos**s, como los modelos de lenguaje basados en redes neuronales, intentan predecir palabras a partir de las palabras vecinas observando secuencias de palabras en el corpus y, en el proceso, aprende representaciones distribuidas que nos proporcionan densas incrustaciones de palabras. \n", "\n", "Nos centraremos en estos métodos predictivos en esta lección.\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Sumergimiento o incrustamiento de palabras-word embeddings</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Con respecto a los sistemas de reconocimiento de voz o imágenes, toda la información ya está presente en forma de vectores de características ricos y densos incrustados en conjuntos de datos de alta dimensión como espectrogramas de audio e intensidades de píxeles de imagen, como hemos estudiado en otras lecciones.\n", "\n", "\n", "Sin embargo, cuando se trata de datos de texto sin procesar, especialmente modelos basados en conteo como la bolsa de palabras (bag of words), estamos tratando con palabras individuales que pueden tener sus propios identificadores y no capturan la relación semántica entre palabras. \n", "\n", "En lecciones anteriores trabajamos con la técnica de bolsa de palabras en la técnica Lattent Dirichlet Allocation (LDA). \n", "\n", "Esto conduce a enormes vectores de palabras dispersas para datos textuales y, por lo tanto, si no tenemos suficientes datos, podemos terminar obteniendo modelos deficientes o incluso sobreajustando los datos debido a la maldición de la dimensionalidad.\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Modelos Word2Vec</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "La familia de modelos Word2Vec no es supervisada, lo que esto significa es que puede simplemente darle un corpus sin etiquetas o información adicionales y puede construir incrustaciones densas de palabras a partir del corpus. \n", "\n", "Pero aún necesitará aprovechar una metodología de clasificación supervisada una vez que tenga este corpus para acceder a estas incorporaciones. \n", "\n", "Haremos esto desde el propio corpus, sin ninguna información auxiliar. Podemos modelar esta arquitectura CBOW ahora como un modelo de clasificación de aprendizaje profundo de modo que tomemos en las palabras de contexto como nuestra entrada, $X$ e intentemos predecir la palabra objetivo, $Y$. " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Modelo CBOW</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "<figure>\n", "<center>\n", "<img src=\"../Imagenes/cbow_1_palabra.png\" width=\"600\" height=\"400\" align=\"center\"/>\n", "</center>\n", "<figcaption>\n", "<p style=\"text-align:center\">Arquitectura del modelo CBOW con una palabra de contexto</p>\n", "</figcaption>\n", "</figure>\n", "\n", "Fuente: Alvaro Montenegro" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "La siguiente imagen es la arquitectura general CBOW." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "<figure>\n", "<center>\n", "<img src=\"../Imagenes/CBOW.png\" width=\"300\" height=\"200\" align=\"center\"/>\n", "</center>\n", "<figcaption>\n", "<p style=\"text-align:center\">Arquitectura general del modelo CBOW</p>\n", "</figcaption>\n", "</figure>\n", "\n", "Fuente: \n", "[Efficient Estimation of Word Representations in Vector space](https://arxiv.org/pdf/1301.3781.pdf)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Veamos un ejemplo de cómo se hace la preparación de los datos para CBOW\n", "\n", "Usaremos el símbolo especial *PAD* para codificar los espacios faltantes (se hizo en tensorflow con la función *sequence_pad*. ¿Recuerda?\n", "\n", "Consideremos la frase en inglés (más tarde mostraremos como usar las herramientas para Español)\n", "\n", "- *the quick brown fox jumps over the lazzy dog*\n", "\n", "La construcción de los contextos usando una ventana de contexto de tamaño 2 es como sigue:\n", "\n", "1. (PAD, PAD,quick, brown) -> the\n", "2. (PAD, the, brown, fox) -> quick\n", "3. (the, quick, fox, jumps) -> brown\n", "4. (quick, brown, jumps, over) -> fox\n", "5. (brown, fox, over, the) -> jumps\n", "6. (fox, jumps, the, lazzy) -> over\n", "7. (jumps, over, lazzy, dog) -> the\n", "8. (over, the, dog, PAD) -> lazzy\n", "8. (the, lazzy, PAD, PAD) -> dog\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Modelo Skip-gram</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "La arquitectura del modelo Skip-gram generalmente intenta lograr lo contrario de lo que hace el modelo CBOW. \n", "\n", "\n", "Intenta predecir las palabras de contexto de origen (palabras circundantes) dada una palabra de destino (la palabra central). Teniendo en cuenta nuestra simple oración de antes, \"the quick brown fox jumps over the lazy dog”. \n", "\n", "Si usamos el modelo CBOW, obtenemos pares de (context_window, target_word) donde si consideramos una ventana de contexto de tamaño 2, tenemos ejemplos como ([quick, fox], brown), ([the, brown], quick) , ([the, dog], lazy) y así sucesivamente. \n", "\n", "Ahora, teniendo en cuenta que el objetivo del modelo *skip-gram* es predecir el contexto a partir de la palabra objetivo, el modelo normalmente invierte los contextos y objetivos e intenta predecir cada palabra de contexto a partir de su palabra objetivo. \n", "\n", "Por lo tanto, la tarea se convierte en predecir el contexto [quick, fox] dada la palabra objetivo *brown* o [the brown] dada la palabra objetivo *quick* y así sucesivamente. \n", "\n", "Por lo tanto, el modelo intenta predecir las palabras de la ventana context_window basándose en target_word.\n", "\n", "La figura ilustra la arquitectura skip-gram\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "<figure>\n", "<center>\n", "<img src=\"../Imagenes/SKIP_gram.png\" width=\"300\" height=\"300\" align=\"center\"/>\n", "</center>\n", "<figcaption>\n", "<p style=\"text-align:center\">Modelo Skip-gram</p>\n", "</figcaption>\n", "</figure>\n", "\n", "Fuente: \n", "[Dipanjan (DJ) Sarkar](https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "<figure>\n", "<center>\n", "<img src=\"../Imagenes/Skip-Gram-architecture.jpg\" width=\"600\" height=\"600\" align=\"center\"/>\n", "</center>\n", "<figcaption>\n", "<p style=\"text-align:center\">Arquitectura Skip-gram</p>\n", "</figcaption>\n", "</figure>\n", "\n", "Fuente: \n", "[Exploring chemical space using natural language processing](https://www.researchgate.net/publication/339013257_Exploring_chemical_space_using_natural_language_processing_methodologies_for_drug_discovery/figures?lo=1)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Para esto, alimentamos nuestros pares de modelos de skip-gram son $(X, Y)$ donde $X$ es nuestra entrada e $Y$ es nuestra etiqueta. \n", "\n", "Hacemos esto usando los pares [(objetivo, contexto), 1] como muestras de entrada positivas donde objetivo es nuestra palabra de interés y contexto es una palabra de contexto que aparece cerca de la palabra objetivo y la etiqueta positiva 1 indica que este es un par contextualmente relevante. \n", "\n", "\n", "También introducimos pares [(objetivo, aleatorio), 0] como muestras de entrada negativa donde objetivo es nuevamente nuestra palabra de interés, pero aleatorio significa que es solo una palabra seleccionada al azar de nuestro vocabulario que no tiene contexto o asociación con nuestra palabra objetivo. \n", "\n", "\n", "Por lo tanto, la etiqueta negativa 0 indica que este es un par contextualmente irrelevante. Hacemos esto para que el modelo pueda aprender qué pares de palabras son contextualmente relevantes y cuáles no y generar incrustaciones similares para palabras semánticamente similares.\n", "\n", "La construcción de los pares [(objetivo, aleatorio), 0] se hace tomando al azar palabras objetivo y asociándole al azar palabras con las que no haya conformado parejas de contexto. Los autores citados en las referencias proponen más de una alternativa para generar estas parejas, basados en distinto modelos de muestreo. Por lo general se sugiere que haya tantas parejas positivas como negativas.\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "En el mismo ejemplo anterior \n", "\n", "- *the quick brown fox jumps over the lazzy dog*\n", "\n", "se tiene que las parejas positivas son:\n", "\n", "1. the: (the, quick), (the, brown) (the, jumps), (the, over), (the, lazzy), (the,dog)\n", "2. quick: (quick, the), (quick, brown), (quick, fox)\n", "3. brown: (brown, the) (brown, quick), (brown, jumps), (brown, over)\n", "4. fox: (fox, quick), (fox, brown), (fox, jumps), (fox, over)\n", "5. jumps: (jumps, brown), (jumps, fox), (jumps, over), (jumps, the)\n", "6. lazzy: (lazzy, over), (lazzy, the), (lazzy, dog)\n", "7. dog: (dog,the), (dog, lazzy)\n", "\n", "Una pareja negativa puede ser (quick, lazzy)." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Sobre que podemos esperar</span> " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Supongamos que se ha obtenido la siguiente codificación word2vec para personajes de comics.\n", "\n", "word2vec(‘Batman’) = [0.9, 0.8, 0.2]\n", "\n", "word2vec(‘Joker’) = [0.8, 0.3, 0.1]\n", "\n", "word2vec(‘Spiderman’) = [0.2, .9, 0.8]\n", "\n", "word2vec(‘Thanos’) = [0.3, 0.1, 0.9]\n", "\n", "1. Parece que la primera característica representa la pertenencia al Universo DC. Observa que \"Batman\" y \"Joker\" tienen valores más altos para su primera función porque pertenecen al Universo DC.\n", "2. Quizás el segundo elemento en la representación de word2vec aquí captura las características de héroe / villano. Es por eso que \"Batman\" y \"Spiderman\" tienen valores más altos y \"Joker\" y \"Thanos\" tienen valores más pequeños.\n", "3. Se podría decir que el tercer componente de la palabra vectores representa los poderes / habilidades sobrenaturales. Todos sabemos que \"Batman\" y \"Joker\" no tienen superpoderes y es por eso que sus vectores tienen números pequeños en la tercera posición." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "<figure>\n", "<center>\n", "<img src=\"../Imagenes/word2vec-models.jpg\" width=\"600\" height=\"600\" align=\"center\"/>\n", "</center>\n", "<figcaption>\n", "<p style=\"text-align:center\">Arquitectura word2vec</p>\n", "</figcaption>\n", "</figure>\n", "\n", "Fuente: \n", "[wor2vec arquitecture](https://figshare.com/articles/figure/_i_Word2Vec_i_architecture_The_figure_shows_two_variants_of_word2vec_architecture_CBOW_and_Skip_gram_26_for_a_sample_/11982951)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Las tareas linguisticas</span> " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "En el paper de Baroni et al., 2014,[Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors](https://www.aclweb.org/anthology/P14-1023.pdf), se comparan los métodos clásicos de distribución semántica probabilística de documentos, con los modelos predictivos presentados en esta lección. Para la comparación ello desarrolla las siguientes tareas lingüísticas\n", "\n", "1. Relación semántica de términos.\n", "2. Detección de sinónimos.\n", "3. Categorización de conceptos.\n", "4. Selección de preferencias. (verbo, sustantivo). Variedad en el uso del lenguaje\n", "5. Analogías sintácticas. [(brother,sister), grandson] -> granddaughter.\n", "\n", "Invitamos al lector interesado la lectura del paper.\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Recursos linguísticos</span> " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "[NLTK](https://www.nltk.org/py-modindex.html) incluye una pequeña selección de textos del archivo de texto electrónico del Proyecto Gutenberg, que contiene unos 25.000 libros electrónicos gratuitos, alojados en [Gutenberg project](http://www.gutenberg.org/)\n", "\n", "*Tokenizador de frases Punkt*\n", "\n", "Este tokenizador divide un texto en una lista de oraciones mediante el uso de un algoritmo no supervisado para construir un modelo de abreviaturas, colocaciones y palabras que inician oraciones. Debe entrenarse en una gran colección de texto sin formato en el idioma de destino antes de que pueda usarse.\n", "\n", "El paquete de datos NLTK incluye un tokenizador Punkt previamente entrenado para inglés." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Corpus de juguete</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Para empezar usaremos el siguiente *corpus de juguete*\n", "\n", "Nuestro corpus de juguetes consta de documentos pertenecientes a varias categorías. \n", "\n", "Otro corpus que usaremos en esta lección es la versión *King James de la Biblia* disponible gratuitamente en *Project Gutenberg* a través del módulo de corpus en nltk.\n", "\n", "Lo cargaremos en breve, en la siguiente sección. Antes de los análisis necesitamos preprocesar y normalizar este texto." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### <span style=\"color:#4CC9F0\">Importa librerías (módulos)</span>" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import re\n", "import nltk\n", "import matplotlib.pyplot as plt\n", "\n", "pd.options.display.max_colwidth = 200\n", "\n", "%matplotlib inline" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### <span style=\"color:#4CC9F0\">Corpus de juguete</span>" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Document</th>\n", " <th>Category</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>The sky is blue and beautiful.</td>\n", " <td>weather</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Love this blue and beautiful sky!</td>\n", " <td>weather</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>The quick brown fox jumps over the lazy dog.</td>\n", " <td>animals</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>A king's breakfast has sausages, ham, bacon, e...</td>\n", " <td>food</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>I love green eggs, ham, sausages and bacon!</td>\n", " <td>food</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>The brown fox is quick and the blue dog is lazy!</td>\n", " <td>animals</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>The sky is very blue and the sky is very beaut...</td>\n", " <td>weather</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>The dog is lazy but the brown fox is quick!</td>\n", " <td>animals</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Document Category\n", "0 The sky is blue and beautiful. weather\n", "1 Love this blue and beautiful sky! weather\n", "2 The quick brown fox jumps over the lazy dog. animals\n", "3 A king's breakfast has sausages, ham, bacon, e... food\n", "4 I love green eggs, ham, sausages and bacon! food\n", "5 The brown fox is quick and the blue dog is lazy! animals\n", "6 The sky is very blue and the sky is very beaut... weather\n", "7 The dog is lazy but the brown fox is quick! animals" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corpus = ['The sky is blue and beautiful.',\n", " 'Love this blue and beautiful sky!',\n", " 'The quick brown fox jumps over the lazy dog.',\n", " \"A king's breakfast has sausages, ham, bacon, eggs, toast and beans\",\n", " 'I love green eggs, ham, sausages and bacon!',\n", " 'The brown fox is quick and the blue dog is lazy!',\n", " 'The sky is very blue and the sky is very beautiful today',\n", " 'The dog is lazy but the brown fox is quick!' \n", "]\n", "labels = ['weather', 'weather', 'animals', 'food', 'food', 'animals', 'weather', 'animals']\n", "\n", "corpus = np.array(corpus)\n", "corpus_df = pd.DataFrame({'Document': corpus, \n", " 'Category': labels})\n", "corpus_df = corpus_df[['Document', 'Category']]\n", "corpus_df" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Pre-procesamiento del texto</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Vamos a hacer un preprocesamiento diferente para cada corpus." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "wpt = nltk.WordPunctTokenizer()\n", "stop_words = nltk.corpus.stopwords.words('english')\n", "\n", "def normalize_document(doc):\n", " # remove special characters: \n", " doc = re.sub(r'[^a-zA-Z\\s]', '', doc, re.I|re.A)\n", " # transform to lower case\n", " doc = doc.lower()\n", " # remove \\whitespaces\n", " doc = doc.strip()\n", " # tokenize document\n", " tokens = wpt.tokenize(doc)\n", " # filter stopwords out of document\n", " filtered_tokens = [token for token in tokens if token not in stop_words]\n", " # re-create document from filtered tokens\n", " doc = ' '.join(filtered_tokens)\n", " return doc\n", "\n", "# crea una función vectorizada para que actué sobre múltiples textos\n", "normalize_corpus = np.vectorize(normalize_document)\n", "#normalize_corpus" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([['sky blue beautiful', 'weather'],\n", " ['love blue beautiful sky', 'weather'],\n", " ['quick brown fox jumps lazy dog', 'animals'],\n", " ['kings breakfast sausages ham bacon eggs toast beans', 'food'],\n", " ['love green eggs ham sausages bacon', 'food'],\n", " ['brown fox quick blue dog lazy', 'animals'],\n", " ['sky blue sky beautiful today', 'weather'],\n", " ['dog lazy brown fox quick', 'animals']], dtype='<U51')" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "norm_corpus_toy = normalize_corpus(corpus_df)\n", "norm_corpus_toy" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">The King James Version of the Bible</span>" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['!', '\"', '#', '$', '%', '&', \"'\", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\\\', ']', '^', '_', '`', '{', '|', '}', '~', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package gutenberg to /home/alvaro/nltk_data...\n", "[nltk_data] Package gutenberg is already up-to-date!\n" ] } ], "source": [ "from nltk.corpus import gutenberg\n", "from string import punctuation\n", "\n", "nltk.download('gutenberg')\n", "\n", "bible = gutenberg.sents('bible-kjv.txt') # tokeniza por sentencias\n", "remove_terms = list(punctuation + '0123456789')\n", "print(remove_terms)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total lines: 30103\n", "\n", "Sample line: ['1', ':', '6', 'And', 'God', 'said', ',', 'Let', 'there', 'be', 'a', 'firmament', 'in', 'the', 'midst', 'of', 'the', 'waters', ',', 'and', 'let', 'it', 'divide', 'the', 'waters', 'from', 'the', 'waters', '.']\n", "\n", "Processed line: god said let firmament midst waters let divide waters waters\n" ] } ], "source": [ "norm_bible = [[word.lower() for word in sent if word not in remove_terms] for sent in bible]\n", "norm_bible = [' '.join(tok_sent) for tok_sent in norm_bible]\n", "norm_bible = filter(None, normalize_corpus(norm_bible))\n", "norm_bible = [tok_sent for tok_sent in norm_bible if len(tok_sent.split()) > 2]\n", "\n", "print('Total lines:', len(bible))\n", "print('\\nSample line:', bible[10])\n", "print('\\nProcessed line:', norm_bible[10])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Modelos con Gensim</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "El objeto word2vec de gensim puede usar CBOW o Skip-gram. Los autores de gensim tomaron el código original escritos por Mikolov y colegas, escrito en c++, lo optimizaron y volvieron *caja negra*. Según ellos gensim es 7 veces más rápido que la implementación que hicimos manualmente en otra lección, usando numpy." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### <span style=\"color:#4CC9F0\">Corpus Biblia </span>" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from gensim.models import word2vec\n", "\n", "# tokenize sentences in corpus\n", "wpt = nltk.WordPunctTokenizer()\n", "tokenized_corpus_bible = [wpt.tokenize(document) for document in norm_bible]\n", "\n", "# Set values for various parameters\n", "feature_size = 100 # Word vector dimensionality (embedding dim)\n", "window_context = 30 # Context window size \n", "min_word_count = 1 # Minimum word count \n", "sample = 1e-3 # Downsample setting for frequent words\n", "\n", "w2v_model = word2vec.Word2Vec(tokenized_corpus_bible, size=feature_size, \n", " window=window_context, min_count=min_word_count,\n", " sample=sample, iter=50)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### <span style=\"color:#4CC9F0\">Palabras similares, corpus biblia</span>" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'god': ['worldly', 'lord', 'reasonable', 'covenant', 'sworn'],\n", " 'jesus': ['peter', 'messias', 'immediately', 'apostles', 'nathanael'],\n", " 'noah': ['ham', 'shem', 'japheth', 'kenan', 'enosh'],\n", " 'egypt': ['pharaoh', 'egyptians', 'bondage', 'flowing', 'rid'],\n", " 'john': ['james', 'baptist', 'peter', 'devine', 'galilee'],\n", " 'gospel': ['christ', 'repentance', 'faith', 'godly', 'hope'],\n", " 'moses': ['congregation', 'ordinance', 'children', 'doctor', 'aaron'],\n", " 'famine': ['pestilence', 'peril', 'mildew', 'blasting', 'sword']}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# view similar words based on gensim's model\n", "similar_words = {search_term: [item[0] for item in w2v_model.wv.most_similar([search_term], topn=5)]\n", " for search_term in ['god', 'jesus', 'noah', 'egypt', 'john', 'gospel', 'moses','famine']}\n", "similar_words" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "{'dios': ['mundano', 'señor', 'razonable', 'menos', 'nosotros'],\n", " 'jesús': ['pedro', 'mesías', 'juan', 'apóstoles', 'james'],\n", " 'noah': ['japheth', 'shem', 'ham', 'henoch', 'enosh'],\n", " 'egipto': ['egipcios', 'esclavitud', 'faraón', 'fluyendo', 'piojos'],\n", " 'john': ['james', 'baptist', 'devine', 'pedro', 'baptism'],\n", " 'evangelio': ['cristo', 'fe', 'predicar', 'arrepentimiento', 'esperanza'],\n", " 'moisés': ['elisheba', 'congregación', 'joshua', 'naashon', 'doctor'],\n", " 'hambruna': ['pestilencia', 'peligro', 'alcanza', 'muertes', 'moho']}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(word2vec.Word2Vec)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### <span style=\"color:#4CC9F0\">Gráfico con tsne</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Incrustaciones estocásticas de vecinos con distribución t-Student.\n", "\n", "[t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html), [Laurens van der Maaten and Geoffrey Hinton](https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) es una herramienta para visualizar datos de alta dimensión. Convierte las similitudes entre los puntos de datos en probabilidades conjuntas y trata de minimizar la divergencia de Kullback-Leibler entre las probabilidades conjuntas de la incrustación de baja dimensión y los datos de alta dimensión. t-SNE tiene una función de costo que no es convexa, es decir, con diferentes inicializaciones podemos obtener diferentes resultados.\n", "\n", "Se recomienda fuertemente utilizar otro método de reducción de dimensionalidad (por ejemplo, PCA para datos densos o TruncatedSVD para datos escasos) para reducir el número de dimensiones a una cantidad razonable (por ejemplo, 50) si el número de características es muy alto. Esto suprimirá algo de ruido y acelerará el cálculo de distancias por pares entre muestras. Para obtener más consejos, consulte las preguntas frecuentes de [Laurens van der Maaten](https://lvdmaaten.github.io/tsne/)\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "<Figure size 1008x576 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.manifold import TSNE\n", "\n", "words = sum([[k] + v for k, v in similar_words.items()], []) # a list of the words in similar_words\n", "wvs = w2v_model.wv[words] # Coordinates of the words\n", "\n", "tsne = TSNE(n_components=2, random_state=100, n_iter=10000, perplexity=2)\n", "np.set_printoptions(suppress=True)\n", "T = tsne.fit_transform(wvs)\n", "labels = words\n", "\n", "plt.figure(figsize=(14, 8))\n", "plt.scatter(T[:, 0], T[:, 1], c='orange', edgecolors='r')\n", "for label, x, y in zip(labels, T[:, 0], T[:, 1]):\n", " plt.annotate(label, xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-0.34405723, -2.2080076 , 0.22129701, ..., -0.27648592,\n", " -0.12405743, -0.15957785],\n", " [-0.11122978, -0.63676333, -0.18996008, ..., 0.06530013,\n", " 0.13042438, 0.25724903],\n", " [ 0.9176589 , 1.1386521 , 0.03363332, ..., 1.1606003 ,\n", " -1.6409662 , -1.9810503 ],\n", " ...,\n", " [ 1.0055212 , 0.5959988 , -1.167708 , ..., -0.2223904 ,\n", " 0.6689782 , -0.24536332],\n", " [ 1.0243756 , 0.62595314, -1.1647514 , ..., -0.17356953,\n", " 0.59791714, -0.216457 ],\n", " [-1.8861887 , 0.5809999 , -0.08317056, ..., 2.7208266 ,\n", " 5.5738573 , 0.00431524]], dtype=float32)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wvs" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['god',\n", " 'worldly',\n", " 'lord',\n", " 'reasonable',\n", " 'covenant',\n", " 'sworn',\n", " 'jesus',\n", " 'peter',\n", " 'messias',\n", " 'immediately',\n", " 'apostles',\n", " 'nathanael',\n", " 'noah',\n", " 'ham',\n", " 'shem',\n", " 'japheth',\n", " 'kenan',\n", " 'enosh',\n", " 'egypt',\n", " 'pharaoh',\n", " 'egyptians',\n", " 'bondage',\n", " 'flowing',\n", " 'rid',\n", " 'john',\n", " 'james',\n", " 'baptist',\n", " 'peter',\n", " 'devine',\n", " 'galilee',\n", " 'gospel',\n", " 'christ',\n", " 'repentance',\n", " 'faith',\n", " 'godly',\n", " 'hope',\n", " 'moses',\n", " 'congregation',\n", " 'ordinance',\n", " 'children',\n", " 'doctor',\n", " 'aaron',\n", " 'famine',\n", " 'pestilence',\n", " 'peril',\n", " 'mildew',\n", " 'blasting',\n", " 'sword']" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "words" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Aplicación a etiquetado automático de textos</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Para esta aplicación vamos a usar nuestro ejemplo de juguete." ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'brown fox quick blue dog lazy'" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "norm_toy = [(text) for text, category in norm_corpus_toy]\n", "#norm_toy = [tok_sent for tok_sent in norm_toy if len(tok_sent.split()) > 2]\n", "norm_toy[5]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "# build word2vec model\n", "wpt = nltk.WordPunctTokenizer()\n", "\n", "tokenized_corpus_toy = [wpt.tokenize(document) for document in norm_toy]\n", "\n", "# Set values for various parameters\n", "feature_size = 10 # Word vector dimensionality \n", "window_context = 10 # Context window size \n", "min_word_count = 1 # Minimum word count \n", "sample = 1e-3 # Downsample setting for frequent words\n", "\n", "w2v_model = word2vec.Word2Vec(tokenized_corpus_toy, size=feature_size, \n", " window=window_context, min_count = min_word_count,\n", " sample=sample, iter=100)\n", " " ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Visualización de incrustaciones con TSNE</span>" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "<Figure size 864x432 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ " \n", "# visualize embeddings\n", "from sklearn.manifold import TSNE\n", "\n", "words = w2v_model.wv.index2word\n", "wvs = w2v_model.wv[words]\n", "\n", "tsne = TSNE(n_components=2, random_state=200, n_iter=5000, perplexity=2)\n", "np.set_printoptions(suppress=True)\n", "T = tsne.fit_transform(wvs)\n", "labels = words\n", "\n", "plt.figure(figsize=(12, 6))\n", "plt.scatter(T[:, 0], T[:, 1], c='orange', edgecolors='r')\n", "for label, x, y in zip(labels, T[:, 0], T[:, 1]):\n", " plt.annotate(label, xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Recuerde que nuestro corpus es extremadamente pequeño, por lo que para obtener incrustaciones de palabras significativas y para que el modelo obtenga más contexto y semántica, más datos ayudan.\n", "\n", "Ahora bien, ¿qué es una palabra incrustada en este escenario? \n", "\n", "Por lo general, es un vector denso para cada palabra, como se muestra en el siguiente ejemplo para la palabra *sky*." ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0.0222188 , -0.00322705, 0.00577812, 0.02297305, 0.00253111,\n", " -0.03698301, -0.00314448, -0.03604106, 0.00387173, 0.01324753],\n", " dtype=float32)" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "w2v_model.wv['sky']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Incrustación de documentos</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Ahora suponga que quisiéramos agrupar los ocho documentos de nuestro corpus de juguetes, necesitaríamos obtener las incrustaciones de nivel de documento de cada una de las palabras presentes en cada documento. \n", "\n", "Una estrategia sería promediar las incrustaciones de palabras para cada palabra en un documento. \n", "\n", "Esta es una estrategia extremadamente útil y puede adoptar la misma para sus propios problemas. Apliquemos esto ahora en nuestro corpus para obtener características para cada documento." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "<ipython-input-45-de269809cb6b>:9: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n", " feature_vector = np.add(feature_vector, model[word])\n" ] } ], "source": [ "def average_word_vectors(words, model, vocabulary, num_features):\n", " \n", " feature_vector = np.zeros((num_features,),dtype=\"float64\")\n", " nwords = 0.\n", " \n", " for word in words:\n", " if word in vocabulary: \n", " nwords = nwords + 1.\n", " feature_vector = np.add(feature_vector, model[word])\n", " \n", " if nwords:\n", " feature_vector = np.divide(feature_vector, nwords)\n", " \n", " return feature_vector\n", " \n", " \n", "def averaged_word_vectorizer(corpus, model, num_features):\n", " vocabulary = set(model.wv.index2word)\n", " features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)\n", " for tokenized_sentence in corpus]\n", " return np.array(features)\n", "\n", "\n", "# get document level embeddings\n", "w2v_feature_array = averaged_word_vectorizer(corpus=tokenized_corpus_toy, model=w2v_model, num_features=feature_size)\n", " " ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " <th>4</th>\n", " <th>5</th>\n", " <th>6</th>\n", " <th>7</th>\n", " <th>8</th>\n", " <th>9</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-0.003041</td>\n", " <td>-0.007926</td>\n", " <td>0.018282</td>\n", " <td>0.003400</td>\n", " <td>-0.003612</td>\n", " <td>0.017513</td>\n", " <td>-0.021539</td>\n", " <td>0.000254</td>\n", " <td>0.014870</td>\n", " <td>0.001121</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>0.004246</td>\n", " <td>-0.016616</td>\n", " <td>0.002898</td>\n", " <td>-0.003310</td>\n", " <td>0.001360</td>\n", " <td>0.015432</td>\n", " <td>-0.020553</td>\n", " <td>0.002866</td>\n", " <td>0.006426</td>\n", " <td>-0.000660</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.005549</td>\n", " <td>-0.010005</td>\n", " <td>-0.016900</td>\n", " <td>0.000572</td>\n", " <td>-0.012724</td>\n", " <td>0.019170</td>\n", " <td>-0.025599</td>\n", " <td>-0.007053</td>\n", " <td>-0.014700</td>\n", " <td>-0.017178</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>0.006906</td>\n", " <td>0.009039</td>\n", " <td>0.000851</td>\n", " <td>0.010022</td>\n", " <td>0.001052</td>\n", " <td>-0.004086</td>\n", " <td>0.006732</td>\n", " <td>-0.004544</td>\n", " <td>-0.006839</td>\n", " <td>-0.005179</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>0.007297</td>\n", " <td>0.004711</td>\n", " <td>-0.014360</td>\n", " <td>0.012438</td>\n", " <td>-0.011912</td>\n", " <td>0.015900</td>\n", " <td>0.001450</td>\n", " <td>-0.018203</td>\n", " <td>-0.009471</td>\n", " <td>-0.005930</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>0.012570</td>\n", " <td>-0.009762</td>\n", " <td>-0.005562</td>\n", " <td>0.001511</td>\n", " <td>-0.006945</td>\n", " <td>0.023436</td>\n", " <td>-0.029692</td>\n", " <td>0.000243</td>\n", " <td>-0.008374</td>\n", " <td>-0.003993</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>-0.000583</td>\n", " <td>-0.003620</td>\n", " <td>0.010479</td>\n", " <td>0.014733</td>\n", " <td>-0.004924</td>\n", " <td>0.005844</td>\n", " <td>-0.010194</td>\n", " <td>-0.009180</td>\n", " <td>0.019397</td>\n", " <td>0.006925</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>0.015602</td>\n", " <td>-0.009999</td>\n", " <td>-0.014323</td>\n", " <td>0.008859</td>\n", " <td>-0.010401</td>\n", " <td>0.020386</td>\n", " <td>-0.026200</td>\n", " <td>-0.004136</td>\n", " <td>-0.011145</td>\n", " <td>-0.012100</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2 3 4 5 6 \\\n", "0 -0.003041 -0.007926 0.018282 0.003400 -0.003612 0.017513 -0.021539 \n", "1 0.004246 -0.016616 0.002898 -0.003310 0.001360 0.015432 -0.020553 \n", "2 0.005549 -0.010005 -0.016900 0.000572 -0.012724 0.019170 -0.025599 \n", "3 0.006906 0.009039 0.000851 0.010022 0.001052 -0.004086 0.006732 \n", "4 0.007297 0.004711 -0.014360 0.012438 -0.011912 0.015900 0.001450 \n", "5 0.012570 -0.009762 -0.005562 0.001511 -0.006945 0.023436 -0.029692 \n", "6 -0.000583 -0.003620 0.010479 0.014733 -0.004924 0.005844 -0.010194 \n", "7 0.015602 -0.009999 -0.014323 0.008859 -0.010401 0.020386 -0.026200 \n", "\n", " 7 8 9 \n", "0 0.000254 0.014870 0.001121 \n", "1 0.002866 0.006426 -0.000660 \n", "2 -0.007053 -0.014700 -0.017178 \n", "3 -0.004544 -0.006839 -0.005179 \n", "4 -0.018203 -0.009471 -0.005930 \n", "5 0.000243 -0.008374 -0.003993 \n", "6 -0.009180 0.019397 0.006925 \n", "7 -0.004136 -0.011145 -0.012100 " ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ " \n", "pp = pd.DataFrame(w2v_feature_array)\n", "pp" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Ahora que tenemos nuestras características para cada documento, agrupemos estos documentos utilizando el **algoritmo de propagación de afinidad**, que es un algoritmo de agrupación basado en el concepto de \"paso de mensajes\" entre puntos de datos y no necesita el número de agrupaciones como una entrada explícita que a menudo es requerido por algoritmos de agrupación en clústeres basados en particiones." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### <span style=\"color:#4CC9F0\">Propagación por afinidad: Afinity propagation</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Este es el resumen del artículo de [Frey y Duek](https://science.sciencemag.org/content/315/5814/972).\n", "\n", "\"La agrupación de datos mediante la identificación de un subconjunto de ejemplos representativos es importante para procesar señales sensoriales y detectar patrones en los datos. \n", "\n", "Estos \"ejemplos\" se pueden encontrar eligiendo aleatoriamente un subconjunto inicial de puntos de datos y luego refinándolo iterativamente, pero esto funciona bien solo si esa elección inicial está cerca de una buena solución.\n", "\n", "\n", "Diseñamos un método llamado \"propagación por afinidad\", que toma como entrada medidas de similitud entre pares de puntos de datos. Los mensajes de valor real se intercambian entre puntos de datos hasta que emerge gradualmente un conjunto de ejemplos de alta calidad y los grupos correspondientes. \n", "\n", "Utilizamos la propagación por afinidad para agrupar imágenes de rostros, detectar genes en datos de microarrays, identificar oraciones representativas en este manuscrito e identificar ciudades a las que se accede de manera eficiente mediante viajes en avión. \n", "\n", "La propagación por afinidad encontró grupos con un error mucho menor que otros métodos, y lo hizo en menos de una centésima parte del tiempo.\"" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/alvaro/anaconda3/envs/nlp/lib/python3.8/site-packages/sklearn/cluster/_affinity_propagation.py:148: FutureWarning: 'random_state' has been introduced in 0.23. It will be set to None starting from 1.0 (renaming of 0.25) which means that results will differ at every function call. Set 'random_state' to None to silence this warning, or to 0 to keep the behavior of versions <0.23.\n", " warnings.warn(\n" ] }, { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Document</th>\n", " <th>Category</th>\n", " <th>ClusterLabel</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>The sky is blue and beautiful.</td>\n", " <td>weather</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Love this blue and beautiful sky!</td>\n", " <td>weather</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>The quick brown fox jumps over the lazy dog.</td>\n", " <td>animals</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>A king's breakfast has sausages, ham, bacon, eggs, toast and beans</td>\n", " <td>food</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>I love green eggs, ham, sausages and bacon!</td>\n", " <td>food</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>The brown fox is quick and the blue dog is lazy!</td>\n", " <td>animals</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>The sky is very blue and the sky is very beautiful today</td>\n", " <td>weather</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>The dog is lazy but the brown fox is quick!</td>\n", " <td>animals</td>\n", " <td>2</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Document \\\n", "0 The sky is blue and beautiful. \n", "1 Love this blue and beautiful sky! \n", "2 The quick brown fox jumps over the lazy dog. \n", "3 A king's breakfast has sausages, ham, bacon, eggs, toast and beans \n", "4 I love green eggs, ham, sausages and bacon! \n", "5 The brown fox is quick and the blue dog is lazy! \n", "6 The sky is very blue and the sky is very beautiful today \n", "7 The dog is lazy but the brown fox is quick! \n", "\n", " Category ClusterLabel \n", "0 weather 0 \n", "1 weather 0 \n", "2 animals 2 \n", "3 food 1 \n", "4 food 1 \n", "5 animals 2 \n", "6 weather 0 \n", "7 animals 2 " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.cluster import AffinityPropagation\n", "\n", "ap = AffinityPropagation()\n", "ap.fit(w2v_feature_array)\n", "cluster_labels = ap.labels_\n", "cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])\n", "pd.concat([corpus_df, cluster_labels], axis=1)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Finalmente hagamos un plot de los documentos usando un análisis de componentes principales (ACP)." ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "<Figure size 576x432 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.decomposition import PCA\n", "\n", "pca = PCA(n_components=2, random_state=0)\n", "pcs = pca.fit_transform(w2v_feature_array)\n", "labels = ap.labels_\n", "categories = list(corpus_df['Category'])\n", "plt.figure(figsize=(8, 6))\n", "\n", "for i in range(len(labels)):\n", " label = labels[i]\n", " color = 'orange' if label == 0 else 'blue' if label == 1 else 'green'\n", " annotation_label = categories[i]\n", " x, y = pcs[i]\n", " plt.scatter(x, y, c=color, edgecolors='k')\n", " plt.annotate(annotation_label, xy=(x+1e-4, y+1e-3), xytext=(0, 0), textcoords='offset points')\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Modelos pre-entrenados. El modelo Glove</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "El modelo GloVe significa Vectores Globales, que es un modelo de aprendizaje no supervisado que se puede utilizar para obtener vectores de palabras densas similares a Word2Vec. \n", "\n", "Sin embargo, la técnica es diferente y el entrenamiento se realiza en una matriz global de co-ocurrencia palabra-palabra, usando el contexto de las mismas, lo que nos da un espacio vectorial con subestructuras significativas. \n", "\n", "Este método fue inventado en Stanford por Pennington et al. y se recomienda que leer el artículo original sobre GloVe, [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf) de Pennington et al. que es una lectura excelente para tener una perspectiva de cómo funciona este modelo.\n", "\n", "\n", "\n", "No cubriremos la implementación del modelo desde cero con demasiado detalle aquí, pero si está interesado en el código real, puede consultar la página oficial de [GloVe](https://nlp.stanford.edu/projects/glove/). \n", "\n", "Aquí mantendremos las cosas simples e intentaremos comprender los conceptos básicos detrás del modelo GloVe. Hemos hablado de métodos de factorización matricial basados en recuento como LSA y métodos predictivos como Word2Vec. \n", "\n", "\n", "El paper afirma que, actualmente, ambas familias sufren importantes inconvenientes. \n", "\n", "\n", "1. Los métodos como LSA aprovechan de manera eficiente la información estadística, pero funcionan relativamente mal en la tarea de analogía de palabras, como la forma en que descubrimos palabras semánticamente similares. \n", "2. Los métodos como skip-gram pueden funcionar mejor en la tarea de analogía, pero no utilizan tan bien las estadísticas del corpus a nivel global.\n", "\n", "\n", "La metodología básica del modelo GloVe es crear primero una enorme matriz de co-ocurrencia palabra-contexto que consta de pares (palabra, contexto) de modo que cada elemento de esta matriz represente la frecuencia con la que aparece una palabra con el contexto (que puede ser una secuencia de palabras).\n", "\n", "La idea entonces es aplicar factorización matricial para aproximar esta matriz como se muestra en la siguiente figura." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "<figure>\n", "<center>\n", "<img src=\"../Imagenes/glove_matrix.png\" width=\"400\" height=\"300\" align=\"center\"/>\n", "</center>\n", "<figcaption>\n", "<p style=\"text-align:center\">Base matemática del modelo Glove</p>\n", "</figcaption>\n", "</figure>\n", "\n", "Fuente: [understanding-feature-engineering](https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa)\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Teniendo en cuenta la matriz **Word-Context (WC)**, la matriz **Word-Feature (WF)** y la **matriz Feature-Context (FC)**, intentamos factorizar **WC = WF x FC**, de modo que nuestro objetivo es reconstruir WC a partir de WF y FC multiplicando ellos. \n", "\n", "\n", "Para esto, normalmente inicializamos WF y FC con algunos pesos aleatorios e intentamos multiplicarlos para obtener WC ’(una aproximación de WC) y medir qué tan cerca está de WC. Hacemos esto varias veces usando el Descenso de gradiente estocástico (SGD) para minimizar el error. \n", "\n", "Finalmente, la matriz Word-Feature (WF) nos da las incrustaciones de palabras para cada palabra donde F se puede preestablecer para un número específico de dimensiones.\n", "\n", "Un punto muy importante para recordar es que los modelos Word2Vec y GloVe son muy similares en su funcionamiento. \n", "\n", "\n", "Ambos tienen como objetivo construir un espacio vectorial donde la posición de cada palabra está influenciada por las palabras vecinas en función de su contexto y semántica. \n", "\n", "Word2Vec comienza con ejemplos individuales locales de pares de co-ocurrencia de palabras y GloVe comienza con estadísticas globales de co-ocurrencia agregadas en todas las palabras del corpus." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Introducción a spaCy</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Intentemos aprovechar las incrustaciones basadas en GloVe para nuestra tarea de agrupación de documentos. \n", "\n", "El marco de trabajo **spaCy** es muy popular y viene con capacidades para aprovechar las incrustaciones de GloVe basadas en diferentes modelos de lenguaje. \n", "\n", "También puede obtener vectores de palabras previamente entrenados y cargarlos según sea necesario usando gensim o spacy. \n", "\n", "Primero instalaremos spacy y usaremos el modelo **en_core_web_md** un modelo intermedio del inglés. Si desea resultados más potentes instale **en_vectors_web_lg** que consiste en vectores de palabras de 300 dimensiones entrenados en Common Crawl con GloVe, [GloVe](https://nlp.stanford.edu/projects/glove/).\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Modelos en Español</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "En la página [spacy-models](https://spacy.io/usage/models) encuentra los modelos disponibles en *spacy*, basados en Glove, para distintos idiomas. \n", "\n", "En particular para Español se tienen [modelos para Español en spacy](https://spacy.io/models/es) : \n", "\n", "1. es_core_news_sm (small): 15 MB\n", "2. es_core_news_md (medium): 500k keys, 20k unique vectors (300 dimensions), 45 MB\n", "3. es_core_news_lg (long): : 500k keys, 500k unique vectors (300 dimensions), 546 MB\n", "\n", "Para bajar e instalar por ejemplo *es_core_news_md* puede escribir\n", "\n", "*python -m spacy download es_core_news_md*\n", "\n", "Veamos\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total word vectors: 20000\n" ] } ], "source": [ "#!conda install -c conda-forge spacy\n", "# !python -m spacy en_core_web_md\n", "\n", "import spacy\n", "\n", "nlp = spacy.load('en_core_web_md')\n", "\n", "\n", "total_vectors = len(nlp.vocab.vectors)\n", "print('Total word vectors:', total_vectors)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Si todo va bien hasta aquí, podemos continuar." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " <th>3</th>\n", " <th>4</th>\n", " <th>5</th>\n", " <th>6</th>\n", " <th>7</th>\n", " <th>8</th>\n", " <th>9</th>\n", " <th>...</th>\n", " <th>290</th>\n", " <th>291</th>\n", " <th>292</th>\n", " <th>293</th>\n", " <th>294</th>\n", " <th>295</th>\n", " <th>296</th>\n", " <th>297</th>\n", " <th>298</th>\n", " <th>299</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>green</th>\n", " <td>-0.072368</td>\n", " <td>0.233200</td>\n", " <td>0.137260</td>\n", " <td>-0.156630</td>\n", " <td>0.248440</td>\n", " <td>0.349870</td>\n", " <td>-0.241700</td>\n", " <td>-0.091426</td>\n", " <td>-0.530150</td>\n", " <td>1.34130</td>\n", " <td>...</td>\n", " <td>-0.405170</td>\n", " <td>0.243570</td>\n", " <td>0.437300</td>\n", " <td>-0.461520</td>\n", " <td>-0.352710</td>\n", " <td>0.336250</td>\n", " <td>0.069899</td>\n", " <td>-0.111550</td>\n", " <td>0.532930</td>\n", " <td>0.712680</td>\n", " </tr>\n", " <tr>\n", " <th>dog</th>\n", " <td>-0.401760</td>\n", " <td>0.370570</td>\n", " <td>0.021281</td>\n", " <td>-0.341250</td>\n", " <td>0.049538</td>\n", " <td>0.294400</td>\n", " <td>-0.173760</td>\n", " <td>-0.279820</td>\n", " <td>0.067622</td>\n", " <td>2.16930</td>\n", " <td>...</td>\n", " <td>0.022908</td>\n", " <td>-0.259290</td>\n", " <td>-0.308620</td>\n", " <td>0.001754</td>\n", " <td>-0.189620</td>\n", " <td>0.547890</td>\n", " <td>0.311940</td>\n", " <td>0.246930</td>\n", " <td>0.299290</td>\n", " <td>-0.074861</td>\n", " </tr>\n", " <tr>\n", " <th>fox</th>\n", " <td>-0.348680</td>\n", " <td>-0.077720</td>\n", " <td>0.177750</td>\n", " <td>-0.094953</td>\n", " <td>-0.452890</td>\n", " <td>0.237790</td>\n", " <td>0.209440</td>\n", " <td>0.037886</td>\n", " <td>0.035064</td>\n", " <td>0.89901</td>\n", " <td>...</td>\n", " <td>-0.283050</td>\n", " <td>0.270240</td>\n", " <td>-0.654800</td>\n", " <td>0.105300</td>\n", " <td>-0.068738</td>\n", " <td>-0.534750</td>\n", " <td>0.061783</td>\n", " <td>0.123610</td>\n", " <td>-0.553700</td>\n", " <td>-0.544790</td>\n", " </tr>\n", " <tr>\n", " <th>toast</th>\n", " <td>0.130740</td>\n", " <td>-0.193730</td>\n", " <td>0.253270</td>\n", " <td>0.090102</td>\n", " <td>-0.272580</td>\n", " <td>-0.030571</td>\n", " <td>0.096945</td>\n", " <td>-0.115060</td>\n", " <td>0.484000</td>\n", " <td>0.84838</td>\n", " <td>...</td>\n", " <td>0.142080</td>\n", " <td>0.481910</td>\n", " <td>0.045167</td>\n", " <td>0.057151</td>\n", " <td>-0.149520</td>\n", " <td>-0.495130</td>\n", " <td>-0.086677</td>\n", " <td>-0.569040</td>\n", " <td>-0.359290</td>\n", " <td>0.097443</td>\n", " </tr>\n", " <tr>\n", " <th>lazy</th>\n", " <td>-0.353320</td>\n", " <td>-0.299710</td>\n", " <td>-0.176230</td>\n", " <td>-0.321940</td>\n", " <td>-0.385640</td>\n", " <td>0.586110</td>\n", " <td>0.411160</td>\n", " <td>-0.418680</td>\n", " <td>0.073093</td>\n", " <td>1.48650</td>\n", " <td>...</td>\n", " <td>0.402310</td>\n", " <td>-0.038554</td>\n", " <td>-0.288670</td>\n", " <td>-0.244130</td>\n", " <td>0.460990</td>\n", " <td>0.514170</td>\n", " <td>0.136260</td>\n", " <td>0.344190</td>\n", " <td>-0.845300</td>\n", " <td>-0.077383</td>\n", " </tr>\n", " <tr>\n", " <th>brown</th>\n", " <td>-0.374120</td>\n", " <td>-0.076264</td>\n", " <td>0.109260</td>\n", " <td>0.186620</td>\n", " <td>0.029943</td>\n", " <td>0.182700</td>\n", " <td>-0.631980</td>\n", " <td>0.133060</td>\n", " <td>-0.128980</td>\n", " <td>0.60343</td>\n", " <td>...</td>\n", " <td>-0.015404</td>\n", " <td>0.392890</td>\n", " <td>-0.034826</td>\n", " <td>-0.720300</td>\n", " <td>-0.365320</td>\n", " <td>0.740510</td>\n", " <td>0.108390</td>\n", " <td>-0.365760</td>\n", " <td>-0.288190</td>\n", " <td>0.114630</td>\n", " </tr>\n", " <tr>\n", " <th>ham</th>\n", " <td>-0.773320</td>\n", " <td>-0.282540</td>\n", " <td>0.580760</td>\n", " <td>0.841480</td>\n", " <td>0.258540</td>\n", " <td>0.585210</td>\n", " <td>-0.021890</td>\n", " <td>-0.463680</td>\n", " <td>0.139070</td>\n", " <td>0.65872</td>\n", " <td>...</td>\n", " <td>0.464470</td>\n", " <td>0.481400</td>\n", " <td>-0.829200</td>\n", " <td>0.354910</td>\n", " <td>0.224530</td>\n", " <td>-0.493920</td>\n", " <td>0.456930</td>\n", " <td>-0.649100</td>\n", " <td>-0.131930</td>\n", " <td>0.372040</td>\n", " </tr>\n", " <tr>\n", " <th>jumps</th>\n", " <td>-0.334840</td>\n", " <td>0.215990</td>\n", " <td>-0.350440</td>\n", " <td>-0.260020</td>\n", " <td>0.411070</td>\n", " <td>0.154010</td>\n", " <td>-0.386110</td>\n", " <td>0.206380</td>\n", " <td>0.386700</td>\n", " <td>1.46050</td>\n", " <td>...</td>\n", " <td>-0.107030</td>\n", " <td>-0.279480</td>\n", " <td>-0.186200</td>\n", " <td>-0.543140</td>\n", " <td>-0.479980</td>\n", " <td>-0.284680</td>\n", " <td>0.036022</td>\n", " <td>0.190290</td>\n", " <td>0.692290</td>\n", " <td>-0.071501</td>\n", " </tr>\n", " <tr>\n", " <th>today</th>\n", " <td>-0.156570</td>\n", " <td>0.594890</td>\n", " <td>-0.031445</td>\n", " <td>-0.077586</td>\n", " <td>0.278630</td>\n", " <td>-0.509210</td>\n", " <td>-0.066350</td>\n", " <td>-0.081890</td>\n", " <td>-0.047986</td>\n", " <td>2.80360</td>\n", " <td>...</td>\n", " <td>-0.326580</td>\n", " <td>-0.413380</td>\n", " <td>0.367910</td>\n", " <td>-0.262630</td>\n", " <td>-0.203690</td>\n", " <td>-0.296560</td>\n", " <td>-0.014873</td>\n", " <td>-0.250060</td>\n", " <td>-0.115940</td>\n", " <td>0.083741</td>\n", " </tr>\n", " <tr>\n", " <th>quick</th>\n", " <td>-0.445630</td>\n", " <td>0.191510</td>\n", " <td>-0.249210</td>\n", " <td>0.465900</td>\n", " <td>0.161950</td>\n", " <td>0.212780</td>\n", " <td>-0.046480</td>\n", " <td>0.021170</td>\n", " <td>0.417660</td>\n", " <td>1.68690</td>\n", " <td>...</td>\n", " <td>-0.329460</td>\n", " <td>0.421860</td>\n", " <td>-0.039543</td>\n", " <td>0.150180</td>\n", " <td>0.338220</td>\n", " <td>0.049554</td>\n", " <td>0.149420</td>\n", " <td>-0.038789</td>\n", " <td>-0.019069</td>\n", " <td>0.348650</td>\n", " </tr>\n", " <tr>\n", " <th>eggs</th>\n", " <td>-0.417810</td>\n", " <td>-0.035192</td>\n", " <td>-0.126150</td>\n", " <td>-0.215930</td>\n", " <td>-0.669740</td>\n", " <td>0.513250</td>\n", " <td>-0.797090</td>\n", " <td>-0.068611</td>\n", " <td>0.634660</td>\n", " <td>1.25630</td>\n", " <td>...</td>\n", " <td>-0.232860</td>\n", " <td>-0.139740</td>\n", " <td>-0.681080</td>\n", " <td>-0.370920</td>\n", " <td>-0.545510</td>\n", " <td>0.073728</td>\n", " <td>0.111620</td>\n", " <td>-0.324700</td>\n", " <td>0.059721</td>\n", " <td>0.159160</td>\n", " </tr>\n", " <tr>\n", " <th>breakfast</th>\n", " <td>0.073378</td>\n", " <td>0.227670</td>\n", " <td>0.208420</td>\n", " <td>-0.456790</td>\n", " <td>-0.078219</td>\n", " <td>0.601960</td>\n", " <td>-0.024494</td>\n", " <td>-0.467980</td>\n", " <td>0.054627</td>\n", " <td>2.28370</td>\n", " <td>...</td>\n", " <td>0.647710</td>\n", " <td>0.373820</td>\n", " <td>0.019931</td>\n", " <td>-0.033672</td>\n", " <td>-0.073184</td>\n", " <td>0.296830</td>\n", " <td>0.340420</td>\n", " <td>-0.599390</td>\n", " <td>-0.061114</td>\n", " <td>0.232200</td>\n", " </tr>\n", " <tr>\n", " <th>beautiful</th>\n", " <td>0.171200</td>\n", " <td>0.534390</td>\n", " <td>-0.348540</td>\n", " <td>-0.097234</td>\n", " <td>0.101800</td>\n", " <td>-0.170860</td>\n", " <td>0.295650</td>\n", " <td>-0.041816</td>\n", " <td>-0.516550</td>\n", " <td>2.11720</td>\n", " <td>...</td>\n", " <td>-0.285540</td>\n", " <td>0.104670</td>\n", " <td>0.126310</td>\n", " <td>0.120040</td>\n", " <td>0.254380</td>\n", " <td>0.247400</td>\n", " <td>0.207670</td>\n", " <td>0.172580</td>\n", " <td>0.063875</td>\n", " <td>0.350990</td>\n", " </tr>\n", " <tr>\n", " <th>kings</th>\n", " <td>0.259230</td>\n", " <td>-0.854690</td>\n", " <td>0.360010</td>\n", " <td>-0.642000</td>\n", " <td>0.568530</td>\n", " <td>-0.321420</td>\n", " <td>0.173250</td>\n", " <td>0.133030</td>\n", " <td>-0.089720</td>\n", " <td>1.52860</td>\n", " <td>...</td>\n", " <td>-0.470090</td>\n", " <td>0.063743</td>\n", " <td>-0.545210</td>\n", " <td>-0.192310</td>\n", " <td>-0.301020</td>\n", " <td>1.068500</td>\n", " <td>0.231160</td>\n", " <td>-0.147330</td>\n", " <td>0.662490</td>\n", " <td>-0.577420</td>\n", " </tr>\n", " <tr>\n", " <th>love</th>\n", " <td>0.139490</td>\n", " <td>0.534530</td>\n", " <td>-0.252470</td>\n", " <td>-0.125650</td>\n", " <td>0.048748</td>\n", " <td>0.152440</td>\n", " <td>0.199060</td>\n", " <td>-0.065970</td>\n", " <td>0.128830</td>\n", " <td>2.05590</td>\n", " <td>...</td>\n", " <td>-0.124380</td>\n", " <td>0.178440</td>\n", " <td>-0.099469</td>\n", " <td>0.008682</td>\n", " <td>0.089213</td>\n", " <td>-0.075513</td>\n", " <td>-0.049069</td>\n", " <td>-0.015228</td>\n", " <td>0.088408</td>\n", " <td>0.302170</td>\n", " </tr>\n", " <tr>\n", " <th>bacon</th>\n", " <td>-0.430730</td>\n", " <td>-0.016025</td>\n", " <td>0.484620</td>\n", " <td>0.101390</td>\n", " <td>-0.299200</td>\n", " <td>0.761820</td>\n", " <td>-0.353130</td>\n", " <td>-0.325290</td>\n", " <td>0.156730</td>\n", " <td>0.87321</td>\n", " <td>...</td>\n", " <td>0.304240</td>\n", " <td>0.413440</td>\n", " <td>-0.540730</td>\n", " <td>-0.035930</td>\n", " <td>-0.429450</td>\n", " <td>-0.246590</td>\n", " <td>0.161490</td>\n", " <td>-1.065400</td>\n", " <td>-0.244940</td>\n", " <td>0.269540</td>\n", " </tr>\n", " <tr>\n", " <th>sausages</th>\n", " <td>-0.481340</td>\n", " <td>0.023467</td>\n", " <td>0.396470</td>\n", " <td>0.364770</td>\n", " <td>-0.083069</td>\n", " <td>0.684590</td>\n", " <td>0.007079</td>\n", " <td>-0.210320</td>\n", " <td>-0.021993</td>\n", " <td>0.81876</td>\n", " <td>...</td>\n", " <td>0.602560</td>\n", " <td>0.297010</td>\n", " <td>-0.543030</td>\n", " <td>-0.169150</td>\n", " <td>-0.689910</td>\n", " <td>-0.307360</td>\n", " <td>0.193250</td>\n", " <td>-0.634980</td>\n", " <td>0.183010</td>\n", " <td>0.285290</td>\n", " </tr>\n", " <tr>\n", " <th>sky</th>\n", " <td>0.312550</td>\n", " <td>-0.303080</td>\n", " <td>0.019587</td>\n", " <td>-0.354940</td>\n", " <td>0.100180</td>\n", " <td>-0.141530</td>\n", " <td>-0.514270</td>\n", " <td>0.886110</td>\n", " <td>-0.530540</td>\n", " <td>1.55660</td>\n", " <td>...</td>\n", " <td>-0.667050</td>\n", " <td>0.279110</td>\n", " <td>0.500970</td>\n", " <td>-0.277580</td>\n", " <td>-0.143720</td>\n", " <td>0.342710</td>\n", " <td>0.287580</td>\n", " <td>0.537740</td>\n", " <td>0.363490</td>\n", " <td>0.496920</td>\n", " </tr>\n", " <tr>\n", " <th>blue</th>\n", " <td>0.129450</td>\n", " <td>0.036518</td>\n", " <td>0.032298</td>\n", " <td>-0.060034</td>\n", " <td>0.399840</td>\n", " <td>-0.103020</td>\n", " <td>-0.507880</td>\n", " <td>0.076630</td>\n", " <td>-0.422920</td>\n", " <td>0.81573</td>\n", " <td>...</td>\n", " <td>-0.501280</td>\n", " <td>0.169010</td>\n", " <td>0.548250</td>\n", " <td>-0.319380</td>\n", " <td>-0.072887</td>\n", " <td>0.382950</td>\n", " <td>0.237410</td>\n", " <td>0.052289</td>\n", " <td>0.182060</td>\n", " <td>0.412640</td>\n", " </tr>\n", " <tr>\n", " <th>beans</th>\n", " <td>-0.423290</td>\n", " <td>-0.264500</td>\n", " <td>0.200870</td>\n", " <td>0.082187</td>\n", " <td>0.066944</td>\n", " <td>1.027600</td>\n", " <td>-0.989140</td>\n", " <td>-0.259950</td>\n", " <td>0.145960</td>\n", " <td>0.76645</td>\n", " <td>...</td>\n", " <td>0.048760</td>\n", " <td>0.351680</td>\n", " <td>-0.786260</td>\n", " <td>-0.368790</td>\n", " <td>-0.528640</td>\n", " <td>0.287650</td>\n", " <td>-0.273120</td>\n", " <td>-1.114000</td>\n", " <td>0.064322</td>\n", " <td>0.223620</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>20 rows × 300 columns</p>\n", "</div>" ], "text/plain": [ " 0 1 2 3 4 5 \\\n", "green -0.072368 0.233200 0.137260 -0.156630 0.248440 0.349870 \n", "dog -0.401760 0.370570 0.021281 -0.341250 0.049538 0.294400 \n", "fox -0.348680 -0.077720 0.177750 -0.094953 -0.452890 0.237790 \n", "toast 0.130740 -0.193730 0.253270 0.090102 -0.272580 -0.030571 \n", "lazy -0.353320 -0.299710 -0.176230 -0.321940 -0.385640 0.586110 \n", "brown -0.374120 -0.076264 0.109260 0.186620 0.029943 0.182700 \n", "ham -0.773320 -0.282540 0.580760 0.841480 0.258540 0.585210 \n", "jumps -0.334840 0.215990 -0.350440 -0.260020 0.411070 0.154010 \n", "today -0.156570 0.594890 -0.031445 -0.077586 0.278630 -0.509210 \n", "quick -0.445630 0.191510 -0.249210 0.465900 0.161950 0.212780 \n", "eggs -0.417810 -0.035192 -0.126150 -0.215930 -0.669740 0.513250 \n", "breakfast 0.073378 0.227670 0.208420 -0.456790 -0.078219 0.601960 \n", "beautiful 0.171200 0.534390 -0.348540 -0.097234 0.101800 -0.170860 \n", "kings 0.259230 -0.854690 0.360010 -0.642000 0.568530 -0.321420 \n", "love 0.139490 0.534530 -0.252470 -0.125650 0.048748 0.152440 \n", "bacon -0.430730 -0.016025 0.484620 0.101390 -0.299200 0.761820 \n", "sausages -0.481340 0.023467 0.396470 0.364770 -0.083069 0.684590 \n", "sky 0.312550 -0.303080 0.019587 -0.354940 0.100180 -0.141530 \n", "blue 0.129450 0.036518 0.032298 -0.060034 0.399840 -0.103020 \n", "beans -0.423290 -0.264500 0.200870 0.082187 0.066944 1.027600 \n", "\n", " 6 7 8 9 ... 290 291 \\\n", "green -0.241700 -0.091426 -0.530150 1.34130 ... -0.405170 0.243570 \n", "dog -0.173760 -0.279820 0.067622 2.16930 ... 0.022908 -0.259290 \n", "fox 0.209440 0.037886 0.035064 0.89901 ... -0.283050 0.270240 \n", "toast 0.096945 -0.115060 0.484000 0.84838 ... 0.142080 0.481910 \n", "lazy 0.411160 -0.418680 0.073093 1.48650 ... 0.402310 -0.038554 \n", "brown -0.631980 0.133060 -0.128980 0.60343 ... -0.015404 0.392890 \n", "ham -0.021890 -0.463680 0.139070 0.65872 ... 0.464470 0.481400 \n", "jumps -0.386110 0.206380 0.386700 1.46050 ... -0.107030 -0.279480 \n", "today -0.066350 -0.081890 -0.047986 2.80360 ... -0.326580 -0.413380 \n", "quick -0.046480 0.021170 0.417660 1.68690 ... -0.329460 0.421860 \n", "eggs -0.797090 -0.068611 0.634660 1.25630 ... -0.232860 -0.139740 \n", "breakfast -0.024494 -0.467980 0.054627 2.28370 ... 0.647710 0.373820 \n", "beautiful 0.295650 -0.041816 -0.516550 2.11720 ... -0.285540 0.104670 \n", "kings 0.173250 0.133030 -0.089720 1.52860 ... -0.470090 0.063743 \n", "love 0.199060 -0.065970 0.128830 2.05590 ... -0.124380 0.178440 \n", "bacon -0.353130 -0.325290 0.156730 0.87321 ... 0.304240 0.413440 \n", "sausages 0.007079 -0.210320 -0.021993 0.81876 ... 0.602560 0.297010 \n", "sky -0.514270 0.886110 -0.530540 1.55660 ... -0.667050 0.279110 \n", "blue -0.507880 0.076630 -0.422920 0.81573 ... -0.501280 0.169010 \n", "beans -0.989140 -0.259950 0.145960 0.76645 ... 0.048760 0.351680 \n", "\n", " 292 293 294 295 296 297 \\\n", "green 0.437300 -0.461520 -0.352710 0.336250 0.069899 -0.111550 \n", "dog -0.308620 0.001754 -0.189620 0.547890 0.311940 0.246930 \n", "fox -0.654800 0.105300 -0.068738 -0.534750 0.061783 0.123610 \n", "toast 0.045167 0.057151 -0.149520 -0.495130 -0.086677 -0.569040 \n", "lazy -0.288670 -0.244130 0.460990 0.514170 0.136260 0.344190 \n", "brown -0.034826 -0.720300 -0.365320 0.740510 0.108390 -0.365760 \n", "ham -0.829200 0.354910 0.224530 -0.493920 0.456930 -0.649100 \n", "jumps -0.186200 -0.543140 -0.479980 -0.284680 0.036022 0.190290 \n", "today 0.367910 -0.262630 -0.203690 -0.296560 -0.014873 -0.250060 \n", "quick -0.039543 0.150180 0.338220 0.049554 0.149420 -0.038789 \n", "eggs -0.681080 -0.370920 -0.545510 0.073728 0.111620 -0.324700 \n", "breakfast 0.019931 -0.033672 -0.073184 0.296830 0.340420 -0.599390 \n", "beautiful 0.126310 0.120040 0.254380 0.247400 0.207670 0.172580 \n", "kings -0.545210 -0.192310 -0.301020 1.068500 0.231160 -0.147330 \n", "love -0.099469 0.008682 0.089213 -0.075513 -0.049069 -0.015228 \n", "bacon -0.540730 -0.035930 -0.429450 -0.246590 0.161490 -1.065400 \n", "sausages -0.543030 -0.169150 -0.689910 -0.307360 0.193250 -0.634980 \n", "sky 0.500970 -0.277580 -0.143720 0.342710 0.287580 0.537740 \n", "blue 0.548250 -0.319380 -0.072887 0.382950 0.237410 0.052289 \n", "beans -0.786260 -0.368790 -0.528640 0.287650 -0.273120 -1.114000 \n", "\n", " 298 299 \n", "green 0.532930 0.712680 \n", "dog 0.299290 -0.074861 \n", "fox -0.553700 -0.544790 \n", "toast -0.359290 0.097443 \n", "lazy -0.845300 -0.077383 \n", "brown -0.288190 0.114630 \n", "ham -0.131930 0.372040 \n", "jumps 0.692290 -0.071501 \n", "today -0.115940 0.083741 \n", "quick -0.019069 0.348650 \n", "eggs 0.059721 0.159160 \n", "breakfast -0.061114 0.232200 \n", "beautiful 0.063875 0.350990 \n", "kings 0.662490 -0.577420 \n", "love 0.088408 0.302170 \n", "bacon -0.244940 0.269540 \n", "sausages 0.183010 0.285290 \n", "sky 0.363490 0.496920 \n", "blue 0.182060 0.412640 \n", "beans 0.064322 0.223620 \n", "\n", "[20 rows x 300 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unique_words = list(set([word for sublist in [doc.split() for doc in norm_corpus] for word in sublist]))\n", "\n", "word_glove_vectors = np.array([nlp(word).vector for word in unique_words])\n", "pd.DataFrame(word_glove_vectors, index=unique_words)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Veamos una gŕafico TSNE para este caso." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "<Figure size 864x432 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.manifold import TSNE\n", "\n", "tsne = TSNE(n_components=2, random_state=0, n_iter=5000, perplexity=3)\n", "np.set_printoptions(suppress=True)\n", "T = tsne.fit_transform(word_glove_vectors)\n", "labels = unique_words\n", "\n", "plt.figure(figsize=(12, 6))\n", "plt.scatter(T[:, 0], T[:, 1], c='orange', edgecolors='r')\n", "for label, x, y in zip(labels, T[:, 0], T[:, 1]):\n", " plt.annotate(label, xy=(x+1, y+1), xytext=(0, 0), textcoords='offset points')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Finalmente una clasificación k-means." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Document</th>\n", " <th>Category</th>\n", " <th>ClusterLabel</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>The sky is blue and beautiful.</td>\n", " <td>weather</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>Love this blue and beautiful sky!</td>\n", " <td>weather</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>The quick brown fox jumps over the lazy dog.</td>\n", " <td>animals</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>A king's breakfast has sausages, ham, bacon, eggs, toast and beans</td>\n", " <td>food</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>I love green eggs, ham, sausages and bacon!</td>\n", " <td>food</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>The brown fox is quick and the blue dog is lazy!</td>\n", " <td>animals</td>\n", " <td>2</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>The sky is very blue and the sky is very beautiful today</td>\n", " <td>weather</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>The dog is lazy but the brown fox is quick!</td>\n", " <td>animals</td>\n", " <td>2</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Document \\\n", "0 The sky is blue and beautiful. \n", "1 Love this blue and beautiful sky! \n", "2 The quick brown fox jumps over the lazy dog. \n", "3 A king's breakfast has sausages, ham, bacon, eggs, toast and beans \n", "4 I love green eggs, ham, sausages and bacon! \n", "5 The brown fox is quick and the blue dog is lazy! \n", "6 The sky is very blue and the sky is very beautiful today \n", "7 The dog is lazy but the brown fox is quick! \n", "\n", " Category ClusterLabel \n", "0 weather 1 \n", "1 weather 1 \n", "2 animals 2 \n", "3 food 0 \n", "4 food 0 \n", "5 animals 2 \n", "6 weather 1 \n", "7 animals 2 " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.cluster import KMeans\n", "\n", "\n", "doc_glove_vectors = np.array([nlp(str(doc)).vector for doc in norm_corpus])\n", "\n", "km = KMeans(n_clusters=3, random_state=0)\n", "km.fit_transform(doc_glove_vectors)\n", "cluster_labels = km.labels_\n", "cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])\n", "pd.concat([corpus_df, cluster_labels], axis=1)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">El modelo FastText</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "El modelo [FastText](https://fasttext.cc/) fue introducido por primera vez por Facebook en 2016 como una extensión y supuestamente una mejora del modelo vainilla de Word2Vec. \n", "\n", "Está basado en el artículo original titulado [Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606.pdf) de Mikolov et al. que es una lectura excelente para obtener una comprensión profunda de cómo funciona este modelo. En general, FastText es un marco para el aprendizaje de representaciones de palabras y también para realizar una clasificación de texto sólida, rápida y precisa. \n", "\n", "El marco es de código abierto de Facebook en GitHub y afirma tener lo siguiente.\n", "\n", "1. Vectores de palabras en inglés de última generación.\n", "2. Vectores de palabras para 157 idiomas entrenados en Wikipedia y rastreo.\n", "3. Modelos para identificación de idiomas y diversas tareas supervisadas.\n", "\n", "De acuedo con los autores, en general, los modelos predictivos como el modelo *Word2Vec* suelen considerar cada palabra como una entidad distinta (por ejemplo, dónde) y generan una incrustación densa para la palabra. \n", "\n", "Sin embargo, esto representa una seria limitación con los idiomas que tienen un vocabulario masivo y muchas palabras raras que pueden no aparecer mucho en diferentes corpus. El modelo Word2Vec normalmente ignora la estructura morfológica de cada palabra y considera una palabra como una sola entidad. \n", "\n", "El modelo **FastText** considera cada palabra como una bolsa de n-gramas de caracteres. Esto también se denomina modelo de subpalabras en el documento.\n", "\n", "\n", "Se agregan símbolos de límites especiales <y> al principio y al final de las palabras. Esto permite distinguir prefijos y sufijos de otras secuencias de caracteres. También incluiyen la propia palabra *w* en el conjunto de sus n-gramas, para aprender una representación de cada palabra (además de su carácter n-gramas). \n", " \n", " \n", "Tomando la palabra *where* y n = 3 (tri-gramas) como ejemplo, estará representada por el carácter n-gramas: <wh, whe, her, ere, re> y la secuencia especial <where> que representa la palabra completa . \n", " \n", " Tenga en cuenta que la secuencia, correspondiente a la palabra <her> es diferente del trigrama ella de la palabra where.\n", " \n", "En la práctica, el artículo recomienda extraer todos los n-gramas para $3\\le n \\le 6$ Este es un enfoque muy simple, y se podrían considerar diferentes conjuntos de n-gramas, por ejemplo, tomando todos los prefijos y sufijos. \n", " \n", "Normalmente asociamos una representación vectorial (incrustación) a cada n-grama de una palabra. Por tanto, podemos representar una palabra mediante la suma de las representaciones vectoriales de sus n-gramas o el promedio de la incrustación de estos n-gramas. \n", " \n", "Según los autores, debido a este efecto de aprovechar los n-gramas de palabras individuales basadas en sus caracteres, existe una mayor probabilidad de que las palabras raras obtengan una buena representación, ya que sus n-gramas basados en caracteres deben aparecer en otras palabras del corpus.\n", " \n", "Vamos a la práctica." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### <span style=\"color:#4CC9F0\">Gráfico basado en ACP</span>" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "<Figure size 1296x720 with 1 Axes>" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "from sklearn.decomposition import PCA\n", "\n", "words = sum([[k] + v for k, v in similar_words.items()], [])\n", "wvs = ft_model.wv[words]\n", "\n", "pca = PCA(n_components=2)\n", "np.set_printoptions(suppress=True)\n", "P = pca.fit_transform(wvs)\n", "labels = words\n", "\n", "plt.figure(figsize=(18, 10))\n", "plt.scatter(P[:, 0], P[:, 1], c='lightgreen', edgecolors='g')\n", "for label, x, y in zip(labels, P[:, 0], P[:, 1]):\n", " plt.annotate(label, xy=(x+0.06, y+0.03), xytext=(0, 0), textcoords='offset points')\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "¡Podemos ver muchos patrones interesantes! Noah, su hijo Shem y su abuelo Matusalén están cerca el uno del otro. \n", "\n", "También vemos a Dios asociado con Moisés y Egipto donde soportó las plagas bíblicas, incluidas el hambre y la pestilencia. También Jesús y algunos de sus discípulos están asociados muy cerca unos de otros.\n", "\n", "Para acceder a cualquiera de las incrustaciones de palabras, puede indexar el modelo con la palabra de la siguiente manera." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0.17718132, -0.19709586, 0.25530368, -0.03545383, 0.76611197,\n", " 0.13848874, -0.05604259, -0.02210611, 0.20488791, 0.17241424,\n", " -0.48582503, 0.05138803, 0.18877277, -0.05722686, 0.16079944,\n", " -0.35822144, 0.5743341 , 0.06780402, -0.3008401 , -0.01876839,\n", " 0.09861004, 0.19337818, -0.41912812, -0.09735594, 0.27861547,\n", " -0.16222528, 0.4602971 , -0.46578154, -0.1064477 , 0.42992973,\n", " 0.06472561, 0.16854142, 0.00933973, 0.04791377, 0.18041942,\n", " 0.05097766, -0.27860153, 0.11862674, 0.06555321, 0.06788372,\n", " -0.20273678, 0.40175107, -0.00107946, 0.31407824, -0.08528732,\n", " 0.0850415 , -0.47559258, 0.11246356, -0.20559783, -0.14798933,\n", " 0.14808579, -0.23450403, -0.15976913, 0.15428546, -0.17971121,\n", " 0.40536925, -0.02532636, -0.49386576, 0.20632172, -0.2625528 ,\n", " 0.00773285, 0.50493914, -0.12690398, -0.23250476, -0.41272193,\n", " -0.28254688, -0.43109837, 0.13680667, 0.2277264 , 0.3559441 ,\n", " 0.19570442, 0.04869397, 0.23427969, -0.1791803 , 0.17570105,\n", " 0.4019909 , 0.17426194, -0.06929158, 0.09311975, -0.13352323,\n", " -0.34312806, 0.46036175, 0.32394302, -0.5765276 , -0.0285289 ,\n", " -0.20002018, 0.16436048, -0.2363575 , -0.21630326, -0.35550892,\n", " -0.37143084, 0.33584678, 0.05374669, 0.0320545 , 0.1500369 ,\n", " 0.11470003, 0.50080407, -0.21105155, 0.10132139, -0.00495873],\n", " dtype=float32)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ft_model.wv['jesus']" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Con estas incrustaciones, podemos realizar algunas tareas interesantes en lenguaje natural. Uno de estos sería encontrar similitudes entre diferentes palabras (entidades)." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.3183717\n", "0.6547534\n" ] } ], "source": [ "print(ft_model.wv.similarity(w1='god', w2='satan'))\n", "print(ft_model.wv.similarity(w1='god', w2='jesus'))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "o también!!" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Odd one out for [ god jesus satan john ]: satan\n", "Odd one out for [ john peter james judas ]: judas\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/alvaro/anaconda3/envs/tf2/lib/python3.7/site-packages/gensim/models/keyedvectors.py:877: FutureWarning: arrays to stack must be passed as a \"sequence\" type such as list or tuple. Support for non-sequence iterables such as generators is deprecated as of NumPy 1.16 and will raise an error in the future.\n", " vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)\n" ] } ], "source": [ "st1 = \"god jesus satan john\"\n", "print('Odd one out for [',st1, ']:', \n", " ft_model.wv.doesnt_match(st1.split()))\n", "st2 = \"john peter james judas\"\n", "print('Odd one out for [',st2, ']:', \n", " ft_model.wv.doesnt_match(st2.split()))\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Modelos de incrustaciones del Español Preentrenados</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "A continuación encontrará enlaces a incrustaciones de palabras en español calculadas con diferentes métodos y de diferentes corpus. Este trabajo es liderado en el Departamento de Ciencias de la Computación Universidad de Chile.\n", "\n", "S incluye una descripción de los parámetros utilizados para calcular las incrustaciones, junto con estadísticas simples de los vectores, vocabulario y descripción del corpus a partir del cual se calcularon las incrustaciones. \n", "\n", "Se proporcionan enlaces directos a las incrustaciones, así que consulte las fuentes originales para obtener una cita adecuada (consulte también las Referencias). Un ejemplo del uso de algunas de estas incrustaciones se puede encontrar aquí o en este tutorial (ambos en español).\n", "\n", "[dccuchile](https://github.com/dccuchile/spanish-word-embeddings/blob/master/README.md)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## <span style=\"color:#4361EE\">Conclusión</span>" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Estos ejemplos deberían darle una buena idea acerca de las estrategias más nuevas y eficientes para aprovechar los modelos de lenguaje de aprendizaje profundo para extraer características de los datos de texto y también abordar problemas como la semántica de palabras, el contexto y la escasez de datos. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 4 }