{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "08e99600",
   "metadata": {},
   "source": [
    "<figure>\n",
    "<img src=\"../Imagenes/logo-final-ap.png\"  width=\"80\" height=\"80\" align=\"left\"/> \n",
    "</figure>\n",
    "\n",
    "# <span style=\"color:#4361EE\"><left>Aprendizaje Profundo</left></span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "fb128ef3",
   "metadata": {},
   "source": [
    "# <span style=\"color:red\"><center>Diplomado en Inteligencia Artificial y Aprendizaje Profundo</center></span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "de5f9582",
   "metadata": {
    "tags": []
   },
   "source": [
    "# <span style=\"color:green\"><center>Introducción a FastText</center></span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "929768fc",
   "metadata": {},
   "source": [
    "<figure>\n",
    "<center>\n",
    "<img src=\"../Imagenes/Fast_castle.jpg\" width=\"600\" height=\"400\" align=\"center\"/>\n",
    "</center>\n",
    "</figure>\n",
    "\n",
    "\n",
    "Fuente \n",
    "(<a href=\"https://commons.wikimedia.org/wiki/File:Fast_castle_-_19092010.jpg\">Bubobubo2</a>, <a href=\"https://creativecommons.org/licenses/by-sa/3.0\">CC BY-SA 3.0</a>, via Wikimedia Commons)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c6df025a",
   "metadata": {},
   "source": [
    "## <span style=\"color:#4361EE\">Profesores</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c870291c",
   "metadata": {},
   "source": [
    "1. Alvaro  Montenegro, PhD, ammontenegrod@unal.edu.co\n",
    "1. Camilo José Torres Jiménez, Msc, cjtorresj@unal.edu.co\n",
    "1. Daniel  Montenegro, Msc, dextronomo@gmail.com "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5d0db68a",
   "metadata": {},
   "source": [
    "## <span style=\"color:#4361EE\">Asesora Medios y Marketing digital</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c076681a",
   "metadata": {},
   "source": [
    "4. Maria del Pilar Montenegro, pmontenegro88@gmail.com\n",
    "5. Jessica López Mejía, jelopezme@unal.edu.co\n",
    "6. Venus Puertas, vpuertasg@unal.edu.co"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "dd599a04",
   "metadata": {},
   "source": [
    "## <span style=\"color:#4361EE\">Jefe Jurídica</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "2d9a8916",
   "metadata": {},
   "source": [
    "7. Paula Andrea Guzmán, guzmancruz.paula@gmail.com"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "7c9d467e",
   "metadata": {},
   "source": [
    "## <span style=\"color:#4361EE\">Coordinador Jurídico</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1e724da2",
   "metadata": {},
   "source": [
    "8. David Fuentes, fuentesd065@gmail.com"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "234ed1af",
   "metadata": {},
   "source": [
    "## <span style=\"color:#4361EE\">Desarrolladores Principales</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ecea2c8d",
   "metadata": {},
   "source": [
    "9. Dairo Moreno, damoralesj@unal.edu.co\n",
    "10. Joan Castro, jocastroc@unal.edu.co\n",
    "11. Bryan Riveros, briveros@unal.edu.co\n",
    "12. Rosmer Vargas, rovargasc@unal.edu.co"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5288fefe",
   "metadata": {},
   "source": [
    "## <span style=\"color:#4361EE\">Expertos en Bases de Datos</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ff6732f3",
   "metadata": {},
   "source": [
    "13. Giovvani Barrera, udgiovanni@gmail.com\n",
    "14. Camilo Chitivo, cchitivo@unal.edu.co"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "7a3a72bf",
   "metadata": {},
   "source": [
    "## <span style=\"color:#4361EE\">Referencias</span> "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e7703b83",
   "metadata": {},
   "source": [
    "1. [P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606.pdf)\n",
    "1. [A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification](https://arxiv.org/pdf/1607.01759.pdf)\n",
    "1. [A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models](https://arxiv.org/pdf/1612.03651.pdf)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "031057e8",
   "metadata": {},
   "source": [
    "## <span style=\"color:#4361EE\">Contenido</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "19d3ffb3",
   "metadata": {},
   "source": [
    "* [Introducción](#Introducción)\n",
    "* [Instalación de FastText](#Instalación-de-FastText)\n",
    "* [Modelo No Supervisado](#Modelo-No-Suervisado)\n",
    "    * [Uso del modelo con un ejemplo de juguete](#Uso-del-modelo-con-un-ejemplo-de-juguete)\n",
    "        * [Jugando con los Parámetros](#Jugando-con-los-Parámetros)\n",
    "        * [Palabras vecinas más cercanas](#Palabras-vecinas-más-cercanas)\n",
    "    * [Uso del Modelo con Wikipedia](#Uso-del-Modelo-con-Wikipedia)\n",
    "        * [Preprocesamiento](#Preprocesamiento)\n",
    "        * [Entrenando el Modelo](#Entrenando-el-Modelo)\n",
    "        * [Palabras más cercanas](#Palabras-más-cercanas)\n",
    "        * [Curiosidades](#Curiosidades)\n",
    "        * [Analogía de Palabras](#Analogía-de-Palabras)\n",
    "* [Modelo Supervisado](#Modelo-Supervisado)\n",
    "    * [Uso del Modelo](#Uso-del-Modelo)\n",
    "    * [Mejorando el Modelo](#Mejorando-el-Modelo)\n",
    "    * [Auto-ajuste de Parámetros](#Autotuning-de-Parámetros)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "65ed6191",
   "metadata": {},
   "source": [
    "## <span style=\"color:#4361EE\">Introducción</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "fc728f71",
   "metadata": {},
   "source": [
    "El modelo [FastText](https://fasttext.cc/) fue introducido por primera vez por Facebook en 2016 como una extensión y supuestamente una mejora del modelo vainilla de Word2Vec. \n",
    "\n",
    "Está basado en el artículo original titulado [Enriching Word Vectors with Subword Information](https://arxiv.org/pdf/1607.04606.pdf) de Mikolov et al. que es una lectura excelente para obtener una comprensión profunda de cómo funciona este modelo. En general, FastText es un marco para el aprendizaje de representaciones de palabras y también para realizar una clasificación de texto sólida, rápida y precisa. \n",
    "\n",
    "FastText es de código abierto desarrollado en Facebook y disponible GitHub. Facebook afirma que el producto dispone de lo siguiente:\n",
    "\n",
    "1. Vectores de palabras en inglés de última generación.\n",
    "1. Vectores de palabras para 157 idiomas entrenados en Wikipedia y otras fuentes.\n",
    "1. Modelos para identificación de idiomas y diversas tareas supervisadas.\n",
    "\n",
    "De acuerdo con los autores, en general, los modelos predictivos como el modelo *Word2Vec* suelen considerar cada palabra como una entidad distinta (por ejemplo la palabra  *donde*) y generan una incrustación densa para la palabra. \n",
    "\n",
    "Sin embargo, esto representa una seria limitación con los idiomas que tienen un vocabulario masivo y muchas palabras raras que pueden no aparecer mucho en diferentes corpus. El modelo `Word2Vec` normalmente ignora la estructura morfológica de cada palabra y considera una palabra como una sola entidad. \n",
    "\n",
    "El modelo **FastText** considera cada palabra como una bolsa de n-gramas de caracteres. Esto también se denomina modelo de subpalabras en el documento.\n",
    "\n",
    "\n",
    "Se agregan símbolos de límites especiales al principio y al final de las palabras. Esto permite distinguir prefijos y sufijos de otras secuencias de caracteres. También incluyen la propia palabra *d* en el conjunto de sus n-gramas, para aprender una representación de cada palabra (además de su carácter n-gramas). \n",
    " \n",
    " \n",
    "Tomando la palabra *donde* como ejemplo, con n = 3 (tri-gramas) estará representada por la lista de  n-gramas: \n",
    "\n",
    "* [don, ond, nde].\n",
    " \n",
    "Tenga en cuenta que la secuencia, correspondiente a la palabra [don] es diferente del trigrama [don] de la palabra *donde*. En la práctica, el artículo recomienda extraer todos los n-gramas para $3\\le n \\le 6$ Este es un enfoque muy simple, y se podrían considerar diferentes conjuntos de n-gramas, por ejemplo, tomando todos los prefijos y sufijos. \n",
    "\n",
    "Normalmente asociamos una representación vectorial (incrustación) a cada n-grama de una palabra. Por tanto, podemos representar una palabra mediante la suma de las representaciones vectoriales de sus n-gramas o el promedio de la incrustación de estos n-gramas. \n",
    "\n",
    "Según los autores, debido a este efecto de aprovechar los n-gramas de palabras individuales basadas en sus caracteres, existe una mayor probabilidad de que las palabras raras obtengan una buena representación, ya que sus n-gramas basados en caracteres deben aparecer en otras palabras del corpus.\n",
    "\n",
    "Vamos a la práctica.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a4a1607f",
   "metadata": {},
   "source": [
    "## <span style=\"color:#4361EE\">Instalación de FastText</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c5b62f0e",
   "metadata": {},
   "source": [
    "`$ git clone https://github.com/facebookresearch/fastText.git`\n",
    "\n",
    "`$ cd fastText`\n",
    "\n",
    "`$ sudo pip install .`\n",
    "\n",
    "`$ # or :`\n",
    "\n",
    "`$ sudo python setup.py install`"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4110a8c3",
   "metadata": {},
   "source": [
    "Si todo va bien, el siguiente comando debería funcionar:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "44451ab0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import fasttext as ft"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "88d739c0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Help on module fasttext.FastText in fasttext:\n",
      "\n",
      "NAME\n",
      "    fasttext.FastText\n",
      "\n",
      "DESCRIPTION\n",
      "    # Copyright (c) 2017-present, Facebook, Inc.\n",
      "    # All rights reserved.\n",
      "    #\n",
      "    # This source code is licensed under the MIT license found in the\n",
      "    # LICENSE file in the root directory of this source tree.\n",
      "\n",
      "FUNCTIONS\n",
      "    cbow(*kargs, **kwargs)\n",
      "    \n",
      "    load_model(path)\n",
      "        Load a model given a filepath and return a model object.\n",
      "    \n",
      "    read_args(arg_list, arg_dict, arg_names, default_values)\n",
      "    \n",
      "    skipgram(*kargs, **kwargs)\n",
      "    \n",
      "    supervised(*kargs, **kwargs)\n",
      "    \n",
      "    tokenize(text)\n",
      "        Given a string of text, tokenize it and return a list of tokens\n",
      "    \n",
      "    train_supervised(*kargs, **kwargs)\n",
      "        Train a supervised model and return a model object.\n",
      "        \n",
      "        input must be a filepath. The input text does not need to be tokenized\n",
      "        as per the tokenize function, but it must be preprocessed and encoded\n",
      "        as UTF-8. You might want to consult standard preprocessing scripts such\n",
      "        as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html\n",
      "        \n",
      "        The input file must must contain at least one label per line. For an\n",
      "        example consult the example datasets which are part of the fastText\n",
      "        repository such as the dataset pulled by classification-example.sh.\n",
      "    \n",
      "    train_unsupervised(*kargs, **kwargs)\n",
      "        Train an unsupervised model and return a model object.\n",
      "        \n",
      "        input must be a filepath. The input text does not need to be tokenized\n",
      "        as per the tokenize function, but it must be preprocessed and encoded\n",
      "        as UTF-8. You might want to consult standard preprocessing scripts such\n",
      "        as tokenizer.perl mentioned here: http://www.statmt.org/wmt07/baseline.html\n",
      "        \n",
      "        The input field must not contain any labels or use the specified label prefix\n",
      "        unless it is ok for those words to be ignored. For an example consult the\n",
      "        dataset pulled by the example script word-vector-example.sh, which is\n",
      "        part of the fastText repository.\n",
      "\n",
      "DATA\n",
      "    BOW = '<'\n",
      "    EOS = '</s>'\n",
      "    EOW = '>'\n",
      "    absolute_import = _Feature((2, 5, 0, 'alpha', 1), (3, 0, 0, 'alpha', 0...\n",
      "    displayed_errors = {}\n",
      "    division = _Feature((2, 2, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0), 1310...\n",
      "    print_function = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0)...\n",
      "    unicode_literals = _Feature((2, 6, 0, 'alpha', 2), (3, 0, 0, 'alpha', ...\n",
      "    unsupervised_default = {'autotuneDuration': 300, 'autotuneMetric': 'f1...\n",
      "\n",
      "FILE\n",
      "    /home/alvaro/miniconda3/envs/ia/lib/python3.8/site-packages/fasttext-0.9.2-py3.8-linux-x86_64.egg/fasttext/FastText.py\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "help(ft.FastText)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "40fdd41a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['BOW',\n",
       " 'EOS',\n",
       " 'EOW',\n",
       " 'FastText',\n",
       " '__builtins__',\n",
       " '__cached__',\n",
       " '__doc__',\n",
       " '__file__',\n",
       " '__loader__',\n",
       " '__name__',\n",
       " '__package__',\n",
       " '__path__',\n",
       " '__spec__',\n",
       " 'absolute_import',\n",
       " 'cbow',\n",
       " 'division',\n",
       " 'load_model',\n",
       " 'print_function',\n",
       " 'skipgram',\n",
       " 'supervised',\n",
       " 'tokenize',\n",
       " 'train_supervised',\n",
       " 'train_unsupervised',\n",
       " 'unicode_literals']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dir(ft)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "97d42654",
   "metadata": {},
   "source": [
    "Como podemos observar, contamos con dos modelos: **unsupervised** y **supervised**.\n",
    "\n",
    "Veamos cómo funcionan."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5e399dd7",
   "metadata": {},
   "source": [
    "## <span style=\"color:#4361EE\">Modelo No Supervisado</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c2e9d441",
   "metadata": {},
   "source": [
    "Para este ejemplo de juguete, usemos los poemas de Daniel."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b4110828",
   "metadata": {},
   "source": [
    "## <span style=\"color:#4361EE\">Uso del modelo con un ejemplo de juguete</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "e066186f",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Read 0M words\n",
      "Number of words:  184\n",
      "Number of labels: 0\n",
      "Progress: 100.0% words/sec/thread:  159899 lr:  0.000000 avg.loss:  4.119662 ETA:   0h 0m 0s\n"
     ]
    }
   ],
   "source": [
    "model = ft.train_unsupervised('../Datos/Poemas_Todo.txt')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "eae390de",
   "metadata": {},
   "source": [
    "Veamos qué tiene el modelo por dentro:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "44abf1aa",
   "metadata": {},
   "outputs": [],
   "source": [
    "dir(model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a4066ec6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['</s>', 'y', 'de', 'que', 'la', 'el', 'en', 'las', 'con', 'los', 'por', 'un', 'se', 'tu', 'del', 'a', 'una', 'no', 'mi', 'te', 'es', 'me', 'como', 'su', 'mis', 'para', 'sin', 'tus', 'entre', 'porque', 'más', 'lo', 'sus', 'nos', 'noche', 'ese', 'cada', 'hasta', 'pero', 'al', 'todo', 'manos', 'Y', 'cielo', 'cuerpo', 'día', 'ojos', 'ni', 'sueño', 'cuando', 'son', 'ser', 'qué', 'le', 'o', 'Me', 'Te', 'labios', 'ya', 'palabras', 'cuerpos', 'luna', 'cosas', 'si', 'mundo', 'vida', 'eres', 'mujer', 'sobre', 'esa', 'sólo', 'donde', 'este', 'toda', 'sueños', 'jamás', 'En', 'piel,', 'amor', 'e', 'Es', 'siempre', 'Ese', 'va', 'hay', 'visto', 'tan', 'he', 'gusta', 'sueños,', 'lleno', 'imagen', 'contra', 'mar', 'dos', 'Yo', 'He', 'rostro', 'La', 'noche,', 'pasión', 'nuestra', 'quien', 'deseo', 'ha', 'nubes', 'puedo', 'frío', 'abismo', 'alma', 'aire', 'lejos', 'profundo', 'éste', 'volando', 'ésta', 'aún', 'dulce', 'colores', 'cara', 'mismo', 'música', 'El', 'vez', 'letras', 'formas', 'apenas', 'ojos,', 'piel', 'dónde', 'desde', 'ante', 'mirando', 'tiempo', 'veces', 'somos', 'así', 'hacia', 'besos', 'brisa', 'entonces', 'nuestro', 'misma', 'esperar', 'A', 'tí', 'algún', 'busco', 'frase', 'suaves', 'No', 'Déjame', 'tí,', 'agua', 'luces', 'yo', 'mí', 'emociones', 'sonido', 'dame', 'casi', 'Un', 'soy', 'ideas', 'veo', 'mano', 'permanente', 'caricias', 'mientras', 'hace', 'Vivo', 'luz', 'polvo', 'esperando', 'lugar', 'quiero', 'fue', 'seres', 'sol', 'sirve', 'bajo', 'ver', 'colores,', 'brazos,', 'Ayer', 'culpa', 'hora', 'estar', 'llena', 'Con', 'pecho', 'maraña', 'voz', 'interminable', 'espacio', 'sé', 'lágrimas', 'aroma', 'noches', 'caen', 'fuego,', 'sueño,', 'gustó', 'suave', 'caminar', 'tú', 'Sé', 'fuego', 'noche.', 'locos', 'brillo', 'estado', 'queriendo', 'Por', 'nubes,', 'Mi', 'has', 'calor', 'engaño', 'vas', 'paraísos,', 'juntar', 'palabras,', 'silencio', 'De', 'respiro', 'pide', 'alto', 'tarde', 'labios,', 'pasión,', 'alma,', 'salir', 'digo', 'alguna', 'fiel', 'cruel', 'Ud.', 'dentro', 'caída', 'melodías', 'culpa,', 'horas', 'placer', 'tal', 'pobre', 'así,', 'unos', 'cómo', 'pensamientos', 'tanto', 'medio', 'llama', 'espalda,', 'nada', 'deseos,', 'rincón', 'canto,', 'mañana,', 'pobres', 'hacer', 'centro', 'Todo', 'ahora', 'misterio', 'Aquí', 'miran', 'claro', 'viento', 'Allá', 'deseo,', 'sentir', 'pecho,', 'iban', 'Tengo', 'viaje', 'viento,', 'cuerpo,', 'partes', 'esos', 'repleta', 'almas', 'demás', 'piernas', 'espalda', 'capaz', 'canto', 'exquisito,', 'descansar', 'versos', 'pies,', 'goce', 'rojo', 'pliegues', 'querer,', 'hombre', 'hemos', 'mar,', 'punto', 'está', 'frente', 'manos,', 'entonces,', 'pequeños', 'placeres', 'soñando,', 'nada,', 'posa', 'besos,', 'miradas', 'Porque', 'existe', 'peligro', 'Ya', 'mirar', 'sonidos', 'iba', 'hecho', 'ciudad', 'color', 'era', 'destrucción', 'tango', 'habían', 'the', 'peligroso', 'Como', 'íntimo', 'roce', 'azul', 'brisa,', 'guerra', 'éstos', 'tiempo,', 'frases', 'Si', 'pues', 'imaginación,', 'escribir', 'beso', 'ir', 'puede', 'ausente', 'árboles', 'eso', 'lugares', 'formación', 'pequeña', 'vulnerable', 'matemática', 'pretendes', 'caprichos,', 'dado', 'decepciones,', 'dibuja', 'bello', 'pequeño', 'miras', 'cosa', 'cosmos', 'solo', 'Mójate', 'sueña.', 'locura', 'conduce', 'tú,', 'paraíso', 'nosotros', 'todas', 'imágenes', 'gran', 'tristeza,', 'allí', 'mí,', 'viajar', 'palabra', 'lado', 'figura', 'escuchar', 'Somos', 'vivir', 'cortas', 'caricias,', 'junto', 'manos.', 'infernal', 'despertar', 'líneas', 'Cómo', 'nocturno', 'ausentes,', 'estos', 'buenos', 'Para', 'algo', 'todo,', 'barco', 'felicidad,', 'tristeza', 'todos', 'é', 'ella', 'tristes', 'caer', 'sienes', 'verdugo', 'poder', 'sonrisa', 'frío,', 'cama', 'máscara', 'seré', '¿Qué', 'alba,', 'otro,', 'sino', 'nuestras', 'da', 'allá', 'vuelo', 'inspiración', 'mástil,', 'apacible', 'compartir', 'velas', 'mortal', 'esta', 'infinita', 'escuchen,', 'despacio,', 'dejando', 'cansados', 'teclas', 'miel,', 'eres,', 'estruendoso,', 'además', 'cobre', 'alas', 'amor,', 'verdad', 'cascabel,', 'Dice', 'poco', 'ninfas', 'pozo', 'loco,', 'sos', 'baila', 'piedra', 'melodía', 'inútil', 'Vendrás', 'Pero', 'señal,', 'sentirme', 'estar,', 'qué,', 'pensamiento', 'hombros', 'olvido', 'aliento', 'lenguaje', 'condenado', 'historia', 'aire,', 'sigues...', 'ojos.', 'les', 'puedan', 'círculo', 'atardecer', 'mezclas', '(no', 'constante', 'mausoleo', 'han', 'actos', 'otoño,', 'llegar', 'ritmo', 'bohemia', 'cuello,', 'soledad', 'vuelta,', 'fuerte', 'tomarte', 'bailar', 'duradero', 'venido', 'presión', 'sean', 'perderme', 'perfumes', 'distancia', 'ti,', 'madrugada,', 'acerca', 'despacio', 'valentía', 'camino', 'eterno', 'realidad', 'lecho', 'copas', 'fija', 'olor', 'tanta', 'final', 'pequeñas', 'crea', 'entero,', 'nuevos', 'nuestros', 'mil', 'triste', 'anhelos.', 'pesar', 'simple', 'pura', 'conocer', 'esconden', 'repletas', 'noche;', 'excelsa', 'impiden', 'maravillas', 'ven', 'desnuda', 'desesperación', 'gota', 'breves', 'vida,', 'mía,', 'gusto', 'veo,', 'prisa,', 'muerte', 'sutil,', 'fría,', 'terciopelo,', 'flor', 'seda', 'fuerzas', 'caos', 'puertas', 'repleto', 'amar,', 'mujeres', 'belleza,', 'sí,', 'inefable', 'carne', 'tengo', 'lienzo', 'eclipse', '¿Estás', 'posaba', 'ropa,', 'vuela', 'bocas', 'gardenias', 'universo,', 'belleza', 'Tu', 'fondo', 'razón,', 'troncos', 'queda', 'placer...', 'máscaras,', 'eterna', 'anhelos,', 'hecha', 'forma', 'mañana', 'pensar', 'viajan', 'saltan', 'límite,', 'suerte', 'fin', 'saber', 'felizmente', 'pintura', 'tirando', 'estás', 'dolor', 'sí', 'ilumina', 'observas', 'estancia', 'Qué', 'base', 'total', 'única', 'destino', 'sola', 'nadie', 'incesante', 'Siente', 'respiro,', 'perderse', 'aquel', 'Hoy', 'completa', 'obra', 'pliegues;', 'llevarme', 'aviones', 'inundando', 'tibios', 'Eres', 'bailando', 'tranquilidad', 'siluetas', 'Regálame', 'tienes', 'Mujer,', 'deja', 'dar', 'danza', 'calles', 'júbilo', 'final.', 'vuelven', 'cielo,', 'pipa', 'transformar', 'amada', 'infinito.', 'concretar', 'razones,', 'sensaciones', 'iré', 'cromáticos,', 'esperaría', 'suspiro', 'dice', 'cigarrillo', 'fielmente', 'especie', 'envolvente', 'cuidando', 'vientos', 'trabajo.', 'olvidado', 'allí,', 'estucado', 'ecos', 'enlazan', 'retoma', 'estrellas,', 'arrulla', 'bifurcados...', 'Mira', 'pies', 'última', 'observa', 'quizá', 'bonito', 'espejos', 'nadan', 'recuerdo', 'Soñaré', 'otoño', 'furia', 'musas', 'alba.', 'van', 'llueve', 'imagen,', 'conciencia', 'universo', 'dedos', 'Seremos', 'ellos,', 'suelo,', 'repente', 'espacio,', 'sienten', 'pintar', 'milongas', 'miedos,', 'tratan', 'habitación', 'tiene', 'sabor', 'duerme...', 'Entre', 'conciencia.', 'Sin', 'suelo', 'importar', 'crueldad', 'rota', 'darle', 'tibio,', 'culpas,', 'existirá', 'duda', 'hablo,', 'inerte.', 'recurso,', 'golondrinas,', 'extinción', 'déjame', 'versos,', 'así...', 'magia', 'quiere', 'Así', 'fluye', 'demencial.', 'Piazzolla', 'dulces', 'pensamiento,', 'milonga...', 'obra,', 'pintura,', 'pincel.', 'estoy', 'eco', 'muerte.', 'Esa', 'susurra', 'lejanía.', 'mares', 'jugaban', 'indiferencia,', 'confesar,', 'vengador,', 'inunda', 'exquisito', 'ventanas', 'están', 'modernidad...', 'sacudir...', 'vidrio', 'habrán', 'pasa,', 'adicta', 'vueltas', 'Vuélvete', 'mundo,', 'basura', 'hablan', 'libros', 'acercando', 'angustia,', 'moralidad', 'sistema', 'contaminadas', 'fábulas', 'allá,', 'sexual', 'Entonces', 'vaivén', 'aquí', 'cuerpos,', 'alegrías', 'figuras', 'placeres,', 'moldes', 'decirte', 'río', 'iniciado', 'torneado', 'engaño,', 'bossa,', 'minutos', 'controlado', 'efímeras,', 'ciclo', 'tristezas,', 'sed', 'mentiras,', 'escapar,', 'decir', 'momento', 'placer,', 'refiero.', 'cariño,', 'desespero', 'letras,', 'bella', 'escorpiones,', 'calmar', 'rocosa,', 'dirigirte', 'Una', 'mirarte', 'haciendo', 'gozo', 'sucios', 'valles', 'sábanas', 'vive', 'juego', 'refleja,', 'tiempo.', 'huesos', 'dragones', 'sábana...', 'recorriendo', 'propia', 'llegue', '(la', 'arena', 'mutan', 'repetición,', 'Despertemos', 'dejar', 'dormir', 'recordarás.', 'cuello', 'monstruo', 'jardín', 'aquello', '¿Lo', 'ves?', 'fantasea', 'ráfagas', 'igual.', 'loco', 'Las', 'satisfacer', 'sombra', 'mujer,', 'volátil.', 'último', 'Más', 'tumultuosas,', 'siento', 'ramas', 'fíjate,', 'grabado', 'Cuando', 'quedar', 'voy', 'ajenos...', 'multiplica', 'manera', 'combina', 'convertirá', 'aguarda']\n"
     ]
    }
   ],
   "source": [
    "print(model.words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "a3f71a19",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "184"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(model.words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "aa70c2a4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-3.0347307e-03,  6.1599063e-03, -1.3848637e-03, -3.1608273e-03,\n",
       "        6.5026223e-03,  6.3184551e-03, -4.6846778e-03, -2.2446606e-03,\n",
       "        8.9806495e-03,  3.7756080e-03,  1.1548055e-03, -4.1064568e-04,\n",
       "        1.2033727e-03,  7.6985159e-03, -5.0652046e-03, -2.6354317e-03,\n",
       "       -2.4917696e-03,  7.0066674e-04, -6.4257701e-04, -4.5810528e-03,\n",
       "        7.7987812e-04,  2.5302153e-03,  1.4552932e-03, -3.1920432e-03,\n",
       "        4.0789871e-03, -2.7199865e-03,  3.3235915e-03, -2.0251921e-03,\n",
       "       -2.2953239e-03,  6.2371460e-03, -3.2799956e-03,  8.9811842e-04,\n",
       "       -2.5823724e-03, -4.4579767e-03, -7.3911683e-03, -1.6582392e-04,\n",
       "        2.8145671e-04, -3.2918619e-03, -2.6939313e-03, -5.1317532e-03,\n",
       "        2.5532853e-03,  6.5070619e-03,  1.0551591e-03,  3.2300127e-03,\n",
       "       -1.1547272e-03, -1.2535534e-03,  1.4361716e-03,  1.7629942e-03,\n",
       "       -2.8988889e-03,  2.0974237e-03, -5.8227698e-03,  8.1397137e-03,\n",
       "       -2.8023466e-03,  9.3555875e-04, -5.7098031e-04, -1.3912616e-03,\n",
       "       -2.5984519e-03, -1.8401010e-03,  4.7945334e-03,  4.0792683e-03,\n",
       "       -9.1515179e-04,  1.2824327e-03,  8.2337502e-03,  3.5726540e-03,\n",
       "        1.2970364e-03, -7.8641268e-04, -4.6819295e-03, -2.3468505e-04,\n",
       "        1.1813758e-03,  2.1915610e-03, -5.0192545e-03,  5.6051221e-03,\n",
       "        3.9733844e-03, -1.9811334e-03, -3.3446504e-03, -4.9200305e-04,\n",
       "        8.0256257e-03,  3.2809959e-03,  2.6104650e-03, -1.7321646e-03,\n",
       "        2.1246636e-04,  2.4007384e-03,  7.8085684e-03, -2.1588639e-03,\n",
       "       -1.0148875e-03, -6.3766371e-03,  7.4433633e-06,  4.0005501e-03,\n",
       "       -2.1610935e-03, -1.7342754e-03,  9.3174335e-03,  4.0063281e-03,\n",
       "        5.8194166e-03, -6.3449475e-03,  3.4241099e-03, -5.0236946e-03,\n",
       "        5.8575752e-03,  1.3908240e-03, -8.8924175e-04, -1.1105216e-03],\n",
       "      dtype=float32)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soy_vector = model.get_word_vector(\"soy\")\n",
    "soy_vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "92e3c255",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(100,)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soy_vector.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "77ee2226",
   "metadata": {},
   "outputs": [],
   "source": [
    "model.save_model(\"../Modelos/Poemas.bin\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "52c1490b",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = ft.load_model(\"../Modelos/Poemas.bin\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "438609a4",
   "metadata": {},
   "source": [
    "Por defecto, el modelo entrenado es skip-gram, pero también tenemos disponible la arquitectura cbow.\n",
    "\n",
    "\n",
    "<figure>\n",
    "<center>\n",
    "<img src=\"../Imagenes/cbow-skip-gram-3.png\" width=\"1600\" height=\"1000\" align=\"center\"/>\n",
    "</center>\n",
    "<figcaption>\n",
    "<p style=\"text-align:center\">Modelos BOW y Skip-gram</p>\n",
    "</figcaption>\n",
    "</figure>\n",
    "\n",
    "\n",
    "Fuente: Alvaro Montenegro"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5d7ffac6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# model_cbow = fasttext.train_unsupervised('../Datos/Poemas_Todo.txt', \"cbow\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "bc78a355",
   "metadata": {},
   "source": [
    "En palabras de Facebook,\n",
    "\n",
    "***In practice, we observe that skipgram models works better with subword information than cbow.***"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c0d4ddb3",
   "metadata": {},
   "source": [
    "[[Volver al inicio]](#Contenido)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "eab97c02",
   "metadata": {},
   "source": [
    "### <span style=\"color:#4CC9F0\">Jugando con los Parámetros</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "268bec49",
   "metadata": {},
   "source": [
    "Dependiendo del problema, puede ser que los parámetros por defecto no sean los más adecuados.\n",
    "\n",
    "Para conocer todos los parámetros de FastText, podemos ingresar [aquí](https://fasttext.cc/docs/en/python-module.html#train_unsupervised-parameters)\n",
    "\n",
    "Por ejemplo, modifiquemos los parámetros **mínimos** y **máximos** de los n-gramas permitidos, la **dimensión** del vector, las **epochs** y la frecuencia mínima de palabras:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2863c877",
   "metadata": {},
   "outputs": [],
   "source": [
    "import fasttext as ft"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "da0bbfc5",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Read 0M words\n",
      "Number of words:  846\n",
      "Number of labels: 0\n",
      "Progress: 100.0% words/sec/thread:   80034 lr:  0.000000 avg.loss:  3.471876 ETA:   0h 0m 0s\n"
     ]
    }
   ],
   "source": [
    "model = ft.train_unsupervised('../Datos/Poemas_Todo.txt', minCount=2, minn=2, maxn=5, dim=300)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5aef7357",
   "metadata": {},
   "source": [
    "Verifiquemos la longitud de las palabras:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "778eac06",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(300,)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.get_word_vector(\"soy\").shape"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4f38c146",
   "metadata": {},
   "source": [
    "[[Volver al inicio]](#Contenido)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "6f169d23",
   "metadata": {},
   "source": [
    "### <span style=\"color:#4CC9F0\">Palabras vecinas más cercanas</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "bd39decd",
   "metadata": {},
   "source": [
    "Como cada palabra en el corpus tiene un vector asociado, podemos obtener palabras cercanas usando la similitud de coseno:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "edf7526f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0.999998927116394, 'sueño,'),\n",
       " (0.9999988079071045, 'sueños,'),\n",
       " (0.9999986290931702, 'contra'),\n",
       " (0.9999985694885254, 'sueños'),\n",
       " (0.9999983906745911, 'suelo,'),\n",
       " (0.9999983906745911, 'cielo'),\n",
       " (0.9999983906745911, 'mundo'),\n",
       " (0.9999982714653015, 'manera'),\n",
       " (0.9999982118606567, 'lienzo'),\n",
       " (0.9999982118606567, 'repletas')]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.get_nearest_neighbors('sueño')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "af3dddc5",
   "metadata": {},
   "source": [
    "[[Volver al inicio]](#Contenido)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1a49f3f8",
   "metadata": {},
   "source": [
    "## <span style=\"color:#4361EE\">Uso del Modelo con Wikipedia</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5f112f1c",
   "metadata": {},
   "source": [
    "Primero que todo, obtengamos los datos que necesitamos:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2821f43e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2021-11-19 16:13:05--  http://mattmahoney.net/dc/enwik9.zip\n",
      "Resolving mattmahoney.net (mattmahoney.net)... 67.195.197.24\n",
      "Connecting to mattmahoney.net (mattmahoney.net)|67.195.197.24|:80... connected.\n",
      "HTTP request sent, awaiting response... 206 Partial Content\n",
      "Length: 322592222 (308M), 294168810 (281M) remaining [application/zip]\n",
      "Saving to: ‘../Datos/enwik9.zip’\n",
      "\n",
      "enwik9.zip          100%[+==================>] 307,65M   566KB/s    in 8m 28s  \n",
      "\n",
      "2021-11-19 16:21:34 (565 KB/s) - ‘../Datos/enwik9.zip’ saved [322592222/322592222]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Corpus Completo\n",
    "#!wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 -o ../Datos/enwiki-latest-pages-articles.xml.bz2\n",
    "# Corpus 1 billón de bytes (1 Giga)\n",
    "!wget -c http://mattmahoney.net/dc/enwik9.zip -P ../Datos"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8768a281",
   "metadata": {},
   "source": [
    "Como el formato viene en `.zip`, descomprimamos el contenido"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "9ffb77c0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Archive:  ../Datos/enwik9.zip\n",
      "  inflating: ../Datos/enwik9         \n"
     ]
    }
   ],
   "source": [
    "!unzip ../Datos/enwik9.zip -d ../Datos"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8131712b",
   "metadata": {},
   "source": [
    "Miremos algo de información:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "5849eefa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<mediawiki xmlns=\"http://www.mediawiki.org/xml/export-0.3/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:schemaLocation=\"http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd\" version=\"0.3\" xml:lang=\"en\">\n",
      "  <siteinfo>\n",
      "    <sitename>Wikipedia</sitename>\n",
      "    <base>http://en.wikipedia.org/wiki/Main_Page</base>\n",
      "    <generator>MediaWiki 1.6alpha</generator>\n",
      "    <case>first-letter</case>\n",
      "      <namespaces>\n",
      "      <namespace key=\"-2\">Media</namespace>\n",
      "      <namespace key=\"-1\">Special</namespace>\n",
      "      <namespace key=\"0\" />\n",
      "      <namespace key=\"1\">Talk</namespace>\n",
      "      <namespace key=\"2\">User</namespace>\n",
      "      <namespace key=\"3\">User talk</namespace>\n",
      "      <namespace key=\"4\">Wikipedia</namespace>\n",
      "      <namespace key=\"5\">Wikipedia talk</namespace>\n",
      "      <namespace key=\"6\">Image</namespace>\n",
      "      <namespace key=\"7\">Image talk</namespace>\n",
      "      <namespace key=\"8\">MediaWiki</namespace>\n",
      "      <namespace key=\"9\">MediaWiki talk</namespace>\n",
      "      <namespace key=\"10\">Template</namespace>\n",
      "      <namespace key=\"11\">Template talk</namespace>\n",
      "      <namespace key=\"12\">Help</namespace>\n",
      "      <namespace key=\"13\">Help talk</namespace>\n",
      "      <namespace key=\"14\">Category</namespace>\n",
      "      <namespace key=\"15\">Category talk</namespace>\n",
      "      <namespace key=\"100\">Portal</namespace>\n",
      "      <namespace key=\"101\">Portal talk</namespace>\n",
      "    </namespaces>\n",
      "  </siteinfo>\n",
      "  <page>\n",
      "    <title>AaA</title>\n",
      "    <id>1</id>\n",
      "    <revision>\n",
      "      <id>32899315</id>\n",
      "      <timestamp>2005-12-27T18:46:47Z</timestamp>\n",
      "      <contributor>\n",
      "        <username>Jsmethers</username>\n",
      "        <id>614213</id>\n",
      "      </contributor>\n",
      "      <text xml:space=\"preserve\">#REDIRECT [[AAA]]</text>\n",
      "    </revision>\n",
      "  </page>\n",
      "  <page>\n",
      "    <title>AlgeriA</title>\n",
      "    <id>5</id>\n",
      "    <revision>\n",
      "      <id>18063769</id>\n",
      "      <timestamp>2005-07-03T11:13:13Z</timestamp>\n",
      "      <contributor>\n",
      "        <username>Docu</username>\n",
      "        <id>8029</id>\n",
      "      </contributor>\n",
      "      <minor />\n",
      "      <comment>addi"
     ]
    }
   ],
   "source": [
    "!head -c 2000 ../Datos/enwik9"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f7792c38",
   "metadata": {},
   "source": [
    "Como un archivo raw de Wikipedia contiene toneladas de datos HTML/XML, preprocesaremos la información con el archivo wikifil.pl, escrito por Matt Mahoney y puede ser encontrado el script original en su página personal [aquí](http://mattmahoney.net/).\n",
    "\n",
    "El script asociado funciona usando el lenguaje de programación PERL, y se ejecuta a continuación:"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ca2ed66c",
   "metadata": {
    "tags": []
   },
   "source": [
    "### <span style=\"color:#4CC9F0\">Preprocesamiento</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "8e996d11",
   "metadata": {},
   "outputs": [],
   "source": [
    "!perl ../Datos/wikifil.pl ../Datos/enwik9 > ../Datos/fil9"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "29342f65",
   "metadata": {},
   "source": [
    "Podemos ver algo del resultado obtenido con el siguiente comando:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a00dabd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "#!head -c 2000 ../Datos/fil9"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a28c2483",
   "metadata": {},
   "source": [
    "[[Volver al inicio]](#Contenido)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "5d5ce2ea",
   "metadata": {
    "tags": []
   },
   "source": [
    "### <span style=\"color:#4CC9F0\">Entrenando del Modelo</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "39e9714a",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Read 124M words3M words\n",
      "Number of words:  218316\n",
      "Number of labels: 0\n",
      "Progress:  31.5% words/sec/thread:   45623 lr:  0.034237 avg.loss:  0.601562 ETA:   0h51m49sm 8s lr:  0.049049 avg.loss:  1.546121 ETA:   1h16m37s 1.495166 ETA:   1h17m48s  4.0% words/sec/thread:   42449 lr:  0.047977 avg.loss:  1.451111 ETA:   1h18m 2s words/sec/thread:   42355 lr:  0.046572 avg.loss:  1.341497 ETA:   1h15m55sh16m 3s% words/sec/thread:   44134 lr:  0.045075 avg.loss:  1.130582 ETA:   1h10m31s 16.0% words/sec/thread:   45933 lr:  0.041981 avg.loss:  0.870252 ETA:   1h 3m 6s 17.8% words/sec/thread:   45410 lr:  0.041112 avg.loss:  0.814919 ETA:   1h 2m31sh 2m25s  44928 lr:  0.038916 avg.loss:  0.719204 ETA:   0h59m48s57m30s 25.7% words/sec/thread:   45038 lr:  0.037161 avg.loss:  0.668852 ETA:   0h56m58s  0h54m58s 0.035755 avg.loss:  0.635238 ETA:   0h54m46s 30.9% words/sec/thread:   45385 lr:  0.034537 avg.loss:  0.608555 ETA:   0h52m32s"
     ]
    }
   ],
   "source": [
    "model = ft.train_unsupervised('../Datos/fil9')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "7e39e3cc",
   "metadata": {},
   "source": [
    "[[Volver al inicio]](#Contenido)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3f1bae27",
   "metadata": {
    "tags": []
   },
   "source": [
    "### <span style=\"color:#4CC9F0\">Palabras más cercanas</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d22e3aad",
   "metadata": {},
   "outputs": [],
   "source": [
    "model.get_nearest_neighbors('asparagus')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "2de11846",
   "metadata": {},
   "source": [
    "[[Volver al inicio]](#Contenido)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f2b25be5",
   "metadata": {},
   "source": [
    "### <span style=\"color:#4CC9F0\">Curiosidades</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "490c1d57",
   "metadata": {},
   "outputs": [],
   "source": [
    "model.get_nearest_neighbors('pidgey')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "fa3e18d9",
   "metadata": {},
   "source": [
    "[[Volver al inicio]](#Contenido)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0f3c1aa9",
   "metadata": {},
   "source": [
    "### <span style=\"color:#4CC9F0\">Analogía de Palabras</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb552179",
   "metadata": {},
   "outputs": [],
   "source": [
    "model.get_analogies(\"berlin\", \"germany\", \"france\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "87d475c7",
   "metadata": {},
   "source": [
    "[[Volver al inicio]](#Contenido)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "2de2b4bd",
   "metadata": {
    "tags": []
   },
   "source": [
    "## <span style=\"color:#4361EE\">Modelo Supervisado</span>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cf713f51",
   "metadata": {},
   "source": [
    "Para este ejemplo, usaremos el conjunto de datos de preguntas de [the cooking section of Stackexchange](https://cooking.stackexchange.com/):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1fc5a85d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ac16dfec",
   "metadata": {},
   "outputs": [],
   "source": [
    "!head cooking.stackexchange.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45aee0b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wc cooking.stackexchange.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2563fca0",
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -n 12404 cooking.stackexchange.txt > cooking.train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0cb6a149",
   "metadata": {},
   "outputs": [],
   "source": [
    "!tail -n 3000 cooking.stackexchange.txt > cooking.valid"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "ad216d34",
   "metadata": {},
   "source": [
    "[[Volver al inicio]](#Contenido)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "703c1260",
   "metadata": {},
   "source": [
    "### <span style=\"color:#4CC9F0\">Uso del Modelo</span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f44e006",
   "metadata": {},
   "outputs": [],
   "source": [
    "import fasttext as ft"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "181029db",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = fasttext.train_supervised(input=\"cooking.train\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6608fa51",
   "metadata": {},
   "outputs": [],
   "source": [
    "model.predict(\"Which baking dish is best to bake a banana bread ?\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "695d73bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "model.test(\"cooking.valid\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "d5c9d6b7",
   "metadata": {},
   "source": [
    "[[Volver al inicio]](#Contenido)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cfde74bf",
   "metadata": {
    "tags": []
   },
   "source": [
    "### <span style=\"color:#4CC9F0\">Mejorando el Modelo </span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6551eb5d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat cooking.stackexchange.txt | sed -e \"s/\\([.\\!?,'/()]\\)/ \\1 /g\" | tr \"[:upper:]\" \"[:lower:]\" > cooking.preprocessed.txt\n",
    "!head -n 12404 cooking.preprocessed.txt > cooking.train\n",
    "!tail -n 3000 cooking.preprocessed.txt > cooking.valid"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0cf3d073",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = fasttext.train_supervised(input=\"cooking.train\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "413932d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "model.test(\"cooking.valid\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f64f56cf",
   "metadata": {},
   "source": [
    "[[Volver al inicio]](#Contenido)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "16b5306e",
   "metadata": {},
   "source": [
    "### <span style=\"color:#4CC9F0\">Auto-ajuste de Parámetros </span>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1b731e89",
   "metadata": {},
   "outputs": [],
   "source": [
    "model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ff21932",
   "metadata": {},
   "outputs": [],
   "source": [
    "model.test(\"cooking.valid\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "7c7e61a3",
   "metadata": {},
   "source": [
    "[[Volver al inicio]](#Contenido)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}