{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![](imgs/miredtwitter_header2.png) \n", "
\n", "

\n", "Recomendación de Información basada en
\n", "Análisis de Redes Sociales
\n", "y Procesamiento de Lenguaje Natural\n", "

\n", "

Trabajo Final de la Licenciatura en Ciencias de la Computación

\n", "

Pablo Gabriel Celayes

\n", "

Martes 30 de mayo de 2017

\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Resumen\n", "\n", "* Introducción\n", "\n", "* Herramientas\n", "\n", "* Datos\n", "\n", "* Predicción social\n", "\n", "* Agregando PLN\n", "\n", "* Conclusiones" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Introducción\n", "## Idea original\n", "\n", "- Recomendador *personalizado* de artículos basado en contenido ( NLP )\n", "- Mejorarlo con información social ( de fuentes externas o relaciones *inferidas* ) \n", "- Preferencias de usuarios de Cogfor\n", "\n", "![](imgs/cogfor.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Mutación\n", "\n", "- Set the datos propio ( usuarios, preferencias, conexiones ) \n", "- Predecir preferencias usando información social\n", "- Combinar con recomendación basada en contenido\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Herramientas\n", "\n", "## Datos\n", "![](imgs/tweepy.png)\n", "![](imgs/sqlalchemy.png)\n", "\n", "## Grafos\n", "![](imgs/networkx.png)\n", "![](imgs/graphtool.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Análisis\n", "![](imgs/numpy.png)\n", "![](imgs/pandas.png)\n", "![](imgs/jupyter.png)\n", "\n", "## ML + PLN\n", "![](imgs/sklearn.png)\n", "![](imgs/nltk.png)\n", "![](imgs/gensim.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Visualización\n", "![](imgs/bokeh.png)\n", "![](imgs/gephi_small.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Datos: Grafo social\n", "\n", "- Todos los usuarios que sigo en Twitter\n", "- relación $\\texttt{seguir}$ entre ellos\n", "- $1213$ nodos-usuarios\n", "- $40489$ aristas-relaciones " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](imgs/miredtwitter.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Datos: Contenido\n", "\n", "- *timelines* entre 18/3/2017 y 17/4/2017 (+ *retweets* y *favs*)\n", "- $163173$ tweets totales.\n", "- $109040$ en español\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Predicción social\n", "\n", "¿ Dado un usuario, cuanto puedo saber de sus preferencias sabiendo las de su entorno ?\n", "\n", "- **Contenido** = tweets en español\n", "- **Preferencias** = retweets\n", "- **Entorno** = seguidos + seguidos-por-seguidos\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Selección de usuarios\n", "\n", "* Al menos $100$ (re)tweets\n", "* Al menos $100$ usuarios en su entorno\n", "* $98$ usuarios" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Tweets visibles\n", "\n", "* Contenido compartido por $u$ o sus seguidos\n", "* Excluímos contenido generado por $u$\n", "* $ T_u := (\\bigcup_{x \\in \\{ u \\} \\cup \\texttt{seguidos}(u)} \\texttt{timeline}(x)) - \\{ t \\in T | \\texttt{autor}(t) = u \\} $\n", "\n", "* máximo $10000$ ( submuestra de negativos si es necesario )" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Extracción de características\n", "\n", "* Entorno $ E_u = (\\bigcup_{x \\in \\{ u \\} \\cup \\texttt{seguidos}(u)} \\texttt{seguidos}(x)) - \\{ u \\} $\n", "* $E_u = \\{u_1, u_2, \\ldots , u_n \\}$ y $T_u = \\{ t_1, \\ldots, t_m \\}$\n", "\n", "$$\n", " M_u := [ \\verb|tweet_en_tl|(t_i, u_j) ]_{\\substack{ 1 \\leq i \\leq m \\\\ 1 \\leq j \\leq n}} \n", "$$\n", "\n", "\n", "$$\n", " y_u := [ \\texttt{tweet_en_tl}(t_i, u) ]_{ 1 \\leq i \\leq m } \n", "$$\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Problema de clasificación\n", "\n", "* Predecir $y_u$ en base a filas de $M_u$\n", "* Particionado:\n", " * $70\\%$ entrenamiento ($M^{en}_u, y^{en}_u$)\n", " * $10\\%$ ajuste ($M^{aj}_u, y^{aj}_u$)\n", " * $20\\%$ evaluación ($M^{ev}_u, y^{ev}_u$)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Support Vector Machines\n", "\n", "* Objetivo: **maximizar** margen y **minimizar** errores\n", "* Funciones *kernel* permiten encontrar fronteras **no lineales**:\n", " * *Radial Basis Function*\n", " * Polinómico\n", "\n", "![](imgs/svm_linsep_err.png)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Calidad de clasificación\n", "\n", "$$\\texttt{precision} := \\frac{|\\{x_i | f(x_i) = 1 \\text{ y } y_i = 1 \\}|}{|\\{x_i | f(x_i) = 1 \\}|}$$\n", "\n", "$$ $$\n", "\n", "$$\\texttt{recall} := \\frac{|\\{x_i | f(x_i) = 1 \\text{ y } y_i = 1 \\}|}{|\\{x_i | y_i = 1 \\}|}$$\n", "\n", "$$ $$\n", "\n", "$$\\texttt{F1} := \\frac{2}{\\frac{1}{\\texttt{precision}} + \\frac{1}{\\texttt{recall}}} = \\frac{2 * \\texttt{precision} * \\texttt{recall} }{\\texttt{precision} + \\texttt{recall}}$$\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Ajuste de parámetros\n", "\n", "* Búsqueda exhaustiva $\\verb|GridSearchCV|$\n", "* Validación cruzada en $3$ partes\n", "* Objetivo: maximizar $F1$\n", "* Grilla:\n", "```\n", "{\n", " \"C\": [ 0.01, 0.1, 1 ],\n", " \"class_weight\": [ \"balanced\", None ], \n", " \"gamma\": [ 0.1, 1, 10 ],\n", " \"kernel\": [ \"rbf\", \"poly\" ]\n", "}\n", "```\n", " * $C$: controla balance entre margen y errores\n", " * $class\\_weight$: ¿dar más importancia a clase minoritaria?\n", " * $gamma$: forma de la frontera de decisión" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Resultados\n", "\n", "* $F1$ sobre $M^{aj}_u$\n", "* Promedio $0,84$\n", "\n", "![](imgs/f1svalidsocial__.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# *¡Y sin tener en cuenta el contenido!*\n", "\n", "\n", "\n", "![](imgs/robot2.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Agregando PLN\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Selección de usuarios\n", "\n", "* $F1 < 0,75$ en $M^{aj}_u$ ( $23$ usuarios )\n", "\n", "* $10$ usuarios más (al azar)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Pre-procesamiento\n", "\n", "- Normalización\n", "- Tokenizado\n", "- Extracción de frases\n", "- Diccionario de términos\n", "- *Bag of terms*\n", "- LDA\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Normalización\n", "\n", "![](imgs/normalize.png)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Tokenizado\n", "\n", "![](imgs/tokenize.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Extracción de frases\n", "\n", "* $\\verb|Phraser|$ de $\\verb|gensim|$ \n", "* Bigramas $a\\ b$ relevantes:\n", " * $min\\_count = 5$\n", " * $cnt(a, b) \\geq min\\_count$\n", " * $ \\frac{(cnt(a, b) - min\\_count) * N}{cnt(a) * cnt(b)} > 10 $\n", "* 2 pasadas sobre el corpus tokenizado ( bigramas y trigramas )" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Diccionario de términos\n", "\n", "* $T$ retokenizado con frases\n", "* término = palabra o frase \n", "* significativo ( al menos $3$ veces ).\n", "* informativo ( en menos del $30\\%$ de los tweets ).\n", "\n", "* Diccionario $D$ de $26201$ términos.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## *Bag of terms*\n", "\n", "- Texto $t$: $\\rightarrow$ multiconjunto (*bag*) de los términos en $t$.\n", "- No importa el orden, pero sí repeticiones.\n", "- En tweets ( $\\leq 140$ caracteres ), en general son *conjuntos* ( $0$ o $1$ ocurrencia).\n", "- Diccionario fijo $D = \\{ t_1, \\ldots, t_{26201} \\}$ : $\\rightarrow$ vector de características enteras (booleanas):\n", "\n", "$$ v_{BOT}(tweet) = [count(t_i, tokens(tweet))]_{i=1}^{26201} $$\n", "\n", "- $v_{BOT}$ se representa ralo (*sparse*):\n", "\n", "![](imgs/bagofterms.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## LDA\n", "\n", "* Descubre temas subyacentes en textos\n", "\n", "* Reducción de dimensionalidad ( espacio de *términos* a espacio de *temas* )\n", "\n", "* Probamos modelos de $10$ y $20$ temas\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Ejemplo: LDA 10 temas\n", "\n", "![](imgs/lda10.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Extendiendo características \n", "\n", "* Características sociales + LDA:\n", "\n", "![](imgs/socialldavec.png)\n", " \n", "* Agregamos escalado ( cada columna con media $0$ y varianza $1$ )" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Evaluando sobre $M_u^{ev}$\n", "\n", "* $LDA10$\n", " * mejora para $4$ usuarios\n", " * mejora media de $2,8\\%$\n", "* $LDA20$\n", " * mejora para $5$ usuarios\n", " * mejora media de $3,8\\%$." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "source": [ "![](imgs/f1lda_eval.png)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Sobreajuste\n", "![](imgs/f1lda_train.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Discretizando...\n", "\n", "* Sobreajuste puede venir de características *demasiado* descriptivas\n", "* Asignamos $1$ para cada tema con puntaje $\\geq 0,25$ y $0$ en caso contrario" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Mejoras?\n", "\n", "* $LDAbool10$\n", " * mejora para $8$ usuarios\n", " * mejora media de $2,4\\%$\n", "* $LDAbool20$\n", " * mejora para $7$ usuarios\n", " * mejora media de $2,5\\%$.\n", "\n", "Impacta a más usuarios, pero mejora menos" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "source": [ "![](imgs/f1ldabool_eval.png)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "source": [ "## ...pero sigue el sobreajuste\n", "\n", "![](imgs/f1ldabool_train.png)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Conclusiones\n", "\n", "* Predicción social pura mejor de lo esperado\n", "* LDA mejora casos flojos, pero hay que resolver sobreajuste\n", "* Twitter es muy buena fuente de datos" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Próximos pasos\n", "\n", "- Reducir sobreajuste\n", "- Características adicionales\n", "- Considerar temporalidad\n", "- Generalizar ( modelo que no dependa del usuario )\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# ¿Preguntas?\n", "\n", "![](imgs/twitter.png) @PCelayes\n", "\n", "![](imgs/github.jpg) https://github.com/pablocelayes/sna_classifier\n", "\n" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11+" } }, "nbformat": 4, "nbformat_minor": 0 }