{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Topic Modeling of Twitter Followers\n", "\n", "This notebook is associated to [this article on my blog](http://alexperrier.github.io/jekyll/update/2015/09/04/topic-modeling-of-twitter-followers.html).\n", "\n", "We use LDAvis to visualize several LDA modeling of the followers of the [@alexip](https://twitter.com/alexip) account.\n", "\n", "The different LDAs were trained with the following parameters\n", "\n", "* 10 topics, 10 passes, alpha = 0.001\n", "* 50 topics, 50 passes, alpha = 0.01\n", "* 40 topics, 100 passes, alpha = 0.001\n", "\n", "Extraction of the data from twitter was done via [this python 2 script](https://github.com/alexperrier/datatalks/tree/master/twitter)\n", "And the dictionary and corpus were created via [this one](https://github.com/alexperrier/datatalks/tree/master/twitter)\n", "\n", "To see the best results, set lambda around [0.5, 0.6]. Lowering Lambda gives more importance to words that are discriminatory for the active topic, words that best define the topic. \n", "\n", "You can skip the 2 first models and jump to the last model which is the best (40 topics)\n", "\n", "A working version of this notebook is available on [nbviewer](http://nbviewer.ipython.org/github/alexperrier/datatalks/blob/master/twitter/LDAvis.ipynb)\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Load the corpus and dictionary\n", "from gensim import corpora, models\n", "import pyLDAvis.gensim\n", "\n", "corpus = corpora.MmCorpus('data/alexip_followers_py27.mm')\n", "dictionary = corpora.Dictionary.load('data/alexip_followers_py27.dict')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# First LDA model with 10 topics, 10 passes, alpha = 0.001\n", "lda = models.LdaModel.load('data/alexip_followers_py27_t10_p10_a001_b01.lda')\n", "followers_data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)\n", "pyLDAvis.display(followers_data)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "With K=10 topics, nearly all the topics are aggregated together and difficult to distinguish. \n", "And even singled out topics [4,9] are not very cohesive. #4 for instance has bitcoin, ruby/ rails and London mixed together.\n", "In the following example, we set K to 50 topics and increase the number of passes from 10 to 50." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lda = models.LdaModel.load('data/alexip_followers_py27_t50_p50_a001.lda')\n", "followers_data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)\n", "pyLDAvis.display(followers_data)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The topic spread is much better.\n", "And the topics are mostly pretty coherent.\n", "\n", "* 1 is about the social web\n", "* 2 is more colloquial about daily life with words like happy, lol, guys, sure, ever, day, ...\n", "* 12 is all about data science with words like dataviz, spark, drivendata, ipython, machinelearning, ...\n", "* 5 is about bitcoins and cryptocurrencies\n", "* 17 most probably comes from French followers that did go through our filtering of non english accounts.\n", "* 29 is about jobs in the UK\n", "\n", "The final model has more topics (100) and was allowed to converge more with 100 passes.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
\n", "" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lda = models.LdaModel.load('data/alexip_followers_py27_t40_p100_a001.lda')\n", "followers_data = pyLDAvis.gensim.prepare(lda, corpus, dictionary)\n", "pyLDAvis.display(followers_data)\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "\n", "* 1 is about working life in general: work, love, well, time, people, email\n", "* 2 is about the social tech world in Boston with discriminant words like: sdlc, edcampldr, ... and the big social network company names \n", "* 3 is more about brands and shopping: personal, things, brand, shop, love, \n", "* 4 is about social media marketing: Marketing, hubspot, customers, ... (and leakage from Project Management (pmp, exam))\n", "* 5 is a mess and shows french words that leaked through our lang filtering\n", "* 6 is about bitcoins\n", "* 7 is about data science: rstats, bigdata, machinelearning, python, dataviz, ...\n", "* 8 is about rails and ruby and hosting. NewtonMa is also relevant as the ruby Boston meetup is in Newton MA.\n", "* 9 is about casino and games\n", "* and 10 could be about learning analytics (less cohesive topic)\n", "\n", "* 13 is about python, pydata, conda, .... (with a bit of angular mixed in)\n", "\n", "etc ...\n", "\n", "It appears that the last Model with K=40 topics and 100 passes is the best so far.\n", "The top 10 topics are relevant and cohesive. \n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'0.055*app + 0.045*team + 0.043*contact + 0.043*idea + 0.029*quote + 0.022*free + 0.020*development + 0.019*looking + 0.017*startup + 0.017*build',\n", " u'0.033*socialmedia + 0.022*python + 0.015*collaborative + 0.014*economy + 0.010*apple + 0.007*conda + 0.007*pydata + 0.007*talk + 0.007*check + 0.006*anaconda',\n", " u'0.053*week + 0.041*followers + 0.033*community + 0.030*insight + 0.010*follow + 0.007*world + 0.007*stats + 0.007*sharing + 0.006*unfollowers + 0.006*blog',\n", " u'0.014*thx + 0.010*event + 0.008*app + 0.007*travel + 0.006*social + 0.006*check + 0.006*marketing + 0.005*follow + 0.005*also + 0.005*time',\n", " u'0.044*docker + 0.036*prodmgmt + 0.029*product + 0.018*productmanagement + 0.017*programming + 0.012*tipoftheday + 0.010*security + 0.009*javascript + 0.009*manager + 0.009*containers',\n", " u'0.089*love + 0.035*john + 0.026*update + 0.022*heart + 0.015*peace + 0.014*beautiful + 0.012*beauty + 0.010*life + 0.010*shanti + 0.009*stories',\n", " u'0.033*geek + 0.009*architecture + 0.007*code + 0.007*products + 0.007*parts + 0.007*charts + 0.007*software + 0.006*cryptrader + 0.006*moombo + 0.006*book',\n", " u'0.049*stories + 0.046*network + 0.044*virginia + 0.044*entrepreneur + 0.039*etmchat + 0.025*etmooc + 0.021*etm + 0.015*join + 0.014*deis + 0.010*today',\n", " u'0.056*slots + 0.053*bonus + 0.052*fsiug + 0.039*casino + 0.031*slot + 0.024*online + 0.014*free + 0.013*hootchat + 0.010*win + 0.009*bonuses',\n", " u'0.056*video + 0.043*add + 0.042*message + 0.032*blog + 0.027*posts + 0.027*media + 0.025*training + 0.017*check + 0.013*gotta + 0.010*insider']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lda.show_topics()\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }