{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data science on twitter\n", "\n", "\n", "Twitter is an indispensable resource for data scientists as well as for the broader data science community. With the right connections, you can use twitter to learn data science, discover new technologies, computational tools and methodologies, and you can contribute to and build a community of data scientists working for the social good. This type of value is generally only available to attendees of top data science conferences on disruptive data science, open data science and data science for good. Indeed, with a good twitter list, you can bring much of this content directly to your twitter feed!\n", "\n", "Data science is a highly diverse and interdisciplinary field, but does data science twitter chatter reflect its interdisciplinary nature? Are there distinct communities of data scientists that interact with and cater to distinct sub-fields? To begin seeking an answer to this question, we will walk you through the simple analysis of a week's worth data science related tweets.\n", "\n", "## A data science twitter network\n", "\n", "Tweets were collected using a tweepy listener (see here1 for a tutorial on building a twitter listener), and stored in a text file named \"data_science_twitter.txt\". Let's first load the tweets and extract user mentions to take a quick look at the volume of data science tweets from this week." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "import sys\n", "import json\n", "\n", "def tweets_n_edges(tweet_file):\n", " tweets=[]\n", " edges=[]\n", "\n", " for i in open(tweet_file,\"r\"):\n", " if i==\"\\n\":\n", " next\n", " else:\n", " try:\n", " tweet = json.JSONDecoder().raw_decode(i)[0]\n", " usr_mentions= tweet['entities']['user_mentions']\n", " if len(usr_mentions)>0:\n", " for ii in usr_mentions:\n", " if tweet['user']['screen_name'] != ii['screen_name']:\n", " edges.append((tweet['user']['screen_name'], ii['screen_name']))\n", " tweets.append(tweet)\n", " except: # if no user mentions, or something unexpected\n", " continue\n", "\n", " return (tweets,edges)\n", "\n", "\n", "tweets,edges = tweets_n_edges(\"data_science_twitter.txt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tweets and network edges (links between twitter users) were gathered based on user mentions. How many tweets and user mentions were there?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 159600 tweets about data science this week, and 162070 user mentions!\n" ] } ], "source": [ "print \"There are %s tweets about data science this week, and %s user mentions!\" % ( len(tweets), len(edges) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data science twitter community is incredibly active; we saw almost 160,000 tweets within a single week! And, there seems to be just as much interaction within the community, as there is about the same number of user mentions, not including self-mentions.\n", "\n", "But what does the network look actually like? To build a network and find the most influential data science twitter uses, we will use the NetworkX2 package to create a directed graph and to calculate eigenvector centrality (a measure of network influence) among the nodes (twitter users). The resulting network is plotted using Gephi3." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(1, (u'GilPress', 0.38942565243403915)),\n", " (2, (u'KirkDBorne', 0.30906334335611996)),\n", " (3, (u'Forbes', 0.23035596746895132)),\n", " (4, (u'BernardMarr', 0.21142119479688257)),\n", " (5, (u'bobehayes', 0.2072355059058224)),\n", " (6, (u'kdnuggets', 0.15597621686762647)),\n", " (7, (u'Ronald_vanLoon', 0.15518713444196847)),\n", " (8, (u'LinkedIn', 0.12561861905035457)),\n", " (9, (u'DataScienceCtrl', 0.11756733241544594)),\n", " (10, (u'BoozAllen', 0.11138358070618962))]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import networkx as nx\n", "\n", "G=nx.DiGraph() # initiate a directed graph\n", "G.add_edges_from(edges) # add edges to the graph from user mentions\n", "ev_cent=nx.eigenvector_centrality(G,max_iter=10000) # compute eigenvector centrality\n", "\n", "ev_tuple = []\n", "for i in ev_cent.keys():\n", " ev_tuple.append((i,ev_cent[i]))\n", " \n", "zip(range(1,11)[::-1],sorted(ev_tuple,key=lambda x: x[1])[-10:])[::-1] # get the top 10 network influencers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Nodes represent twitter handles and the edges between the nodes represent user mentions. The size and color of the nodes correspond to eigenvector centrality values, which, again, is one measure of network influence. Let's take a quick peek at the top 10 influencers (who are also plotted above):\n", "\n", "
    \n", "
  1. GilPress
  2. \n", "
  3. KirkDBorne
  4. \n", "
  5. Forbes
  6. \n", "
  7. BernardMarr
  8. \n", "
  9. bobehayes
  10. \n", "
  11. kdnuggets
  12. \n", "
  13. Ronald_vanLoon
  14. \n", "
  15. LinkedIn
  16. \n", "
  17. DataScienceCtrl
  18. \n", "
  19. BoozAllen
  20. \n", "
\n", "\n", "The top 10 influencers include some of the most respected individuals and organizations in data science, and so their influence among data scientists on twitter is not at all surprising.\n", "\n", "However, data science is a highly interdisciplinary field. Different communities may have different topic foci and different community influencers. For data scientists working in different sub-fields or in different spheres of data science, it is important to know who the most influential figures in the various sub-domains are, as these will be the people/handles to follow for the most up-to date news, analyses, methods and tools. To find distinct data science communities, we will use a community detection algorithm implemented to work on top of the NetworkX package, Community4. It implements the louvain method5 for community detection." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import community\n", "\n", "def get_communities(tweets, edges):\n", " G_un=nx.Graph()\n", " G_un.add_edges_from(edges)\n", " parts = community.best_partition(G_un)\n", " values = [parts.get(node) for node in G_un.nodes()]\n", "\n", " communities = {}\n", "\n", " for i in tweets:\n", " screen_name = i['user']['screen_name'].encode(\"ascii\",\"ignore\")\n", " raw_text = i['text'].encode(\"ascii\",\"ignore\")\n", " if screen_name in parts.keys() and i['lang'] in ('en','und'): # get english tweets\n", " comm_num = parts[screen_name]\n", " if comm_num in communities.keys():\n", " if screen_name in communities[comm_num].keys():\n", " text = communities[comm_num][screen_name]['raw_text']\n", " communities[comm_num][screen_name]['n_tweets'] += 1\n", " communities[comm_num][screen_name]['raw_text'] = ' '.join([text, raw_text]) \n", " else:\n", " communities[comm_num][screen_name] = {\n", " 'raw_text' : raw_text,\n", " 'n_tweets' : 1 \n", " }\n", " else:\n", " communities[comm_num] = {}\n", " communities[comm_num][screen_name] = {\n", " 'raw_text' : raw_text,\n", " 'n_tweets' : 1 \n", " }\n", " else:\n", " continue\n", "\n", " return communities\n", "\n", "communities = get_communities(tweets,edges)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1234 distinct communities were detected \n", "\n", "Here are the top 10 most populous communities:\n", "\n", "Community 25 has 2883 members\n", "Community 3 has 2841 members\n", "Community 11 has 2564 members\n", "Community 22 has 2027 members\n", "Community 13 has 1629 members\n", "Community 17 has 1619 members\n", "Community 39 has 1442 members\n", "Community 38 has 822 members\n", "Community 19 has 785 members\n", "Community 45 has 776 members\n" ] } ], "source": [ "community_size = []\n", "for i in communities.keys():\n", " community_size.append((i,len(communities[i].keys())))\n", "\n", "print \"%s distinct communities were detected \\n\" % len(communities.keys())\n", "\n", "print \"Here are the top 10 most populous communities:\\n\"\n", "for i,j in sorted(community_size,key=lambda x: x[1])[::-1][:10]:\n", " print \"Community %s has %s members\" % (i,j)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Chatter among data science communities\n", "\n", "We see that there are a number of highly populous communities detected in the larger network, and many more communities that are smaller in size. Let's take a quick look at a few of the most populous communities. We will look to see who the most influential users are among each of the interrogated communities, and try to find popular topics that the community focuses on using topic modeling. Our analyses will focus on communities 11, 13 and 38.\n", "\n", "Let's start by visually inspecting the sub-network associated with community 11:\n", "\n", "\n", "\n", "We see a number of influential handles in this subnetwork, but the top 5 are:\n", "\n", "
    \n", "
  1. BernardMarr
  2. \n", "
  3. DataScienceCtrl
  4. \n", "
  5. EvanSinar
  6. \n", "
  7. Datafloq
  8. \n", "
  9. kdnuggets
  10. \n", "
\n", "\n", "But what is this data science community talking about? To take a quick look at the types of topics that this community is interested in, we will use the topic modeling package Topik6 from Continuum. Topik gives a high-level interface to wildly popular topic modeling libraries in Python.\n", "\n", "First we want to set up a directory structure for Topik to read from. We make each twitter user in the community a document that Topik can read:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import re\n", "\n", "def make_dir_struc(communities):\n", " os.makedirs(\"communities\")\n", "\n", " for i in communities.keys():\n", " os.makedirs(\"./communities/\"+str(i))\n", " for ii in communities[i].keys():\n", " if communities[i][ii]['n_tweets']>2:\n", " raw_text = communities[i][ii]['raw_text']\n", "\n", " # try to get rid of links\n", " taw_text = re.sub(r'\\w+:\\/{2}[\\d\\w-]+(\\.[\\d\\w-]+)*(?:(?:\\/[^\\s/]*))*', '', raw_text)\n", " raw_text = ' '.join([iii for iii in raw_text.split() if iii[:4] !=\"http\"])\n", "\n", " # try to get rid of hashtags and user mentions \n", " raw_text = ' '.join([iii for iii in raw_text.split() if \"#\" not in iii])\n", " raw_text = ' '.join([iii for iii in raw_text.split() if \"@\" not in iii])\n", "\n", " # clean up\n", " raw_text = raw_text.encode(\"ascii\",\"ignore\").replace('\\n', ' ')\n", " if len(raw_text.split()) > 100:\n", " comm_user = open(\"./communities/\"+str(i)+\"/\"+ii,\"w\")\n", " comm_user.write(raw_text)\n", " comm_user.close()\n", " \n", "make_dir_struc(communities)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now build a topic model for community 11 and visualize the result. Topik enables us to do so in a very streamlined way. We will simply tokenize the data, input a list of stop words and the number of topics to search for, then build the model and visualize the results using Topik." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", " Loading BokehJS ...\n", "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/javascript": [ "\n", "(function(global) {\n", " function now() {\n", " return new Date();\n", " }\n", "\n", " if (typeof (window._bokeh_onload_callbacks) === \"undefined\") {\n", " window._bokeh_onload_callbacks = [];\n", " }\n", "\n", " function run_callbacks() {\n", " window._bokeh_onload_callbacks.forEach(function(callback) { callback() });\n", " delete window._bokeh_onload_callbacks\n", " console.info(\"Bokeh: all callbacks have finished\");\n", " }\n", "\n", " function load_libs(js_urls, callback) {\n", " window._bokeh_onload_callbacks.push(callback);\n", " if (window._bokeh_is_loading > 0) {\n", " console.log(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", " return null;\n", " }\n", " if (js_urls == null || js_urls.length === 0) {\n", " run_callbacks();\n", " return null;\n", " }\n", " console.log(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", " window._bokeh_is_loading = js_urls.length;\n", " for (var i = 0; i < js_urls.length; i++) {\n", " var url = js_urls[i];\n", " var s = document.createElement('script');\n", " s.src = url;\n", " s.async = false;\n", " s.onreadystatechange = s.onload = function() {\n", " window._bokeh_is_loading--;\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: all BokehJS libraries loaded\");\n", " run_callbacks()\n", " }\n", " };\n", " s.onerror = function() {\n", " console.warn(\"failed to load library \" + url);\n", " };\n", " console.log(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", " document.getElementsByTagName(\"head\")[0].appendChild(s);\n", " }\n", " };\n", "\n", " var js_urls = ['https://cdn.pydata.org/bokeh/release/bokeh-0.11.1.min.js', 'https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.11.1.min.js', 'https://cdn.pydata.org/bokeh/release/bokeh-compiler-0.11.1.min.js'];\n", "\n", " var inline_js = [\n", " function(Bokeh) {\n", " Bokeh.set_log_level(\"info\");\n", " },\n", " \n", " function(Bokeh) {\n", " Bokeh.$(\"#7a139010-a634-42e5-ae05-6b7e61bcb528\").text(\"BokehJS successfully loaded\");\n", " },\n", " function(Bokeh) {\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-0.11.1.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-0.11.1.min.css\");\n", " console.log(\"Bokeh: injecting CSS: https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.11.1.min.css\");\n", " Bokeh.embed.inject_css(\"https://cdn.pydata.org/bokeh/release/bokeh-widgets-0.11.1.min.css\");\n", " }\n", " ];\n", "\n", " function run_inline_js() {\n", " for (var i = 0; i < inline_js.length; i++) {\n", " inline_js[i](window.Bokeh);\n", " }\n", " }\n", "\n", " if (window._bokeh_is_loading === 0) {\n", " console.log(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", " run_inline_js();\n", " } else {\n", " load_libs(js_urls, function() {\n", " console.log(\"Bokeh: BokehJS plotting callback run at\", now());\n", " run_inline_js();\n", " });\n", " }\n", "}(this));" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import nltk\n", "from topik import read_input, tokenize, vectorize, run_model, visualize\n", "from topik.visualizers.termite_plot import termite\n", "from bokeh.plotting import figure, output_file, show\n", "from bokeh.io import output_notebook\n", "\n", "output_notebook()\n", "\n", "def topic_model(directory, stopwords, ntopics):\n", " raw_data = read_input(directory)\n", " content_field = \"text\"\n", " raw_data = ((hash(item[content_field]), item[content_field]) for item in raw_data)\n", " tokenized_corpus = tokenize(raw_data,stopwords=stopwords)\n", " vectorized_corpus = vectorize(tokenized_corpus)\n", " model = run_model(vectorized_corpus, ntopics=ntopics)\n", " return model" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

<Bokeh Notebook handle for In[23]>

" ], "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stopwords=['amp','get','got','hey','hmm','hoo','hop','iep','let','ooo','par',\n", " 'pdt','pln','pst','wha','yep','yer','aest','didn','nzdt','via',\n", " 'one','com','new','like','great','make','top','awesome','best',\n", " 'good','wow','yes','say','yay','would','thanks','thank','going','ht',\n", " 'new','use','should','could','best','really','see','want','nice', 'rt',\n", " 'while','know','big','data','bigdatablogs']\n", "\n", "stopwords=set(stopwords+nltk.corpus.stopwords.words(\"english\"))\n", "ntopics = 40\n", "\n", "directory = \"./communities/11/\" # start with community 11\n", "model = topic_model(directory, stopwords, ntopics)\n", "show(termite(model))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The termite plot7 is a nice way to visualize topic modeling results. The x axis lists the topic numbers and the y axis lists frequent topic words. The size of the circle corresponds to the frequency of that word with respect to a topic. The termite plot for community 11 seems to shows us that the twitter chatter for this community includes broad data science topics like machine intelligence, analytics, data mining, but also includes a substantial amount of chatter about data science related blogs, blog posts or stories, as well as data science conferences such as ODSC Boston, tutorials, online classes and careers.\n", "\n", "This community seems to reflect the general data science community, but also twitter handles who are influential community builders that routinely tweet about data science blogging, reporting and other community-focused topics, such as training, conferences and careers. Of course, it is no surprise then that DataScienceCtrl and kdnuggets are among most influential handles in this network. Not only are kdnuggest and DataScienceCtrl regularly the most active and respected sources of data science news and blog postings, but BernardMarr, EvanSinar and Datafloq are all highly respected and influential in the broader data science community." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "If we look at the next community, community 13 (above) we see that the most influential handles include\n", "\n", "
    \n", "
  1. BoozAllen
  2. \n", "
  3. LaurenNealPhD
  4. \n", "
  5. wendykan
  6. \n", "
  7. petrguerra
  8. \n", "
  9. kaggle
  10. \n", "
" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

<Bokeh Notebook handle for In[28]>

" ], "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "directory = \"./communities/13/\"\n", "model = topic_model(directory, stopwords, ntopics)\n", "show(termite(model))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The topic modeling and termite plot for this network seems to show a focus on enterprise business analytics, industry applications, data science for social good and disruptive companies. This is consistent with the top influencers in this community. BoozAllen, LaurenNealPhD and petrguerra are all Booz Allen associated accounts, with LaurenNealPjD and petrguerra being a senior associate and VP at Booz Allen, respectively. Interestingly, Booz Allen and Kaggle co-sponsored the data science bowl this year, and we do indeed see a signature of the data science bowl with topics including terms related data science for social good in this community. kaggle and wendykan, a data scientist at kaggle, are also among the top influencers of this community." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "Lastly, if we take a look at community 38 (above) we see that the influential handles include\n", "\n", "
    \n", "
  1. CloudExpo
  2. \n", "
  3. ThingsExpo
  4. \n", "
  5. DevOpsSummit
  6. \n", "
  7. newrelic
  8. \n", "
  9. johnnewton
  10. \n", "
\n", "\n", "CloudExpo and ThingsExpo are both twitter handles devoted to cloud and IoT meetings." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "

<Bokeh Notebook handle for In[29]>

" ], "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "directory = \"./communities/38/\"\n", "model = topic_model(directory, stopwords, ntopics)\n", "show(termite(model))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This last community is enriched for chatter related to IoT, cloud business analytics, online courses, networks and security. This very nicely reflects the fact that the most influential twitter handles are related to cloud conferences and open source analytics/software, and IoT conferences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Concluding remarks\n", "\n", "We found that the data science twitter community is incredibly active and interactive, and that several important distinct communities exist and that the individual influencers for each community are experts in the subdomain or are otherwise highly regarded in the data science community. With Topik, topic modeling as well as the visualization of the results was incredibly stream-lined, and we can really gain a great deal of insight from collecting and analyzing even a single week of data science twitter chatter." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }