{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data science on twitter\n", "\n", "\n", "Twitter is an indispensable resource for data scientists as well as for the broader data science community. With the right connections, you can use twitter to learn data science, discover new technologies, computational tools and methodologies, and you can contribute to and build a community of data scientists working for the social good. This type of value is generally only available to attendees of top data science conferences on disruptive data science, open data science and data science for good. Indeed, with a good twitter list, you can bring much of this content directly to your twitter feed!\n", "\n", "Data science is a highly diverse and interdisciplinary field, but does data science twitter chatter reflect its interdisciplinary nature? Are there distinct communities of data scientists that interact with and cater to distinct sub-fields? To begin seeking an answer to this question, we will walk you through the simple analysis of a week's worth data science related tweets.\n", "\n", "## A data science twitter network\n", "\n", "Tweets were collected using a tweepy listener (see here1 for a tutorial on building a twitter listener), and stored in a text file named \"data_science_twitter.txt\". Let's first load the tweets and extract user mentions to take a quick look at the volume of data science tweets from this week." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import os\n", "import sys\n", "import json\n", "\n", "def tweets_n_edges(tweet_file):\n", " tweets=[]\n", " edges=[]\n", "\n", " for i in open(tweet_file,\"r\"):\n", " if i==\"\\n\":\n", " next\n", " else:\n", " try:\n", " tweet = json.JSONDecoder().raw_decode(i)[0]\n", " usr_mentions= tweet['entities']['user_mentions']\n", " if len(usr_mentions)>0:\n", " for ii in usr_mentions:\n", " if tweet['user']['screen_name'] != ii['screen_name']:\n", " edges.append((tweet['user']['screen_name'], ii['screen_name']))\n", " tweets.append(tweet)\n", " except: # if no user mentions, or something unexpected\n", " continue\n", "\n", " return (tweets,edges)\n", "\n", "\n", "tweets,edges = tweets_n_edges(\"data_science_twitter.txt\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tweets and network edges (links between twitter users) were gathered based on user mentions. How many tweets and user mentions were there?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 159600 tweets about data science this week, and 162070 user mentions!\n" ] } ], "source": [ "print \"There are %s tweets about data science this week, and %s user mentions!\" % ( len(tweets), len(edges) )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data science twitter community is incredibly active; we saw almost 160,000 tweets within a single week! And, there seems to be just as much interaction within the community, as there is about the same number of user mentions, not including self-mentions.\n", "\n", "But what does the network look actually like? To build a network and find the most influential data science twitter uses, we will use the NetworkX2 package to create a directed graph and to calculate eigenvector centrality (a measure of network influence) among the nodes (twitter users). The resulting network is plotted using Gephi3." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(1, (u'GilPress', 0.38942565243403915)),\n", " (2, (u'KirkDBorne', 0.30906334335611996)),\n", " (3, (u'Forbes', 0.23035596746895132)),\n", " (4, (u'BernardMarr', 0.21142119479688257)),\n", " (5, (u'bobehayes', 0.2072355059058224)),\n", " (6, (u'kdnuggets', 0.15597621686762647)),\n", " (7, (u'Ronald_vanLoon', 0.15518713444196847)),\n", " (8, (u'LinkedIn', 0.12561861905035457)),\n", " (9, (u'DataScienceCtrl', 0.11756733241544594)),\n", " (10, (u'BoozAllen', 0.11138358070618962))]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import networkx as nx\n", "\n", "G=nx.DiGraph() # initiate a directed graph\n", "G.add_edges_from(edges) # add edges to the graph from user mentions\n", "ev_cent=nx.eigenvector_centrality(G,max_iter=10000) # compute eigenvector centrality\n", "\n", "ev_tuple = []\n", "for i in ev_cent.keys():\n", " ev_tuple.append((i,ev_cent[i]))\n", " \n", "zip(range(1,11)[::-1],sorted(ev_tuple,key=lambda x: x[1])[-10:])[::-1] # get the top 10 network influencers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Nodes represent twitter handles and the edges between the nodes represent user mentions. The size and color of the nodes correspond to eigenvector centrality values, which, again, is one measure of network influence. Let's take a quick peek at the top 10 influencers (who are also plotted above):\n", "\n", "
<Bokeh Notebook handle for In[23]>
<Bokeh Notebook handle for In[28]>
<Bokeh Notebook handle for In[29]>