{ "metadata": { "name": "Chapter4" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Mining the Social Web, 1st Edition - Friends, Followers, and Setwise Operations (Chapter 4)\n", "\n", "If you only have 10 seconds...\n", "\n", "Twitter's new API will prevent you from running much of the code from _Mining the Social Web_, and this IPython Notebook shows you how to roll with the changes and adapt as painlessly as possible until an updated printing is available. In particular, it shows you how to authenticate before executing any API requests illustrated in this chapter. It is highly recommended that you read the IPython Notebook file for Chapter 1 before attempting the examples in this chapter if you haven't already.\n", "\n", "If you have a couple of minutes...\n", "\n", "Twitter is officially retiring v1.0 of their API as of March 2013 with v1.1 of the API being the new status quo. There are a few fundamental differences that social web miners that should consider (see Twitter's blog at https://dev.twitter.com/blog/changes-coming-to-twitter-api and https://dev.twitter.com/docs/api/1.1/overview) with the two changes that are most likely to affect an existing workflow being that authentication is now mandatory for *all* requests, rate-limiting being on a per resource basis (as opposed to an overall rate limit based on a fixed number of requests per unit time), various platform objects changing (for the better), and search semantics changing to a \"pageless\" approach. All in all, the v1.1 API looks much cleaner and more consistent, and it should be a good thing longer-term although it may cause interim pains for folks migrating to it.\n", "\n", "The latest printing of Mining the Social Web (2012-02-22, Third release) reflects v1.0 of the API, and this document is intended to provide readers with updated examples from Chapter 4 of the book until a new printing provides updates.\n", "\n", "Unlike the IPython Notebook for Chapter 1, there is no filler in this notebook at this time. See the Chapter 1 notebook for a good introduction to using the Twitter API and all that it entails.\n", "\n", "As a reader of my book, I want you to know that I'm committed to helping you in any way that I can, so please reach out on Facebook at https://www.facebook.com/MiningTheSocialWeb or on Twitter at http://twitter.com/SocialWebMining if you have any questions or concerns in the meanwhile. I'd also love your feedback on whether or not you think that IPython Notebook is a good tool for tinkering with the source code for the book, because I'm strongly considering it as a supplement for each chapter.\n", "\n", "Regards - Matthew A. Russell\n", "\n", "\n", "## A Brief Technical Preamble\n", "\n", "* You will need to set your PYTHONPATH environment variable to point to the 'python_code' folder for the GitHub source code when launching this notebook or some of the examples won't work, because they import utility code that's located there\n", "\n", "* Note that this notebook doesn't repeatedly redefine a connection to the Twitter API. It creates a connection one time and resuses it throughout the remainder of the examples in the notebook\n", "\n", "* Arguments that are typically passed in through the command line are hardcoded in the examples for convenience. CLI arguments are typically in ALL_CAPS, so they're easy to spot and change as needed.\n", "\n", "* For simplicity, examples that harvest data are limited to small numbers so that it's easier to use experiment with this notebook (given that @timoreilly, the principal subject of the examples, has vast numbers of followers.)\n", "\n", "* The parenthetical file names at the end of the captions for the examples correspond to files in the 'python_code' folder of the GitHub repository\n", "\n", "* Just like you'd learn from reading the book, you'll need to have a Redis server running because several of the examples in this chapter store and fetch data from it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 4-1. Fetching extended information about a Twitter user" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import twitter\n", "import json\n", "\n", "# Go to http://twitter.com/apps/new to create an app and get these items\n", "# See https://dev.twitter.com/docs/auth/oauth for more information on Twitter's OAuth implementation\n", "\n", "CONSUMER_KEY = ''\n", "CONSUMER_SECRET = ''\n", "OAUTH_TOKEN = ''\n", "OAUTH_TOKEN_SECRET = ''\n", "\n", "auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,\n", " CONSUMER_KEY, CONSUMER_SECRET)\n", "\n", "t = twitter.Twitter(domain='api.twitter.com', \n", " api_version='1.1',\n", " auth=auth\n", " )\n", "\n", "screen_name = 'timoreilly'\n", "\n", "response = t.users.show(screen_name=screen_name)\n", "print json.dumps(response, sort_keys=True, indent=4)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 4-2. Using OAuth to authenticate and grab some friend data (friends__followers_get_friends.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import time\n", "import cPickle\n", "import twitter\n", "\n", "SCREEN_NAME = 'timoreilly' # XXX: IPython Notebook cannot prompt for input\n", "\n", "friends_limit = 10000\n", "\n", "ids = []\n", "wait_period = 2 # secs\n", "cursor = -1\n", "\n", "while cursor != 0:\n", " if wait_period > 3600: # 1 hour\n", " print >> sys.stderr, 'Too many retries. Saving partial data to disk and exiting'\n", " f = file('%s.friend_ids' % str(cursor), 'wb')\n", " cPickle.dump(ids, f)\n", " f.close()\n", " exit()\n", "\n", " try:\n", " response = t.friends.ids(screen_name=SCREEN_NAME, cursor=cursor)\n", " ids.extend(response['ids'])\n", " wait_period = 2\n", " except twitter.api.TwitterHTTPError, e:\n", " if e.e.code == 401:\n", " print >> sys.stderr, 'Encountered 401 Error (Not Authorized)'\n", " print >> sys.stderr, 'User %s is protecting their tweets' % (SCREEN_NAME, )\n", " elif e.e.code in (502, 503):\n", " print >> sys.stderr, \\\n", " 'Encountered %i Error. Trying again in %i seconds' % \\\n", " (e.e.code, wait_period)\n", " time.sleep(wait_period)\n", " wait_period *= 1.5\n", " continue\n", " elif t.account.rate_limit_status()['remaining_hits'] == 0:\n", " status = t.account.rate_limit_status()\n", " now = time.time() # UTC\n", " when_rate_limit_resets = status['reset_time_in_seconds'] # UTC\n", " sleep_time = when_rate_limit_resets - now\n", " print >> sys.stderr, \\\n", " 'Rate limit reached. Trying again in %i seconds' % (sleep_time,)\n", " time.sleep(sleep_time)\n", " continue\n", " else:\n", " raise e # Best to handle this on a case by case basis\n", "\n", " cursor = response['next_cursor']\n", " print >> sys.stderr, 'Fetched %i ids for %s' % (len(ids), SCREEN_NAME)\n", " if len(ids) >= friends_limit:\n", " break\n", "\n", "# Do something interesting with the ids\n", "\n", "print ids" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 4-3. Example 4-2 refactored to use two common utilties for OAuth and making API requests (friends_followers__get_friends_refactored.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import time\n", "import cPickle\n", "import twitter\n", "from twitter__util import makeTwitterRequest \n", "\n", "SCREEN_NAME = 'timoreilly' # XXX: IPython Notebook cannot prompt for input\n", "FRIENDS_LIMIT = 10000 # XXX: IPython Notebook cannot prompt for input\n", "\n", "def getFriendIds(screen_name=None, user_id=None, friends_limit=10000):\n", "\n", " ids = []\n", " cursor = -1\n", " while cursor != 0:\n", " params = dict(cursor=cursor)\n", " if screen_name is not None:\n", " params['screen_name'] = screen_name\n", " else:\n", " params['user_id'] = user_id\n", "\n", " response = makeTwitterRequest(t.friends.ids, **params)\n", "\n", " ids.extend(response['ids'])\n", " cursor = response['next_cursor']\n", " print >> sys.stderr, \\\n", " 'Fetched %i ids for %s' % (len(ids), screen_name or user_id)\n", " if len(ids) >= friends_limit:\n", " break\n", "\n", " return ids\n", "\n", "if __name__ == '__main__':\n", " ids = getFriendIds(SCREEN_NAME, friends_limit=FRIENDS_LIMIT)\n", "\n", " # do something interesting with the ids\n", "\n", " print ids" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 4-4. Harvesting, storing, and computing statistics about friends and followers (friends_followers__friend_follower_symmetry.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import locale\n", "import time\n", "import functools\n", "import twitter\n", "import redis\n", "\n", "# A template-like function for maximizing code reuse,\n", "# which is essentially a wrapper around makeTwitterRequest\n", "# with some additional logic in place for interfacing with \n", "# Redis\n", "from twitter__util import _getFriendsOrFollowersUsingFunc\n", "\n", "# Creates a consistent key value for a user given a screen name\n", "from twitter__util import getRedisIdByScreenName\n", "\n", "SCREEN_NAME = 'timoreilly' # XXX: IPython Notebook cannot prompt for input\n", "\n", "MAXINT = 10000 #sys.maxint\n", "\n", "# For nice number formatting\n", "locale.setlocale(locale.LC_ALL, '') \n", "\n", "# Connect using default settings for localhost\n", "r = redis.Redis() \n", "\n", "# Some wrappers around _getFriendsOrFollowersUsingFunc \n", "# that bind the first two arguments\n", "\n", "getFriends = functools.partial(_getFriendsOrFollowersUsingFunc, \n", " t.friends.ids, 'friend_ids', t, r)\n", "\n", "getFollowers = functools.partial(_getFriendsOrFollowersUsingFunc,\n", " t.followers.ids, 'follower_ids', t, r)\n", "\n", "screen_name = SCREEN_NAME\n", "\n", "# get the data\n", "\n", "print >> sys.stderr, 'Getting friends for %s...' % (screen_name, )\n", "getFriends(screen_name, limit=MAXINT)\n", "\n", "print >> sys.stderr, 'Getting followers for %s...' % (screen_name, )\n", "getFollowers(screen_name, limit=MAXINT)\n", "\n", "# use redis to compute the numbers\n", "\n", "n_friends = r.scard(getRedisIdByScreenName(screen_name, 'friend_ids'))\n", "\n", "n_followers = r.scard(getRedisIdByScreenName(screen_name, 'follower_ids'))\n", "\n", "n_friends_diff_followers = r.sdiffstore('temp',\n", " [getRedisIdByScreenName(screen_name,\n", " 'friend_ids'),\n", " getRedisIdByScreenName(screen_name,\n", " 'follower_ids')])\n", "r.delete('temp')\n", "\n", "n_followers_diff_friends = r.sdiffstore('temp',\n", " [getRedisIdByScreenName(screen_name,\n", " 'follower_ids'),\n", " getRedisIdByScreenName(screen_name,\n", " 'friend_ids')])\n", "r.delete('temp')\n", "\n", "n_friends_inter_followers = r.sinterstore('temp',\n", " [getRedisIdByScreenName(screen_name, 'follower_ids'),\n", " getRedisIdByScreenName(screen_name, 'friend_ids')])\n", "r.delete('temp')\n", "\n", "print '%s is following %s' % (screen_name, locale.format('%d', n_friends, True))\n", "print '%s is being followed by %s' % (screen_name, locale.format('%d',\n", " n_followers, True))\n", "print '%s of %s are not following %s back' % (locale.format('%d',\n", " n_friends_diff_followers, True), locale.format('%d', n_friends, True),\n", " screen_name)\n", "print '%s of %s are not being followed back by %s' % (locale.format('%d',\n", " n_followers_diff_friends, True), locale.format('%d', n_followers, True),\n", " screen_name)\n", "print '%s has %s mutual friends' \\\n", " % (screen_name, locale.format('%d', n_friends_inter_followers, True))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 4-5. Resolving basic user information such as screen names from IDs (friends_followers__get_user_info.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import json\n", "import redis\n", "\n", "# A makeTwitterRequest call through to the /users/lookup \n", "# resource, which accepts a comma separated list of up \n", "# to 100 screen names. Details are fairly uninteresting. \n", "# See also http://dev.twitter.com/doc/get/users/lookup\n", "from twitter__util import getUserInfo\n", "\n", "if __name__ == \"__main__\":\n", " # XXX: IPython Notebook cannot prompt for input\n", " screen_names = ['timoreilly', 'socialwebmining', 'ptwobrussell']\n", "\n", " r = redis.Redis()\n", "\n", " print json.dumps(\n", " getUserInfo(t, r, screen_names=screen_names),\n", " indent=4\n", " )" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 4-7. Finding common friends/followers for multiple Twitterers, with output that's easier on the eyes (friends_followers__friends_followers_in_common.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import redis\n", "\n", "from twitter__util import getRedisIdByScreenName\n", "\n", "# A pretty-print function for numbers\n", "from twitter__util import pp\n", "\n", "r = redis.Redis()\n", "\n", "def friendsFollowersInCommon(screen_names):\n", " r.sinterstore('temp$friends_in_common', \n", " [getRedisIdByScreenName(screen_name, 'friend_ids') \n", " for screen_name in screen_names]\n", " )\n", "\n", " r.sinterstore('temp$followers_in_common',\n", " [getRedisIdByScreenName(screen_name, 'follower_ids')\n", " for screen_name in screen_names]\n", " )\n", "\n", " print 'Friends in common for %s: %s' % (', '.join(screen_names),\n", " pp(r.scard('temp$friends_in_common')))\n", "\n", " print 'Followers in common for %s: %s' % (', '.join(screen_names),\n", " pp(r.scard('temp$followers_in_common')))\n", "\n", " # Clean up scratch workspace\n", "\n", " r.delete('temp$friends_in_common')\n", " r.delete('temp$followers_in_common')\n", "\n", "# Note:\n", "# The assumption is that the screen names you are \n", "# supplying have already been added to Redis.\n", "# See friends_followers__get_friends__refactored.py (Example 4-3)\n", "\n", "# XXX: IPython Notebook cannot prompt for input\n", "friendsFollowersInCommon(['timoreilly', 'socialwebmining'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 4-8. Crawling friends/followers connections (friends_followers__crawl.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import redis\n", "import functools\n", "from twitter__util import getUserInfo\n", "from twitter__util import _getFriendsOrFollowersUsingFunc\n", "\n", "SCREEN_NAME = 'timoreilly' # XXX: IPython Notebook cannot prompt for input\n", "\n", "r = redis.Redis()\n", "\n", "# Some wrappers around _getFriendsOrFollowersUsingFunc that \n", "# create convenience functions\n", "\n", "getFriends = functools.partial(_getFriendsOrFollowersUsingFunc, \n", " t.friends.ids, 'friend_ids', t, r)\n", "getFollowers = functools.partial(_getFriendsOrFollowersUsingFunc,\n", " t.followers.ids, 'follower_ids', t, r)\n", "\n", "def crawl(\n", " screen_names,\n", " friends_limit=10000,\n", " followers_limit=10000,\n", " depth=1,\n", " friends_sample=0.2, #XXX\n", " followers_sample=0.0,\n", " ):\n", "\n", " getUserInfo(t, r, screen_names=screen_names)\n", " for screen_name in screen_names:\n", " friend_ids = getFriends(screen_name, limit=friends_limit)\n", " follower_ids = getFollowers(screen_name, limit=followers_limit)\n", "\n", " friends_info = getUserInfo(t, r, user_ids=friend_ids, \n", " sample=friends_sample)\n", "\n", " followers_info = getUserInfo(t, r, user_ids=follower_ids,\n", " sample=followers_sample)\n", "\n", " next_queue = [u['screen_name'] for u in friends_info + followers_info]\n", "\n", " d = 1\n", " while d < depth:\n", " d += 1\n", " (queue, next_queue) = (next_queue, [])\n", " for _screen_name in queue:\n", " friend_ids = getFriends(_screen_name, limit=friends_limit)\n", " follower_ids = getFollowers(_screen_name, limit=followers_limit)\n", "\n", " next_queue.extend(friend_ids + follower_ids)\n", "\n", " # Note that this function takes a kw between 0.0 and 1.0 called\n", " # sample that allows you to crawl only a random sample of nodes\n", " # at any given level of the graph\n", "\n", " getUserInfo(t, r, user_ids=next_queue)\n", "\n", "crawl([SCREEN_NAME])\n", "\n", "# The data is now in the system. Do something interesting. For example, \n", "# find someone's most popular followers as an indiactor of potential influence.\n", "# See friends_followers__calculate_avg_influence_of_followers.py" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 4-9. Calculating a Twitterer's most popular followers (friends_followers__calculate_avg_influence_of_followers.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import json\n", "import locale\n", "import redis\n", "from prettytable import PrettyTable\n", "\n", "# Pretty printing numbers\n", "from twitter__util import pp \n", "\n", "# These functions create consistent keys from \n", "# screen names and user id values\n", "from twitter__util import getRedisIdByScreenName \n", "from twitter__util import getRedisIdByUserId\n", "\n", "SCREEN_NAME = 'timoreilly' # XXX: IPython Notebook cannot prompt for input\n", "\n", "locale.setlocale(locale.LC_ALL, '')\n", "\n", "def calculate():\n", " r = redis.Redis() # Default connection settings on localhost\n", "\n", " follower_ids = list(r.smembers(getRedisIdByScreenName(SCREEN_NAME,\n", " 'follower_ids')))\n", "\n", " followers = r.mget([getRedisIdByUserId(follower_id, 'info.json')\n", " for follower_id in follower_ids])\n", " followers = [json.loads(f) for f in followers if f is not None]\n", "\n", " freqs = {}\n", " for f in followers:\n", " cnt = f['followers_count']\n", " if not freqs.has_key(cnt):\n", " freqs[cnt] = []\n", "\n", " freqs[cnt].append({'screen_name': f['screen_name'], 'user_id': f['id']})\n", "\n", " # It could take a few minutes to calculate freqs, so store a snapshot for later use\n", "\n", " r.set(getRedisIdByScreenName(SCREEN_NAME, 'follower_freqs'),\n", " json.dumps(freqs))\n", "\n", " keys = freqs.keys()\n", " keys.sort()\n", "\n", " print 'The top 10 followers from the sample:'\n", "\n", " field_names = ['Date', 'Count']\n", " pt = PrettyTable(field_names=field_names)\n", " pt.align = 'l'\n", "\n", " for (user, freq) in reversed([(user['screen_name'], k) for k in keys[-10:]\n", " for user in freqs[k]]):\n", " pt.add_row([user, pp(freq)])\n", "\n", " print pt\n", "\n", " all_freqs = [k for k in keys for user in freqs[k]]\n", " avg = reduce(lambda x, y: x + y, all_freqs) / len(all_freqs)\n", "\n", " print \"\\nThe average number of followers for %s's followers: %s\" \\\n", " % (SCREEN_NAME, pp(avg))\n", "\n", "calculate()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 4-10. Exporting friend/follower data from Redis to NetworkX for easy graph analytics (friends_followers__redis_to_networkx.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Summary: Build up a digraph where an edge exists between two users \n", "# if the source node is following the destination node\n", "\n", "import os\n", "import sys\n", "import json\n", "import networkx as nx\n", "import redis\n", "\n", "from twitter__util import getRedisIdByScreenName\n", "from twitter__util import getRedisIdByUserId\n", "\n", "SCREEN_NAME = 'timoreilly' # XXX: IPython Notebook cannot prompt for input\n", "\n", "g = nx.Graph()\n", "r = redis.Redis()\n", "\n", "# Compute all ids for nodes appearing in the graph\n", "\n", "friend_ids = list(r.smembers(getRedisIdByScreenName(SCREEN_NAME, 'friend_ids')))\n", "id_for_screen_name = json.loads(r.get(getRedisIdByScreenName(SCREEN_NAME,\n", " 'info.json')))['id']\n", "ids = [id_for_screen_name] + friend_ids\n", "\n", "for current_id in ids:\n", " print >> sys.stderr, 'Processing user with id', current_id\n", "\n", " try:\n", " current_info = json.loads(r.get(getRedisIdByUserId(current_id, 'info.json'\n", " )))\n", " current_screen_name = current_info['screen_name']\n", " friend_ids = list(r.smembers(getRedisIdByScreenName(current_screen_name,\n", " 'friend_ids')))\n", "\n", " # filter out ids for this person if they aren't also SCREEN_NAME's friends too, \n", " # which is the basis of the query\n", "\n", " friend_ids = [fid for fid in friend_ids if fid in ids]\n", " except Exception, e:\n", " print >> sys.stderr, 'Skipping', current_id\n", "\n", " for friend_id in friend_ids:\n", " try:\n", " friend_info = json.loads(r.get(getRedisIdByUserId(friend_id,\n", " 'info.json')))\n", " except TypeError, e:\n", " print >> sys.stderr, '\\tSkipping', friend_id, 'for', current_screen_name\n", " continue\n", "\n", " g.add_edge(current_screen_name, friend_info['screen_name'])\n", "\n", "# Pickle the graph to disk...\n", "\n", "if not os.path.isdir('out'):\n", " os.mkdir('out')\n", "\n", "filename = os.path.join('out', SCREEN_NAME + '.gpickle')\n", "nx.write_gpickle(g, filename)\n", "\n", "print 'Pickle file stored in: %s' % filename\n", "\n", "# You can un-pickle like so...\n", "\n", "# g = nx.read_gpickle(os.path.join('out', SCREEN_NAME + '.gpickle'))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 4-11. Using NetworkX to find cliques in graphs (friends_followers__clique_analysis.py)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import json\n", "import networkx as nx\n", "\n", "G = 'out/timoreilly.gpickle' # IPython Notebook cannot prompt for input\n", "\n", "g = nx.read_gpickle(G)\n", "\n", "# Finding cliques is a hard problem, so this could\n", "# take a while for large graphs.\n", "# See http://en.wikipedia.org/wiki/NP-complete and \n", "# http://en.wikipedia.org/wiki/Clique_problem\n", "\n", "cliques = [c for c in nx.find_cliques(g)]\n", "\n", "num_cliques = len(cliques)\n", "\n", "clique_sizes = [len(c) for c in cliques]\n", "max_clique_size = max(clique_sizes)\n", "avg_clique_size = sum(clique_sizes) / num_cliques\n", "\n", "max_cliques = [c for c in cliques if len(c) == max_clique_size]\n", "\n", "num_max_cliques = len(max_cliques)\n", "\n", "max_clique_sets = [set(c) for c in max_cliques]\n", "people_in_every_max_clique = list(reduce(lambda x, y: x.intersection(y),\n", " max_clique_sets))\n", "\n", "print 'Num cliques:', num_cliques\n", "print 'Avg clique size:', avg_clique_size\n", "print 'Max clique size:', max_clique_size\n", "print 'Num max cliques:', num_max_cliques\n", "print\n", "print 'People in all max cliques:'\n", "print json.dumps(people_in_every_max_clique, indent=4)\n", "print\n", "print 'Max cliques:'\n", "print json.dumps(max_cliques, indent=4)" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }