{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Mining the Social Web, 2nd Edition\n", "\n", "##Chapter 9: Twitter Cookbook\n", "\n", "This IPython Notebook provides an interactive way to follow along with and explore the numbered examples from [_Mining the Social Web (2nd Edition)_](http://bit.ly/135dHfs). The intent behind this notebook is to reinforce the concepts from the sample code in a fun, convenient, and effective way. This notebook assumes that you are reading along with the book and have the context of the discussion as you work through these exercises.\n", "\n", "In the somewhat unlikely event that you've somehow stumbled across this notebook outside of its context on GitHub, [you can find the full source code repository here](http://bit.ly/16kGNyb).\n", "\n", "## Copyright and Licensing\n", "\n", "You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.\n", "\n", "## Notes\n", "\n", "This notebook is still a work in progress and currently features 25 recipes. The example titles should be fairly self-explanatory, and the code is designed to be reused as you progress further in the notebook --- meaning that you should follow along and execute each cell along the way since later cells may depend on functions being defined from earlier cells. Consider this notebook draft material at this point." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 1. Accessing Twitter's API for development purposes" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import twitter\n", "\n", "def oauth_login():\n", " # XXX: Go to http://twitter.com/apps/new to create an app and get values\n", " # for these credentials that you'll need to provide in place of these\n", " # empty string values that are defined as placeholders.\n", " # See https://dev.twitter.com/docs/auth/oauth for more information \n", " # on Twitter's OAuth implementation.\n", " \n", " CONSUMER_KEY = ''\n", " CONSUMER_SECRET = ''\n", " OAUTH_TOKEN = ''\n", " OAUTH_TOKEN_SECRET = ''\n", " \n", " auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,\n", " CONSUMER_KEY, CONSUMER_SECRET)\n", " \n", " twitter_api = twitter.Twitter(auth=auth)\n", " return twitter_api\n", "\n", "# Sample usage\n", "twitter_api = oauth_login() \n", "\n", "# Nothing to see by displaying twitter_api except that it's now a\n", "# defined variable\n", "\n", "print twitter_api" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 2. Doing the OAuth dance to access Twitter's API for production purposes" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import json\n", "from flask import Flask, request\n", "import multiprocessing\n", "from threading import Timer\n", "from IPython.display import IFrame\n", "from IPython.display import display\n", "from IPython.display import Javascript as JS\n", "\n", "import twitter\n", "from twitter.oauth_dance import parse_oauth_tokens\n", "from twitter.oauth import read_token_file, write_token_file\n", "\n", "# Note: This code is exactly the flow presented in the _AppendixB notebook\n", "\n", "OAUTH_FILE = \"resources/ch09-twittercookbook/twitter_oauth\"\n", "\n", "# XXX: Go to http://twitter.com/apps/new to create an app and get values\n", "# for these credentials that you'll need to provide in place of these\n", "# empty string values that are defined as placeholders.\n", "# See https://dev.twitter.com/docs/auth/oauth for more information \n", "# on Twitter's OAuth implementation, and ensure that *oauth_callback*\n", "# is defined in your application settings as shown next if you are \n", "# using Flask in this IPython Notebook.\n", "\n", "# Define a few variables that will bleed into the lexical scope of a couple of \n", "# functions that follow\n", "CONSUMER_KEY = ''\n", "CONSUMER_SECRET = ''\n", "oauth_callback = 'http://127.0.0.1:5000/oauth_helper'\n", " \n", "# Set up a callback handler for when Twitter redirects back to us after the user \n", "# authorizes the app\n", "\n", "webserver = Flask(\"TwitterOAuth\")\n", "@webserver.route(\"/oauth_helper\")\n", "def oauth_helper():\n", " \n", " oauth_verifier = request.args.get('oauth_verifier')\n", "\n", " # Pick back up credentials from ipynb_oauth_dance\n", " oauth_token, oauth_token_secret = read_token_file(OAUTH_FILE)\n", " \n", " _twitter = twitter.Twitter(\n", " auth=twitter.OAuth(\n", " oauth_token, oauth_token_secret, CONSUMER_KEY, CONSUMER_SECRET),\n", " format='', api_version=None)\n", "\n", " oauth_token, oauth_token_secret = parse_oauth_tokens(\n", " _twitter.oauth.access_token(oauth_verifier=oauth_verifier))\n", "\n", " # This web server only needs to service one request, so shut it down\n", " shutdown_after_request = request.environ.get('werkzeug.server.shutdown')\n", " shutdown_after_request()\n", "\n", " # Write out the final credentials that can be picked up after the following\n", " # blocking call to webserver.run().\n", " write_token_file(OAUTH_FILE, oauth_token, oauth_token_secret)\n", " return \"%s %s written to %s\" % (oauth_token, oauth_token_secret, OAUTH_FILE)\n", "\n", "# To handle Twitter's OAuth 1.0a implementation, we'll just need to implement a \n", "# custom \"oauth dance\" and will closely follow the pattern defined in \n", "# twitter.oauth_dance.\n", "\n", "def ipynb_oauth_dance():\n", " \n", " _twitter = twitter.Twitter(\n", " auth=twitter.OAuth('', '', CONSUMER_KEY, CONSUMER_SECRET),\n", " format='', api_version=None)\n", "\n", " oauth_token, oauth_token_secret = parse_oauth_tokens(\n", " _twitter.oauth.request_token(oauth_callback=oauth_callback))\n", "\n", " # Need to write these interim values out to a file to pick up on the callback \n", " # from Twitter that is handled by the web server in /oauth_helper\n", " write_token_file(OAUTH_FILE, oauth_token, oauth_token_secret)\n", " \n", " oauth_url = ('http://api.twitter.com/oauth/authorize?oauth_token=' + oauth_token)\n", " \n", " # Tap the browser's native capabilities to access the web server through a new \n", " # window to get user authorization\n", " display(JS(\"window.open('%s')\" % oauth_url))\n", "\n", "# After the webserver.run() blocking call, start the OAuth Dance that will\n", "# ultimately cause Twitter to redirect a request back to it. Once that request\n", "# is serviced, the web server will shut down and program flow will resume\n", "# with the OAUTH_FILE containing the necessary credentials.\n", "Timer(1, lambda: ipynb_oauth_dance()).start()\n", "\n", "webserver.run(host='0.0.0.0')\n", "\n", "# The values that are read from this file are written out at\n", "# the end of /oauth_helper\n", "oauth_token, oauth_token_secret = read_token_file(OAUTH_FILE)\n", "\n", "# These four credentials are what is needed to authorize the application\n", "auth = twitter.oauth.OAuth(oauth_token, oauth_token_secret,\n", " CONSUMER_KEY, CONSUMER_SECRET)\n", " \n", "twitter_api = twitter.Twitter(auth=auth)\n", "\n", "print twitter_api" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 3. Discovering the trending topics" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import json\n", "import twitter\n", "\n", "def twitter_trends(twitter_api, woe_id):\n", " # Prefix ID with the underscore for query string parameterization.\n", " # Without the underscore, the twitter package appends the ID value\n", " # to the URL itself as a special-case keyword argument.\n", " return twitter_api.trends.place(_id=woe_id)\n", "\n", "# Sample usage\n", "\n", "twitter_api = oauth_login()\n", "\n", "# See https://dev.twitter.com/docs/api/1.1/get/trends/place and\n", "# http://developer.yahoo.com/geo/geoplanet/ for details on\n", "# Yahoo! Where On Earth ID\n", "\n", "WORLD_WOE_ID = 1\n", "world_trends = twitter_trends(twitter_api, WORLD_WOE_ID)\n", "print json.dumps(world_trends, indent=1)\n", "\n", "US_WOE_ID = 23424977\n", "us_trends = twitter_trends(twitter_api, US_WOE_ID)\n", "print json.dumps(us_trends, indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 4. Searching for tweets" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def twitter_search(twitter_api, q, max_results=200, **kw):\n", "\n", " # See https://dev.twitter.com/docs/api/1.1/get/search/tweets and \n", " # https://dev.twitter.com/docs/using-search for details on advanced \n", " # search criteria that may be useful for keyword arguments\n", " \n", " # See https://dev.twitter.com/docs/api/1.1/get/search/tweets \n", " search_results = twitter_api.search.tweets(q=q, count=100, **kw)\n", " \n", " statuses = search_results['statuses']\n", " \n", " # Iterate through batches of results by following the cursor until we\n", " # reach the desired number of results, keeping in mind that OAuth users\n", " # can \"only\" make 180 search queries per 15-minute interval. See\n", " # https://dev.twitter.com/docs/rate-limiting/1.1/limits\n", " # for details. A reasonable number of results is ~1000, although\n", " # that number of results may not exist for all queries.\n", " \n", " # Enforce a reasonable limit\n", " max_results = min(1000, max_results)\n", " \n", " for _ in range(10): # 10*100 = 1000\n", " try:\n", " next_results = search_results['search_metadata']['next_results']\n", " except KeyError, e: # No more results when next_results doesn't exist\n", " break\n", " \n", " # Create a dictionary from next_results, which has the following form:\n", " # ?max_id=313519052523986943&q=NCAA&include_entities=1\n", " kwargs = dict([ kv.split('=') \n", " for kv in next_results[1:].split(\"&\") ])\n", " \n", " search_results = twitter_api.search.tweets(**kwargs)\n", " statuses += search_results['statuses']\n", " \n", " if len(statuses) > max_results: \n", " break\n", " \n", " return statuses\n", "\n", "# Sample usage\n", "\n", "twitter_api = oauth_login()\n", "\n", "q = \"CrossFit\"\n", "results = twitter_search(twitter_api, q, max_results=10)\n", " \n", "# Show one sample search result by slicing the list...\n", "print json.dumps(results[0], indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 5. Constructing convenient function calls" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from functools import partial\n", "\n", "pp = partial(json.dumps, indent=1)\n", "\n", "twitter_world_trends = partial(twitter_trends, twitter_api, WORLD_WOE_ID)\n", "\n", "print pp(twitter_world_trends())\n", "\n", "authenticated_twitter_search = partial(twitter_search, twitter_api)\n", "results = authenticated_twitter_search(\"iPhone\")\n", "print pp(results)\n", "\n", "authenticated_iphone_twitter_search = partial(authenticated_twitter_search, \"iPhone\")\n", "results = authenticated_iphone_twitter_search()\n", "print pp(results)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 6. Saving and restoring JSON data with flat-text files" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import io, json\n", "\n", "def save_json(filename, data):\n", " with io.open('resources/ch09-twittercookbook/{0}.json'.format(filename), \n", " 'w', encoding='utf-8') as f:\n", " f.write(unicode(json.dumps(data, ensure_ascii=False)))\n", "\n", "def load_json(filename):\n", " with io.open('resources/ch09-twittercookbook/{0}.json'.format(filename), \n", " encoding='utf-8') as f:\n", " return f.read()\n", "\n", "# Sample usage\n", "\n", "q = 'CrossFit'\n", "\n", "twitter_api = oauth_login()\n", "results = twitter_search(twitter_api, q, max_results=10)\n", "\n", "save_json(q, results)\n", "results = load_json(q)\n", "\n", "print json.dumps(results, indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 7. Saving and accessing JSON data with MongoDB" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import json\n", "import pymongo # pip install pymongo\n", "\n", "def save_to_mongo(data, mongo_db, mongo_db_coll, **mongo_conn_kw):\n", " \n", " # Connects to the MongoDB server running on \n", " # localhost:27017 by default\n", " \n", " client = pymongo.MongoClient(**mongo_conn_kw)\n", " \n", " # Get a reference to a particular database\n", " \n", " db = client[mongo_db]\n", " \n", " # Reference a particular collection in the database\n", " \n", " coll = db[mongo_db_coll]\n", " \n", " # Perform a bulk insert and return the IDs\n", " \n", " return coll.insert(data)\n", "\n", "def load_from_mongo(mongo_db, mongo_db_coll, return_cursor=False,\n", " criteria=None, projection=None, **mongo_conn_kw):\n", " \n", " # Optionally, use criteria and projection to limit the data that is \n", " # returned as documented in \n", " # http://docs.mongodb.org/manual/reference/method/db.collection.find/\n", " \n", " # Consider leveraging MongoDB's aggregations framework for more \n", " # sophisticated queries.\n", " \n", " client = pymongo.MongoClient(**mongo_conn_kw)\n", " db = client[mongo_db]\n", " coll = db[mongo_db_coll]\n", " \n", " if criteria is None:\n", " criteria = {}\n", " \n", " if projection is None:\n", " cursor = coll.find(criteria)\n", " else:\n", " cursor = coll.find(criteria, projection)\n", "\n", " # Returning a cursor is recommended for large amounts of data\n", " \n", " if return_cursor:\n", " return cursor\n", " else:\n", " return [ item for item in cursor ]\n", "\n", "# Sample usage\n", "\n", "q = 'CrossFit'\n", "\n", "twitter_api = oauth_login()\n", "results = twitter_search(twitter_api, q, max_results=10)\n", "\n", "save_to_mongo(results, 'search_results', q)\n", "\n", "load_from_mongo('search_results', q)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 8. Sampling the Twitter firehose with the Streaming API" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Finding topics of interest by using the filtering capablities it offers.\n", "\n", "import twitter\n", "\n", "# Query terms\n", "\n", "q = 'CrossFit' # Comma-separated list of terms\n", "\n", "print >> sys.stderr, 'Filtering the public timeline for track=\"%s\"' % (q,)\n", "\n", "# Returns an instance of twitter.Twitter\n", "twitter_api = oauth_login()\n", "\n", "# Reference the self.auth parameter\n", "twitter_stream = twitter.TwitterStream(auth=twitter_api.auth)\n", "\n", "# See https://dev.twitter.com/docs/streaming-apis\n", "stream = twitter_stream.statuses.filter(track=q)\n", "\n", "# For illustrative purposes, when all else fails, search for Justin Bieber\n", "# and something is sure to turn up (at least, on Twitter)\n", "\n", "for tweet in stream:\n", " print tweet['text']\n", " # Save to a database in a particular collection" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 9. Collecting time-series data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import datetime\n", "import time\n", "import twitter\n", "\n", "def get_time_series_data(api_func, mongo_db_name, mongo_db_coll, \n", " secs_per_interval=60, max_intervals=15, **mongo_conn_kw):\n", " \n", " # Default settings of 15 intervals and 1 API call per interval ensure that \n", " # you will not exceed the Twitter rate limit.\n", " \n", " interval = 0\n", " \n", " while True:\n", " \n", " # A timestamp of the form \"2013-06-14 12:52:07\"\n", " now = str(datetime.datetime.now()).split(\".\")[0]\n", " \n", " ids = save_to_mongo(api_func(), mongo_db_name, mongo_db_coll + \"-\" + now)\n", " \n", " print >> sys.stderr, \"Write {0} trends\".format(len(ids))\n", " print >> sys.stderr, \"Zzz...\"\n", " print >> sys.stderr.flush()\n", " \n", " time.sleep(secs_per_interval) # seconds\n", " interval += 1\n", " \n", " if interval >= 15:\n", " break\n", " \n", "# Sample usage\n", "\n", "get_time_series_data(twitter_world_trends, 'time-series', 'twitter_world_trends')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 10. Extracting tweet entities" ] }, { "cell_type": "code", "collapsed": true, "input": [ "def extract_tweet_entities(statuses):\n", " \n", " # See https://dev.twitter.com/docs/tweet-entities for more details on tweet\n", " # entities\n", "\n", " if len(statuses) == 0:\n", " return [], [], [], [], []\n", " \n", " screen_names = [ user_mention['screen_name'] \n", " for status in statuses\n", " for user_mention in status['entities']['user_mentions'] ]\n", " \n", " hashtags = [ hashtag['text'] \n", " for status in statuses \n", " for hashtag in status['entities']['hashtags'] ]\n", "\n", " urls = [ url['expanded_url'] \n", " for status in statuses \n", " for url in status['entities']['urls'] ]\n", " \n", " symbols = [ symbol['text']\n", " for status in statuses\n", " for symbol in status['entities']['symbols'] ]\n", " \n", " # In some circumstances (such as search results), the media entity\n", " # may not appear\n", " if status['entities'].has_key('media'): \n", " media = [ media['url'] \n", " for status in statuses \n", " for media in status['entities']['media'] ]\n", " else:\n", " media = []\n", "\n", " return screen_names, hashtags, urls, media, symbols\n", "\n", "# Sample usage\n", "\n", "q = 'CrossFit'\n", "\n", "statuses = twitter_search(twitter_api, q)\n", "\n", "screen_names, hashtags, urls, media, symbols = extract_tweet_entities(statuses)\n", " \n", "# Explore the first five items for each...\n", "\n", "print json.dumps(screen_names[0:5], indent=1) \n", "print json.dumps(hashtags[0:5], indent=1)\n", "print json.dumps(urls[0:5], indent=1)\n", "print json.dumps(media[0:5], indent=1)\n", "print json.dumps(symbols[0:5], indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 11. Finding the most popular tweets in a collection of tweets" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import twitter\n", "\n", "def find_popular_tweets(twitter_api, statuses, retweet_threshold=3):\n", "\n", " # You could also consider using the favorite_count parameter as part of \n", " # this heuristic, possibly using it to provide an additional boost to \n", " # popular tweets in a ranked formulation\n", " \n", " return [ status\n", " for status in statuses \n", " if status['retweet_count'] > retweet_threshold ] \n", " \n", "# Sample usage\n", "\n", "q = \"CrossFit\"\n", "\n", "twitter_api = oauth_login()\n", "search_results = twitter_search(twitter_api, q, max_results=200)\n", "\n", "popular_tweets = find_popular_tweets(twitter_api, search_results)\n", "\n", "for tweet in popular_tweets:\n", " print tweet['text'], tweet['retweet_count']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 12. Finding the most popular tweet entities in a collection of tweets" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import twitter\n", "from collections import Counter\n", "\n", "def get_common_tweet_entities(statuses, entity_threshold=3):\n", "\n", " # Create a flat list of all tweet entities\n", " tweet_entities = [ e\n", " for status in statuses\n", " for entity_type in extract_tweet_entities([status]) \n", " for e in entity_type \n", " ]\n", "\n", " c = Counter(tweet_entities).most_common()\n", "\n", " # Compute frequencies\n", " return [ (k,v) \n", " for (k,v) in c\n", " if v >= entity_threshold\n", " ]\n", "\n", "# Sample usage\n", "\n", "q = 'CrossFit'\n", "\n", "twitter_api = oauth_login()\n", "search_results = twitter_search(twitter_api, q, max_results=100)\n", "common_entities = get_common_tweet_entities(search_results)\n", "\n", "print \"Most common tweet entities\"\n", "print common_entities" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 13. Tabulating frequency analysis" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from prettytable import PrettyTable\n", "\n", "# Get some frequency data\n", "\n", "twitter_api = oauth_login()\n", "search_results = twitter_search(twitter_api, q, max_results=100)\n", "common_entities = get_common_tweet_entities(search_results)\n", "\n", "# Use PrettyTable to create a nice tabular display\n", "\n", "pt = PrettyTable(field_names=['Entity', 'Count']) \n", "[ pt.add_row(kv) for kv in common_entities ]\n", "pt.align['Entity'], pt.align['Count'] = 'l', 'r' # Set column alignment\n", "print pt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 14. Finding users who have retweeted a status" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import twitter\n", "\n", "twitter_api = oauth_login()\n", "\n", "print \"\"\"User IDs for retweeters of a tweet by @fperez_org\n", "that was retweeted by @SocialWebMining and that @jyeee then retweeted\n", "from @SocialWebMining's timeline\\n\"\"\"\n", "print twitter_api.statuses.retweeters.ids(_id=334188056905129984)['ids']\n", "print json.dumps(twitter_api.statuses.show(_id=334188056905129984), indent=1)\n", "print\n", "\n", "print \"@SocialWeb's retweet of @fperez_org's tweet\\n\"\n", "print twitter_api.statuses.retweeters.ids(_id=345723917798866944)['ids']\n", "print json.dumps(twitter_api.statuses.show(_id=345723917798866944), indent=1)\n", "print\n", "\n", "print \"@jyeee's retweet of @fperez_org's tweet\\n\"\n", "print twitter_api.statuses.retweeters.ids(_id=338835939172417537)['ids']\n", "print json.dumps(twitter_api.statuses.show(_id=338835939172417537), indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 15. Extracting a retweet's attribution" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re\n", "\n", "def get_rt_attributions(tweet):\n", "\n", " # Regex adapted from Stack Overflow (http://bit.ly/1821y0J)\n", "\n", " rt_patterns = re.compile(r\"(RT|via)((?:\\b\\W*@\\w+)+)\", re.IGNORECASE)\n", " rt_attributions = []\n", "\n", " # Inspect the tweet to see if it was produced with /statuses/retweet/:id.\n", " # See https://dev.twitter.com/docs/api/1.1/get/statuses/retweets/%3Aid.\n", " \n", " if tweet.has_key('retweeted_status'):\n", " attribution = tweet['retweeted_status']['user']['screen_name'].lower()\n", " rt_attributions.append(attribution)\n", "\n", " # Also, inspect the tweet for the presence of \"legacy\" retweet patterns\n", " # such as \"RT\" and \"via\", which are still widely used for various reasons\n", " # and potentially very useful. See https://dev.twitter.com/discussions/2847 \n", " # and https://dev.twitter.com/discussions/1748 for some details on how/why.\n", "\n", " try:\n", " rt_attributions += [ \n", " mention.strip() \n", " for mention in rt_patterns.findall(tweet['text'])[0][1].split() \n", " ]\n", " except IndexError, e:\n", " pass\n", "\n", " # Filter out any duplicates\n", "\n", " return list(set([rta.strip(\"@\").lower() for rta in rt_attributions]))\n", "\n", "# Sample usage\n", "twitter_api = oauth_login()\n", "\n", "tweet = twitter_api.statuses.show(_id=214746575765913602)\n", "print get_rt_attributions(tweet)\n", "print\n", "tweet = twitter_api.statuses.show(_id=345723917798866944)\n", "print get_rt_attributions(tweet)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 16. Making robust Twitter requests" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import time\n", "from urllib2 import URLError\n", "from httplib import BadStatusLine\n", "import json\n", "import twitter\n", "\n", "def make_twitter_request(twitter_api_func, max_errors=10, *args, **kw): \n", " \n", " # A nested helper function that handles common HTTPErrors. Return an updated\n", " # value for wait_period if the problem is a 500 level error. Block until the\n", " # rate limit is reset if it's a rate limiting issue (429 error). Returns None\n", " # for 401 and 404 errors, which requires special handling by the caller.\n", " def handle_twitter_http_error(e, wait_period=2, sleep_when_rate_limited=True):\n", " \n", " if wait_period > 3600: # Seconds\n", " print >> sys.stderr, 'Too many retries. Quitting.'\n", " raise e\n", " \n", " # See https://dev.twitter.com/docs/error-codes-responses for common codes\n", " \n", " if e.e.code == 401:\n", " print >> sys.stderr, 'Encountered 401 Error (Not Authorized)'\n", " return None\n", " elif e.e.code == 404:\n", " print >> sys.stderr, 'Encountered 404 Error (Not Found)'\n", " return None\n", " elif e.e.code == 429: \n", " print >> sys.stderr, 'Encountered 429 Error (Rate Limit Exceeded)'\n", " if sleep_when_rate_limited:\n", " print >> sys.stderr, \"Retrying in 15 minutes...ZzZ...\"\n", " sys.stderr.flush()\n", " time.sleep(60*15 + 5)\n", " print >> sys.stderr, '...ZzZ...Awake now and trying again.'\n", " return 2\n", " else:\n", " raise e # Caller must handle the rate limiting issue\n", " elif e.e.code in (500, 502, 503, 504):\n", " print >> sys.stderr, 'Encountered %i Error. Retrying in %i seconds' % \\\n", " (e.e.code, wait_period)\n", " time.sleep(wait_period)\n", " wait_period *= 1.5\n", " return wait_period\n", " else:\n", " raise e\n", "\n", " # End of nested helper function\n", " \n", " wait_period = 2 \n", " error_count = 0 \n", "\n", " while True:\n", " try:\n", " return twitter_api_func(*args, **kw)\n", " except twitter.api.TwitterHTTPError, e:\n", " error_count = 0 \n", " wait_period = handle_twitter_http_error(e, wait_period)\n", " if wait_period is None:\n", " return\n", " except URLError, e:\n", " error_count += 1\n", " print >> sys.stderr, \"URLError encountered. Continuing.\"\n", " if error_count > max_errors:\n", " print >> sys.stderr, \"Too many consecutive errors...bailing out.\"\n", " raise\n", " except BadStatusLine, e:\n", " error_count += 1\n", " print >> sys.stderr, \"BadStatusLine encountered. Continuing.\"\n", " if error_count > max_errors:\n", " print >> sys.stderr, \"Too many consecutive errors...bailing out.\"\n", " raise\n", "\n", "# Sample usage\n", "\n", "twitter_api = oauth_login()\n", "\n", "# See https://dev.twitter.com/docs/api/1.1/get/users/lookup for \n", "# twitter_api.users.lookup\n", "\n", "response = make_twitter_request(twitter_api.users.lookup, \n", " screen_name=\"SocialWebMining\")\n", "\n", "print json.dumps(response, indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 17. Resolving user profile information" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def get_user_profile(twitter_api, screen_names=None, user_ids=None):\n", " \n", " # Must have either screen_name or user_id (logical xor)\n", " assert (screen_names != None) != (user_ids != None), \\\n", " \"Must have screen_names or user_ids, but not both\"\n", " \n", " items_to_info = {}\n", "\n", " items = screen_names or user_ids\n", " \n", " while len(items) > 0:\n", "\n", " # Process 100 items at a time per the API specifications for /users/lookup.\n", " # See https://dev.twitter.com/docs/api/1.1/get/users/lookup for details.\n", " \n", " items_str = ','.join([str(item) for item in items[:100]])\n", " items = items[100:]\n", "\n", " if screen_names:\n", " response = make_twitter_request(twitter_api.users.lookup, \n", " screen_name=items_str)\n", " else: # user_ids\n", " response = make_twitter_request(twitter_api.users.lookup, \n", " user_id=items_str)\n", " \n", " for user_info in response:\n", " if screen_names:\n", " items_to_info[user_info['screen_name']] = user_info\n", " else: # user_ids\n", " items_to_info[user_info['id']] = user_info\n", "\n", " return items_to_info\n", "\n", "# Sample usage\n", "\n", "twitter_api = oauth_login()\n", "\n", "print get_user_profile(twitter_api, screen_names=[\"SocialWebMining\", \"ptwobrussell\"]) \n", "#print get_user_profile(twitter_api, user_ids=[132373965])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 18. Extracting tweet entities from arbitrary text" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import twitter_text\n", "\n", "# Sample usage\n", "\n", "txt = \"RT @SocialWebMining Mining 1M+ Tweets About #Syria http://wp.me/p3QiJd-1I\"\n", "\n", "ex = twitter_text.Extractor(txt)\n", "\n", "print \"Screen Names:\", ex.extract_mentioned_screen_names_with_indices()\n", "print \"URLs:\", ex.extract_urls_with_indices()\n", "print \"Hashtags:\", ex.extract_hashtags_with_indices()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 19. Getting all friends or followers for a user" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from functools import partial\n", "from sys import maxint\n", "\n", "def get_friends_followers_ids(twitter_api, screen_name=None, user_id=None,\n", " friends_limit=maxint, followers_limit=maxint):\n", " \n", " # Must have either screen_name or user_id (logical xor)\n", " assert (screen_name != None) != (user_id != None), \\\n", " \"Must have screen_name or user_id, but not both\"\n", " \n", " # See https://dev.twitter.com/docs/api/1.1/get/friends/ids and\n", " # https://dev.twitter.com/docs/api/1.1/get/followers/ids for details\n", " # on API parameters\n", " \n", " get_friends_ids = partial(make_twitter_request, twitter_api.friends.ids, \n", " count=5000)\n", " get_followers_ids = partial(make_twitter_request, twitter_api.followers.ids, \n", " count=5000)\n", "\n", " friends_ids, followers_ids = [], []\n", " \n", " for twitter_api_func, limit, ids, label in [\n", " [get_friends_ids, friends_limit, friends_ids, \"friends\"], \n", " [get_followers_ids, followers_limit, followers_ids, \"followers\"]\n", " ]:\n", " \n", " if limit == 0: continue\n", " \n", " cursor = -1\n", " while cursor != 0:\n", " \n", " # Use make_twitter_request via the partially bound callable...\n", " if screen_name: \n", " response = twitter_api_func(screen_name=screen_name, cursor=cursor)\n", " else: # user_id\n", " response = twitter_api_func(user_id=user_id, cursor=cursor)\n", "\n", " if response is not None:\n", " ids += response['ids']\n", " cursor = response['next_cursor']\n", " \n", " print >> sys.stderr, 'Fetched {0} total {1} ids for {2}'.format(len(ids), \n", " label, (user_id or screen_name))\n", " \n", " # XXX: You may want to store data during each iteration to provide an \n", " # an additional layer of protection from exceptional circumstances\n", " \n", " if len(ids) >= limit or response is None:\n", " break\n", "\n", " # Do something useful with the IDs, like store them to disk...\n", " return friends_ids[:friends_limit], followers_ids[:followers_limit]\n", "\n", "# Sample usage\n", "\n", "twitter_api = oauth_login()\n", "\n", "friends_ids, followers_ids = get_friends_followers_ids(twitter_api, \n", " screen_name=\"SocialWebMining\", \n", " friends_limit=10, \n", " followers_limit=10)\n", "\n", "print friends_ids\n", "print followers_ids" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 20. Analyzing a user's friends and followers" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def setwise_friends_followers_analysis(screen_name, friends_ids, followers_ids):\n", " \n", " friends_ids, followers_ids = set(friends_ids), set(followers_ids)\n", " \n", " print '{0} is following {1}'.format(screen_name, len(friends_ids))\n", "\n", " print '{0} is being followed by {1}'.format(screen_name, len(followers_ids))\n", " \n", " print '{0} of {1} are not following {2} back'.format(\n", " len(friends_ids.difference(followers_ids)), \n", " len(friends_ids), screen_name)\n", " \n", " print '{0} of {1} are not being followed back by {2}'.format(\n", " len(followers_ids.difference(friends_ids)), \n", " len(followers_ids), screen_name)\n", " \n", " print '{0} has {1} mutual friends'.format(\n", " screen_name, len(friends_ids.intersection(followers_ids)))\n", "\n", "# Sample usage\n", "\n", "screen_name = \"ptwobrussell\"\n", "\n", "twitter_api = oauth_login()\n", "\n", "friends_ids, followers_ids = get_friends_followers_ids(twitter_api, \n", " screen_name=screen_name)\n", "setwise_friends_followers_analysis(screen_name, friends_ids, followers_ids)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 21. Harvesting a user's tweets" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def harvest_user_timeline(twitter_api, screen_name=None, user_id=None, max_results=1000):\n", " \n", " assert (screen_name != None) != (user_id != None), \\\n", " \"Must have screen_name or user_id, but not both\" \n", " \n", " kw = { # Keyword args for the Twitter API call\n", " 'count': 200,\n", " 'trim_user': 'true',\n", " 'include_rts' : 'true',\n", " 'since_id' : 1\n", " }\n", " \n", " if screen_name:\n", " kw['screen_name'] = screen_name\n", " else:\n", " kw['user_id'] = user_id\n", " \n", " max_pages = 16\n", " results = []\n", " \n", " tweets = make_twitter_request(twitter_api.statuses.user_timeline, **kw)\n", " \n", " if tweets is None: # 401 (Not Authorized) - Need to bail out on loop entry\n", " tweets = []\n", " \n", " results += tweets\n", " \n", " print >> sys.stderr, 'Fetched %i tweets' % len(tweets)\n", " \n", " page_num = 1\n", " \n", " # Many Twitter accounts have fewer than 200 tweets so you don't want to enter\n", " # the loop and waste a precious request if max_results = 200.\n", " \n", " # Note: Analogous optimizations could be applied inside the loop to try and \n", " # save requests. e.g. Don't make a third request if you have 287 tweets out of \n", " # a possible 400 tweets after your second request. Twitter does do some \n", " # post-filtering on censored and deleted tweets out of batches of 'count', though,\n", " # so you can't strictly check for the number of results being 200. You might get\n", " # back 198, for example, and still have many more tweets to go. If you have the\n", " # total number of tweets for an account (by GET /users/lookup/), then you could \n", " # simply use this value as a guide.\n", " \n", " if max_results == kw['count']:\n", " page_num = max_pages # Prevent loop entry\n", " \n", " while page_num < max_pages and len(tweets) > 0 and len(results) < max_results:\n", " \n", " # Necessary for traversing the timeline in Twitter's v1.1 API:\n", " # get the next query's max-id parameter to pass in.\n", " # See https://dev.twitter.com/docs/working-with-timelines.\n", " kw['max_id'] = min([ tweet['id'] for tweet in tweets]) - 1 \n", " \n", " tweets = make_twitter_request(twitter_api.statuses.user_timeline, **kw)\n", " results += tweets\n", "\n", " print >> sys.stderr, 'Fetched %i tweets' % (len(tweets),)\n", " \n", " page_num += 1\n", " \n", " print >> sys.stderr, 'Done fetching tweets'\n", "\n", " return results[:max_results]\n", " \n", "# Sample usage\n", "\n", "twitter_api = oauth_login()\n", "tweets = harvest_user_timeline(twitter_api, screen_name=\"SocialWebMining\", \\\n", " max_results=200)\n", "\n", "# Save to MongoDB with save_to_mongo or a local file with save_json..." ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 22. Crawling a friendship graph" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def crawl_followers(twitter_api, screen_name, limit=1000000, depth=2):\n", " \n", " # Resolve the ID for screen_name and start working with IDs for consistency \n", " # in storage\n", "\n", " seed_id = str(twitter_api.users.show(screen_name=screen_name)['id'])\n", " \n", " _, next_queue = get_friends_followers_ids(twitter_api, user_id=seed_id, \n", " friends_limit=0, followers_limit=limit)\n", "\n", " # Store a seed_id => _follower_ids mapping in MongoDB\n", " \n", " save_to_mongo({'followers' : [ _id for _id in next_queue ]}, 'followers_crawl', \n", " '{0}-follower_ids'.format(seed_id))\n", " \n", " d = 1\n", " while d < depth:\n", " d += 1\n", " (queue, next_queue) = (next_queue, [])\n", " for fid in queue:\n", " follower_ids = get_friends_followers_ids(twitter_api, user_id=fid, \n", " friends_limit=0, \n", " followers_limit=limit)\n", " \n", " # Store a fid => follower_ids mapping in MongoDB\n", " save_to_mongo({'followers' : [ _id for _id in next_queue ]}, \n", " 'followers_crawl', '{0}-follower_ids'.format(fid))\n", " \n", " next_queue += follower_ids\n", "\n", "# Sample usage\n", "\n", "screen_name = \"timoreilly\"\n", "\n", "twitter_api = oauth_login()\n", "\n", "crawl_followers(twitter_api, screen_name, depth=1, limit=10)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 23. Analyzing tweet content" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def analyze_tweet_content(statuses):\n", " \n", " if len(statuses) == 0:\n", " print \"No statuses to analyze\"\n", " return\n", " \n", " # A nested helper function for computing lexical diversity\n", " def lexical_diversity(tokens):\n", " return 1.0*len(set(tokens))/len(tokens) \n", " \n", " # A nested helper function for computing the average number of words per tweet\n", " def average_words(statuses):\n", " total_words = sum([ len(s.split()) for s in statuses ]) \n", " return 1.0*total_words/len(statuses)\n", "\n", " status_texts = [ status['text'] for status in statuses ]\n", " screen_names, hashtags, urls, media, _ = extract_tweet_entities(statuses)\n", " \n", " # Compute a collection of all words from all tweets\n", " words = [ w \n", " for t in status_texts \n", " for w in t.split() ]\n", " \n", " print \"Lexical diversity (words):\", lexical_diversity(words)\n", " print \"Lexical diversity (screen names):\", lexical_diversity(screen_names)\n", " print \"Lexical diversity (hashtags):\", lexical_diversity(hashtags)\n", " print \"Averge words per tweet:\", average_words(status_texts)\n", "\n", " \n", "# Sample usage\n", "\n", "q = 'CrossFit'\n", "twitter_api = oauth_login()\n", "search_results = twitter_search(twitter_api, q)\n", "\n", "analyze_tweet_content(search_results)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 24. Summarizing link targets" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "import json\n", "import nltk\n", "import numpy\n", "import urllib2\n", "from boilerpipe.extract import Extractor\n", "\n", "def summarize(url, n=100, cluster_threshold=5, top_sentences=5):\n", "\n", " # Adapted from \"The Automatic Creation of Literature Abstracts\" by H.P. Luhn\n", " #\n", " # Parameters:\n", " # * n - Number of words to consider\n", " # * cluster_threshold - Distance between words to consider\n", " # * top_sentences - Number of sentences to return for a \"top n\" summary\n", " \n", " # Begin - nested helper function\n", " def score_sentences(sentences, important_words):\n", " scores = []\n", " sentence_idx = -1\n", " \n", " for s in [nltk.tokenize.word_tokenize(s) for s in sentences]:\n", " \n", " sentence_idx += 1\n", " word_idx = []\n", " \n", " # For each word in the word list...\n", " for w in important_words:\n", " try:\n", " # Compute an index for important words in each sentence\n", " \n", " word_idx.append(s.index(w))\n", " except ValueError, e: # w not in this particular sentence\n", " pass\n", " \n", " word_idx.sort()\n", " \n", " # It is possible that some sentences may not contain any important words\n", " if len(word_idx)== 0: continue\n", " \n", " # Using the word index, compute clusters with a max distance threshold\n", " # for any two consecutive words\n", " \n", " clusters = []\n", " cluster = [word_idx[0]]\n", " i = 1\n", " while i < len(word_idx):\n", " if word_idx[i] - word_idx[i - 1] < cluster_threshold:\n", " cluster.append(word_idx[i])\n", " else:\n", " clusters.append(cluster[:])\n", " cluster = [word_idx[i]]\n", " i += 1\n", " clusters.append(cluster)\n", " \n", " # Score each cluster. The max score for any given cluster is the score \n", " # for the sentence.\n", " \n", " max_cluster_score = 0\n", " for c in clusters:\n", " significant_words_in_cluster = len(c)\n", " total_words_in_cluster = c[-1] - c[0] + 1\n", " score = 1.0 * significant_words_in_cluster \\\n", " * significant_words_in_cluster / total_words_in_cluster\n", " \n", " if score > max_cluster_score:\n", " max_cluster_score = score\n", " \n", " scores.append((sentence_idx, score))\n", " \n", " return scores \n", " \n", " # End - nested helper function\n", " \n", " extractor = Extractor(extractor='ArticleExtractor', url=url)\n", "\n", " # It's entirely possible that this \"clean page\" will be a big mess. YMMV.\n", " # The good news is that the summarize algorithm inherently accounts for handling\n", " # a lot of this noise.\n", "\n", " txt = extractor.getText()\n", " \n", " sentences = [s for s in nltk.tokenize.sent_tokenize(txt)]\n", " normalized_sentences = [s.lower() for s in sentences]\n", "\n", " words = [w.lower() for sentence in normalized_sentences for w in\n", " nltk.tokenize.word_tokenize(sentence)]\n", "\n", " fdist = nltk.FreqDist(words)\n", "\n", " top_n_words = [w[0] for w in fdist.items() \n", " if w[0] not in nltk.corpus.stopwords.words('english')][:n]\n", "\n", " scored_sentences = score_sentences(normalized_sentences, top_n_words)\n", "\n", " # Summarization Approach 1:\n", " # Filter out nonsignificant sentences by using the average score plus a\n", " # fraction of the std dev as a filter\n", "\n", " avg = numpy.mean([s[1] for s in scored_sentences])\n", " std = numpy.std([s[1] for s in scored_sentences])\n", " mean_scored = [(sent_idx, score) for (sent_idx, score) in scored_sentences\n", " if score > avg + 0.5 * std]\n", "\n", " # Summarization Approach 2:\n", " # Another approach would be to return only the top N ranked sentences\n", "\n", " top_n_scored = sorted(scored_sentences, key=lambda s: s[1])[-top_sentences:]\n", " top_n_scored = sorted(top_n_scored, key=lambda s: s[0])\n", "\n", " # Decorate the post object with summaries\n", "\n", " return dict(top_n_summary=[sentences[idx] for (idx, score) in top_n_scored],\n", " mean_scored_summary=[sentences[idx] for (idx, score) in mean_scored])\n", "\n", "# Sample usage\n", "\n", "sample_url = 'http://radar.oreilly.com/2013/06/phishing-in-facebooks-pond.html'\n", "summary = summarize(sample_url)\n", "\n", "print \"-------------------------------------------------\"\n", "print \" 'Top N Summary'\"\n", "print \"-------------------------------------------------\"\n", "print \" \".join(summary['top_n_summary'])\n", "print\n", "print\n", "print \"-------------------------------------------------\"\n", "print \" 'Mean Scored' Summary\"\n", "print \"-------------------------------------------------\"\n", "print \" \".join(summary['mean_scored_summary'])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Example 25. Analyzing a user's favorite tweets" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\n", "def analyze_favorites(twitter_api, screen_name, entity_threshold=2):\n", " \n", " # Could fetch more than 200 by walking the cursor as shown in other\n", " # recipes, but 200 is a good sample to work with.\n", " favs = twitter_api.favorites.list(screen_name=screen_name, count=200)\n", " print \"Number of favorites:\", len(favs)\n", " \n", " # Figure out what some of the common entities are, if any, in the content\n", " \n", " common_entities = get_common_tweet_entities(favs, \n", " entity_threshold=entity_threshold)\n", " \n", " # Use PrettyTable to create a nice tabular display\n", " \n", " pt = PrettyTable(field_names=['Entity', 'Count']) \n", " [ pt.add_row(kv) for kv in common_entities ]\n", " pt.align['Entity'], pt.align['Count'] = 'l', 'r' # Set column alignment\n", " \n", " print\n", " print \"Common entities in favorites...\"\n", " print pt\n", " \n", " \n", " # Print out some other stats\n", " print\n", " print \"Some statistics about the content of the favorities...\"\n", " print\n", " analyze_tweet_content(favs)\n", " \n", " # Could also start analyzing link content or summarized link content, and more.\n", "\n", "# Sample usage\n", "\n", "twitter_api = oauth_login()\n", "analyze_favorites(twitter_api, \"ptwobrussell\")" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }