{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Chapter\u00a01.\u00a0Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This chapter kicks off our journey of mining the social web with\n", " Twitter, a rich source of social data that is a great starting point for\n", " social web mining because of its inherent openness for public consumption,\n", " clean and well-documented API, rich developer tooling, and broad appeal to\n", " users from every walk of life. Twitter data is particularly interesting\n", " because tweets happen at the \"speed of thought\" and are available for\n", " consumption as they happen in near real time, represent the broadest\n", " cross-section of society at an international level, and are so inherently\n", " multifaceted. Tweets and Twitter's \"following\" mechanism link people in a variety of ways, ranging from short\n", " (but often meaningful) conversational dialogues to interest graphs that\n", " connect people and the things that they care about." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since this is the first chapter, we'll take our time acclimating to\n", " our journey in social web mining. However, given that Twitter data is so\n", " accessible and open to public scrutiny, Chapter 9, Twitter Cookbook\n", " further elaborates on the broad number of data mining possibilities by\n", " providing a terse collection of recipes in a convenient problem/solution\n", " format that can be easily manipulated and readily applied to a wide range of\n", " problems. You'll also be able to apply concepts from future chapters to\n", " Twitter data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This content is a full-text excerpt from Mining the Social Web (2nd Edition) that has been minimally converted to IPython Notebook format so that you can interactively run the example code as you read the book. The purpose of this offering is to determine if there is sufficient interest to offer the remainder of the entire book as a collection of IPython Notebooks (as a standard distribution format that's in addition to PDF, Kindle, etc.) based on feedback from you.
\n", "If you would like to see the full-text of this book (or other books like it) offered in a native IPython Notebook format, tweet something like \"@OReillyMedia: Please distribute @SocialWebMining in IPython Notebook format\" so that both O'Reilly Media and the author receive your feedback. Alternatively, contact O'Reilly Media using the link above. Thanks!
\n", "\n", "You can also view this sampler chapter in O'Reilly's new Chimera ebook reader or download a PDF.\n", "
\n", "
" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this chapter, we'll ease into the process of getting situated\n", " with a minimal (but effective) development environment with Python, survey\n", " Twitter's API, and distill some analytical insights from tweets using\n", " frequency analysis. Topics that you'll learn about in this chapter\n", " include:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:Always get the latest bug-fixed source code for this chapter (and\n", " every other chapter) online at http://bit.ly/MiningTheSocialWeb2E.\n", " Be sure to also take advantage of this book's virtual machine experience,\n", " as described in Appendix A, Information About This Book's Virtual Machine Experience, to maximize your enjoyment of the\n", " sample code.
Twitter's developer platform and how to make API requests
\n", "Tweet metadata and how to use it
\n", "Extracting entities such as user mentions, hashtags, and URLs\n", " from tweets
\n", "Techniques for performing frequency analysis with Python
\n", "Plotting histograms of Twitter data with IPython Notebook
\n", "We want to be heard.
\n", "We want to satisfy our curiosity.
\n", "We want it easy.
\n", "We want it now.
\n", "" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Exploring Twitter's API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now having a proper frame of reference for Twitter, let us now\n", " transition our attention to the problem of acquiring and analyzing Twitter\n", " data." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Fundamental Twitter Terminology" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Twitter might be described as a real-time, highly social microblogging\n", " service that allows users to post short status updates, called tweets, that appear on timelines. Tweets may include one or more\n", " entities in their 140 characters of content and reference one or more\n", " places that map to locations in the real world. An understanding\n", " of users, tweets, and timelines is particularly essential to effective\n", " use of Twitter's API, so a\n", " brief introduction to these fundamental Twitter\n", " Platform objects is in order before we interact with the API to\n", " fetch some data. We've largely discussed Twitter users and Twitter's\n", " asymmetric following model for relationships thus far, so this section\n", " briefly introduces tweets and timelines in order to round out a general\n", " understanding of the Twitter platform." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tweets are the essence of Twitter, and while they are notionally\n", " thought of as the 140 characters of text content associated with a\n", " user's status update, there's really quite a bit more metadata there than meets the eye. In addition to the\n", " textual content of a tweet itself, tweets come bundled with two\n", " additional pieces of metadata that are of particular note:\n", " entities and places. Tweet\n", " entities are essentially the user mentions, hashtags, URLs, and media that may be\n", " associated with a tweet, and places are locations in the real world that\n", " may be attached to a tweet. Note that a place may be the actual location\n", " in which a tweet was authored, but it might also be a reference to the\n", " place described in a tweet." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make it all a bit more concrete, let's consider a sample tweet\n", " with the following text:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sidebar Discussion:\n", "\n", "\n", "\n", "\n", "\n", " \n", "A fundamental aspect of human intelligence is the desire to classify things and\n", " derive a hierarchy in which each element “belongs to” or is a “child” of\n", " a parent element one level higher in the hierarchy. Leaving aside some\n", " of the finer distinctions between a\n", " taxonomy and an ontology, think of a\n", " taxonomy as a hierarchical structure like a tree\n", " that classifies elements into particular parent/child relationships,\n", " whereas a folksonomy (a\n", " term coined around 2004) describes the universe of\n", " collaborative tagging and social indexing efforts that emerge in various\n", " ecosystems of the Web. It’s a play on words in the sense that it blends\n", " folk and taxonomy. So, in\n", " essence, a folksonomy is just a fancy way of describing the\n", " decentralized universe of tags that emerges as a mechanism of collective intelligence\n", " when you allow people to classify content with labels. One of the things\n", " that's so compelling about the use of hashtags on Twitter is that the\n", " folksonomies that organically emerge act as points of aggregation for\n", " common interests and provide a focused way to explore while still\n", " leaving open the possibility for nearly unbounded serendipity.
\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The tweet is 124 characters long and contains four tweet entities:\n", " the user mentions @ptwobrussell and @SocialWebMining, the hashtag\n", " \\#social, and the URL http://on.fb.me/16WJAf9. Although there is a place called\n", " Franklin, Tennessee that's explicitly mentioned in the tweet, the\n", " places metadata associated with the tweet might\n", " include the location in which the tweet was authored, which may or may\n", " not be Franklin, Tennessee. That's a lot of metadata that's packed into\n", " fewer than 140 characters and illustrates just how potent a short quip\n", " can be: it can unambiguously refer to multiple other Twitter users, link\n", " to web pages, and cross-reference topics with hashtags that act as\n", " points of aggregation and horizontally slice through the entire Twitterverse in an easily searchable\n", " fashion." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, timelines are the chronologically\n", " sorted collections of tweets. Abstractly, you might say that a timeline\n", " is any particular collection of tweets displayed in chronological order;\n", " however, you'll commonly see a couple of timelines that are particularly\n", " noteworthy. From the perspective of an arbitrary Twitter user, the\n", " home timeline is the view that you see when you log into your account and look\n", " at all of the tweets from users that you are following, whereas a particular user timeline is a\n", " collection of tweets only from a certain user." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, when you log into your Twitter account, your home timeline is located at\n", " https://twitter.com. The URL for any\n", " particular user timeline, however, must be suffixed with a context that\n", " identifies the user, such as https://twitter.com/SocialWebMining.\n", " If you're interested in seeing what a particular user's home timeline\n", " looks like from that user's perspective, you can access it with the\n", " additional following suffix appended to the URL.\n", " For example, what Tim O'Reilly sees on his home timeline when he logs\n", " into Twitter is accessible at https://twitter.com/timoreilly/following." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An application like TweetDeck provides several customizable views into the\n", " tumultuous landscape of tweets, as shown in Figure 1.1, “TweetDeck provides a highly customizable user interface that\n", " can be helpful for analyzing what is happening on Twitter and\n", " demonstrates the kind of data that you have access to through the\n", " Twitter API”, and is worth trying out if you haven't journeyed\n", " far beyond the Twitter.com user interface." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "@ptwobrussell is writing @SocialWebMining, 2nd Ed. from his home\n", " office in Franklin, TN. Be \\#social: http://on.fb.me/16WJAf9
\n", "
twitter
. Like most other Python packages, you\n",
" can install it with pip
by typing pip install\n",
" twitter
in a terminal."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:See Appendix C, Python and IPython Notebook Tips & Tricks for instructions on how to install\n", "
pip
.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sidebar Discussion:\n", "\n", "\n", "\n", "\n", "\n", " \n", "We’ll work though some examples that illustrate the use of the\n", "
\n", "pydoc
) in a few different ways. Outside of a Python shell, running\n", "pydoc
in your terminal on a package\n", " in yourPYTHONPATH
is a\n", " nice option. For example, on a Linux or Mac system, you can simply\n", " typepydoc twitter
in a\n", " terminal to get the package-level documentation, whereaspydoc twitter.Twitter
provides\n", " documentation on thepydoc
as a package. Typingpython -mpydoc twitter.Twitter
, for\n", " example, would provide information on thetwitter.Twitter
class. If you find yourself\n", " reviewing the documentation for certain modules often, you can elect\n", " to pass the-w
option to\n", "pydoc
and write out an HTML page that you can save and\n", " bookmark in your browser.However, more than likely, you'll be in the middle of a working\n", " session when you need some help. The built-in
\n", "help
\n", " function accepts a package or class name and is useful for an ordinary\n", " Python shell, whereas IPython users can suffix a package\n", " or class name with a question mark to view inline help. For example,\n", " you could typehelp(twitter)
or\n", "help(twitter.Twitter)
in a\n", " regular Python interpreter, while you could use the shortcut\n", "twitter?
ortwitter.Twitter?
in IPython or IPython\n", " Notebook.It is highly recommended that you adopt IPython as your standard\n", " Python shell when working outside of IPython Notebook because of the\n", " various convenience functions, such as tab completion, session\n", " history, and \"magic\n", " functions,\" that it offers. Recall that Appendix A, Information About This Book's Virtual Machine Experience provides minimal details on getting oriented with\n", " recommended developer tools such as IPython.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before you can make any API requests to Twitter, you'll need to\n", " create an application at https://dev.twitter.com/apps.\n", " Creating an application is the standard way for developers to gain API\n", " access and for Twitter to monitor and interact with third-party platform\n", " developers as needed. The process for creating an application is pretty\n", " standard, and all that's needed is read-only access to the API." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the present context, you are creating an app that you are going to authorize to access\n", " your account data, so this might seem a bit\n", " roundabout; why not just plug in your username and password to access\n", " the API? While that approach might work fine for\n", " you, a third party such as a friend or colleague\n", " probably wouldn't feel comfortable forking over a username/password\n", " combination in order to enjoy the same insights from\n", " your app. Giving up credentials is never a sound\n", " practice. Fortunately, some smart people recognized this problem years ago, and now there's a\n", " standardized protocol called OAuth (short for Open Authorization)\n", " that works for these kinds of situations in a generalized way for the\n", " broader social web. The protocol is a social web standard at this\n", " point." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you remember nothing else from this tangent, just remember that\n", " OAuth is a means of allowing users to authorize third-party applications\n", " to access their account data without needing to share sensitive\n", " information like a password. Appendix B, OAuth Primer provides a slightly\n", " broader overview of how OAuth works if you're interested, and Twitter's OAuth documentation offers\n", " specific details about its particular implementation.[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For simplicity of development, the key pieces of information that\n", " you'll need to take away from your newly created application's settings\n", " are its consumer key, consumer secret, access token, and access\n", " token secret. In tandem, these four credentials provide everything that\n", " an application would ultimately be getting to authorize itself through a\n", " series of redirects involving the user granting authorization, so treat\n", " them with the same sensitivity that you would a password." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:We'll opt to make programmatic API requests with Python, because\n", " the
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure 1.2, “Create a new Twitter application to get OAuth credentials and\n", " API access at https://dev.twitter.com/apps;\n", " the four (blurred) OAuth fields are what you'll use to make API calls\n", " to Twitter's API” shows the context of\n", " retrieving these credentials." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:See Appendix B, OAuth Primer for details on implementing an OAuth\n", " 2.0 flow that you would need to build an application that requires an\n", " arbitrary user to authorize it to access account data.
GET trends/place
\n",
" resource. While you're at it, go ahead and bookmark the official API documentation as well\n",
" as the REST API v1.1\n",
" resources, because you'll be referencing them regularly as you\n",
" learn the ropes of the developer-facing side of the Twitterverse."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let\u2019s fire up IPython Notebook and initiate a search. Follow along\n", " with Example 1.1, “Authorizing an application to access Twitter account\n", " data” by substituting your own\n", " account credentials into the variables at the beginning of the code\n", " example and execute the call to create an instance of the Twitter API.\n", " The code works by using your OAuth credentials to create an object\n", " calledNote:As of March 2013, Twitter's API operates at version 1.1 and is\n", " significantly different in a few areas from the previous v1 API that\n", " you may have encountered. Version 1 of the API passed through a\n", " deprecation cycle of approximately six months and is no longer\n", " operational. All sample code in this book presumes version 1.1 of\n", " the API.
auth
that represents your\n",
" OAuth authorization, which can then be passed to a class called Twitter
that is capable of issuing queries to\n",
" Twitter's API."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Example 1.1. Authorizing an application to access Twitter account data"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import twitter\n",
"\n",
"# XXX: Go to http://dev.twitter.com/apps/new to create an app and get values\n",
"# for these credentials, which you'll need to provide in place of these\n",
"# empty string values that are defined as placeholders.\n",
"# See https://dev.twitter.com/docs/auth/oauth for more information \n",
"# on Twitter's OAuth implementation.\n",
"\n",
"CONSUMER_KEY = ''\n",
"CONSUMER_SECRET = ''\n",
"OAUTH_TOKEN = ''\n",
"OAUTH_TOKEN_SECRET = ''\n",
"\n",
"auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,\n",
" CONSUMER_KEY, CONSUMER_SECRET)\n",
"\n",
"twitter_api = twitter.Twitter(auth=auth)\n",
"\n",
"# Nothing to see by displaying twitter_api except that it's now a\n",
"# defined variable\n",
"\n",
"print twitter_api"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The results of this example should simply display an unambiguous\n",
" representation of the twitter\\_api
\n",
" object that we've constructed, such as:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<twitter.api.Twitter object at 0x39d9b50>
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This indicates that we've successfully used OAuth credentials to\n",
" gain authorization to query Twitter's API."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Exploring Trending Topics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With an authorized API connection in place, you can now issue a\n",
" request. Example 1.2, “Retrieving trends” demonstrates how to\n",
" ask Twitter for the topics that are currently trending worldwide, but\n",
" keep in mind that the API can easily be parameterized to constrain the\n",
" topics to more specific locales if you feel inclined to try out some of\n",
" the possibilities. The device for constraining queries is via Yahoo!\n",
" GeoPlanet’s Where On Earth (WOE) ID system, which is an API unto\n",
" itself that aims to provide a way to map a unique identifier to any\n",
" named place on Earth (or theoretically, even in a virtual world). If you\n",
" haven't already, go ahead and try out the example that collects a set of\n",
" trends for both the entire world and just the United States."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Example 1.2. Retrieving trends"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# The Yahoo! Where On Earth ID for the entire world is 1.\n",
"# See https://dev.twitter.com/docs/api/1.1/get/trends/place and\n",
"# http://developer.yahoo.com/geo/geoplanet/\n",
"\n",
"WORLD_WOE_ID = 1\n",
"US_WOE_ID = 23424977\n",
"\n",
"# Prefix ID with the underscore for query string parameterization.\n",
"# Without the underscore, the twitter package appends the ID value\n",
"# to the URL itself as a special case keyword argument.\n",
"\n",
"world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)\n",
"us_trends = twitter_api.trends.place(_id=US_WOE_ID)\n",
"\n",
"print world_trends\n",
"print\n",
"print us_trends"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should see a semireadable response that is a list of Python\n",
" dictionaries from the API (as opposed to any kind of error message),\n",
" such as the following truncated results, before proceeding further. (In\n",
" just a moment, we'll reformat the response to be more easily\n",
" readable.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[{u'created\\_at': u'2013-03-27T11:50:40Z', u'trends': [{u'url': u'http://twitter.com/search?q=%23MentionSomeoneImportantForYou'...
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the sample result contains a URL for a trend\n",
" represented as a search query that corresponds to the hashtag\n",
" \\#MentionSomeoneImportantForYou, where %23 is the URL encoding for the\n",
" hashtag symbol. We'll use this rather benign hashtag throughout the\n",
" remainder of the chapter as a unifying theme for examples that follow.\n",
" Although a sample data file containing tweets for this hashtag is\n",
" available with the book's source code, you'll have much more fun\n",
" exploring a topic that's trending at the time you read this as opposed\n",
" to following along with a canned topic that is no longer\n",
" trending."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The pattern for using the twitter
module is simple\n",
" and predictable: instantiate the Twitter
class with an object\n",
" chain corresponding to a base URL and then invoke methods on the object\n",
" that correspond to URL contexts. For example,\n",
" twitter\\_api.
\\_trends.place(WORLD\\_WOE\\_ID)
initiates an HTTP\n",
" call to GET\n",
" https://api.twitter.com/1.1/trends/place.json?id=1.\n",
" Note the URL mapping to the object chain that's constructed with the\n",
" twitter
package to make the request\n",
" and how query string parameters are passed in as keyword arguments. To\n",
" use the twitter
package for arbitrary\n",
" API requests, you generally construct the request in that kind of\n",
" straightforward manner, with just a couple of minor caveats that we'll\n",
" encounter soon enough."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Twitter imposes rate limits on how many requests an application can make to any given API\n",
" resource within a given time window. Twitter's rate limits are well documented, and\n",
" each individual API resource also states its particular limits for your\n",
" convenience. For example, the API request that we just issued for trends\n",
" limits applications to 15 requests per 15-minute window (see Figure 1.3, “Rate limits for Twitter API resources are identified in the\n",
" online documentation for each API call; the particular API resource\n",
" shown here allows 15 requests per \"rate limit window,\" which is\n",
" currently defined as 15 minutes”). For more nuanced information on\n",
" how Twitter's rate limits work, see REST API Rate\n",
" Limiting in v1.1. For the purposes of following along in this\n",
" chapter, it's highly unlikely that you'll get rate\n",
" limited. “Making Robust Twitter Requests” (Example 9.16, “Making robust Twitter requests”) will introduce some\n",
" techniques demonstrating best practices while working with rate\n",
" limits."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although it hasn't explicitly been stated yet, the semireadable\n", " output from Example 1.2, “Retrieving trends” is printed out as\n", " native Python data structures. While an IPython interpreter will \"pretty\n", " print\" the output for you automatically, IPython Notebook and a standard\n", " Python interpreter will not. If you find yourself in these\n", " circumstances, you may find it handy to use the built-inNote:The developer documentation states that the results of a Trends\n", " API query are updated only once every five minutes, so it's not a\n", " judicious use of your efforts or API requests to ask for results more\n", " often than that.
json
\n",
" package to force a nicer display, as illustrated in Example 1.3, “Displaying API responses as pretty-printed JSON”."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.3. Displaying API responses as pretty-printed JSON" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import json\n", "\n", "print json.dumps(world_trends, indent=1)\n", "print\n", "print json.dumps(us_trends, indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An abbreviated sample response from the Trends API produced with\n", "Note:JSON is a data\n", " exchange format that you will encounter on a regular\n", " basis. In a nutshell, JSON provides a way to arbitrarily store maps,\n", " lists, primitives such as numbers and strings, and combinations\n", " thereof. In other words, you can theoretically model just about\n", " anything with JSON should you desire to do so.
json.dumps
would look like the\n",
" following:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[\n", " {\n", " \"created\\_at\": \"2013-03-27T11:50:40Z\", \n", " \"trends\": [\n", " {\n", " \"url\": \"http://twitter.com/search?q=%23MentionSomeoneImportantForYou\", \n", " \"query\": \"%23MentionSomeoneImportantForYou\", \n", " \"name\": \"\\#MentionSomeoneImportantForYou\", \n", " \"promoted\\_content\": null, \n", " \"events\": null\n", " },\n", " ...\n", " ]\n", " }\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although it's easy enough to skim the two sets of trends and look\n", " for commonality, let's use Python's
set
data\n",
" structure to automatically compute this for us, because that's exactly\n",
" the kind of thing that sets lend themselves to doing. In this instance,\n",
" a set refers to the mathematical notion of a data\n",
" structure that stores an unordered collection of unique items and can be\n",
" computed upon with other sets of items and setwise operations. For\n",
" example, a setwise intersection computes common items between sets, a\n",
" setwise union combines all of the items from sets, and the setwise\n",
" difference among sets acts sort of like a subtraction operation in which\n",
" items from one set are removed from another."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Example 1.4, “Computing the intersection of two sets of trends” demonstrates how to use a\n",
" Python list\n",
" comprehension to parse out the names of the trending topics from\n",
" the results that were previously queried, cast those lists to sets, and\n",
" compute the setwise intersection to reveal the common items between\n",
" them. Keep in mind that there may or may not be significant overlap\n",
" between any given sets of trends, all depending on what's actually\n",
" happening when you query for the trends. In other words, the results of\n",
" your analysis will be entirely dependent upon your query and the data\n",
" that is returned from it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.4. Computing the intersection of two sets of trends" ] }, { "cell_type": "code", "collapsed": false, "input": [ "world_trends_set = set([trend['name'] \n", " for trend in world_trends[0]['trends']])\n", "\n", "us_trends_set = set([trend['name'] \n", " for trend in us_trends[0]['trends']]) \n", "\n", "common_trends = world_trends_set.intersection(us_trends_set)\n", "\n", "print common_trends" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:Recall that Appendix C, Python and IPython Notebook Tips & Tricks provides a reference for\n", " some common Python idioms like list comprehensions that you may find\n", " useful to review.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:You should complete Example 1.4, “Computing the intersection of two sets of trends”\n", " before moving on in this chapter to ensure that you are able to access\n", " and analyze Twitter data. Can you explain what, if any, correlation\n", " exists between trends in your country and the rest of the\n", " world?
" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Searching for Tweets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the common items between the sets of trending topics turns out\n", " to be the hashtag \\#MentionSomeoneImportantForYou, so let's use it as the\n", " basis of a search query to fetch some tweets for further analysis. Example 1.5, “Collecting search results” illustrates how to exercise theSidebar Discussion:\n", "\n", "\n", "\n", "\n", "\n", " \n", "Computing setwise operations may seem a rather primitive form of\n", " analysis, but the ramifications of set theory for general mathematics\n", " are considerably more profound since it provides the foundation for\n", " many mathematical principles.
\n", "Georg Cantor is generally credited with formalizing the mathematics\n", " behind set theory, and his paper “On a Characteristic Property of All\n", " Real Algebraic Numbers” (1874) formalized set theory as part of his\n", " work on answering questions related to the concept of infinity. To\n", " understand how it worked, consider the following question: is the set\n", " of positive integers larger in cardinality than the set of both\n", " positive and negative integers?
\n", "Although common intuition may be that there are twice as many\n", " positive and negative integers than positive integers alone, Cantor’s\n", " work showed that the cardinalities of the sets are actually equal!\n", " Mathematically, he showed that you can map both sets of numbers such\n", " that they form a sequence with a definite starting point that extends\n", " forever in one direction like this: {1,\n", " –1, 2, –2, 3, –3, ...}.
\n", "Because the numbers can be clearly enumerated but there is never an ending\n", " point, the cardinalities of the sets are said to be\n", " countably infinite. In other words, there is a\n", " definite sequence that could be followed deterministically if you\n", " simply had enough time to count them.
GET search/tweets
resource for a particular query of\n",
" interest, including the ability to use a special field that's included\n",
" in the metadata for the search results to easily make additional\n",
" requests for more search results. Coverage of Twitter's Streaming API resources is out of\n",
" scope for this chapter but is introduced in “Sampling the Twitter Firehose with the Streaming API” (Example 9.8, “Sampling the Twitter firehose with the Streaming API”)\n",
" and may be more appropriate for many situations in which you want to\n",
" maintain a constantly updated view of tweets."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.5. Collecting search results" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Import unquote to prevent url encoding errors in next_results\n", "from urllib import unquote\n", "\n", "# XXX: Set this variable to a trending topic, \n", "# or anything else for that matter. The example query below\n", "# was a trending topic when this content was being developed\n", "# and is used throughout the remainder of this chapter.\n", "\n", "q = '#MentionSomeoneImportantForYou' \n", "\n", "count = 100\n", "\n", "# See https://dev.twitter.com/docs/api/1.1/get/search/tweets\n", "\n", "search_results = twitter_api.search.tweets(q=q, count=count)\n", "\n", "statuses = search_results['statuses']\n", "\n", "\n", "# Iterate through 5 more batches of results by following the cursor\n", "\n", "for _ in range(5):\n", " print \"Length of statuses\", len(statuses)\n", " try:\n", " next_results = search_results['search_metadata']['next_results']\n", " except KeyError, e: # No more results when next_results doesn't exist\n", " break\n", " \n", " # Create a dictionary from next_results, which has the following form:\n", " # ?max_id=313519052523986943&q=NCAA&include_entities=1\n", " kwargs = dict([kv.split('=') for kv in unquote(next_results[1:]).split(\"&\") ]) \n", "\n", " \n", " search_results = twitter_api.search.tweets(**kwargs)\n", " statuses += search_results['statuses']\n", "\n", "# Show one sample search result by slicing the list...\n", "print json.dumps(statuses[0], indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:The use of
\\*args
and\\*\\*kwargs
as illustrated in Example 1.5, “Collecting search results” as parameters to a function is a\n", " Python idiom for expressing arbitrary arguments and keyword arguments,\n", " respectively. See Appendix C, Python and IPython Notebook Tips & Tricks for a brief overview of this\n", " idiom.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In essence, all the code does is repeatedly make requests to the Search API.\n", " One thing that might initially catch you off guard if you've worked with\n", " other web APIs (including version 1 of Twitter's API) is that there's no\n", " explicit concept of pagination in the Search API\n", " itself. Reviewing the API documentation reveals that this is a\n", " intentional decision, and there are some good reasons for taking a\n", " cursoring approach instead, given the highly\n", " dynamic state of Twitter resources. The best practices for cursoring\n", " vary a bit throughout the Twitter developer platform, with the Search\n", " API providing a slightly simpler way of navigating search results than\n", " other resources such as timelines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Search results contain a specialNote:Although we're just passing in a hashtag to the Search API at\n", " this point, it's well worth noting that it contains a number of powerful operators that allow you\n", " to filter queries according to the existence or nonexistence of\n", " various keywords, originator of the tweet, location associated with\n", " the tweet, etc.
search\\_metadata
node that embeds a next\\_results
field with a query string that\n",
" provides the basis of a subsequent query. If we weren't using a library\n",
" like twitter
to make the HTTP\n",
" requests for us, this preconstructed query string would just be appended\n",
" to the Search API URL, and we'd update it with additional parameters for\n",
" handling OAuth. However, since we are not making our HTTP requests\n",
" directly, we must parse the query string into its constituent key/value\n",
" pairs and provide them as keyword arguments."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In Python parlance, we are unpacking the\n",
" values in a dictionary into keyword arguments that the function\n",
" receives. In other words, the function call inside of the for
loop in Example 1.5, “Collecting search results” ultimately invokes a function such as\n",
" twitter\\_api.search.tweets
(q='%23MentionSomeoneImportantForYou',\n",
" include\\_entities=1, max\\_id=313519
052523986943)
even though it appears in\n",
" the source code as twitter\\_api
.search.tweets(\\*\\*kwargs)
, with kwargs
being a dictionary of key/value\n",
" pairs."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next sample tweet shows the search results for a query for\n", " \\#MentionSomeoneImportantForYou. Take a moment to peruse (all of) it. As\n", " I mentioned earlier, there's a lot more to a tweet than meets the eye.\n", " The particular tweet that follows is fairly representative and contains\n", " in excess of 5 KB of total content when represented in uncompressed\n", " JSON. That's more than 40 times the amount of data that makes up the 140\n", " characters of text that's normally thought of as a tweet!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:The
search\\_metadata
field\n", " also contains arefresh\\_url
value\n", " that can be used if you'd like to maintain and periodically update\n", " your collection of results with new information that's become\n", " available since the previous query.
[\n", " {\n", " \"contributors\": null, \n", " \"truncated\": false, \n", " \"text\": \"RT @hassanmusician: \\#MentionSomeoneImportantForYou God.\", \n", " \"in\\_reply\\_to\\_status\\_id\": null, \n", " \"id\": 316948241264549888, \n", " \"favorite\\_count\": 0, \n", " \"source\": \"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tweets are imbued with some of the richest metadata that you'll\n", " find on the social web, and Chapter 9, Twitter Cookbook\n", " elaborates on some of the many possibilities." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Analyzing the 140 Characters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The online documentation is always the definitive source for Twitter\n", " platform objects, and it's worthwhile to bookmark the Tweets page, because it's one that\n", " you'll refer to quite frequently as you get familiarized with the basic\n", " anatomy of a tweet. No attempt is made here or elsewhere in the book to\n", " regurgitate online documentation, but a few notes are of interest given\n", " that you might still be a bit overwhelmed by the 5 KB of information that\n", " a tweet comprises. For simplicity of nomenclature, let's assume that we've\n", " extracted a single tweet from the search results and stored it in a\n", " variable namedt
. For example,t.keys()
returns the top-level fields for the\n", " tweet andt['id']
accesses the\n", " identifier of the tweet." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:If you're following along with the IPython Notebook for this\n", " chapter, the exact tweet that's under scrutiny is stored in a variable\n", " named
t
so that you can\n", " interactively access its fields and explore more easily. The current\n", " discussion assumes the same nomenclature, so values should correspond\n", " one-for-one.
The human-readable text of a tweet is available through t['text']
:
The entities in the text of a tweet are conveniently processed\n",
" for you and available through t['entities']
:
Clues as to the \"interestingness\" of a tweet are available through t['favorite\\_count']
and t['retweet\\_count']
, which return the\n",
" number of times it's been bookmarked or retweeted,\n",
" respectively.
If a tweet has been retweeted, the t['retweeted\\_status']
field provides\n",
" significant detail about the original tweet itself and its author.\n",
" Keep in mind that sometimes the text of a tweet changes as it is\n",
" retweeted, as users add reactions or otherwise manipulate the\n",
" text.
The t['retweeted']
field\n",
" denotes whether or not the authenticated user (via an\n",
" authorized application) has retweeted this particular tweet. Fields\n",
" that vary depending upon the\n",
" point of view of the particular user are denoted in Twitter's\n",
" developer documentation as perspectival, which\n",
" means that their values will vary depending upon the perspective of\n",
" the user.
Additionally, note that only original tweets are retweeted\n",
" from the standpoint of the API and information management. Thus, the\n",
" retweet\\_count
reflects the total\n",
" number of times that the original tweet has been retweeted and\n",
" should reflect the same value in both the original tweet and all\n",
" subsequent retweets. In other words, retweets aren't retweeted. It\n",
" may be a bit counterintuitive at first, but if you think you're\n",
" retweeting a retweet, you're actually just retweeting the original\n",
" tweet that you were exposed to through a proxy. See “Examining Patterns in Retweets” later in this chapter for a\n",
" more nuanced discussion about the difference between retweeting vs\n",
" quoting a tweet.
" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.6. Extracting text, screen names, and hashtags from tweets" ] }, { "cell_type": "code", "collapsed": false, "input": [ "status_texts = [ status['text'] \n", " for status in statuses ]\n", "\n", "screen_names = [ user_mention['screen_name'] \n", " for status in statuses\n", " for user_mention in status['entities']['user_mentions'] ]\n", "\n", "hashtags = [ hashtag['text'] \n", " for status in statuses\n", " for hashtag in status['entities']['hashtags'] ]\n", "\n", "# Compute a collection of all words from all tweets\n", "words = [ w \n", " for t in status_texts \n", " for w in t.split() ]\n", "\n", "# Explore the first 5 items for each...\n", "\n", "print json.dumps(status_texts[0:5], indent=1)\n", "print json.dumps(screen_names[0:5], indent=1) \n", "print json.dumps(hashtags[0:5], indent=1)\n", "print json.dumps(words[0:5], indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sample output follows; it displays five status texts, screen\n", " names, and hashtags to provide a feel for what's in the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:List comprehensions are used frequently throughout this book,\n", " and it's worth consulting Appendix C, Python and IPython Notebook Tips & Tricks or the official Python tutorial for\n", " more details if you'd like additional context.
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:In Python, syntax in which square brackets appear after a list\n", " or string value, such as
status\\_texts[0:5]
, is indicative of\n", " slicing, whereby you can easily extract items\n", " from lists or substrings from strings. In this particular case,\n", "[0:5]
indicates that you'd like the\n", " first five items in the liststatus\\_texts
(corresponding to items at\n", " indices 0 through 4). See Appendix C, Python and IPython Notebook Tips & Tricks for a more extended\n", " description of slicing in Python.
[\n",
" \"\\u201c@KathleenMariee\\_: \\#MentionSomeOneImportantForYou @AhhlicksCruise..., \n",
" \"\\#MentionSomeoneImportantForYou My bf @Linkin\\_Sunrise.\", \n",
" \"RT @hassanmusician: \\#MentionSomeoneImportantForYou God.\", \n",
" \"\\#MentionSomeoneImportantForYou @Louis\\_Tomlinson\", \n",
" \"\\#MentionSomeoneImportantForYou @Delta\\_Universe\"\n",
"]\n",
"[\n",
" \"KathleenMariee\\_\", \n",
" \"AhhlicksCruise\", \n",
" \"itsravennn\\_cx\", \n",
" \"kandykisses\\_13\", \n",
" \"BMOLOGY\"\n",
"]\n",
"[\n",
" \"MentionSomeOneImportantForYou\", \n",
" \"MentionSomeoneImportantForYou\", \n",
" \"MentionSomeoneImportantForYou\", \n",
" \"MentionSomeoneImportantForYou\", \n",
" \"MentionSomeoneImportantForYou\"\n",
"]\n",
"[\n",
" \"\\u201c@KathleenMariee\\_:\", \n",
" \"\\#MentionSomeOneImportantForYou\", \n",
" \"@AhhlicksCruise\", \n",
" \",\", \n",
" \"@itsravennn\\_cx\"\n",
"]
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As expected, \\#MentionSomeoneImportantForYou dominates the hashtag\n",
" output. The output also provides a few commonly occurring screen names\n",
" that are worth investigating."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Analyzing Tweets and Tweet Entities with Frequency Analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Virtually all analysis boils down to the simple exercise of counting things on\n",
" some level, and much of what we'll be doing in this book is manipulating\n",
" data so that it can be counted and further manipulated in meaningful\n",
" ways."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From an empirical standpoint, counting observable things is the\n",
" starting point for just about everything, and thus the starting point\n",
" for any kind of statistical filtering or manipulation that strives to\n",
" find what may be a faint signal in noisy data. Whereas we just extracted\n",
" the first 5 items of each unranked list to get a feel for the data,\n",
" let's now take a closer look at what's in the data by computing a\n",
" frequency distribution and looking at the top 10 items in each\n",
" list."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As of Python 2.7, a collections
module is available that provides a counter that makes computing a\n",
" frequency distribution rather trivial. Example 1.7, “Creating a basic frequency distribution from the words in\n",
" tweets” demonstrates how to use a Counter
to compute frequency distributions as\n",
" ranked lists of terms. Among the more compelling reasons for mining\n",
" Twitter data is to try to answer the question of what people are talking\n",
" about right now. One of the simplest techniques you\n",
" could apply to answer this question is basic frequency analysis, just as\n",
" we are performing here."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Example 1.7. Creating a basic frequency distribution from the words in tweets"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from collections import Counter\n",
"\n",
"for item in [words, screen_names, hashtags]:\n",
" c = Counter(item)\n",
" print c.most_common()[:10] # top 10\n",
" print"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are some sample results from frequency analysis of\n",
" tweets:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[(u'\\#MentionSomeoneImportantForYou', 92), (u'RT', 34), (u'my', 10), \n", " (u',', 6), (u'@justinbieber', 6), (u'<3', 6), (u'My', 5), (u'and', 4), \n", " (u'I', 4), (u'te', 3)]\n", "\n", "[(u'justinbieber', 6), (u'Kid\\_Charliej', 2), (u'Cavillafuerte', 2), \n", " (u'touchmestyles\\_', 1), (u'aliceorr96', 1), (u'gymleeam', 1), (u'fienas', 1), \n", " (u'nayely\\_1D', 1), (u'angelchute', 1)]\n", "\n", "[(u'MentionSomeoneImportantForYou', 94), (u'mentionsomeoneimportantforyou', 3), \n", " (u'NoHomo', 1), (u'Love', 1), (u'MentionSomeOneImportantForYou', 1), \n", " (u'MyHeart', 1), (u'bebesito', 1)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result of the frequency distribution is a map of key/value\n", " pairs corresponding to terms and their frequencies, so let's make\n", " reviewing the results a little easier on the eyes by emitting a tabular\n", " format. You can install a package called
prettytable
by typing pip install prettytable
in a terminal; this\n",
" package provides a convenient way to emit a fixed-width tabular format\n",
" that can be easily copied-and-pasted."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Example 1.8, “Using prettytable to display tuples in a nice tabular\n",
" format” shows how to use it to display the\n",
" same results."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Example 1.8. Using prettytable to display tuples in a nice tabular format"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from prettytable import PrettyTable\n",
"\n",
"for label, data in (('Word', words), \n",
" ('Screen Name', screen_names), \n",
" ('Hashtag', hashtags)):\n",
" pt = PrettyTable(field_names=[label, 'Count']) \n",
" c = Counter(data)\n",
" [ pt.add_row(kv) for kv in c.most_common()[:10] ]\n",
" pt.align[label], pt.align['Count'] = 'l', 'r' # Set column alignment\n",
" print pt"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"+--------------------------------+-------+\n", "| Word | Count |\n", "+--------------------------------+-------+\n", "| \\#MentionSomeoneImportantForYou | 92 |\n", "| RT | 34 |\n", "| my | 10 |\n", "| , | 6 |\n", "| @justinbieber | 6 |\n", "| <3 | 6 |\n", "| My | 5 |\n", "| and | 4 |\n", "| I | 4 |\n", "| te | 3 |\n", "+--------------------------------+-------+\n", "+----------------+-------+\n", "| Screen Name | Count |\n", "+----------------+-------+\n", "| justinbieber | 6 |\n", "| Kid\\_Charliej | 2 |\n", "| Cavillafuerte | 2 |\n", "| touchmestyles\\_ | 1 |\n", "| aliceorr96 | 1 |\n", "| gymleeam | 1 |\n", "| fienas | 1 |\n", "| nayely\\_1D | 1 |\n", "| angelchute | 1 |\n", "+----------------+-------+\n", "+-------------------------------+-------+\n", "| Hashtag | Count |\n", "+-------------------------------+-------+\n", "| MentionSomeoneImportantForYou | 94 |\n", "| mentionsomeoneimportantforyou | 3 |\n", "| NoHomo | 1 |\n", "| Love | 1 |\n", "| MentionSomeOneImportantForYou | 1 |\n", "| MyHeart | 1 |\n", "| bebesito | 1 |\n", "+-------------------------------+-------+" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick skim of the results reveals at least one marginally\n", " surprising thing: Justin Bieber is high on the list of entities for this\n", " small sample of data, and given his popularity with tweens on Twitter he\n", " may very well have been the \"most important someone\" for this trending\n", " topic, though the results here are inconclusive. The appearance of\n", "
<3
is also interesting because\n",
" it is an escaped form of <3
, which\n",
" represents a heart shape (that's rotated 90 degrees, like other\n",
" emoticons and smileys) and is a common abbreviation for \"loves.\" Given\n",
" the nature of the query, it's not surprising to see a value like\n",
" <3
, although it may initially\n",
" seem like junk or noise."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although the entities with a frequency greater than two are\n",
" interesting, the broader results are also revealing in other ways. For\n",
" example, \"RT\" was a very common token, implying that there were a\n",
" significant number of retweets (we'll investigate this observation\n",
" further in “Examining Patterns in Retweets”). Finally, as\n",
" might be expected, the \\#MentionSomeoneImportantForYou hashtag and a\n",
" couple of case-sensitive variations dominated the hashtags; a\n",
" data-processing takeaway is that it would be worthwhile to normalize\n",
" each word, screen name, and hashtag to lowercase when tabulating\n",
" frequencies since there will inevitably be variation in tweets."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Computing the Lexical Diversity of Tweets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A slightly more advanced measurement that involves calculating simple\n",
" frequencies and can be applied to unstructured text is a metric called\n",
" lexical diversity. Mathematically, this is an\n",
" expression of the number of unique tokens in the\n",
" text divided by the total number of tokens in the\n",
" text, which are both elementary yet important metrics in and of\n",
" themselves. Lexical diversity is an interesting concept in the area of\n",
" interpersonal communications because it provides a quantitative measure\n",
" for the diversity of an individual's or group's vocabulary. For example,\n",
" suppose you are listening to someone who repeatedly says \"and stuff\" to\n",
" broadly generalize information as opposed to providing specific examples\n",
" to reinforce points with more detail or clarity. Now, contrast that\n",
" speaker to someone else who seldom uses the word \"stuff\" to generalize\n",
" and instead reinforces points with concrete examples. The speaker who\n",
" repeatedly says \"and stuff\" would have a lower lexical diversity than\n",
" the speaker who uses a more diverse vocabulary, and chances are\n",
" reasonably good that you'd walk away from the conversation feeling as\n",
" though the speaker with the higher lexical diversity understands the\n",
" subject matter better."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As applied to tweets or similar online communications, lexical\n",
" diversity can be worth considering as a primitive statistic for\n",
" answering a number of questions, such as how broad or narrow the subject\n",
" matter is that an individual or group discusses. Although an overall\n",
" assessment could be interesting, breaking down the analysis to specific\n",
" time periods could yield additional insight, as could comparing\n",
" different groups or individuals. For example, it would be interesting to\n",
" measure whether or not there is a significant difference between the\n",
" lexical diversity of two soft drink companies such as Coca-Cola and Pepsi as an entry point for\n",
" exploration if you were comparing the effectiveness of their social\n",
" media marketing campaigns on Twitter."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With a basic understanding of how to use a statistic like lexical\n",
" diversity to analyze textual content such as tweets, let's now compute\n",
" the lexical diversity for statuses, screen names, and hashtags for our\n",
" working data set, as shown in Example 1.9, “Calculating lexical diversity for tweets”."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Example 1.9. Calculating lexical diversity for tweets"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# A function for computing lexical diversity\n",
"def lexical_diversity(tokens):\n",
" return 1.0*len(set(tokens))/len(tokens) \n",
"\n",
"# A function for computing the average number of words per tweet\n",
"def average_words(statuses):\n",
" total_words = sum([ len(s.split()) for s in statuses ]) \n",
" return 1.0*total_words/len(statuses)\n",
"\n",
"print lexical_diversity(words)\n",
"print lexical_diversity(screen_names)\n",
"print lexical_diversity(hashtags)\n",
"print average_words(status_texts)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The results of Example 1.9, “Calculating lexical diversity for tweets”\n",
" follow:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"0.67610619469\n", "0.955414012739\n", "0.0686274509804\n", "5.76530612245" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a few observations worth considering in the\n", " results:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
The lexical diversity of the words in the text of the tweets\n", " is around 0.67. One way to interpret that figure would be to say\n", " that about two out of every three words is unique, or you might\n", " say that each status update carries around 67% unique information.\n", " Given that the average number of words in each tweet is around\n", " six, that translates to about four unique words per tweet.\n", " Intuition aligns with the data in that the nature of a\n", " \\#MentionSomeoneImportantForYou trending hashtag is to solicit a\n", " response that will probably be a few words long. In any event, a\n", " value of 0.67 is on the high side for lexical diversity of\n", " ordinary human communication, but given the nature of the data, it\n", " seems very reasonable.
\n", "The lexical diversity of the screen names, however, is even higher, with a value\n", " of 0.95, which means that about 19 out of 20 screen names\n", " mentioned are unique. This observation also makes sense given that\n", " many answers to the question will be a screen name, and that most\n", " people won't be providing the same responses for the solicitous\n", " hashtag.
\n", "The lexical diversity of the hashtags is extremely low at a value of around 0.068, implying\n", " that very few values other than the \\#MentionSomeoneImportantForYou\n", " hashtag appear multiple times in the results. Again, this makes\n", " good sense given that most responses are short and that hashtags\n", " really wouldn't make much sense to introduce as a response to the\n", " prompt of mentioning someone important for you.
\n", "The average number of words per tweet is very low at a value\n", " of just under 6, which makes sense given the nature of the\n", " hashtag, which is designed to solicit short responses consisting\n", " of just a few words.
\n", "retweet\\_count
and retweeted\\_status
, some Twitter users may prefer to quote a tweet, which entails a workflow involving copying and pasting the text\n",
" and prepending \"RT @username\" or suffixing \"/via\n",
" @username\" to provide attribution."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A good exercise at this point would be to further analyze the data\n", " to determine if there was a particular tweet that was highly retweeted\n", " or if there were just lots of \"one-off\" retweets. The approach we'll\n", " take to find the most popular retweets is to simply iterate over each\n", " status update and store out the retweet count, originator of the\n", " retweet, and text of the retweet if the status update is a retweet.\n", " Example 1.10, “Finding the most popular retweets” demonstrates how to capture\n", " these values with a list comprehension and sort by the retweet count to\n", " display the top few results." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.10. Finding the most popular retweets" ] }, { "cell_type": "code", "collapsed": false, "input": [ "retweets = [\n", " # Store out a tuple of these three values ...\n", " (status['retweet_count'], \n", " status['retweeted_status']['user']['screen_name'],\n", " status['text']) \n", " \n", " # ... for each status ...\n", " for status in statuses \n", " \n", " # ... so long as the status meets this condition.\n", " if status.has_key('retweeted_status')\n", " ]\n", "\n", "# Slice off the first 5 from the sorted results and display each item in the tuple\n", "\n", "pt = PrettyTable(field_names=['Count', 'Screen Name', 'Text'])\n", "[ pt.add_row(row) for row in sorted(retweets, reverse=True)[:5] ]\n", "pt.max_width['Text'] = 50\n", "pt.align= 'l'\n", "print pt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Results from Example 1.10, “Finding the most popular retweets” are\n", " interesting:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:When mining Twitter data, you'll probably want to both account\n", " for the tweet metadata and use heuristics to analyze the 140\n", " characters for conventions such as \"RT @username\"\n", " or \"/via @username\" when considering retweets, in\n", " order to maximize the efficacy of your analysis. See “Finding Users Who Have Retweeted a Status” for a more\n", " detailed discussion on retweeting with Twitter's native Retweet API\n", " versus \"quoting\" tweets and using conventions to apply\n", " attribution.
+-------+----------------+----------------------------------------------------+\n", "| Count | Screen Name | Text |\n", "+-------+----------------+----------------------------------------------------+\n", "| 23 | hassanmusician | RT @hassanmusician: \\#MentionSomeoneImportantForYou |\n", "| | | God. |\n", "| 21 | HSweethearts | RT @HSweethearts: \\#MentionSomeoneImportantForYou |\n", "| | | my high school sweetheart \u2764 |\n", "| 15 | LosAlejandro\\_ | RT @LosAlejandro\\_: \u00bfNadie te menciono en |\n", "| | | \"\\#MentionSomeoneImportantForYou\"? JAJAJAJAJAJAJAJA |\n", "| | | JAJAJAJAJAJAJAJAJAJAJAJAJAJAJAJAJAJAJAJA Ven, ... |\n", "| 9 | SCOTTSUMME | RT @SCOTTSUMME: \\#MentionSomeoneImportantForYou My |\n", "| | | Mum. Shes loving, caring, strong, all in one. I |\n", "| | | love her so much \u2764\u2764\u2764\u2764 |\n", "| 7 | degrassihaha | RT @degrassihaha: \\#MentionSomeoneImportantForYou I |\n", "| | | can't put every Degrassi cast member, crew member, |\n", "| | | and writer in just one tweet.... |\n", "+-------+----------------+----------------------------------------------------+" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"God\" tops the list, followed closely by \"my high school\n", " sweetheart,\" and coming in at number four on the list is \"My Mum.\" None\n", " of the top five items in the list correspond to Twitter user accounts,\n", " although we might have suspected this (with the exception of\n", " @justinbieber) from the previous analysis. Inspection of results further\n", " down the list does reveal particular user mentions, but the sample we\n", " have drawn from for this query is so small that no trends emerge.\n", " Searching for a larger sample of results would likely yield some user\n", " mentions with a frequency greater than one, which would be interesting\n", " to further analyze. The possibilities for further analysis are pretty\n", " open-ended, and by now, hopefully, you're itching to try out some custom\n", " queries of your own." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we move on, a subtlety worth noting is that it's quite\n", " possible (and probable, given the relatively low frequencies of the\n", " retweets observed in this section) that the original tweets that were\n", " retweeted may not exist in our sample search results set. For example,\n", " the most popular retweet in the sample results originated from a user\n", " with a screen name of @hassanmusician and was retweeted 23 times.\n", " However, closer inspection of the data reveals that we collected only 1\n", " of the 23 retweets in our search results. Neither the original tweet nor\n", " any of the other 22 retweets appears in the data set. This doesn't pose\n", " any particular problems, although it might beg the question of who the\n", " other 22 retweeters for this status were." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The answer to this kind of question is a valuable one because it\n", " allows us to take content that represents a concept, such as \"God\" in\n", " this case, and discover a group of other users who apparently share the\n", " same sentiment or common interest. As previously mentioned, a handy way\n", " to model data involving people and the things that they're interested in\n", " is called an interest graph; this is the\n", " primary data structure that supports analysis in Chapter 7, Mining GitHub: Inspecting Software Collaboration Habits, Building\n", " Interest Graphs, and More.\n", " Interpretative speculation about these users could suggest that they are\n", " spiritual or religious individuals, and further analysis of their\n", " particular tweets might corroborate that inference. Example 1.11, “Looking up users who have retweeted a status” shows how to find these individuals\n", " with theNote:Suggested exercises are at the end of this chapter. Be sure to\n", " also check out Chapter 9, Twitter Cookbook as a source of\n", " inspiration: it includes more than two dozen recipes presented in a\n", " cookbook-style format.
GET statuses/retweets/:id
API."
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Example 1.11. Looking up users who have retweeted a status"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Get the original tweet id for a tweet from its retweeted_status node \n",
"# and insert it here in place of the sample value that is provided\n",
"# from the text of the book\n",
"\n",
"_retweets = twitter_api.statuses.retweets(id=317127304981667841)\n",
"print [r['user']['screen_name'] for r in _retweets]"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Further analysis of the users who retweeted this particular status\n",
" for any particular religious or spiritual affiliation is left as an\n",
" independent exercise."
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Visualizing Frequency Data with Histograms"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A nice feature of IPython Notebook is its ability to generate and insert\n",
" high-quality and customizable plots of data as part of an interactive\n",
" workflow. In particular, the matplotlib
package and\n",
" other scientific computing tools that are available for IPython Notebook\n",
" are quite powerful and capable of generating complex figures with very\n",
" little effort once you understand the basic workflows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To illustrate the use of matplotlib
's plotting capabilities, let's plot\n",
" some data for display. To get warmed up, we’ll consider a plot that\n",
" displays the results from the words
\n",
" variable as defined in Example 1.9, “Calculating lexical diversity for tweets”. With\n",
" the help of a Counter
, it's easy to\n",
" generate a sorted list of tuples where each tuple is a (word, frequency)
pair; the x-axis value will\n",
" correspond to the index of the tuple, and the y-axis will correspond to\n",
" the frequency for the word in that tuple. It would generally be\n",
" impractical to try to plot each word as a value on the x-axis, although\n",
" that's what the x-axis is representing. Figure 1.4, “A plot displaying the sorted frequencies for the words computed\n",
" by Example 1.8, “Using prettytable to display tuples in a nice tabular\n",
" format””\n",
" displays a plot for the same words data that we previously rendered as a\n",
" table in Example 1.8, “Using prettytable to display tuples in a nice tabular\n",
" format”. The y-axis values on the plot\n",
" correspond to the number of times a word appeared. Although labels for\n",
" each word are not provided, x-axis values have been sorted so that the\n",
" relationship between word frequencies is more apparent. Each axis has\n",
" been adjusted to a logarithmic scale to \"squash\" the curve being\n",
" displayed. The plot can be generated directly in IPython Notebook\n",
" with the code shown in Example 1.12, “Plotting frequencies of words”."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.12. Plotting frequencies of words" ] }, { "cell_type": "code", "collapsed": false, "input": [ "word_counts = sorted(Counter(words).values(), reverse=True)\n", "\n", "plt.loglog(word_counts)\n", "plt.ylabel(\"Freq\")\n", "plt.xlabel(\"Word Rank\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A plot of frequency values is intuitive and convenient, but it can also be\n", " useful to group together data values into bins that correspond to a\n", " range of frequencies. For example, how many words have a frequency\n", " between 1 and 5, between 5 and 10, between 10 and 15, and so forth? A\n", " histogram is\n", " designed for precisely this purpose and provides a convenient\n", " visualization for displaying tabulated frequencies as adjacent\n", " rectangles, where the area of each rectangle is a measure of the data\n", " values that fall within that particular range of values. Figures 1.5 and 1.6 show histograms of\n", " the tabular data generated from Examples 1.8 and 1.10,\n", " respectively. Although the histograms don't have x-axis labels that show\n", " us which words have which frequencies, that's not really their purpose.\n", " A histogram gives us insight into the underlying frequency distribution,\n", " with the x-axis corresponding to a range for words that each have a\n", " frequency within that range and the y-axis corresponding to the total\n", " frequency of all words that appear within that range." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When interpreting Figure 1.5, “Histograms of tabulated frequency data for words, screen names,\n", " and hashtags, each displaying a particular kind of data that is\n", " grouped by frequency”, look back to the\n", " corresponding tabular data and consider that there are a large number of\n", " words, screen names, or hashtags that have low frequencies and appear few times in the text;\n", " however, when we combine all of these low-frequency terms and bin them\n", " together into a range of \"all words with frequency between 1 and 10,\" we\n", " see that the total number of these low-frequency words accounts for most\n", " of the text. More concretely, we see that there are approximately 10\n", " words that account for almost all of the frequencies as rendered by the\n", " area of the large blue rectangle, while there are just a couple of words\n", " with much higher frequencies: \"\\#MentionSomeoneImportantForYou\" and \"RT,\"\n", " with respective frequencies of 34 and 92 as given by our tabulated\n", " data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Likewise, when interpreting Figure 1.6, “A histogram of retweet frequencies”, we see that\n", " there are a select few tweets that are retweeted with a much higher\n", " frequencies than the bulk of the tweets, which are retweeted only once\n", " and account for the majority of the volume given by the largest blue\n", " rectangle on the left side of the histogram." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:If you are using the virtual machine, your IPython Notebooks\n", " should be configured to use plotting capabilities out of the box. If\n", " you are running on your own local environment, be sure to have started\n", " IPython Notebook with PyLab\n", " enabled as follows:
\n", "ipython notebook --pylab=inline
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the primary takeaways from this chapter from an analytical\n", " standpoint is that counting is generally the first step to any kind of\n", " meaningful quantitative analysis. Although basic frequency analysis is\n", " simple, it is a powerful tool for your repertoire that shouldn\u2019t be\n", " overlooked just because it\u2019s so obvious; besides, many other advanced\n", " statistics depend on it. On the contrary, frequency analysis and measures\n", " such as lexical diversity should be employed early and often, for\n", " precisely the reason that doing so is so obvious and simple. Oftentimes,\n", " but not always, the results from the simplest techniques can rival the\n", " quality of those from more sophisticated analytics. With respect to data\n", " in the Twitterverse, these modest techniques can usually get you quite a\n", " long way toward answering the question, \u201cWhat are people talking about\n", " right now?\u201d Now that's something we'd all like to know, isn't it?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:Chapter 9, Twitter Cookbook contains a number of Twitter\n", " recipes covering a broad array of topics that range from tweet\n", " harvesting and analysis to the effective use of storage for archiving\n", " tweets to techniques for analyzing followers for insights.
" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Recommended Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note:The source code outlined for this chapter and all other chapters\n", " is available at GitHub in a\n", " convenient IPython Notebook format that you're highly encouraged to try\n", " out from the comfort of your own web browser.
Bookmark and spend some time reviewing Twitter's API documentation. In\n", " particular, spend some time browsing the information on the REST API and platform objects.
\n", "If you haven't already, get comfortable working in IPython and IPython Notebook as a more\n", " productive alternative to the traditional Python interpreter. Over\n", " the course of your social web mining career, the saved time and\n", " increased productivity will really start to add up.
\n", "If you have a Twitter account with a nontrivial number of tweets, request your\n", " historical tweet archive from your account settings and analyze it.\n", " The export of your account data includes files organized by time\n", " period in a convenient JSON format. See the\n", " README.txt file included in the downloaded\n", " archive for more details. What are the most common terms that appear\n", " in your tweets? Who do you retweet the most often? How many of your\n", " tweets are retweeted (and why do you think this is the case)?
\n", "Take some time to explore Twitter's REST API with its developer console. Although we\n",
" opted to dive in with the twitter
\n",
" Python package in a programmatic fashion in this chapter, the\n",
" console can be useful for exploring the API, the effects of\n",
" parameters, and more. The command-line tool Twurl is another option to\n",
" consider if you prefer working in a terminal.
Complete the exercise of determining whether there seems to be\n", " a spiritual or religious affiliation for the users who retweeted the\n", " status citing \"God\" as someone important to them, or follow the\n", " workflow in this chapter for a trending topic or arbitrary search\n", " query of your own choosing. Explore some of the advanced search features that\n", " are available for more precise querying.
\n", "Explore Yahoo! GeoPlanet's\n", " Where On Earth ID API so that you can compare and contrast\n", " trends from different locales.
\n", "Take a closer look at matplotlib
and learn how to create\n",
" beautiful plots of 2D and 3D data\n",
" with IPython Notebook.
Explore and apply some of the exercises from Chapter 9, Twitter Cookbook.
\n", "\n", " Beautiful plots of 2D and\n", " 3D data with IPython Notebook\n", "
\n", "\n", " IPython \"magic\n", " functions\"\n", "
\n", "\n", " json.org\n", "
\n", "\n", " PyLab\n", "
\n", "\n", " Python list\n", " comprehensions\n", "
\n", "\n", " The official Python\n", " tutorial\n", "
\n", "\n", " OAuth\n", "
\n", "\n", " Twitter API\n", " documentation\n", "
\n", "\n", " Twitter API Rate Limiting\n", " in v1.1\n", "
\n", "\n", " Twitter developer\n", " console\n", "
\n", "\n", " Twitter Developer Rules of\n", " the Road\n", "
\n", "\n", " Twitter's OAuth\n", " documentation\n", "
\n", "\n", " Twitter Search API\n", " operators\n", "
\n", "\n", " Twitter Streaming\n", " API\n", "
\n", "\n", " Twitter terms of\n", " service\n", "
\n", "\n", " Twurl\n", "
\n", "\n", " Yahoo! GeoPlanet's Where\n", " On Earth ID API\n", "
\n", "[1] Although it's an implementation detail, it may be worth noting\n", " that Twitter's v1.1 API still implements OAuth 1.0a, whereas many\n", " other social web properties have since upgraded to OAuth 2.0.
\n", "