{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Chapter\u00a01.\u00a0Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

This content is a full-text excerpt from Mining the Social Web (2nd Edition) that has been minimally converted to IPython Notebook format so that you can interactively run the example code as you read the book. The purpose of this offering is to determine if there is sufficient interest to offer the remainder of the entire book as a collection of IPython Notebooks (as a standard distribution format that's in addition to PDF, Kindle, etc.) based on feedback from you.

\n", "

If you would like to see the full-text of this book (or other books like it) offered in a native IPython Notebook format, tweet something like \"@OReillyMedia: Please distribute @SocialWebMining in IPython Notebook format\" so that both O'Reilly Media and the author receive your feedback. Alternatively, contact O'Reilly Media using the link above. Thanks!

\n", "

\n", "You can also view this sampler chapter in O'Reilly's new Chimera ebook reader or download a PDF.\n", "

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This chapter kicks off our journey of mining the social web with\n", " Twitter, a rich source of social data that is a great starting point for\n", " social web mining because of its inherent openness for public consumption,\n", " clean and well-documented API, rich developer tooling, and broad appeal to\n", " users from every walk of life. Twitter data is particularly interesting\n", " because tweets happen at the \"speed of thought\" and are available for\n", " consumption as they happen in near real time, represent the broadest\n", " cross-section of society at an international level, and are so inherently\n", " multifaceted. Tweets and Twitter's \"following\" mechanism link people in a variety of ways, ranging from short\n", " (but often meaningful) conversational dialogues to interest graphs that\n", " connect people and the things that they care about." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since this is the first chapter, we'll take our time acclimating to\n", " our journey in social web mining. However, given that Twitter data is so\n", " accessible and open to public scrutiny, Chapter 9, Twitter Cookbook\n", " further elaborates on the broad number of data mining possibilities by\n", " providing a terse collection of recipes in a convenient problem/solution\n", " format that can be easily manipulated and readily applied to a wide range of\n", " problems. You'll also be able to apply concepts from future chapters to\n", " Twitter data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

Always get the latest bug-fixed source code for this chapter (and\n", " every other chapter) online at http://bit.ly/MiningTheSocialWeb2E.\n", " Be sure to also take advantage of this book's virtual machine experience,\n", " as described in Appendix A, Information About This Book's Virtual Machine Experience, to maximize your enjoyment of the\n", " sample code.

" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this chapter, we'll ease into the process of getting situated\n", " with a minimal (but effective) development environment with Python, survey\n", " Twitter's API, and distill some analytical insights from tweets using\n", " frequency analysis. Topics that you'll learn about in this chapter\n", " include:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Why Is Twitter All the Rage?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most chapters won't open with a reflective discussion, but since this is\n", " the first chapter of the book and introduces a social website that is\n", " often misunderstood, it seems appropriate to take a moment to examine\n", " Twitter at a fundamental level." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How would you define Twitter?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many ways to answer this question, but let's consider it\n", " from an overarching angle that addresses some fundamental aspects of our\n", " shared humanity that any technology needs to account for in order to be\n", " useful and successful. After all, the purpose of technology is to enhance\n", " our human experience." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As humans, what are some things that we want that technology might\n", " help us to get?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the context of the current discussion, these are just a few\n", " observations that are generally true of humanity. We have a deeply rooted\n", " need to share our ideas and experiences, which gives us the ability to\n", " connect with other people, to be heard, and to feel a sense of worth and\n", " importance. We are curious about the world around us and how to organize\n", " and manipulate it, and we use communication to share our observations, ask\n", " questions, and engage with other people in meaningful dialogues about our\n", " quandaries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last two bullet points highlight our inherent intolerance to\n", " friction. Ideally, we don't want to have to work any harder than is\n", " absolutely necessary to satisfy our curiosity or get any particular job\n", " done; we'd rather be doing \"something else\" or moving on to the next thing\n", " because our time on this planet is so precious and short. Along similar\n", " lines, we want things now and tend to be impatient\n", " when actual progress doesn't happen at the speed of our own\n", " thought." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One way to describe Twitter is as a microblogging service that\n", " allows people to communicate with short, 140-character messages that\n", " roughly correspond to thoughts or ideas. In that regard, you could think\n", " of Twitter as being akin to a free, high-speed, global text-messaging\n", " service. In other words, it's a glorified piece of valuable infrastructure\n", " that enables rapid and easy communication. However, that\u2019s not all of the\n", " story. It doesn't adequately address our inherent curiosity and the value\n", " proposition that emerges when you have over 500 million curious people registered, with\n", " over 100 million of them actively engaging their curiosity on a\n", " regular monthly basis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Besides the macro-level possibilities for marketing and\n", " advertising\u2014which are always lucrative with a user base of that size\u2014it's\n", " the underlying network dynamics that created the gravity for such a user\n", " base to emerge that are truly interesting, and that's why Twitter is all\n", " the rage. While the communication bus that enables users to share short\n", " quips at the speed of thought may be a necessary\n", " condition for viral adoption and sustained engagement on the Twitter\n", " platform, it's not a sufficient condition. The extra\n", " ingredient that makes it sufficient is that Twitter's asymmetric\n", " following model satisfies our curiosity. It is the\n", " asymmetric following model that casts Twitter as more of an interest graph\n", " than a social network, and the APIs that provide just enough of a\n", " framework for structure and self-organizing behavior to emerge from the\n", " chaos." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In other words, whereas some social websites like Facebook and\n", " LinkedIn require the mutual acceptance of a connection between users\n", " (which usually implies a real-world connection of some kind), Twitter's\n", " relationship model allows you to keep up with the latest happenings of\n", " any other user, even though that other user may not\n", " choose to follow you back or even know that you exist. Twitter's\n", " following model is simple but exploits a fundamental\n", " aspect of what makes us human: our curiosity. Whether it be an infatuation\n", " with celebrity gossip, an urge to keep up with a favorite sports team, a\n", " keen interest in a particular political topic, or a desire to connect with\n", " someone new, Twitter provides you with boundless opportunities to satisfy\n", " your curiosity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Think of an interest graph as a way of modeling\n", " connections between people and their arbitrary interests. Interest graphs\n", " provide a profound number of possibilities in the data mining realm that\n", " primarily involve measuring correlations between things for the objective\n", " of making intelligent recommendations and other applications in machine\n", " learning. For example, you could use an interest graph to measure\n", " correlations and make recommendations ranging from whom to follow on\n", " Twitter to what to purchase online to whom you should date. To illustrate\n", " the notion of Twitter as an interest graph, consider that a Twitter user\n", " need not be a real person; it very well could be a person, but it could\n", " also be an inanimate object, a company, a musical group, an imaginary\n", " persona, an impersonation of someone (living or dead), or just about\n", " anything else." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, the @HomerJSimpson account is the official\n", " account for Homer Simpson, a popular character from The\n", " Simpsons television show. Although Homer Simpson isn't a real\n", " person, he's a well-known personality throughout the world, and the\n", " @HomerJSimpson Twitter persona acts as an conduit for him (or his\n", " creators, actually) to engage his fans. Likewise, although this book will\n", " probably never reach the popularity of Homer Simpson, @SocialWebMining is its official\n", " Twitter account and provides a means for a community that's interested in\n", " its content to connect and engage on various levels. When you realize that\n", " Twitter enables you to create, connect, and explore a community of\n", " interest for an arbitrary topic of interest, the power of Twitter and the\n", " insights you can gain from mining its data become much more\n", " obvious." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is very little governance of what a Twitter account can be aside from the badges on some\n", " accounts that identify celebrities and public figures as \"verified\n", " accounts\" and basic restrictions in Twitter's Terms of Service agreement, which is\n", " required for using the service. It may seem very subtle, but it's an\n", " important distinction from some social websites in which accounts must\n", " correspond to real, living people, businesses, or entities of a similar\n", " nature that fit into a particular taxonomy. Twitter places no particular\n", " restrictions on the persona of an account and relies on self-organizing\n", " behavior such as following relationships and folksonomies that emerge from\n", " the use of hashtags to create a certain kind of order within the system." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Sidebar Discussion:
\n", "
\n", "
\n", "
Taxonomies and Folksonomies
\n", "
\n", "
\n", "
\n", "

A fundamental aspect of human intelligence is the desire to classify things and\n", " derive a hierarchy in which each element “belongs to” or is a “child” of\n", " a parent element one level higher in the hierarchy. Leaving aside some\n", " of the finer distinctions between a\n", " taxonomy and an ontology, think of a\n", " taxonomy as a hierarchical structure like a tree\n", " that classifies elements into particular parent/child relationships,\n", " whereas a folksonomy (a\n", " term coined around 2004) describes the universe of\n", " collaborative tagging and social indexing efforts that emerge in various\n", " ecosystems of the Web. It’s a play on words in the sense that it blends\n", " folk and taxonomy. So, in\n", " essence, a folksonomy is just a fancy way of describing the\n", " decentralized universe of tags that emerges as a mechanism of collective intelligence\n", " when you allow people to classify content with labels. One of the things\n", " that's so compelling about the use of hashtags on Twitter is that the\n", " folksonomies that organically emerge act as points of aggregation for\n", " common interests and provide a focused way to explore while still\n", " leaving open the possibility for nearly unbounded serendipity.

" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Exploring Twitter's API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now having a proper frame of reference for Twitter, let us now\n", " transition our attention to the problem of acquiring and analyzing Twitter\n", " data." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Fundamental Twitter Terminology" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Twitter might be described as a real-time, highly social microblogging\n", " service that allows users to post short status updates, called tweets, that appear on timelines. Tweets may include one or more\n", " entities in their 140 characters of content and reference one or more\n", " places that map to locations in the real world. An understanding\n", " of users, tweets, and timelines is particularly essential to effective\n", " use of Twitter's API, so a\n", " brief introduction to these fundamental Twitter\n", " Platform objects is in order before we interact with the API to\n", " fetch some data. We've largely discussed Twitter users and Twitter's\n", " asymmetric following model for relationships thus far, so this section\n", " briefly introduces tweets and timelines in order to round out a general\n", " understanding of the Twitter platform." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tweets are the essence of Twitter, and while they are notionally\n", " thought of as the 140 characters of text content associated with a\n", " user's status update, there's really quite a bit more metadata there than meets the eye. In addition to the\n", " textual content of a tweet itself, tweets come bundled with two\n", " additional pieces of metadata that are of particular note:\n", " entities and places. Tweet\n", " entities are essentially the user mentions, hashtags, URLs, and media that may be\n", " associated with a tweet, and places are locations in the real world that\n", " may be attached to a tweet. Note that a place may be the actual location\n", " in which a tweet was authored, but it might also be a reference to the\n", " place described in a tweet." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make it all a bit more concrete, let's consider a sample tweet\n", " with the following text:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

@ptwobrussell is writing @SocialWebMining, 2nd Ed. from his home\n", " office in Franklin, TN. Be \\#social: http://on.fb.me/16WJAf9

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The tweet is 124 characters long and contains four tweet entities:\n", " the user mentions @ptwobrussell and @SocialWebMining, the hashtag\n", " \\#social, and the URL http://on.fb.me/16WJAf9. Although there is a place called\n", " Franklin, Tennessee that's explicitly mentioned in the tweet, the\n", " places metadata associated with the tweet might\n", " include the location in which the tweet was authored, which may or may\n", " not be Franklin, Tennessee. That's a lot of metadata that's packed into\n", " fewer than 140 characters and illustrates just how potent a short quip\n", " can be: it can unambiguously refer to multiple other Twitter users, link\n", " to web pages, and cross-reference topics with hashtags that act as\n", " points of aggregation and horizontally slice through the entire Twitterverse in an easily searchable\n", " fashion." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, timelines are the chronologically\n", " sorted collections of tweets. Abstractly, you might say that a timeline\n", " is any particular collection of tweets displayed in chronological order;\n", " however, you'll commonly see a couple of timelines that are particularly\n", " noteworthy. From the perspective of an arbitrary Twitter user, the\n", " home timeline is the view that you see when you log into your account and look\n", " at all of the tweets from users that you are following, whereas a particular user timeline is a\n", " collection of tweets only from a certain user." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, when you log into your Twitter account, your home timeline is located at\n", " https://twitter.com. The URL for any\n", " particular user timeline, however, must be suffixed with a context that\n", " identifies the user, such as https://twitter.com/SocialWebMining.\n", " If you're interested in seeing what a particular user's home timeline\n", " looks like from that user's perspective, you can access it with the\n", " additional following suffix appended to the URL.\n", " For example, what Tim O'Reilly sees on his home timeline when he logs\n", " into Twitter is accessible at https://twitter.com/timoreilly/following." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An application like TweetDeck provides several customizable views into the\n", " tumultuous landscape of tweets, as shown in Figure 1.1, “TweetDeck provides a highly customizable user interface that\n", " can be helpful for analyzing what is happening on Twitter and\n", " demonstrates the kind of data that you have access to through the\n", " Twitter API”, and is worth trying out if you haven't journeyed\n", " far beyond the Twitter.com user interface." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Figure 1.1. TweetDeck provides a highly customizable user interface that\n", " can be helpful for analyzing what is happening on Twitter and\n", " demonstrates the kind of data that you have access to through the\n", " Twitter API
\n", "
\n", "
\n", " \"TweetDeck\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Whereas timelines are collections of tweets with relatively low\n", " velocity, streams are samples of public tweets\n", " flowing through Twitter in realtime. The public\n", " firehose of all tweets has been known to peak at hundreds of thousands of tweets per\n", " minute during events with particularly wide interest, such as\n", " presidential debates. Twitter's public firehose emits far too much data\n", " to consider for the scope of this book and presents interesting\n", " engineering challenges, which is at least one of the reasons that\n", " various third-party commercial vendors have partnered with Twitter to\n", " bring the firehose to the masses in a more consumable fashion. That\n", " said, a small random sample of the\n", " public timeline is available that provides filterable access to enough public data for API\n", " developers to develop powerful applications." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The remainder of this chapter and Part II of this book assume that\n", " you have a Twitter account, which is required for API access. If you\n", " don't have an account already, take a moment to create onem and then review Twitter’s liberal terms of service, API documentation, and Developer Rules of the Road. The\n", " sample code for this chapter and Part II of the book generally don't\n", " require you to have any friends or followers of your own, but some of\n", " the examples in Part II will be a lot more interesting and fun if you\n", " have an active account with a handful of friends and followers that you\n", " can use as a basis for social web mining. If you don't have an active\n", " account, now would be a good time to get plugged in and start priming\n", " your account for the data mining fun to come." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Creating a Twitter API Connection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Twitter has taken great care to craft an elegantly simple RESTful\n", " API that is intuitive and easy to use. Even so, there are great\n", " libraries available to further mitigate the work involved in making API\n", " requests. A particularly beautiful Python package that wraps the Twitter API and\n", " mimics the public API semantics almost one-to-one is twitter. Like most other Python packages, you\n", " can install it with pip by typing pip install\n", " twitter in a terminal." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

See Appendix C, Python and IPython Notebook Tips & Tricks for instructions on how to install\n", " pip.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Sidebar Discussion:
\n", "
\n", "
\n", "
Python Tip: Harnessing pydoc for Effective Help During\n", " Development
\n", "
\n", "
\n", "
\n", "

We’ll work though some examples that illustrate the use of the\n", " twitter package, but just in case\n", " you're ever in a situation where you need some help (and you will be),\n", " it's worth remembering that you can always skim the documentation for\n", " a package (its pydoc) in a few different ways. Outside of a Python shell, running\n", " pydoc in your terminal on a package\n", " in your PYTHONPATH is a\n", " nice option. For example, on a Linux or Mac system, you can simply\n", " type pydoc twitter in a\n", " terminal to get the package-level documentation, whereas pydoc twitter.Twitter provides\n", " documentation on the Twitter class included with that\n", " package. On Windows systems, you can get the same information, albeit\n", " in a slightly different way, by executing pydoc as a package. Typing python -mpydoc twitter.Twitter, for\n", " example, would provide information on the twitter.Twitter class. If you find yourself\n", " reviewing the documentation for certain modules often, you can elect\n", " to pass the -w option to\n", " pydoc and write out an HTML page that you can save and\n", " bookmark in your browser.

\n", "

However, more than likely, you'll be in the middle of a working\n", " session when you need some help. The built-in help\n", " function accepts a package or class name and is useful for an ordinary\n", " Python shell, whereas IPython users can suffix a package\n", " or class name with a question mark to view inline help. For example,\n", " you could type help(twitter) or\n", " help(twitter.Twitter) in a\n", " regular Python interpreter, while you could use the shortcut\n", " twitter? or twitter.Twitter? in IPython or IPython\n", " Notebook.

\n", "

It is highly recommended that you adopt IPython as your standard\n", " Python shell when working outside of IPython Notebook because of the\n", " various convenience functions, such as tab completion, session\n", " history, and \"magic\n", " functions,\" that it offers. Recall that Appendix A, Information About This Book's Virtual Machine Experience provides minimal details on getting oriented with\n", " recommended developer tools such as IPython.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

We'll opt to make programmatic API requests with Python, because\n", " the twitter package so elegantly\n", " mimics the RESTful API. If you're interested in seeing the raw\n", " requests that you could make with HTTP or exploring the API in a more\n", " interactive manner, however, check out the developer console or the\n", " command-line tool Twurl.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before you can make any API requests to Twitter, you'll need to\n", " create an application at https://dev.twitter.com/apps.\n", " Creating an application is the standard way for developers to gain API\n", " access and for Twitter to monitor and interact with third-party platform\n", " developers as needed. The process for creating an application is pretty\n", " standard, and all that's needed is read-only access to the API." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the present context, you are creating an app that you are going to authorize to access\n", " your account data, so this might seem a bit\n", " roundabout; why not just plug in your username and password to access\n", " the API? While that approach might work fine for\n", " you, a third party such as a friend or colleague\n", " probably wouldn't feel comfortable forking over a username/password\n", " combination in order to enjoy the same insights from\n", " your app. Giving up credentials is never a sound\n", " practice. Fortunately, some smart people recognized this problem years ago, and now there's a\n", " standardized protocol called OAuth (short for Open Authorization)\n", " that works for these kinds of situations in a generalized way for the\n", " broader social web. The protocol is a social web standard at this\n", " point." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you remember nothing else from this tangent, just remember that\n", " OAuth is a means of allowing users to authorize third-party applications\n", " to access their account data without needing to share sensitive\n", " information like a password. Appendix B, OAuth Primer provides a slightly\n", " broader overview of how OAuth works if you're interested, and Twitter's OAuth documentation offers\n", " specific details about its particular implementation.[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For simplicity of development, the key pieces of information that\n", " you'll need to take away from your newly created application's settings\n", " are its consumer key, consumer secret, access token, and access\n", " token secret. In tandem, these four credentials provide everything that\n", " an application would ultimately be getting to authorize itself through a\n", " series of redirects involving the user granting authorization, so treat\n", " them with the same sensitivity that you would a password." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

See Appendix B, OAuth Primer for details on implementing an OAuth\n", " 2.0 flow that you would need to build an application that requires an\n", " arbitrary user to authorize it to access account data.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure 1.2, “Create a new Twitter application to get OAuth credentials and\n", " API access at https://dev.twitter.com/apps;\n", " the four (blurred) OAuth fields are what you'll use to make API calls\n", " to Twitter's API” shows the context of\n", " retrieving these credentials." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Figure 1.2. Create a new Twitter application to get OAuth credentials and\n", " API access at https://dev.twitter.com/apps;\n", " the four (blurred) OAuth fields are what you'll use to make API calls\n", " to Twitter's API
\n", "
\n", "
\n", " \"Create\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Without further ado, let\u2019s create an authenticated connection to\n", " Twitter's API and find out what people are talking about by inspecting\n", " the trends available to us through the GET trends/place\n", " resource. While you're at it, go ahead and bookmark the official API documentation as well\n", " as the REST API v1.1\n", " resources, because you'll be referencing them regularly as you\n", " learn the ropes of the developer-facing side of the Twitterverse." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

As of March 2013, Twitter's API operates at version 1.1 and is\n", " significantly different in a few areas from the previous v1 API that\n", " you may have encountered. Version 1 of the API passed through a\n", " deprecation cycle of approximately six months and is no longer\n", " operational. All sample code in this book presumes version 1.1 of\n", " the API.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let\u2019s fire up IPython Notebook and initiate a search. Follow along\n", " with Example 1.1, “Authorizing an application to access Twitter account\n", " data” by substituting your own\n", " account credentials into the variables at the beginning of the code\n", " example and execute the call to create an instance of the Twitter API.\n", " The code works by using your OAuth credentials to create an object\n", " called auth that represents your\n", " OAuth authorization, which can then be passed to a class called Twitter that is capable of issuing queries to\n", " Twitter's API." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.1. Authorizing an application to access Twitter account data" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import twitter\n", "\n", "# XXX: Go to http://dev.twitter.com/apps/new to create an app and get values\n", "# for these credentials, which you'll need to provide in place of these\n", "# empty string values that are defined as placeholders.\n", "# See https://dev.twitter.com/docs/auth/oauth for more information \n", "# on Twitter's OAuth implementation.\n", "\n", "CONSUMER_KEY = ''\n", "CONSUMER_SECRET = ''\n", "OAUTH_TOKEN = ''\n", "OAUTH_TOKEN_SECRET = ''\n", "\n", "auth = twitter.oauth.OAuth(OAUTH_TOKEN, OAUTH_TOKEN_SECRET,\n", " CONSUMER_KEY, CONSUMER_SECRET)\n", "\n", "twitter_api = twitter.Twitter(auth=auth)\n", "\n", "# Nothing to see by displaying twitter_api except that it's now a\n", "# defined variable\n", "\n", "print twitter_api" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results of this example should simply display an unambiguous\n", " representation of the twitter\\_api\n", " object that we've constructed, such as:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<twitter.api.Twitter object at 0x39d9b50>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This indicates that we've successfully used OAuth credentials to\n", " gain authorization to query Twitter's API." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Exploring Trending Topics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With an authorized API connection in place, you can now issue a\n", " request. Example 1.2, “Retrieving trends” demonstrates how to\n", " ask Twitter for the topics that are currently trending worldwide, but\n", " keep in mind that the API can easily be parameterized to constrain the\n", " topics to more specific locales if you feel inclined to try out some of\n", " the possibilities. The device for constraining queries is via Yahoo!\n", " GeoPlanet’s Where On Earth (WOE) ID system, which is an API unto\n", " itself that aims to provide a way to map a unique identifier to any\n", " named place on Earth (or theoretically, even in a virtual world). If you\n", " haven't already, go ahead and try out the example that collects a set of\n", " trends for both the entire world and just the United States." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.2. Retrieving trends" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# The Yahoo! Where On Earth ID for the entire world is 1.\n", "# See https://dev.twitter.com/docs/api/1.1/get/trends/place and\n", "# http://developer.yahoo.com/geo/geoplanet/\n", "\n", "WORLD_WOE_ID = 1\n", "US_WOE_ID = 23424977\n", "\n", "# Prefix ID with the underscore for query string parameterization.\n", "# Without the underscore, the twitter package appends the ID value\n", "# to the URL itself as a special case keyword argument.\n", "\n", "world_trends = twitter_api.trends.place(_id=WORLD_WOE_ID)\n", "us_trends = twitter_api.trends.place(_id=US_WOE_ID)\n", "\n", "print world_trends\n", "print\n", "print us_trends" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see a semireadable response that is a list of Python\n", " dictionaries from the API (as opposed to any kind of error message),\n", " such as the following truncated results, before proceeding further. (In\n", " just a moment, we'll reformat the response to be more easily\n", " readable.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[{u'created\\_at': u'2013-03-27T11:50:40Z', u'trends': [{u'url': u'http://twitter.com/search?q=%23MentionSomeoneImportantForYou'..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the sample result contains a URL for a trend\n", " represented as a search query that corresponds to the hashtag\n", " \\#MentionSomeoneImportantForYou, where %23 is the URL encoding for the\n", " hashtag symbol. We'll use this rather benign hashtag throughout the\n", " remainder of the chapter as a unifying theme for examples that follow.\n", " Although a sample data file containing tweets for this hashtag is\n", " available with the book's source code, you'll have much more fun\n", " exploring a topic that's trending at the time you read this as opposed\n", " to following along with a canned topic that is no longer\n", " trending." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The pattern for using the twitter module is simple\n", " and predictable: instantiate the Twitter class with an object\n", " chain corresponding to a base URL and then invoke methods on the object\n", " that correspond to URL contexts. For example,\n", " twitter\\_api.\\_trends.place(WORLD\\_WOE\\_ID) initiates an HTTP\n", " call to GET\n", " https://api.twitter.com/1.1/trends/place.json?id=1.\n", " Note the URL mapping to the object chain that's constructed with the\n", " twitter package to make the request\n", " and how query string parameters are passed in as keyword arguments. To\n", " use the twitter package for arbitrary\n", " API requests, you generally construct the request in that kind of\n", " straightforward manner, with just a couple of minor caveats that we'll\n", " encounter soon enough." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Twitter imposes rate limits on how many requests an application can make to any given API\n", " resource within a given time window. Twitter's rate limits are well documented, and\n", " each individual API resource also states its particular limits for your\n", " convenience. For example, the API request that we just issued for trends\n", " limits applications to 15 requests per 15-minute window (see Figure 1.3, “Rate limits for Twitter API resources are identified in the\n", " online documentation for each API call; the particular API resource\n", " shown here allows 15 requests per \"rate limit window,\" which is\n", " currently defined as 15 minutes”). For more nuanced information on\n", " how Twitter's rate limits work, see REST API Rate\n", " Limiting in v1.1. For the purposes of following along in this\n", " chapter, it's highly unlikely that you'll get rate\n", " limited. “Making Robust Twitter Requests” (Example 9.16, “Making robust Twitter requests”) will introduce some\n", " techniques demonstrating best practices while working with rate\n", " limits." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Figure 1.3. Rate limits for Twitter API resources are identified in the\n", " online documentation for each API call; the particular API resource\n", " shown here allows 15 requests per \"rate limit window,\" which is\n", " currently defined as 15 minutes
\n", "
\n", "
\n", " \"Rate\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

The developer documentation states that the results of a Trends\n", " API query are updated only once every five minutes, so it's not a\n", " judicious use of your efforts or API requests to ask for results more\n", " often than that.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although it hasn't explicitly been stated yet, the semireadable\n", " output from Example 1.2, “Retrieving trends” is printed out as\n", " native Python data structures. While an IPython interpreter will \"pretty\n", " print\" the output for you automatically, IPython Notebook and a standard\n", " Python interpreter will not. If you find yourself in these\n", " circumstances, you may find it handy to use the built-in json\n", " package to force a nicer display, as illustrated in Example 1.3, “Displaying API responses as pretty-printed JSON”." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

JSON is a data\n", " exchange format that you will encounter on a regular\n", " basis. In a nutshell, JSON provides a way to arbitrarily store maps,\n", " lists, primitives such as numbers and strings, and combinations\n", " thereof. In other words, you can theoretically model just about\n", " anything with JSON should you desire to do so.

" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.3. Displaying API responses as pretty-printed JSON" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import json\n", "\n", "print json.dumps(world_trends, indent=1)\n", "print\n", "print json.dumps(us_trends, indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An abbreviated sample response from the Trends API produced with\n", " json.dumps would look like the\n", " following:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
[\n",
      " {\n",
      "  \"created\\_at\": \"2013-03-27T11:50:40Z\", \n",
      "  \"trends\": [\n",
      "   {\n",
      "    \"url\": \"http://twitter.com/search?q=%23MentionSomeoneImportantForYou\", \n",
      "    \"query\": \"%23MentionSomeoneImportantForYou\", \n",
      "    \"name\": \"\\#MentionSomeoneImportantForYou\", \n",
      "    \"promoted\\_content\": null, \n",
      "    \"events\": null\n",
      "   },\n",
      "   ...\n",
      "  ]\n",
      " }\n",
      "]
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although it's easy enough to skim the two sets of trends and look\n", " for commonality, let's use Python's set data\n", " structure to automatically compute this for us, because that's exactly\n", " the kind of thing that sets lend themselves to doing. In this instance,\n", " a set refers to the mathematical notion of a data\n", " structure that stores an unordered collection of unique items and can be\n", " computed upon with other sets of items and setwise operations. For\n", " example, a setwise intersection computes common items between sets, a\n", " setwise union combines all of the items from sets, and the setwise\n", " difference among sets acts sort of like a subtraction operation in which\n", " items from one set are removed from another." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 1.4, “Computing the intersection of two sets of trends” demonstrates how to use a\n", " Python list\n", " comprehension to parse out the names of the trending topics from\n", " the results that were previously queried, cast those lists to sets, and\n", " compute the setwise intersection to reveal the common items between\n", " them. Keep in mind that there may or may not be significant overlap\n", " between any given sets of trends, all depending on what's actually\n", " happening when you query for the trends. In other words, the results of\n", " your analysis will be entirely dependent upon your query and the data\n", " that is returned from it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

Recall that Appendix C, Python and IPython Notebook Tips & Tricks provides a reference for\n", " some common Python idioms like list comprehensions that you may find\n", " useful to review.

" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.4. Computing the intersection of two sets of trends" ] }, { "cell_type": "code", "collapsed": false, "input": [ "world_trends_set = set([trend['name'] \n", " for trend in world_trends[0]['trends']])\n", "\n", "us_trends_set = set([trend['name'] \n", " for trend in us_trends[0]['trends']]) \n", "\n", "common_trends = world_trends_set.intersection(us_trends_set)\n", "\n", "print common_trends" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

You should complete Example 1.4, “Computing the intersection of two sets of trends”\n", " before moving on in this chapter to ensure that you are able to access\n", " and analyze Twitter data. Can you explain what, if any, correlation\n", " exists between trends in your country and the rest of the\n", " world?

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Sidebar Discussion:
\n", "
\n", "
\n", "
Set Theory, Intuition, and Countable Infinity
\n", "
\n", "
\n", "
\n", "

Computing setwise operations may seem a rather primitive form of\n", " analysis, but the ramifications of set theory for general mathematics\n", " are considerably more profound since it provides the foundation for\n", " many mathematical principles.

\n", "

Georg Cantor is generally credited with formalizing the mathematics\n", " behind set theory, and his paper “On a Characteristic Property of All\n", " Real Algebraic Numbers” (1874) formalized set theory as part of his\n", " work on answering questions related to the concept of infinity. To\n", " understand how it worked, consider the following question: is the set\n", " of positive integers larger in cardinality than the set of both\n", " positive and negative integers?

\n", "

Although common intuition may be that there are twice as many\n", " positive and negative integers than positive integers alone, Cantor’s\n", " work showed that the cardinalities of the sets are actually equal!\n", " Mathematically, he showed that you can map both sets of numbers such\n", " that they form a sequence with a definite starting point that extends\n", " forever in one direction like this: {1,\n", " –1, 2, –2, 3, –3, ...}.

\n", "

Because the numbers can be clearly enumerated but there is never an ending\n", " point, the cardinalities of the sets are said to be\n", " countably infinite. In other words, there is a\n", " definite sequence that could be followed deterministically if you\n", " simply had enough time to count them.

" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Searching for Tweets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the common items between the sets of trending topics turns out\n", " to be the hashtag \\#MentionSomeoneImportantForYou, so let's use it as the\n", " basis of a search query to fetch some tweets for further analysis. Example 1.5, “Collecting search results” illustrates how to exercise the GET search/tweets resource for a particular query of\n", " interest, including the ability to use a special field that's included\n", " in the metadata for the search results to easily make additional\n", " requests for more search results. Coverage of Twitter's Streaming API resources is out of\n", " scope for this chapter but is introduced in “Sampling the Twitter Firehose with the Streaming API” (Example 9.8, “Sampling the Twitter firehose with the Streaming API”)\n", " and may be more appropriate for many situations in which you want to\n", " maintain a constantly updated view of tweets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

The use of \\*args and \\*\\*kwargs as illustrated in Example 1.5, “Collecting search results” as parameters to a function is a\n", " Python idiom for expressing arbitrary arguments and keyword arguments,\n", " respectively. See Appendix C, Python and IPython Notebook Tips & Tricks for a brief overview of this\n", " idiom.

" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.5. Collecting search results" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Import unquote to prevent url encoding errors in next_results\n", "from urllib import unquote\n", "\n", "# XXX: Set this variable to a trending topic, \n", "# or anything else for that matter. The example query below\n", "# was a trending topic when this content was being developed\n", "# and is used throughout the remainder of this chapter.\n", "\n", "q = '#MentionSomeoneImportantForYou' \n", "\n", "count = 100\n", "\n", "# See https://dev.twitter.com/docs/api/1.1/get/search/tweets\n", "\n", "search_results = twitter_api.search.tweets(q=q, count=count)\n", "\n", "statuses = search_results['statuses']\n", "\n", "\n", "# Iterate through 5 more batches of results by following the cursor\n", "\n", "for _ in range(5):\n", " print \"Length of statuses\", len(statuses)\n", " try:\n", " next_results = search_results['search_metadata']['next_results']\n", " except KeyError, e: # No more results when next_results doesn't exist\n", " break\n", " \n", " # Create a dictionary from next_results, which has the following form:\n", " # ?max_id=313519052523986943&q=NCAA&include_entities=1\n", " kwargs = dict([kv.split('=') for kv in unquote(next_results[1:]).split(\"&\") ]) \n", "\n", " \n", " search_results = twitter_api.search.tweets(**kwargs)\n", " statuses += search_results['statuses']\n", "\n", "# Show one sample search result by slicing the list...\n", "print json.dumps(statuses[0], indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

Although we're just passing in a hashtag to the Search API at\n", " this point, it's well worth noting that it contains a number of powerful operators that allow you\n", " to filter queries according to the existence or nonexistence of\n", " various keywords, originator of the tweet, location associated with\n", " the tweet, etc.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In essence, all the code does is repeatedly make requests to the Search API.\n", " One thing that might initially catch you off guard if you've worked with\n", " other web APIs (including version 1 of Twitter's API) is that there's no\n", " explicit concept of pagination in the Search API\n", " itself. Reviewing the API documentation reveals that this is a\n", " intentional decision, and there are some good reasons for taking a\n", " cursoring approach instead, given the highly\n", " dynamic state of Twitter resources. The best practices for cursoring\n", " vary a bit throughout the Twitter developer platform, with the Search\n", " API providing a slightly simpler way of navigating search results than\n", " other resources such as timelines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Search results contain a special search\\_metadata node that embeds a next\\_results field with a query string that\n", " provides the basis of a subsequent query. If we weren't using a library\n", " like twitter to make the HTTP\n", " requests for us, this preconstructed query string would just be appended\n", " to the Search API URL, and we'd update it with additional parameters for\n", " handling OAuth. However, since we are not making our HTTP requests\n", " directly, we must parse the query string into its constituent key/value\n", " pairs and provide them as keyword arguments." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Python parlance, we are unpacking the\n", " values in a dictionary into keyword arguments that the function\n", " receives. In other words, the function call inside of the for loop in Example 1.5, “Collecting search results” ultimately invokes a function such as\n", " twitter\\_api.search.tweets(q='%23MentionSomeoneImportantForYou',\n", " include\\_entities=1, max\\_id=313519052523986943) even though it appears in\n", " the source code as twitter\\_api.search.tweets(\\*\\*kwargs), with kwargs being a dictionary of key/value\n", " pairs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

The search\\_metadata field\n", " also contains a refresh\\_url value\n", " that can be used if you'd like to maintain and periodically update\n", " your collection of results with new information that's become\n", " available since the previous query.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next sample tweet shows the search results for a query for\n", " \\#MentionSomeoneImportantForYou. Take a moment to peruse (all of) it. As\n", " I mentioned earlier, there's a lot more to a tweet than meets the eye.\n", " The particular tweet that follows is fairly representative and contains\n", " in excess of 5 KB of total content when represented in uncompressed\n", " JSON. That's more than 40 times the amount of data that makes up the 140\n", " characters of text that's normally thought of as a tweet!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
[\n",
      " {\n",
      "  \"contributors\": null, \n",
      "  \"truncated\": false, \n",
      "  \"text\": \"RT @hassanmusician: \\#MentionSomeoneImportantForYou God.\", \n",
      "  \"in\\_reply\\_to\\_status\\_id\": null, \n",
      "  \"id\": 316948241264549888, \n",
      "  \"favorite\\_count\": 0, \n",
      "  \"source\": \""
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Tweets are imbued with some of the richest metadata that you'll\n",
      "      find on the social web, and Chapter 9, Twitter Cookbook\n",
      "      elaborates on some of the many possibilities."
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Analyzing the 140 Characters"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The online documentation is always the definitive source for Twitter\n",
      "    platform objects, and it's worthwhile to bookmark the Tweets page, because it's one that\n",
      "    you'll refer to quite frequently as you get familiarized with the basic\n",
      "    anatomy of a tweet. No attempt is made here or elsewhere in the book to\n",
      "    regurgitate online documentation, but a few notes are of interest given\n",
      "    that you might still be a bit overwhelmed by the 5 KB of information that\n",
      "    a tweet comprises. For simplicity of nomenclature, let's assume that we've\n",
      "    extracted a single tweet from the search results and stored it in a\n",
      "    variable named t. For example, t.keys() returns the top-level fields for the\n",
      "    tweet and t['id'] accesses the\n",
      "    identifier of the tweet."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "
Note:

If you're following along with the IPython Notebook for this\n", " chapter, the exact tweet that's under scrutiny is stored in a variable\n", " named t so that you can\n", " interactively access its fields and explore more easily. The current\n", " discussion assumes the same nomenclature, so values should correspond\n", " one-for-one.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should tinker around with the sample tweet and consult the\n", " documentation to clarify any lingering questions you might have before\n", " moving forward. A good working knowledge of a tweet's anatomy is critical\n", " to effectively mining Twitter data." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Extracting Tweet Entities" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's distill the entities and the text of the tweets into a convenient data\n", " structure for further examination. Example 1.6, “Extracting text, screen names, and hashtags from tweets”\n", " extracts the text, screen names, and hashtags from the tweets that are\n", " collected and introduces a Python idiom called a double (or\n", " nested) list comprehension. If\n", " you understand a (single) list comprehension, the code formatting should\n", " illustrate the double list comprehension as simply a collection of\n", " values that are derived from a nested loop as opposed to the results of\n", " a single loop. List comprehensions are particularly powerful because\n", " they usually yield substantial performance gains over nested lists and\n", " provide an intuitive (once you’re familiar with them) yet terse\n", " syntax." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

List comprehensions are used frequently throughout this book,\n", " and it's worth consulting Appendix C, Python and IPython Notebook Tips & Tricks or the official Python tutorial for\n", " more details if you'd like additional context.

" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.6. Extracting text, screen names, and hashtags from tweets" ] }, { "cell_type": "code", "collapsed": false, "input": [ "status_texts = [ status['text'] \n", " for status in statuses ]\n", "\n", "screen_names = [ user_mention['screen_name'] \n", " for status in statuses\n", " for user_mention in status['entities']['user_mentions'] ]\n", "\n", "hashtags = [ hashtag['text'] \n", " for status in statuses\n", " for hashtag in status['entities']['hashtags'] ]\n", "\n", "# Compute a collection of all words from all tweets\n", "words = [ w \n", " for t in status_texts \n", " for w in t.split() ]\n", "\n", "# Explore the first 5 items for each...\n", "\n", "print json.dumps(status_texts[0:5], indent=1)\n", "print json.dumps(screen_names[0:5], indent=1) \n", "print json.dumps(hashtags[0:5], indent=1)\n", "print json.dumps(words[0:5], indent=1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sample output follows; it displays five status texts, screen\n", " names, and hashtags to provide a feel for what's in the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

In Python, syntax in which square brackets appear after a list\n", " or string value, such as status\\_texts[0:5], is indicative of\n", " slicing, whereby you can easily extract items\n", " from lists or substrings from strings. In this particular case,\n", " [0:5] indicates that you'd like the\n", " first five items in the list status\\_texts (corresponding to items at\n", " indices 0 through 4). See Appendix C, Python and IPython Notebook Tips & Tricks for a more extended\n", " description of slicing in Python.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[\n", " \"\\u201c@KathleenMariee\\_: \\#MentionSomeOneImportantForYou @AhhlicksCruise..., \n", " \"\\#MentionSomeoneImportantForYou My bf @Linkin\\_Sunrise.\", \n", " \"RT @hassanmusician: \\#MentionSomeoneImportantForYou God.\", \n", " \"\\#MentionSomeoneImportantForYou @Louis\\_Tomlinson\", \n", " \"\\#MentionSomeoneImportantForYou @Delta\\_Universe\"\n", "]\n", "[\n", " \"KathleenMariee\\_\", \n", " \"AhhlicksCruise\", \n", " \"itsravennn\\_cx\", \n", " \"kandykisses\\_13\", \n", " \"BMOLOGY\"\n", "]\n", "[\n", " \"MentionSomeOneImportantForYou\", \n", " \"MentionSomeoneImportantForYou\", \n", " \"MentionSomeoneImportantForYou\", \n", " \"MentionSomeoneImportantForYou\", \n", " \"MentionSomeoneImportantForYou\"\n", "]\n", "[\n", " \"\\u201c@KathleenMariee\\_:\", \n", " \"\\#MentionSomeOneImportantForYou\", \n", " \"@AhhlicksCruise\", \n", " \",\", \n", " \"@itsravennn\\_cx\"\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, \\#MentionSomeoneImportantForYou dominates the hashtag\n", " output. The output also provides a few commonly occurring screen names\n", " that are worth investigating." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Analyzing Tweets and Tweet Entities with Frequency Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Virtually all analysis boils down to the simple exercise of counting things on\n", " some level, and much of what we'll be doing in this book is manipulating\n", " data so that it can be counted and further manipulated in meaningful\n", " ways." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From an empirical standpoint, counting observable things is the\n", " starting point for just about everything, and thus the starting point\n", " for any kind of statistical filtering or manipulation that strives to\n", " find what may be a faint signal in noisy data. Whereas we just extracted\n", " the first 5 items of each unranked list to get a feel for the data,\n", " let's now take a closer look at what's in the data by computing a\n", " frequency distribution and looking at the top 10 items in each\n", " list." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As of Python 2.7, a collections module is available that provides a counter that makes computing a\n", " frequency distribution rather trivial. Example 1.7, “Creating a basic frequency distribution from the words in\n", " tweets” demonstrates how to use a Counter to compute frequency distributions as\n", " ranked lists of terms. Among the more compelling reasons for mining\n", " Twitter data is to try to answer the question of what people are talking\n", " about right now. One of the simplest techniques you\n", " could apply to answer this question is basic frequency analysis, just as\n", " we are performing here." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.7. Creating a basic frequency distribution from the words in tweets" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from collections import Counter\n", "\n", "for item in [words, screen_names, hashtags]:\n", " c = Counter(item)\n", " print c.most_common()[:10] # top 10\n", " print" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are some sample results from frequency analysis of\n", " tweets:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
[(u'\\#MentionSomeoneImportantForYou', 92), (u'RT', 34), (u'my', 10), \n",
      " (u',', 6), (u'@justinbieber', 6), (u'<3', 6), (u'My', 5), (u'and', 4), \n",
      " (u'I', 4), (u'te', 3)]\n",
      "\n",
      "[(u'justinbieber', 6), (u'Kid\\_Charliej', 2), (u'Cavillafuerte', 2), \n",
      " (u'touchmestyles\\_', 1), (u'aliceorr96', 1), (u'gymleeam', 1), (u'fienas', 1), \n",
      " (u'nayely\\_1D', 1), (u'angelchute', 1)]\n",
      "\n",
      "[(u'MentionSomeoneImportantForYou', 94), (u'mentionsomeoneimportantforyou', 3), \n",
      " (u'NoHomo', 1), (u'Love', 1), (u'MentionSomeOneImportantForYou', 1), \n",
      " (u'MyHeart', 1),  (u'bebesito', 1)]
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result of the frequency distribution is a map of key/value\n", " pairs corresponding to terms and their frequencies, so let's make\n", " reviewing the results a little easier on the eyes by emitting a tabular\n", " format. You can install a package called prettytable by typing pip install prettytable in a terminal; this\n", " package provides a convenient way to emit a fixed-width tabular format\n", " that can be easily copied-and-pasted." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example 1.8, “Using prettytable to display tuples in a nice tabular\n", " format” shows how to use it to display the\n", " same results." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.8. Using prettytable to display tuples in a nice tabular format" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from prettytable import PrettyTable\n", "\n", "for label, data in (('Word', words), \n", " ('Screen Name', screen_names), \n", " ('Hashtag', hashtags)):\n", " pt = PrettyTable(field_names=[label, 'Count']) \n", " c = Counter(data)\n", " [ pt.add_row(kv) for kv in c.most_common()[:10] ]\n", " pt.align[label], pt.align['Count'] = 'l', 'r' # Set column alignment\n", " print pt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results from Example 1.8, “Using prettytable to display tuples in a nice tabular\n", " format” are displayed as a\n", " series of nicely formatted text-based tables that are easy to skim, as\n", " the following output demonstrates." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
+--------------------------------+-------+\n",
      "| Word                           | Count |\n",
      "+--------------------------------+-------+\n",
      "| \\#MentionSomeoneImportantForYou |    92 |\n",
      "| RT                             |    34 |\n",
      "| my                             |    10 |\n",
      "| ,                              |     6 |\n",
      "| @justinbieber                  |     6 |\n",
      "| <3                          |     6 |\n",
      "| My                             |     5 |\n",
      "| and                            |     4 |\n",
      "| I                              |     4 |\n",
      "| te                             |     3 |\n",
      "+--------------------------------+-------+\n",
      "+----------------+-------+\n",
      "| Screen Name    | Count |\n",
      "+----------------+-------+\n",
      "| justinbieber   |     6 |\n",
      "| Kid\\_Charliej   |     2 |\n",
      "| Cavillafuerte  |     2 |\n",
      "| touchmestyles\\_ |     1 |\n",
      "| aliceorr96     |     1 |\n",
      "| gymleeam       |     1 |\n",
      "| fienas         |     1 |\n",
      "| nayely\\_1D      |     1 |\n",
      "| angelchute     |     1 |\n",
      "+----------------+-------+\n",
      "+-------------------------------+-------+\n",
      "| Hashtag                       | Count |\n",
      "+-------------------------------+-------+\n",
      "| MentionSomeoneImportantForYou |    94 |\n",
      "| mentionsomeoneimportantforyou |     3 |\n",
      "| NoHomo                        |     1 |\n",
      "| Love                          |     1 |\n",
      "| MentionSomeOneImportantForYou |     1 |\n",
      "| MyHeart                       |     1 |\n",
      "| bebesito                      |     1 |\n",
      "+-------------------------------+-------+
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A quick skim of the results reveals at least one marginally\n", " surprising thing: Justin Bieber is high on the list of entities for this\n", " small sample of data, and given his popularity with tweens on Twitter he\n", " may very well have been the \"most important someone\" for this trending\n", " topic, though the results here are inconclusive. The appearance of\n", " &lt;3 is also interesting because\n", " it is an escaped form of <3, which\n", " represents a heart shape (that's rotated 90 degrees, like other\n", " emoticons and smileys) and is a common abbreviation for \"loves.\" Given\n", " the nature of the query, it's not surprising to see a value like\n", " &lt;3, although it may initially\n", " seem like junk or noise." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although the entities with a frequency greater than two are\n", " interesting, the broader results are also revealing in other ways. For\n", " example, \"RT\" was a very common token, implying that there were a\n", " significant number of retweets (we'll investigate this observation\n", " further in “Examining Patterns in Retweets”). Finally, as\n", " might be expected, the \\#MentionSomeoneImportantForYou hashtag and a\n", " couple of case-sensitive variations dominated the hashtags; a\n", " data-processing takeaway is that it would be worthwhile to normalize\n", " each word, screen name, and hashtag to lowercase when tabulating\n", " frequencies since there will inevitably be variation in tweets." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Computing the Lexical Diversity of Tweets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A slightly more advanced measurement that involves calculating simple\n", " frequencies and can be applied to unstructured text is a metric called\n", " lexical diversity. Mathematically, this is an\n", " expression of the number of unique tokens in the\n", " text divided by the total number of tokens in the\n", " text, which are both elementary yet important metrics in and of\n", " themselves. Lexical diversity is an interesting concept in the area of\n", " interpersonal communications because it provides a quantitative measure\n", " for the diversity of an individual's or group's vocabulary. For example,\n", " suppose you are listening to someone who repeatedly says \"and stuff\" to\n", " broadly generalize information as opposed to providing specific examples\n", " to reinforce points with more detail or clarity. Now, contrast that\n", " speaker to someone else who seldom uses the word \"stuff\" to generalize\n", " and instead reinforces points with concrete examples. The speaker who\n", " repeatedly says \"and stuff\" would have a lower lexical diversity than\n", " the speaker who uses a more diverse vocabulary, and chances are\n", " reasonably good that you'd walk away from the conversation feeling as\n", " though the speaker with the higher lexical diversity understands the\n", " subject matter better." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As applied to tweets or similar online communications, lexical\n", " diversity can be worth considering as a primitive statistic for\n", " answering a number of questions, such as how broad or narrow the subject\n", " matter is that an individual or group discusses. Although an overall\n", " assessment could be interesting, breaking down the analysis to specific\n", " time periods could yield additional insight, as could comparing\n", " different groups or individuals. For example, it would be interesting to\n", " measure whether or not there is a significant difference between the\n", " lexical diversity of two soft drink companies such as Coca-Cola and Pepsi as an entry point for\n", " exploration if you were comparing the effectiveness of their social\n", " media marketing campaigns on Twitter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With a basic understanding of how to use a statistic like lexical\n", " diversity to analyze textual content such as tweets, let's now compute\n", " the lexical diversity for statuses, screen names, and hashtags for our\n", " working data set, as shown in Example 1.9, “Calculating lexical diversity for tweets”." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.9. Calculating lexical diversity for tweets" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# A function for computing lexical diversity\n", "def lexical_diversity(tokens):\n", " return 1.0*len(set(tokens))/len(tokens) \n", "\n", "# A function for computing the average number of words per tweet\n", "def average_words(statuses):\n", " total_words = sum([ len(s.split()) for s in statuses ]) \n", " return 1.0*total_words/len(statuses)\n", "\n", "print lexical_diversity(words)\n", "print lexical_diversity(screen_names)\n", "print lexical_diversity(hashtags)\n", "print average_words(status_texts)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results of Example 1.9, “Calculating lexical diversity for tweets”\n", " follow:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
0.67610619469\n",
      "0.955414012739\n",
      "0.0686274509804\n",
      "5.76530612245
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a few observations worth considering in the\n", " results:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What would be interesting at this point would be to zoom in on\n", " some of the data and see if there were any common responses or other\n", " insights that could come from a more qualitative analysis. Given an\n", " average number of words per tweet as low as 6, it's unlikely that users\n", " applied any abbreviations to stay within the 140 characters, so the\n", " amount of noise for the data should be remarkably low, and additional\n", " frequency analysis may reveal some fascinating things." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Examining Patterns in Retweets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though the user interface and many Twitter clients have long since adopted\n", " the native Retweet API used to populate status values such as retweet\\_count and retweeted\\_status, some Twitter users may prefer to quote a tweet, which entails a workflow involving copying and pasting the text\n", " and prepending \"RT @username\" or suffixing \"/via\n", " @username\" to provide attribution." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

When mining Twitter data, you'll probably want to both account\n", " for the tweet metadata and use heuristics to analyze the 140\n", " characters for conventions such as \"RT @username\"\n", " or \"/via @username\" when considering retweets, in\n", " order to maximize the efficacy of your analysis. See “Finding Users Who Have Retweeted a Status” for a more\n", " detailed discussion on retweeting with Twitter's native Retweet API\n", " versus \"quoting\" tweets and using conventions to apply\n", " attribution.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A good exercise at this point would be to further analyze the data\n", " to determine if there was a particular tweet that was highly retweeted\n", " or if there were just lots of \"one-off\" retweets. The approach we'll\n", " take to find the most popular retweets is to simply iterate over each\n", " status update and store out the retweet count, originator of the\n", " retweet, and text of the retweet if the status update is a retweet.\n", " Example 1.10, “Finding the most popular retweets” demonstrates how to capture\n", " these values with a list comprehension and sort by the retweet count to\n", " display the top few results." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.10. Finding the most popular retweets" ] }, { "cell_type": "code", "collapsed": false, "input": [ "retweets = [\n", " # Store out a tuple of these three values ...\n", " (status['retweet_count'], \n", " status['retweeted_status']['user']['screen_name'],\n", " status['text']) \n", " \n", " # ... for each status ...\n", " for status in statuses \n", " \n", " # ... so long as the status meets this condition.\n", " if status.has_key('retweeted_status')\n", " ]\n", "\n", "# Slice off the first 5 from the sorted results and display each item in the tuple\n", "\n", "pt = PrettyTable(field_names=['Count', 'Screen Name', 'Text'])\n", "[ pt.add_row(row) for row in sorted(retweets, reverse=True)[:5] ]\n", "pt.max_width['Text'] = 50\n", "pt.align= 'l'\n", "print pt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Results from Example 1.10, “Finding the most popular retweets” are\n", " interesting:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
+-------+----------------+----------------------------------------------------+\n",
      "| Count | Screen Name    | Text                                               |\n",
      "+-------+----------------+----------------------------------------------------+\n",
      "| 23    | hassanmusician | RT @hassanmusician: \\#MentionSomeoneImportantForYou |\n",
      "|       |                | God.                                               |\n",
      "| 21    | HSweethearts   | RT @HSweethearts: \\#MentionSomeoneImportantForYou   |\n",
      "|       |                | my high school sweetheart \u2764                        |\n",
      "| 15    | LosAlejandro\\_  | RT @LosAlejandro\\_: \u00bfNadie te menciono en           |\n",
      "|       |                | \"\\#MentionSomeoneImportantForYou\"? JAJAJAJAJAJAJAJA |\n",
      "|       |                | JAJAJAJAJAJAJAJAJAJAJAJAJAJAJAJAJAJAJAJA Ven, ...  |\n",
      "| 9     | SCOTTSUMME     | RT @SCOTTSUMME: \\#MentionSomeoneImportantForYou My  |\n",
      "|       |                | Mum. Shes loving, caring, strong, all in one. I    |\n",
      "|       |                | love her so much \u2764\u2764\u2764\u2764                            |\n",
      "| 7     | degrassihaha   | RT @degrassihaha: \\#MentionSomeoneImportantForYou I |\n",
      "|       |                | can't put every Degrassi cast member, crew member, |\n",
      "|       |                | and writer in just one tweet....                   |\n",
      "+-------+----------------+----------------------------------------------------+
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"God\" tops the list, followed closely by \"my high school\n", " sweetheart,\" and coming in at number four on the list is \"My Mum.\" None\n", " of the top five items in the list correspond to Twitter user accounts,\n", " although we might have suspected this (with the exception of\n", " @justinbieber) from the previous analysis. Inspection of results further\n", " down the list does reveal particular user mentions, but the sample we\n", " have drawn from for this query is so small that no trends emerge.\n", " Searching for a larger sample of results would likely yield some user\n", " mentions with a frequency greater than one, which would be interesting\n", " to further analyze. The possibilities for further analysis are pretty\n", " open-ended, and by now, hopefully, you're itching to try out some custom\n", " queries of your own." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

Suggested exercises are at the end of this chapter. Be sure to\n", " also check out Chapter 9, Twitter Cookbook as a source of\n", " inspiration: it includes more than two dozen recipes presented in a\n", " cookbook-style format.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we move on, a subtlety worth noting is that it's quite\n", " possible (and probable, given the relatively low frequencies of the\n", " retweets observed in this section) that the original tweets that were\n", " retweeted may not exist in our sample search results set. For example,\n", " the most popular retweet in the sample results originated from a user\n", " with a screen name of @hassanmusician and was retweeted 23 times.\n", " However, closer inspection of the data reveals that we collected only 1\n", " of the 23 retweets in our search results. Neither the original tweet nor\n", " any of the other 22 retweets appears in the data set. This doesn't pose\n", " any particular problems, although it might beg the question of who the\n", " other 22 retweeters for this status were." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The answer to this kind of question is a valuable one because it\n", " allows us to take content that represents a concept, such as \"God\" in\n", " this case, and discover a group of other users who apparently share the\n", " same sentiment or common interest. As previously mentioned, a handy way\n", " to model data involving people and the things that they're interested in\n", " is called an interest graph; this is the\n", " primary data structure that supports analysis in Chapter 7, Mining GitHub: Inspecting Software Collaboration Habits, Building\n", " Interest Graphs, and More.\n", " Interpretative speculation about these users could suggest that they are\n", " spiritual or religious individuals, and further analysis of their\n", " particular tweets might corroborate that inference. Example 1.11, “Looking up users who have retweeted a status” shows how to find these individuals\n", " with the GET statuses/retweets/:id API." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.11. Looking up users who have retweeted a status" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Get the original tweet id for a tweet from its retweeted_status node \n", "# and insert it here in place of the sample value that is provided\n", "# from the text of the book\n", "\n", "_retweets = twitter_api.statuses.retweets(id=317127304981667841)\n", "print [r['user']['screen_name'] for r in _retweets]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Further analysis of the users who retweeted this particular status\n", " for any particular religious or spiritual affiliation is left as an\n", " independent exercise." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Visualizing Frequency Data with Histograms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A nice feature of IPython Notebook is its ability to generate and insert\n", " high-quality and customizable plots of data as part of an interactive\n", " workflow. In particular, the matplotlib package and\n", " other scientific computing tools that are available for IPython Notebook\n", " are quite powerful and capable of generating complex figures with very\n", " little effort once you understand the basic workflows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate the use of matplotlib's plotting capabilities, let's plot\n", " some data for display. To get warmed up, we’ll consider a plot that\n", " displays the results from the words\n", " variable as defined in Example 1.9, “Calculating lexical diversity for tweets”. With\n", " the help of a Counter, it's easy to\n", " generate a sorted list of tuples where each tuple is a (word, frequency) pair; the x-axis value will\n", " correspond to the index of the tuple, and the y-axis will correspond to\n", " the frequency for the word in that tuple. It would generally be\n", " impractical to try to plot each word as a value on the x-axis, although\n", " that's what the x-axis is representing. Figure 1.4, “A plot displaying the sorted frequencies for the words computed\n", " by Example 1.8, “Using prettytable to display tuples in a nice tabular\n", " format””\n", " displays a plot for the same words data that we previously rendered as a\n", " table in Example 1.8, “Using prettytable to display tuples in a nice tabular\n", " format”. The y-axis values on the plot\n", " correspond to the number of times a word appeared. Although labels for\n", " each word are not provided, x-axis values have been sorted so that the\n", " relationship between word frequencies is more apparent. Each axis has\n", " been adjusted to a logarithmic scale to \"squash\" the curve being\n", " displayed. The plot can be generated directly in IPython Notebook\n", " with the code shown in Example 1.12, “Plotting frequencies of words”." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Figure 1.4. A plot displaying the sorted frequencies for the words computed\n", " by Example 1.8, “Using prettytable to display tuples in a nice tabular\n", " format”
\n", "
\n", "
\n", " \"A\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

If you are using the virtual machine, your IPython Notebooks\n", " should be configured to use plotting capabilities out of the box. If\n", " you are running on your own local environment, be sure to have started\n", " IPython Notebook with PyLab\n", " enabled as follows:

\n", "
ipython notebook --pylab=inline
" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.12. Plotting frequencies of words" ] }, { "cell_type": "code", "collapsed": false, "input": [ "word_counts = sorted(Counter(words).values(), reverse=True)\n", "\n", "plt.loglog(word_counts)\n", "plt.ylabel(\"Freq\")\n", "plt.xlabel(\"Word Rank\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A plot of frequency values is intuitive and convenient, but it can also be\n", " useful to group together data values into bins that correspond to a\n", " range of frequencies. For example, how many words have a frequency\n", " between 1 and 5, between 5 and 10, between 10 and 15, and so forth? A\n", " histogram is\n", " designed for precisely this purpose and provides a convenient\n", " visualization for displaying tabulated frequencies as adjacent\n", " rectangles, where the area of each rectangle is a measure of the data\n", " values that fall within that particular range of values. Figures 1.5 and 1.6 show histograms of\n", " the tabular data generated from Examples 1.8 and 1.10,\n", " respectively. Although the histograms don't have x-axis labels that show\n", " us which words have which frequencies, that's not really their purpose.\n", " A histogram gives us insight into the underlying frequency distribution,\n", " with the x-axis corresponding to a range for words that each have a\n", " frequency within that range and the y-axis corresponding to the total\n", " frequency of all words that appear within that range." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When interpreting Figure 1.5, “Histograms of tabulated frequency data for words, screen names,\n", " and hashtags, each displaying a particular kind of data that is\n", " grouped by frequency”, look back to the\n", " corresponding tabular data and consider that there are a large number of\n", " words, screen names, or hashtags that have low frequencies and appear few times in the text;\n", " however, when we combine all of these low-frequency terms and bin them\n", " together into a range of \"all words with frequency between 1 and 10,\" we\n", " see that the total number of these low-frequency words accounts for most\n", " of the text. More concretely, we see that there are approximately 10\n", " words that account for almost all of the frequencies as rendered by the\n", " area of the large blue rectangle, while there are just a couple of words\n", " with much higher frequencies: \"\\#MentionSomeoneImportantForYou\" and \"RT,\"\n", " with respective frequencies of 34 and 92 as given by our tabulated\n", " data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Likewise, when interpreting Figure 1.6, “A histogram of retweet frequencies”, we see that\n", " there are a select few tweets that are retweeted with a much higher\n", " frequencies than the bulk of the tweets, which are retweeted only once\n", " and account for the majority of the volume given by the largest blue\n", " rectangle on the left side of the histogram." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Figure 1.5. Histograms of tabulated frequency data for words, screen names,\n", " and hashtags, each displaying a particular kind of data that is\n", " grouped by frequency
\n", "
\n", "
\n", " \"Histograms\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Figure 1.6. A histogram of retweet frequencies
\n", "
\n", "
\n", " \"A\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code for generating these histograms directly in IPython\n", " Notebook is given in Examples 1.13 and 1.14. Taking some time to explore the\n", " capabilities of matplotlib and other scientific computing tools is a\n", " worthwhile investment." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.13. Generating histograms of words, screen names, and hashtags" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for label, data in (('Words', words), \n", " ('Screen Names', screen_names), \n", " ('Hashtags', hashtags)):\n", "\n", " # Build a frequency map for each set of data\n", " # and plot the values\n", " c = Counter(data)\n", " plt.hist(c.values())\n", " \n", " # Add a title and y-label ...\n", " plt.title(label)\n", " plt.ylabel(\"Number of items in bin\")\n", " plt.xlabel(\"Bins (number of times an item appeared)\")\n", " \n", " # ... and display as a new figure\n", " plt.figure()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Example 1.14. Generating a histogram of retweet counts" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Using underscores while unpacking values in\n", "# a tuple is idiomatic for discarding them\n", "\n", "counts = [count for count, _, _ in retweets]\n", "\n", "plt.hist(counts)\n", "plt.title(\"Retweets\")\n", "plt.xlabel('Bins (number of times retweeted)')\n", "plt.ylabel('Number of tweets in bin')\n", "\n", "print counts" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Closing Remarks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This chapter introduced Twitter as a successful technology platform that\n", " has grown virally and become \"all the rage,\" given its ability to satisfy\n", " some fundamental human desires relating to communication, curiosity, and\n", " the self-organizing behavior that has emerged from its chaotic network\n", " dynamics. The example code in this chapter got you up and running with\n", " Twitter's API, illustrated how easy (and fun) it is to use Python to\n", " interactively explore and analyze Twitter data, and provided some starting\n", " templates that you can use for mining tweets. We started out the chapter\n", " by learning how to create an authenticated connection and then progressed\n", " through a series of examples that illustrated how to discover trending\n", " topics for particular locales, how to search for tweets that might be\n", " interesting, and how to analyze those tweets using some elementary but\n", " effective techniques based on frequency analysis and simple statistics.\n", " Even what seemed like a somewhat arbitrary trending topic turned out to\n", " lead us down worthwhile paths with lots of possibilities for additional\n", " analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

Chapter 9, Twitter Cookbook contains a number of Twitter\n", " recipes covering a broad array of topics that range from tweet\n", " harvesting and analysis to the effective use of storage for archiving\n", " tweets to techniques for analyzing followers for insights.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the primary takeaways from this chapter from an analytical\n", " standpoint is that counting is generally the first step to any kind of\n", " meaningful quantitative analysis. Although basic frequency analysis is\n", " simple, it is a powerful tool for your repertoire that shouldn\u2019t be\n", " overlooked just because it\u2019s so obvious; besides, many other advanced\n", " statistics depend on it. On the contrary, frequency analysis and measures\n", " such as lexical diversity should be employed early and often, for\n", " precisely the reason that doing so is so obvious and simple. Oftentimes,\n", " but not always, the results from the simplest techniques can rival the\n", " quality of those from more sophisticated analytics. With respect to data\n", " in the Twitterverse, these modest techniques can usually get you quite a\n", " long way toward answering the question, \u201cWhat are people talking about\n", " right now?\u201d Now that's something we'd all like to know, isn't it?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Note:

The source code outlined for this chapter and all other chapters\n", " is available at GitHub in a\n", " convenient IPython Notebook format that you're highly encouraged to try\n", " out from the comfort of your own web browser.

" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Recommended Exercises" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Online Resources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following list of links from this chapter may be useful for\n", " review:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
\n", "
\n", "

[1] Although it's an implementation detail, it may be worth noting\n", " that Twitter's v1.1 API still implements OAuth 1.0a, whereas many\n", " other social web properties have since upgraded to OAuth 2.0.

\n", "
" ] } ], "metadata": {} } ] }