{ "metadata": { "name": "", "signature": "sha256:6e1c1ace4cc89d263cdbb868f9870b574bddd02e185cd4766b88b66df8595a48" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Collecting Twitter data from the API using Python\n", "\n", "Alex Hanna, University of Wisconsin-Madison
\n", "[alex-hanna.com](http://alex-hanna.com)
\n", "[@alexhanna](http://twitter.com/alexhanna)\n", "\n", "Today we're going to focus on collecting Twitter data using their API using Python. You can use another language to collect data but Python makes it rather straightforward without having to know many other details about the API.\n", "\n", "## Using APIs\n", "\n", "The way that researchers and other people who want to get large publically available Twitter datasets is through their API. API stands for Application Programming Interface and many services that want to start a developer community around their product usually releases one. Facebook has an API that is somewhat restrictive, while Klout has an API to let you automatically look up Klout scores and all their different facets.\n", "\n", "For instance, here are a list of different APIs of different services:\n", "\n", "Facebook - [https://developers.facebook.com/docs/reference/api/](https://developers.facebook.com/docs/reference/api/)\n", "\n", "YouTube - [https://developers.google.com/youtube/getting_started](https://developers.google.com/youtube/getting_started)\n", "\n", "Klout - [http://klout.com/s/developers/docs](http://klout.com/s/developers/docs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accessing the Twitter API\n", "\n", "The Twitter API has two different flavors: RESTful and Streaming. The RESTful API is useful for getting things like lists of followers and those who follow a particular user, and is what most Twitter clients are built off of. We are not going to deal with the RESTful API right now, but you can find more information on it here: [https://dev.twitter.com/docs/api]( https://dev.twitter.com/docs/api). Right now we are going to focus on the Streaming API (more info here: [https://dev.twitter.com/docs/streaming-api](https://dev.twitter.com/docs/streaming-api)). The Streaming API works by making a request for a specific type of data \u2014 filtered by keyword, user, geographic area, or a random sample \u2014 and then keeping the connection open as long as there are no errors in the connection.\n", "\n", "## Understanding Twitter Data\n", "\n", "Once you\u2019ve connected to the Twitter API, whether via the RESTful API or the Streaming API, you\u2019re going to start getting a bunch of data back. The data you get back will be encoded in JSON, or JavaScript Object Notation. JSON is a way to encode complicated information in a platform-independent way. It could be considered the lingua franca of information exchange on the Internet. When you click a snazzy Web 2.0 button on Facebook or Amazon and the page produces a lightbox (a box that hovers above a page without leaving the page you\u2019re on now), there was probably some JSON involved.\n", "\n", "JSON is a rather simplistic and elegant way to encode complex data structures. When a tweet comes back from the API, this is what it looks like (with a little bit of beautifying):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\"\"\"\n", "{\n", " \"contributors\": null, \n", " \"truncated\": false, \n", " \"text\": \"TeeMinus24's Shirt of the Day is Palpatine/Vader '12. Support the Sith. Change you can't stop. http://t.co/wFh1cCep\", \n", " \"in_reply_to_status_id\": null, \n", " \"id\": 175090352598945794, \n", " \"entities\": {\n", " \"user_mentions\": [], \n", " \"hashtags\": [], \n", " \"urls\": [\n", " {\n", " \"indices\": [\n", " 95, \n", " 115\n", " ], \n", " \"url\": \"http://t.co/wFh1cCep\", \n", " \"expanded_url\": \"http://fb.me/1isEdQJSq\", \n", " \"display_url\": \"fb.me/1isEdQJSq\"\n", " }\n", " ]\n", " }, \n", " \"retweeted\": false, \n", " \"coordinates\": null, \n", " \"source\": \"Facebook\", \n", " \"in_reply_to_screen_name\": null, \n", " \"id_str\": \"175090352598945794\", \n", " \"retweet_count\": 0, \n", " \"in_reply_to_user_id\": null, \n", " \"favorited\": false, \n", " \"user\": {\n", " \"follow_request_sent\": null, \n", " \"profile_use_background_image\": true, \n", " \"default_profile_image\": false, \n", " \"profile_background_image_url_https\": \"https://si0.twimg.com/images/themes/theme14/bg.gif\", \n", " \"verified\": false, \n", " \"profile_image_url_https\": \"https://si0.twimg.com/profile_images/1428484273/TeeMinus24_logo_normal.jpg\", \n", " \"profile_sidebar_fill_color\": \"efefef\", \n", " \"is_translator\": false, \n", " \"id\": 281077639, \n", " \"profile_text_color\": \"333333\", \n", " \"followers_count\": 43, \n", " \"protected\": false, \n", " \"location\": \"\", \n", " \"profile_background_color\": \"131516\", \n", " \"id_str\": \"281077639\", \n", " \"utc_offset\": -18000, \n", " \"statuses_count\": 461, \n", " \"description\": \"We are a limited edition t-shirt company. We make tees that are designed for the fan; movies, television shows, video games, sci-fi, web, and tech. We have it!\", \n", " \"friends_count\": 52, \n", " \"profile_link_color\": \"009999\", \n", " \"profile_image_url\": \"http://a0.twimg.com/profile_images/1428484273/TeeMinus24_logo_normal.jpg\", \n", " \"notifications\": null, \n", " \"show_all_inline_media\": false, \n", " \"geo_enabled\": false, \n", " \"profile_background_image_url\": \"http://a0.twimg.com/images/themes/theme14/bg.gif\", \n", " \"screen_name\": \"TeeMinus24\", \n", " \"lang\": \"en\", \n", " \"profile_background_tile\": true, \n", " \"favourites_count\": 0, \n", " \"name\": \"Vincent Genovese\", \n", " \"url\": \"http://www.teeminus24.com\", \n", " \"created_at\": \"Tue Apr 12 15:48:23 +0000 2011\", \n", " \"contributors_enabled\": false, \n", " \"time_zone\": \"Eastern Time (US & Canada)\", \n", " \"profile_sidebar_border_color\": \"eeeeee\", \n", " \"default_profile\": false, \n", " \"following\": null, \n", " \"listed_count\": 1\n", " }, \n", " \"geo\": null, \n", " \"in_reply_to_user_id_str\": null, \n", " \"possibly_sensitive\": false, \n", " \"created_at\": \"Thu Mar 01 05:29:27 +0000 2012\", \n", " \"possibly_sensitive_editable\": true, \n", " \"in_reply_to_status_id_str\": null, \n", " \"place\": null\n", "}\"\"\"" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let\u2019s move our focus now to the actual elements of the tweet. Most of the keys, that is, the words on the left of the colon, are self-explanatory. The most important ones are \u201ctext\u201d, \u201centities\u201d, and \u201cuser\u201d. \u201cText\u201d is the text of the tweet, \u201centities\u201d are the user mentions, hashtags, and links used in the tweet, separated out for easy access. \u201cUser\u201d contains a lot of information on the user, from URL of their profile image to the date they joined Twitter.\n", "\n", "Now that you see what data you get with a tweet, you can envision interesting types of analysis that can emerge by analyzing a whole lot of them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Disclaimer on Collecting Tweets\n", "\n", "Unfortunately, you do not have carte blanche to share the tweets you collect. Twitter restricts publicly releasing datasets according to their API Terms of Service (https://dev.twitter.com/terms/api-terms). This is unfortunately for collaboration when colleagues have collected very unique datasets. However, you can share derivative analysis from tweets, such as content analysis and aggregate statistics." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Logging into the remote server\n", "\n", "The terminal (also called the command line) is how you access a UNIX-based machine remotely. For reasons \n", "\n", "For now, we\u2019re not going to access that cluster (and given that it\u2019s not been setup yet, it\u2019s not possible). Instead we\u2019re going to connect to an old computer I have running out of my living room. Which means be very sympathetic to my bandwidth restrictions. Here is how you connect to it.\n", "\n", "For Windows:\n", "\n", "1. Download PuTTy at [http://www.chiark.greenend.org.uk/~sgtatham/putty/](http://www.chiark.greenend.org.uk/~sgtatham/putty/). Save and open it.\n", "2. Type in the \u201chost\u201d field ec2-54-225-7-147.compute-1.amazonaws.com\n", "3. Use the credential information that I give you in the workshop.\n", "\n", "For Mac:\n", "\n", "1. Go to Applications -> Utilities -> Terminal and open Terminal.\n", "2. Type ssh <username>@ec2-54-225-7-147.compute-1.amazonaws.com, where <username> is the username I give you.\n", "3. Use the credential information I give you in the workshop.\n", "\n", "Once you've done that, you need to check out the git repository for this workshop. You can do that by typing this command:\n", "\n", " [hse0@ip-10-196-55-224 ~]$ git clone https://github.com/raynach/hse-twitter\n", "\n", "Note: don't type [hse0@ip-10-196-55-224 ~]$. This is the command-line identifier that I'm using so you know you're on the command line. Type everything after the $.\n", "\n", "This will create the hse-twitter directory in your home directory. A directory is another name for a folder. Within the hse-twitter directory will be two more: bin and data. Get into the hse-twitter directory by typing the following:\n", "\n", " [hse0@ip-10-196-55-224 ~]$ cd hse-twitter\n", "\n", "Then get into the bin directory by typing:\n", "\n", " [hse0@ip-10-196-55-224 hse-twitter]$ cd bin\n", "\n", "If you want to move up a level, you can type:\n", "\n", " [hse0@ip-10-196-55-224 bin]$ cd ..\n", "\n", "If you want to see where you are, you can use the ls and the pwd commands. So for instance, once you are in the hse-twitter directory, do this:\n", "\n", " [hse0@ip-10-196-55-224 hse-twitter]$ ls\n", "\n", "And you should get this as output:\n", "\n", " bin data README.md\n", "\n", "Enter the bin directory and you can see where you are with pwd\n", "\n", " [hse0@ip-10-196-55-224 hse-twitter]$ cd bin/\n", " [hse0@ip-10-196-55-224 bin]$ pwd\n", " /home/hse0/hse-twitter/bin" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Editing and Writing FIles\n", "\n", "First thing\u2019s first \u2014 get a text editor. Go to [http://www.jedit.org/](http://www.jedit.org/) and download jEdit. jEdit is a free, open-source text editor written in Java. It has a ton of cool features and makes editing files on remote servers a snap. Which is why we are using it.\n", "\n", "Once you have downloaded it and installed it, go to Plugins->Plugins Manager in the menu. You should get a screen that looks like this.\n", "\n", "\"\"\n", "\n", "Click the \"Install\" tab and find the \"FTP\" plugin. Select the checkbox and click install. Once it installs, close the window. Now, select Plugin->FTP->Open from Secure FTP Server... from the menu. Type in all your information so looks like the screenshot below.\n", "\n", "\"\"\n", "\n", "Now, once it loads, navigate to the \"hse-twitter/bin\" directory like you would with a GUI file manager. Open up streaming.py. This is the file we'll eventually get to editing.\n", "\n", "However, before that, I need to give a little background on Python. Python is an interpreted script language. This is different from languages like Java or C, which are traditionally compiled into a language that the computer can read directly. Python is different. It reads files like a script -- line-by-line and executing commands in a procedural fashion." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Collecting Data\n", "\n", "Collecting data is pretty straightforward with tweepy. The first thing to do is to create an instance of a tweepy StreamListener to handle the incoming data. The way that I have mine set up is that I start a new file for every 20,000 tweets, tagged with a prefix and a timestamp. I also keep another file open for the list of status IDs that have been deleted, which are handled differently than other tweet data. I call this file slistener.py. You should have a copy of it." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from tweepy import StreamListener\n", "import json, time, sys\n", "\n", "class SListener(StreamListener):\n", "\n", " def __init__(self, api = None, fprefix = 'streamer'):\n", " self.api = api or API()\n", " self.counter = 0\n", " self.fprefix = fprefix\n", " self.output = open('../data/' + fprefix + '.' \n", " + time.strftime('%Y%m%d-%H%M%S') + '.json', 'w')\n", " self.delout = open('../data/delete.txt', 'a')\n", "\n", " def on_data(self, data):\n", "\n", " if 'in_reply_to_status' in data:\n", " self.on_status(data)\n", " elif 'delete' in data:\n", " delete = json.loads(data)['delete']['status']\n", " if self.on_delete(delete['id'], delete['user_id']) is False:\n", " return False\n", " elif 'limit' in data:\n", " if self.on_limit(json.loads(data)['limit']['track']) is False:\n", " return False\n", " elif 'warning' in data:\n", " warning = json.loads(data)['warnings']\n", " print warning['message']\n", " return false\n", "\n", " def on_status(self, status):\n", " self.output.write(status + \"\\n\")\n", "\n", " self.counter += 1\n", "\n", " if self.counter >= 20000:\n", " self.output.close()\n", " self.output = open('../data/' + self.fprefix + '.' \n", " + time.strftime('%Y%m%d-%H%M%S') + '.json', 'w')\n", " self.counter = 0\n", "\n", " return\n", "\n", " def on_delete(self, status_id, user_id):\n", " self.delout.write( str(status_id) + \"\\n\")\n", " return\n", "\n", " def on_limit(self, track):\n", " sys.stderr.write(track + \"\\n\")\n", " return\n", "\n", " def on_error(self, status_code):\n", " sys.stderr.write('Error: ' + str(status_code) + \"\\n\")\n", " return False\n", "\n", " def on_timeout(self):\n", " sys.stderr.write(\"Timeout, sleeping for 60 seconds...\\n\")\n", " time.sleep(60)\n", " return " ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we need the script that does the collecting itself. I call this file streaming.py. You can collect on users, keywords, or specific locations defined by bounding boxes. The API documentation has more information on this. For now, let\u2019s just track some popular keywords \u2014 \"obama\" and \"egypt\" (keywords are case-insensitive)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "\n", "from slistener import SListener\n", "import time, tweepy, sys\n", "\n", "## auth. \n", "## TK: Edit the username and password fields to authenticate from Twitter.\n", "username = ''\n", "password = ''\n", "auth = tweepy.auth.BasicAuthHandler(username, password)\n", "api = tweepy.API(auth)\n", "\n", "## Eventually you'll need to use OAuth. Here's the code for it here.\n", "## You can learn more about OAuth here: https://dev.twitter.com/docs/auth/oauth\n", "#consumer_key = \"\"\n", "#consumer_secret = \"\"\n", "#access_token = \"\"\n", "#access_token_secret = \"\"\n", "\n", "# OAuth process, using the keys and tokens\n", "#auth = tweepy.OAuthHandler(consumer_key, consumer_secret)\n", "#auth.set_access_token(access_token, access_token_secret)\n", "\n", "def main( mode = 1 ):\n", " track = ['obama', 'egypt']\n", " follow = []\n", " \n", " listen = SListener(api, 'test')\n", " stream = tweepy.Stream(auth, listen)\n", "\n", " print \"Streaming started on %s users and %s keywords...\" % (len(track), len(follow))\n", "\n", " try: \n", " stream.filter(track = track, follow = follow)\n", " #stream.sample()\n", " except:\n", " print \"error!\"\n", " stream.disconnect()\n", "\n", "if __name__ == '__main__':\n", " main()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You need to change the username and password variables in the code above for it to work. You can do that by accessing the file through jEdit and making the changes there. Use your personal Twitter username and password, or create one for the workshop and use it there.\n", "\n", "Once you've changed the username and password, you can start collecting data. Run the following command:\n", "\n", " [hse0@ip-10-196-55-224 bin]$ python streaming.py\n", "\n", "To stop this script from running, press Ctrl + C.\n", "\n", "Note: if we run this all at once we may get an error from Twitter. If you can a 421 error, that is what is probably happening. So only run this for a second and then quit out of it.\n", "\n", "You can see the data collected by going to the data directory. If you are in bin, you can type cd ../data. Then type ls -l to see something like this.\n", "\n", " [hse0@ip-10-196-55-224 bin]$ cd ../data\n", " [hse0@ip-10-196-55-224 data]$ ls -l\n", " total 1496\n", " -rw-rw-r-- 1 hse0 hse0 0 Aug 16 00:12 delete.txt\n", " -rw-rw-r-- 1 hse0 hse0 0 Aug 16 00:12 placeholder\n", " -rw-rw-r-- 1 hse0 hse0 208409 Aug 16 00:14 test.20130816-001357.json\n", "\n", "The file that ends with \"json\" is the Twitter data. If you type \"more <filename>\" you can see some of the raw JSON." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Network Data\n", "\n", "Once we've actually collected some Twitter data, we need to get network data out of it somehow. For the current analysis we are just going to look at mention networks. The way that tweets are structured, we can easily pull out mentions of other users.\n", "\n", "The networked part of Twitter JSON looks like this:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\"\"\"{\n", " \"id_str\":\"265631953204686848\", \n", " \"text\":\"@AllenVaughan In all fairness that \\\"talented\\\" team was 5-6...\",\n", " \"in_reply_to_user_id_str\":\"166318986\",\n", " ...\n", " \"entities\":{\n", " \"urls\":[],\n", " \"hashtags\":[],\n", " \"user_mentions\":[ \n", " {\n", " \"id_str\":\"166318986\",\n", " \"indices\":[0,13],\n", " \"name\":\"Allen Vaughan\",\n", " \"screen_name\":\"AllenVaughan\",\n", " \"id\":166318986\n", " }\n", " ]}\n", "...\n", "}\"\"\"" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A potential pitfall of pulling out mentions this way is that retweets are also recorded like this, so we could be measuring both retweets and mentions. However, we can bracket this considering right now. There are ways of filtering out retweets that we can cover in future modules. For now, we will focus on how to create mention networks.\n", "\n", "To process these mentions we will create a file called mentionMapper.py. The algorithm for producing these networks is pretty straightforward. It looks like the following pseudocode:\n", "\n", " for each tweet:\n", " user1 = current user\n", " for each mention of another user (user2):\n", " emit user1, user2, 1\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#!/usr/bin/env python\n", "\n", "import json, sys\n", "\n", "def main():\n", "\n", " for line in sys.stdin:\n", " line = line.strip()\n", "\n", " data = ''\n", " try:\n", " data = json.loads(line)\n", " except ValueError as detail:\n", " sys.stderr.write(detail.__str__() + \"\\n\")\n", " continue\n", "\n", " if 'entities' in data and len(data['entities']['user_mentions']) > 0:\n", " user = data['user']\n", " user_mentions = data['entities']['user_mentions']\n", "\n", " for u2 in user_mentions: \n", " print \"\\t\".join([\n", " user['id_str'],\n", " u2['id_str'],\n", " \"1\"\n", " ])\n", "\n", "if __name__ == '__main__':\n", " main()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To run this, we use a few UNIX commands to tie the output of one UNIX command to be the input of another. First, make sure that you are in the bin directory. Next, we'll use the cat command and the | character to pipe the Twitter data into the Python file. You'll need to replace \"../data/test.20130816-004221.json\" with the name of the file that tweepy generated for you.\n", "\n", " [hse0@ip-10-196-55-224 bin]$ cat ../data/test.20130816-004221.json | python mentionMapper.py \n", " 634645657\t386260483\t1\n", " 228023890\t312739659\t1\n", " 34344598\t420452993\t1\n", " 539484750\t20909329\t1\n", " 1411163412\t1614375631\t1\n", " 152158738\t80330381\t1\n", " ...\n", "\n", "You'll probably get a lot of output here. To redirect this to a file, you can use the > character.\n", "\n", " [hse0@ip-10-196-55-224 bin]$ cat ../data/test.20130816-004221.json | python mentionMapper.py > network-data.csv\n", "\n", "**Updated, 2013-08-17**\n", "\n", "If you weren't able to collect your own tweets, you can grab a test dataset of 10,000 tweets by typing this:\n", "\n", " [hse0@ip-10-196-55-224 bin]$ wget http://ssc.wisc.edu/~ahanna/sampleTweets.json\n", "\n", "This will download a file which you can run the mentionMapper.py file on:\n", "\n", " [hse0@ip-10-196-55-224 bin]$ cat sampleTweets.json | python mentionMapper.py > network-data.csv\n", "\n", "**End Update**\n", "\n", "And voila! Now you have an edgelist that you can give to NodeXL, Gephi, or R. You can connect to the server using a tool like CyberDuck ([http://cyberduck.ch/](http://cyberduck.ch/)) or WinSCP ([http://winscp.net/eng/index.php](http://winscp.net/eng/index.php)) to download the file and plug it into your favorite statistical package.\n", "\n", "This is, of course, a very basic sort of method to gather connections. We can also filter user mentions on whether it is a user mention (e.g. @alexhanna hi!) or a retweet. We can also count how many interactions users have between each other within a given amount of time to quantify tie strength. But this analysis is a basic building block of this kind of analysis.\n", "\n", "Much of this workshop has been based off of [this tutorial](http://www.alex-hanna.com/tworkshops/lesson-7-mention-network-analysis/), which contains a few more details and displays a network graph surrounding the US presidential election." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Appendix: How to get your OAuth keys for tweepy\n", "\n", "[@jsajuria](www.twitter.com/jsajuria) put together this tutorial of how to get OAuth keys for the script above.\n", "\n", "First, go into [dev.twitter.com](dev.twitter.com) and login using your Twitter account details. You will now activate your account to work as a Twitter developer.\n", "\n", "Then, click on your Twitter avatar (upper right-hand corner) and a menu will display. Click on \"My Applications\"\n", "\n", "\n", "\n", "Once there, you will have to click the \"Create new application\" button\n", "\n", "\n", "\n", "You'll get a form with the basic data of your app. Create a name and a short description, and enter any URL (you won't need a custom URL for using tweepy)\n", "\n", "\n", "\n", "Now, you have your new app, and can ask Twitter to create your keys. You can access [this site](https://dev.twitter.com/docs/auth/tokens-devtwittercom) to get instructions on how to create them. Once you have them, open the streaming.py on jEdit and change the code in the following way:\n", "\n", "Remove the # signs on the code lines referring to the OAuth process and enter your details (consumer key, token, and their respective secrets), as shown below:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "## Eventually you'll need to use OAuth. Here's the code for it here.` \n", "## You can learn more about OAuth here: https://dev.twitter.com/docs/auth/oauth` \n", "consumer_key = \"YOUR CONSUMER KEY HERE\" \n", "consumer_secret = \"YOUR CONSUMER SECRET HERE\"\n", "access_token = \"YOUR TOKEN HERE\"\n", "access_token_secret = \"YOUR TOKEN SECRET HERE\"\n", "\n", "# OAuth process, using the keys and tokens\n", "auth = tweepy.OAuthHandler(consumer_key, consumer_secret) \n", "auth.set_access_token(access_token, access_token_secret)\n", "\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once finished, remember to save it, go back to the terminal, and type\n", "\n", ">`python streaming.py`\n", "\n", "That should do it!" ] } ], "metadata": {} } ] }