{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Harvest Social media data (Twitter) \n", "## I REST my case\n", "### Learning outcomes\n", "\n", "Data from social media such as Twitter, Flickr, Instagram etc. potentially is a rich source of spatial information.\n", "\n", "During this exercise you learn to apply python to:\n", "- connect to Twitter using the Twython package\n", "- fire a query to Twitter\n", "- parse the results and write it to a file\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Connecting to Twitter\n", "\n", "Before we harvest tweets, we need to connect to the Twitter platform as a developer. To do this you have to:\n", "\n", "1. Log in using your Twitter account (create one if you don't have)\n", "2. Go to https://apps.twitter.com/ and go to `manage your apps` at the bottom of the page\n", "3. Create a new app by filling in the fields (see figure)\n", "4. The result of creating (actually registering) a new app is that keys (tokens) are generated which allow you to access the Twitter API (if you don't know what API means, google for it)\n", "5. You can find the tokens under the leaf `Keys and Access Tokens`. A consumer key is generated for you. \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "After you have obtained your access code to twitter we can start coding a script in Python giving us access to Twitter.\n", "\n", "The first thing to do is import the necessary libraries. The most important libraries you need are Twython and json. Have at look at the Twython website and answer to your self what it does. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the code below: if you don't get an error message the Twython libraries are installed properly. If not, open a terminal and install using conda (conda install --name astrolab -c conda-forge twython), pip ( pip install twython) or easy intstall ( easy_install twython). More info [here](https://twython.readthedocs.org/en/latest/usage/install.html#pip-or-easy-install)\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [], "source": [ "from twython import Twython\n", "import json\n", "import datetime " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Json\n", "\n", "Before we start the real thing you have to understand something about JSON (if you do know JSON skip this part). JSON is an important data format for many web applications. Also Twitter makes use of JSON. It is a lightweight alternative for XML Make sure that you understand what JSON means and how it structures data: https://en.wikipedia.org/wiki/JSON\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Connecting to Twitter using Twython\n", "\n", "As mentioned Twython is the library that we use to connect to Twitter. Twitter offers two APIs: The REST api and the streaming API. This tutorial is on using the REST api. \n", "\n", "The REST APIs provide programmatic access to read and write Twitter data, author a new Tweet, read author profiles and follower data, and more. The responses are available in JSON. Have a look at https://dev.twitter.com/rest/public to have a glance of what is offered.\n", "\n", "If you want real-time (ok almost real-time) access to tweets you can use the STREAMING api. The Streaming APIs give developers low latency access to Twitter’s global stream of Tweet data. See for more information: https://dev.twitter.com/streaming/overview\n", "\n", "\n", "Exercise:\n", "- Go to https://dev.twitter.com/overview/api/tweets, explore the Twitter data structure and find out where and what types of geodata are stored in tweets.\n", "\n", "\n", "In this example we are going to use the REST api to fire some queries to Twitter, process the JSON response to extract from it what we need, and write it to a simple text file we can use to import in excel, a database, or even a GIS package. \n", "\n", "First of all instantiate a Twython object using the following:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "\n", "\n", "##codes to access twitter API. \n", "APP_KEY = \n", "APP_SECRET = \n", "OAUTH_TOKEN = \n", "OAUTH_TOKEN_SECRET = \n", "\n", "##initiating Twython object \n", "twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fire a question to Twitter\n", "\n", "Ok we have a connection (at least if you filled in your credentials correctly). Now lets ask Twitter a simple question. For that we use the Twitter `SEARCH` API (which is part of the REST api). You can find documentation at https://dev.twitter.com/rest/public/search.\n", "\n", "The results of the query we store in a variable called `search_results`.\n", "\n", "Exercise: \n", "- Implement a query that only includes tweets within 25 km from Amsterdam (or any other place). tip: have a look at the \"geocode\" parameter (see https://dev.twitter.com/rest/public/search)\n", "\n", "Have a look at the code below and answer the following questions:\n", "\n", "> **Question 1**: What is the meaning of 'q' and what does 'count' mean?\n", "\n", "> **Question 2**: What datastructure is `search-results`?\n", "\n", "> **Question 3**: Why do we code like: `search_results['statuses']`?\n", "\n", "> **Question 4**: What datastructure is `result`?\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{u'contributors': None, u'truncated': False, u'text': u'#Politie #Amsterdam Man zonder rijbewijs maakt brokken: [ Woensdag 20 januari 2016 | Politie ] Man zonder rijb... https://t.co/hxPT9DwM3v', u'is_quote_status': False, u'in_reply_to_status_id': None, u'id': 689743974907572224, u'favorite_count': 0, u'source': u'twitterfeed', u'retweeted': False, u'coordinates': None, u'entities': {u'symbols': [], u'user_mentions': [], u'hashtags': [{u'indices': [0, 8], u'text': u'Politie'}, {u'indices': [9, 19], u'text': u'Amsterdam'}], u'urls': [{u'url': u'https://t.co/hxPT9DwM3v', u'indices': [114, 137], u'expanded_url': u'http://bit.ly/1WtvURy', u'display_url': u'bit.ly/1WtvURy'}]}, u'in_reply_to_screen_name': None, u'in_reply_to_user_id': None, u'retweet_count': 0, u'id_str': u'689743974907572224', u'favorited': False, u'user': {u'follow_request_sent': False, u'has_extended_profile': False, u'profile_use_background_image': True, u'default_profile_image': False, u'id': 90489299, u'profile_background_image_url_https': u'https://pbs.twimg.com/profile_background_images/641610498/z7hhovezh374pfmh1okp.png', u'verified': False, u'profile_text_color': u'000000', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/557631377164607488/RvjbcG5x_normal.png', u'profile_sidebar_fill_color': u'FF0000', u'entities': {u'url': {u'urls': [{u'url': u'http://t.co/PVPg3sxYGB', u'indices': [0, 22], u'expanded_url': u'http://worldtv.com/police-station/web', u'display_url': u'worldtv.com/police-station\\u2026'}]}, u'description': {u'urls': [{u'url': u'http://t.co/Xoz1PYPkhq', u'indices': [112, 134], u'expanded_url': u'http://p2000.nl', u'display_url': u'p2000.nl'}, {u'url': u'http://t.co/NAJn2rjgsH', u'indices': [138, 160], u'expanded_url': u'http://oozo.nl/', u'display_url': u'oozo.nl'}]}}, u'followers_count': 1654, u'profile_sidebar_border_color': u'000000', u'id_str': u'90489299', u'profile_background_color': u'FFFFFF', u'listed_count': 57, u'is_translation_enabled': False, u'utc_offset': 3600, u'statuses_count': 68119, u'description': u'Thanks for Following !! #State #police #Messages - Look @ Favorites for More #Politie #FBI #CIA Info , Website http://t.co/Xoz1PYPkhq or http://t.co/NAJn2rjgsH', u'friends_count': 1565, u'location': u'Netherlands ', u'profile_link_color': u'000000', u'profile_image_url': u'http://pbs.twimg.com/profile_images/557631377164607488/RvjbcG5x_normal.png', u'following': False, u'geo_enabled': False, u'profile_banner_url': u'https://pbs.twimg.com/profile_banners/90489299/1411431556', u'profile_background_image_url': u'http://pbs.twimg.com/profile_background_images/641610498/z7hhovezh374pfmh1okp.png', u'screen_name': u'politieregio', u'lang': u'en', u'profile_background_tile': False, u'favourites_count': 39, u'name': u'police Scanner', u'notifications': False, u'url': u'http://t.co/PVPg3sxYGB', u'created_at': u'Mon Nov 16 21:31:47 +0000 2009', u'contributors_enabled': False, u'time_zone': u'Amsterdam', u'protected': False, u'default_profile': False, u'is_translator': False}, u'geo': None, u'in_reply_to_user_id_str': None, u'possibly_sensitive': False, u'lang': u'nl', u'created_at': u'Wed Jan 20 09:39:13 +0000 2016', u'in_reply_to_status_id_str': None, u'place': None, u'metadata': {u'iso_language_code': u'nl', u'result_type': u'recent'}}\n" ] } ], "source": [ " \n", "search_results = twitter.search(q='#amsterdam', count=1)\n", "\n", "\n", "for result in search_results['statuses']:\n", " print(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get it out of JSON\n", "\n", "Nasty isn' t it? Just the plain output of the JSON containing a lot of information about the user, the location, the tweet etc. \n", "\n", "Have a look at the code below. Our task is to select the data we are interested in. The tweet information is available in `statuses`. So what we have to do is loop over all results, store each separate tweet in a variable (in this case called `tweet`) and extract specific JSON fields from the tweet. Not so difficult provided that you know what you are looking for and where it is located. The thing is the JSON is a nested structure having various levels. As mentioned the structure you can study at https://dev.twitter.com/overview/api/tweets. In the code below you see how to get it using Python. It helps to realize that the tweet is basically represented in Python as a dictionary with keys and values (thanks to the Twyton and json libraries we imported).\n", "\n", "Exercise:\n", "- Complete and test the code below by pulling the coordinates from the tweet.\n", "\n", "Exercise for the real die-hards (optional):\n", "- In the `place` object a boundingbox which encloses the place is stored. Write a function (using OGR for example) that calculates the centroid of this bounding box.\n", "\n", "Answer these questions while doing the exercise:\n", "> **Question 5**: Have a look at the `place` object at https://dev.twitter.com/overview/api/tweets. What does `Nullable` mean?\n", "\n", "> **Question 6**: What is the meaning of `if tweet['place'] != None:` and why do we need to code it like this?" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "politieregio\n", "1654\n", "#Politie #Amsterdam Man zonder rijbewijs maakt brokken: [ Woensdag 20 januari 2016 | Politie ] Man zonder rijb... https://t.co/hxPT9DwM3v\n", "===========================\n" ] } ], "source": [ "##parsing out \n", "twitter_data = [] # create empty array\n", "for tweet in search_results[\"statuses\"]:\n", " username = tweet['user']['screen_name']\n", " followers_count = tweet['user']['followers_count']\n", " tweettext = tweet['text']\n", " if tweet['place'] != None:\n", " full_place_name = tweet['place']['full_name']\n", " place_type = tweet['place']['place_type'] \n", " coordinates = tweet['coordinates']\n", " if coordinates != None:\n", " print(\"\")\n", " #do it yourself: enter code here to pull out coordinate \n", " tweet_data = [username, followers_count, tweettext] \n", " # add coordinates to array\n", " twitter_data += [tweet_data]\n", "\n", "print(twitter_data) # print harvested twitter data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get it out of Python\n", "\n", "So ok, now you know how to harvest tweets. The only thing you have to do now is to make your data last. In other words writing it to a file or optionally to a database (for harvesting to database see [harvesting streaming data](http://nbviewer.jupyter.org/github/GeoScripting-WUR/PythonWeek/blob/gh-pages/HarvestingRealTimeTweets.ipynb)). The most simple thing to do is to adapt the code above and write the code to a delimited file (comma or tab). If you use a comma delimited file make sure that you replace existing commas in the tweet text by something else (or by nothing). Otherwise you will screw-up the datastructure of your file. The code sniplet below gives some hints. Try to interlace it yourself with the code above.\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import csv\n", "\n", "output_file = 'twitter_data.csv'\n", "\n", "f = open(output_file, 'w')\n", "\n", "with f:\n", " writer = csv.writer(f)\n", " for row in twitter_data:\n", " writer.writerow(row)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check if your csv with the twitter data is there. If it is, hooray you just harvested tweets. \n", "\n", "Reading a CSV file can also be done with Python (see [here](http://zetcode.com/python/csv/) for more info). CSV files are a popular import and export data format used in spreadsheets and databases. CSV files have a very simple format and most (if not all) data analysis software can read and write CSV files.\n", "\n", "If you want to do queries on your twitter data, you don't need to go to excel or R, but you can use `Pandas` in Python ([Pandas tutorial](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python)) to read your CSV files and do queries. The dataframe `Pandas` provides is similar to a dataframe in R, which makes it easy and intuitive to use for us. `Pandas` is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.\n", "\n", "After finishing this basic twitter harvest tutorial, you can continue with the next tutorial \"[Harvesting Real-Time Tweets](http://nbviewer.jupyter.org/github/GeoScripting-WUR/PythonWeek/blob/gh-pages/HarvestingRealTimeTweets.ipynb)\"." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 1 }