{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using the Twitter STREAMING API\n", "### A basic example\n", "\n", "In the previous exercise you learned how to harvest tweets already posted by using the REST api. In this exercise we will continue harvesting tweets posted in (semi) real time. It is just a basic example to get your started \n", "\n", "The lines below you know already from the previous excersise:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from twython import TwythonStreamer\n", "import string, json, pprint\n", "import urllib\n", "from datetime import datetime\n", "from datetime import date\n", "from time import *\n", "import string, os, sys, subprocess, time\n", "import psycopg2\n", "\n", "# get access to the twitter API\n", "APP_KEY = \"\"\n", "APP_SECRET = \"\"\n", "OAUTH_TOKEN = \"\"\n", "OAUTH_TOKEN_SECRET = \"\"\n", "\n", "## just some date and time hack to generate an unique filename if needed\n", "output_file = 'result_' + datetime.now().strftime('%Y%m%d-%H%M%S') + '.csv' \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting up a streaming class\n", "\n", "The new thing is that we are not going to use the `Twython` interface from the library but the `TwythonStreamer` interface. In the code below you see a Python class (MyStreamer) which inherits from the TwythonStreamer interface. \n", "\n", "This class has a number of functions. The main ones are: `on_succes` and `on_error`. The `on_succes` is called when data has been successfully recieved from the stream. The parameter `data` (a dictionary thanks to Twython) contains the tweet which you can parse-out the way you did previously.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Class to process JSON data comming from the twitter stream API. Extract relevant fields\n", "class MyStreamer(TwythonStreamer):\n", " def on_success(self, data):\n", " tweet_lat = 0.0\n", " tweet_lon = 0.0\n", " tweet_name = \"\"\n", " retweet_count = 0\n", "\n", " if 'id' in data:\n", " tweet_id = data['id']\n", " if 'text' in data:\n", " tweet_text = data['text'].replace(\"'\",\"''\").replace(';','')\n", " if 'coordinates' in data: \n", " geo = data['coordinates']\n", " if not geo is None:\n", " latlon = geo['coordinates']\n", " tweet_lon = latlon[0]\n", " tweet_lat= latlon[1]\n", " if 'created_at' in data:\n", " dt = data['created_at']\n", " tweet_datetime = datetime.strptime(dt, '%a %b %d %H:%M:%S +0000 %Y')\n", "\n", " if 'user' in data:\n", " users = data['user']\n", " tweet_name = users['screen_name']\n", "\n", " if 'retweet_count' in data:\n", " retweet_count = data['retweet_count']\n", " \n", " if tweet_lat != 0:\n", " #some elementary output to console \n", " string_to_write = str(tweet_datetime)+\", \"+str(tweet_lat)+\", \"+str(tweet_lon)+\": \"+str(tweet_text)\n", " print(string_to_write)\n", " #write_tweet(string_to_write)\n", " \n", " def on_error(self, status_code, data):\n", " print(\"OOPS Error: \" +str(status_code))\n", " #self.disconnect\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fiitering the stream\n", "\n", "Ok. To do it nicely in a Pythonic way; below you see the main procedure where the `MyStreamer` class is instantiated (with all authentication tokens) and next only capture those tweets within a certain bounding box. Have a look at https://twython.readthedocs.org/en/latest/api.html#streaming-interface for more information on what and how to filter the incoming tweet stream\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Main procedure\n", "def main():\n", " try:\n", " stream = MyStreamer(APP_KEY, APP_SECRET,OAUTH_TOKEN, OAUTH_TOKEN_SECRET)\n", " print('Connecting to twitter: will take a minute')\n", " except ValueError:\n", " print('OOPS! that hurts, something went wrong while making connection with Twitter: '+str(ValueError))\n", " #global target\n", " \n", " \n", " # Filter based on bounding box see twitter api documentation for more info\n", " try:\n", " stream.statuses.filter(locations='3.00,50.00,7.35,53.65')\n", " except ValueError:\n", " print('OOPS! that hurts, something went wrong while getting the stream from Twitter: '+str(ValueError))\n", "\n", "\n", " \n", "if __name__ == '__main__':\n", " main()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok just for granted. A basic function to write tweets to a file but probably you figured that out yourself. You can add the basic write function to the Streamer class to write your tweets to a file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def write_tweet(tweet, output_file):\n", " target = open(output_file, 'a')\n", " target.write(tweet)\n", " target.write('\\n')\n", " target.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Beyond the basics\n", "\n", "\n", "If you are bored and need a challenge, it would be nice not to write to a dull text file but to a real PostGIS database. \n", "\n", "PostGIS is installed in OSGEO life (see for a quick start: http://live.osgeo.org/en/quickstart/postgis_quickstart.html. The cool thing is that you directly can connect to QGIS and/or do spatial queries on the database. Another way to view your data is via [pgAdmin III](https://www.pgadmin.org/features/), which can be used to inspect and query your PostGis database.\n", "\n", "To be able to connect to a PostGIS database in your Python script you need to import the `psycopg2` library with Conda. Installing Python libraries is easy. Secondly, you need to [change a configuration in your postgres configurations](https://stackoverflow.com/questions/18664074/getting-error-peer-authentication-failed-for-user-postgres-when-trying-to-ge). Open your PostgreSQL settings with the bash command below with the nano text editor:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "sudo nano /etc/postgresql/9.5/main/pg_hba.conf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Be careful not to change anything else! Scroll down to the bottom of the text file and change:\n", "\n", "local all postgres peer\n", "\n", "into:\n", "\n", "local all postgres md5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After changing the line press ctrl+x, press y for yes and press enter to save with same filename. Now you can locally connect to the PostGIS database." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below you can see how to make a connection to the database. We already prepared a PostGIS database for you: dbname = geoscripting, user = geoscripting and passw = geoscripting. \n", "If setting up the connection to the database does not work, have a look at this [psycopg2 tutorial](https://wiki.postgresql.org/wiki/Psycopg2_Tutorial), this [psycopg2 tutorial with PostgreSQL](https://wiki.postgresql.org/wiki/Using_psycopg2_with_PostgreSQL) or this [Jupyter Notebook with PostgreSQL tutorial](https://www.compose.com/articles/getting-started-with-compose-postgresql-and-jupyter-notebooks/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Create connection to PostGis database and create a cursor\n", "\n", "try:\n", " conn = psycopg2.connect(\"dbname=geoscripting user=geoscripting password=geoscripting host=localhost\")\n", " cur = conn.cursor()\n", "except:\n", " print(\"oops error\")\n", "# Optionally use host=/var/run/postgresql for running outside of conda/virtual environment\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have a connection and a cursor to the database you can execute SQL queries ([some SQL examples](http://jazstudios.blogspot.nl/2010/06/postgresql-login-commands.html)) from Python: creating a table, inserting data into table, dropping tables, retrieving information from table, etc.. \n", "\n", "Using Python as a scripting language to do both harvesting via Twython and saving in PostGIS with psycopg2/SQL makes your life easy! Some benefits of PostGIS are: able to write and read features from database at same time, save spatial features in database based on simple features standard and do spatial processing in db based on third party applications [discussion about PostGIS vs MySQL](https://gis.stackexchange.com/questions/3435/would-postgis-offer-an-advantage-over-mysql-for-a-produce-farm-application).\n", "\n", "Go ahead and add a table to create a new table in the PostGIS database to store your tweet data and save some artifical tweet data in the database." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create table in PostGis database\n", "insert_query = \"\"\"CREATE TABLE {table_name} \n", " (tweet_name varchar(50),\n", " tweet_lat varchar(50),\n", " tweet_lon varchar(50),\n", " tweet_text varchar(500));\"\"\".format(table_name = \"geoscripting\") \n", "# the format function formats your string to: \"\"\"CREATE TABLE geoscripting (etc.);\"\"\"\n", "\n", "# Execute and commit query\n", "cur.execute(insert_query)\n", "conn.commit() \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can inspect your data in the PostGIS database in QGIS, pgAdmin III or via PSQL in the terminal. Give pgAdmin III a go. Go to Applications --> Programming --> pgAdmin III. Now the software opens. Then add a new connection to a server. Fill in:\n", "\n", "- name: geoscripting\n", "- host: localhost\n", "- maintenance db: postgres\n", "- username: geoscripting\n", "- password: geoscripting\n", "\n", "In the object browser you should see a server called geoscripting: open it --> databases --> geoscripting (name of database) --> schemas --> public --> tables --> geoscripting (name of table). You should see four columns.\n", "\n", "Add some data to the columns by firing another query." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Mock-up data\n", "tweet_name = \"Geoscripting\"\n", "tweet_lat = 52.1235\n", "tweet_lon = 5.1425\n", "tweet_text = \"#Geoscripting\"\n", "\n", "# Load into postgis database\n", "insert_query = \"\"\"INSERT INTO {table_name} VALUES(%s, %s, %s, %s)\"\"\".format(table_name = \"geoscripting\") \n", "#%s saves data as a string\n", "data = (str(tweet_name), str(tweet_lat), str(tweet_lon), str(tweet_text))\n", "\n", "# Execute and commit query\n", "cur.execute(insert_query, data)\n", "conn.commit()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Retrieve some of the data from your PostGIS database by letting the cursor fetch your data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Retrieve data from PostGIS database\n", "cur.execute(\"SELECT * FROM {table_name}\".format(table_name = \"geoscripting\")) \n", "twitter_data = cur.fetchall()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you don't need the connection to the database, close the cursor and connection to the database!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Close cursor that executed query\n", "cur.close()\n", "\n", "# Close connection to database\n", "conn.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have seen real-time tweet harvesting and saving at work. Now you can get to work to add the script that saves your tweets in the PostGIS database to the Streamer class you used before. This could be the result of your exercise script or you can try something else.\n", "\n", "Don't forget to close the connection to the database when you don't need it anymore!!! " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 1 }