{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "GDELT Data Wrangler\n", "===================\n", "James Houghton; \n", "April 8th, 2014\n", "[houghton@mit.edu](mailto:houghton@mit.edu)\n", "\n", "\n", "\n", "One of the challenges of dealing with [GDELT](http://gdeltproject.org/) is that its size makes implementation in a SQL database challenging. It is easier (although slower) to extract a subset of the data that we wish to work with, and do our actual data investigation afterwards. If we're interested in data broken down by date, we have it easy: GDELT files are provided this way. If instead we want to look at a single country, we have some work to do. This notebook parses through each of the GDELT files, one at a time, extracts the relevant lines, and exports them again to a smaller set of csv files. It then gives the option to load these files into a Pandas DataFrame, pickle the result, and remove the intermediate files.\n", "\n", "There are certainly more computationally efficient methods for doing this, but this one works well enough." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional References\n", "\n", "- To export a single set of data, use this exporter: http://analysis.gdeltproject.org/module-event-exporter.html\n", "- Raw Data is available here: http://data.gdeltproject.org/events/index.html\n", "- A description of the data fields is here: http://data.gdeltproject.org/documentation/GDELT-Data_Format_Codebook.pdf\n", "- The event codes are described in this document: http://gdeltproject.org/data/documentation/CAMEO.Manual.1.1b3.pdf\n", "- This condensed version of the codes is easier to browse: http://cameocodes.wikispaces.com/EventCodes\n", "- CAMEO country codes are listed here: http://cameocodes.wikispaces.com/countrybyname\n", "- FIPS country codes are listed here: http://en.wikipedia.org/wiki/List_of_FIPS_country_codes\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##1. Identify the files we need to download\n", "\n", "Links to each of the datafiles are available on the GDELT website. We extract this list to help collect and process the data. We probably want to perform this task every time we rebuild a dataset, so that we know we have the most up-to-date set of files." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import requests\n", "import lxml.html as lh\n", "\n", "gdelt_base_url = 'http://data.gdeltproject.org/events/'\n", "\n", "# get the list of all the links on the gdelt file page\n", "page = requests.get(gdelt_base_url+'index.html')\n", "doc = lh.fromstring(page.content)\n", "link_list = doc.xpath(\"//*/ul/li/a/@href\")\n", "\n", "# separate out those links that begin with four digits \n", "file_list = [x for x in link_list if str.isdigit(x[0:4])]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Extract Relevant GDELT Rows to Intermediate Files\n", "\n", "In this example, we pull out all of the rows with a specific country code, and create a set of files which mirrors that of GDELT itself in quantity and format. Each output file is smaller than its corresponding input file.\n", "\n", "As we're pulling out rows based upon location, we'll use GDELT cells:\n", "\n", "- ActionGeo_CountryCode (51)\n", "- Actor1Geo_CountryCode (37)\n", "- Actor2Geo_CountryCode (44)\n", "\n", "These use the [FIPS country codes](http://en.wikipedia.org/wiki/List_of_FIPS_country_codes), for inconvenience." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The general algorithm we will follow can be loosely described as:\n", "\n", "\n", "- Get a list of the gdelt files\n", "- For each element in the list\n", "> - Check and see if the files are available locally (We assume we can afford to keep a local copy of the compressed files, so that we don't have to continually ping the gdelt server.)\n", "> > - If not, get the files from the web\n", "> - Unzip the compressed raw GDELT file\n", "> - For each resulting CSV file (probably only one)\n", "> > - Create and open an output file\n", "> > - For each line of the input file\n", "> > > - Read into a string\n", "> > > - Split string by tab delimiter\n", "> > > - Check for the desired values in the appropriate list indexes\n", "> > > > - If they are not there, continue the loop\n", "> > > > - If the are there, write the line to the output file\n", "> > - Increment a current-file counter\n", "> > - Close the input and output files\n", "> > - Delete the input file\n", "\n", "We reset the infilecounter and outfilecounter external to the main algorithm cell so that if the algorithm encounters an error, and quits, we can pick up where we left off.\n", "\n", "We choose to check for the files, and potentially download them once for each file that we try to open, as opposed to doing all file collection in advance. That way, we can start to run additional code as soon as possible. This will help with debugging as we'll \"fail fast\".\n", "\n", "Remember to set the local_path to somewhere you're happy to store large files." ] }, { "cell_type": "code", "collapsed": false, "input": [ "infilecounter = 0\n", "outfilecounter = 0" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import os.path\n", "import urllib\n", "import zipfile\n", "import glob\n", "import operator\n", "\n", "local_path = '/Users/me/Desktop/GDELT_Data/'\n", "\n", "fips_country_code = 'UK'\n", "\n", "for compressed_file in file_list[infilecounter:]:\n", " print compressed_file,\n", " \n", " # if we dont have the compressed file stored locally, go get it. Keep trying if necessary.\n", " while not os.path.isfile(local_path+compressed_file): \n", " print 'downloading,',\n", " urllib.urlretrieve(url=gdelt_base_url+compressed_file, \n", " filename=local_path+compressed_file)\n", " \n", " # extract the contents of the compressed file to a temporary directory \n", " print 'extracting,',\n", " z = zipfile.ZipFile(file=local_path+compressed_file, mode='r') \n", " z.extractall(path=local_path+'tmp/')\n", " \n", " # parse each of the csv files in the working directory, \n", " print 'parsing,',\n", " for infile_name in glob.glob(local_path+'tmp/*'):\n", " outfile_name = local_path+'country/'+fips_country_code+'%04i.tsv'%outfilecounter\n", " \n", " # open the infile and outfile\n", " with open(infile_name, mode='r') as infile, open(outfile_name, mode='w') as outfile:\n", " for line in infile:\n", " # extract lines with our interest country code\n", " if fips_country_code in operator.itemgetter(51, 37, 44)(line.split('\\t')): \n", " outfile.write(line)\n", " outfilecounter +=1\n", " \n", " # delete the temporary file\n", " os.remove(infile_name)\n", " infilecounter +=1\n", " print 'done'" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Build intermediary files into a Pandas Dataframe\n", "\n", "We may be content to use the data we just sampled into csv files in its present state. However, if we are working in python, it is convenient to load them into a DataFrame, save that DataFrame to a pickle, and delete the temporary files. This can save space on the disk, and make our future analysis of the data more simple.\n", "\n", "Our algorithm here is simple - we build dataframes out of each of the temporary files, and then merge them into one big dataframe. We save that big dataframe, and delete the temporary files.\n", "\n", "We use a helper file here which lists the column names. You can download the file to your working directory with this link:\n", "\n", "http://gdeltproject.org/data/lookups/CSV.header.fieldids.xlsx" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import glob\n", "import pandas as pd\n", "\n", "# Get the GDELT field names from a helper file\n", "colnames = pd.read_excel('CSV.header.fieldids.xlsx', sheetname='Sheet1', \n", " index_col='Column ID', parse_cols=1)['Field Name']\n", "\n", "# Build DataFrames from each of the intermediary files\n", "files = glob.glob(local_path+'country/'+fips_country_code+'*')\n", "DFlist = []\n", "for active_file in files:\n", " print active_file\n", " DFlist.append(pd.read_csv(active_file, sep='\\t', header=None, dtype=str,\n", " names=colnames, index_col=['GLOBALEVENTID']))\n", "\n", "# Merge the file-based dataframes and save a pickle\n", "DF = pd.concat(DFlist)\n", "DF.to_pickle(local_path+'backup'+fips_country_code+'.pickle') \n", " \n", "# once everythin is safely stored away, remove the temporary files\n", "for active_file in files:\n", " os.remove(active_file)" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }