{ "metadata": { "name": "access2research.ipynb" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Access to Research Map\n", "======================\n", "\n", "An IPython Notebook for generating post code data relating to the location of libraries participating in the [Access to Research](http://www.accesstoresearch.org.uk/) program. The code scrapes the Access to Research site for the URLs of participating libraries and then attempts to find postcodes at the library websites. [The map](https://mapsengine.google.com/map/edit?mid=zBC5pwaMebo8.kakfyX3YqeXs) made from this data is available at Google Maps Engine. More details on the motivation and process are available in [this post](http://cameronneylon.net/blog/improving-on-access-to-research)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#imports\n", "import requests\n", "import string\n", "from bs4 import BeautifulSoup\n", "import re\n", "import csv" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grab the Access to Research page with the libraries information" ] }, { "cell_type": "code", "collapsed": false, "input": [ "page = requests.get('http://www.accesstoresearch.org.uk/libraries')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "soup = BeautifulSoup(page.text)\n", "librarylist = soup.find('div', class_='col-lft')\n", "libraries = librarylist.find_all('ul')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can find the number of libraries and look at the structure of the information on each one. Once we know the structure of each library element we can easily pull out the name and URL for each one." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print len(libraries)\n", "print libraries[0]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "235\n", "\n" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [ "libs = []\n", "for library in libraries:\n", " name = library.find('a').text\n", " url = library.find('a')['href']\n", " libs.append([name, url])\n", " \n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just check that we've got that working properly by printing out the first one." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print libs[0]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[u'Abingdon', u'http://www.oxfordshire.gov.uk/cms/content/abingdon-library']\n" ] } ], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Actual position of postcode for various library systems\n", "-------------------------------------------------------\n", "\n", "For some County Council websites it seems there are multiple post codes on the library pages. For instance both the library postcode and the council postcode. So for those we need to identify more specific structures within the page where we find the correct postcode.\n", "\n", "* For Kent County Council in a span element with id \"ctl00__mainContent_uxPostcodeLabel\"\n", "* For Oxfordshire the correct postcode us in a div (no! sometimes a span!) with class \"postal-code\"\n", "* Surrey doesn't appear to give postcodes at all...\n", "* Calderdale is in an unordered list with class \"contactitem\"\n", "* West Sussex works fine, as does Lewisham, Buckinghamshire and East Sussex\n", "\n", "To deal with this we need both a way of recognising a valid UK postcode (via regex) and some specific processing for the irritating cases." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# The UK Postcode REGEX came from\n", "# http://www.regxlib.com/REDetails.aspx?regexp_id=260\n", "ukpc = re.compile('([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]? {1,2}[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)')\n", "\n", "def process(raw_postcode,page):\n", " \"\"\"\n", " Given a postcode from the 'doesn't work' list, process page correctly\n", " \"\"\"\n", " soup = BeautifulSoup(page)\n", " if raw_postcode == 'OX1 1ND':\n", " elem = soup.find(['span','div'], class_='postal-code')\n", "\n", " postcode = ukpc.search(elem.get_text()).group(0)\n", " \n", " elif raw_postcode == 'ME14 1LQ':\n", " elem = soup.find('span', id='ctl00__mainContent_uxPostcodeLabel')\n", " postcode = ukpc.search(elem.get_text()).group(0)\n", " \n", " elif raw_postcode == 'HX1 1UJ':\n", " elem = soup.find_all('ul', class_='contactitem')[1]\n", " postcode = ukpc.search(elem.get_text()).group(0)\n", " \n", " return postcode" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "With all of that in hand we can now process our list, grabbing each webpage, finding the postcode, and if necessary checking that we've got the correct one." ] }, { "cell_type": "code", "collapsed": false, "input": [ "tocorrect = ['OX1 1ND', 'ME14 1LQ', 'HX1 1UJ']\n", "for library in libs:\n", " page = requests.get(library[1])\n", " m = ukpc.search(page.text)\n", " if m:\n", " postcode = m.group(0)\n", " if postcode in tocorrect:\n", " postcode = process(postcode, page.text)\n", " library.append(postcode)\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally we can write out the files. Because Google Mapsengine Lite (the free version) only allows 100 items in a given layer I've divided the list up into three as well as dumping the full set." ] }, { "cell_type": "code", "collapsed": false, "input": [ "filename = 'libraries.csv'\n", "with open(filename, 'w') as f:\n", " writer = csv.writer(f)\n", " writer.writerow(['Library', 'url', 'postcode'])\n", " writer.writerows(libs)\n", "\n", "for segment in [(0,100), (101, 200), (201,235)]:\n", " filename = 'libraries%s.csv' % str(segment[0])\n", " with open(filename, 'w') as f:\n", " writer = csv.writer(f)\n", " writer.writerow(['Library', 'url', 'postcode'])\n", " writer.writerows(libs[segment[0]:segment[1]])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can find the output data in the [github repository](https://github.com/cameronneylon/access2research) which also hosts a version of this IPython Notebook. The [map itself](https://mapsengine.google.com/map/edit?mid=zBC5pwaMebo8.kakfyX3YqeXs) is live at [Google Maps Engine](https://mapsengine.google.com/map/) where I used the free lite version. The post discussing this, and the irony that I've probably violated a whole range of text mining rules, not least [those imposed on users](http://www.accesstoresearch.org.uk/terms-and-conditions) of the Access to Research Service is available [on my blog](http://cameronneylon.net/blog/improving-on-access-to-research)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", " \n", " \"CC0\"\n", " \n", "
\n", " To the extent possible under law,\n", " \n", " Cameron Neylon\n", " has waived all copyright and related or neighboring rights to\n", " Access to Research Map - IPython Notebook.\n", "

" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }