{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Scraping Recent Ofsted Reports"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In the UK, schools are regularly inspected by Ofsted, the Office for Standards in Education, Children\u2019s Services and Skills. Inspection reports are made public and can be searched for via the Ofsted website. Searches can be made over different time periods and across different sectors.\n",
      "\n",
      "Here's one example of a particular investigation that required looking at reports published in the previous week (one of the available search limits) that referred to primary schools.\n",
      "\n",
      "!['Ofsted search form'](files/ofstedsearch.png)\n",
      "\n",
      "The search returns pages of two sorts:\n",
      "\n",
      "* results listings\n",
      "* individual school reports\n",
      "\n",
      "The results listings are themselves spread over several pages:\n",
      "\n",
      "!['Ofsted search results'](files/ofstedsearchresults.png)\n",
      "\n",
      "The indvidual school report pages are linked to from the results listing. The report pages are published according to a template, so they all have a similar look, and more importantly, a similar structure at the level of HTML, the language the pages are actually written in.\n",
      "\n",
      "!['Ofsted school report'](files/ofstedschoolreport.png)\n",
      "\n",
      "The brief was was capture the *Overall effectiveness* of each school. A further useful requirement was to obtain the information necessary to be able to pinpoint the location, even if only approximately, of the school on a map.\n",
      "\n",
      "This notebook describes a series of simple steps taken to construct a screenscraping/webscraping tool capable of:\n",
      "\n",
      "* getting a list of **all** the results of a search for primary school reports published in the last week, results that may be spread over several results pages\n",
      "* getting the name, identifier, and postcode (which provides enough information to crudely plot the location) of each school identified in the results\n",
      "\n",
      "Note that the scraper isn't necessarily as efficient as it could be. The intention as much as anything is to demonstrate one possible walkthrough of the sorts of problem solving you might engage in when trying to address this sort of task."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Obtaining Information About Each Reported School"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The first step we will take is to look at a typical school report page. The information we primarily want to extract is the *Overall effectiveness*."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#The first thing we need to do is bring in some helper libraries\n",
      "\n",
      "#We're going to load in web pages, so a tool for doing that\n",
      "import urllib\n",
      "\n",
      "#We may need to do some complex string matching, which typically require the use of regular expressions\n",
      "import re\n",
      "\n",
      "#There are various tools to help us extract information from web page defining HTML. I'm using BeautifulSoup\n",
      "#If you don't have BeautifulSoup installed, uncomment and execute the following shell command\n",
      "#!pip install beautifulsoup\n",
      "from BeautifulSoup import BeautifulSoup"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 131
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's start by trying to grab the information from a single report page.\n",
      "\n",
      "Here's what the structure of the page looks like - in the Chrome browser I can use the in-built *Developer Tools* to inspect the HTML structure of an element:\n",
      "\n",
      "* highlight the element in the page\n",
      "* raise the context sensitive menu by right-clicking\n",
      "* select *Inspect Element*\n",
      "* view the result in the developer tools area\n",
      "\n",
      "!['Chrome developer tools'](files/Chromedevelopertools.png)\n",
      "\n",
      "In this case I see that the *Overall effectiveness* result is contained within a `<span>` element that has a particular `class`, *ins-judgement ins-judgement-2*.\n",
      "\n",
      "Let's see if we can grab that."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#First we need to load in the page from the target web address/URL\n",
      "#urllib2.urlopen(url) opens the connection\n",
      "#.read() reads in the HTML from the connection\n",
      "#BeautifulSoup() parses the HTML and puts it in a form we can work with\n",
      "url='http://www.ofsted.gov.uk/inspection-reports/find-inspection-report/provider/ELS/116734'\n",
      "soup = BeautifulSoup(urllib2.urlopen(url).read())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 132
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#We can search the soup to look for span elements with the specified class\n",
      "#A list of results is returned so we pick the first (in fact, only) result which has index value [0]\n",
      "#Then we want to look at the text that is contained within that span element\n",
      "print soup('span', {'class': 'ins-judgement ins-judgement-2'})[0].text"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Good\n"
       ]
      }
     ],
     "prompt_number": 133
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Inspecting some other pages, we notice that the second, numerically qualified *ins-judgement-N* element corresponds to the the overall outcome.\n",
      "\n",
      "This means that if we try to parse a page corresponding to a school with a different outcome, which leads to a class value of *ins-judgement-3* appearing, the scrape won't work - no match will be made.\n",
      "\n",
      "We can get round this by using a regular expression that will match the class on the first part, just the *ins-judgement*. In regular expression speak, the ^ means 'starting at the beginning of the string' and the .\\* means 'match any number (\\*) of any character (.)'."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print soup('span', {'class': re.compile(r\"^ins-judgement.*\") })[0].text"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Good\n"
       ]
      }
     ],
     "prompt_number": 134
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can now make a little function out of this to grab the overall assessment from any report page:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def stripper(url):\n",
      "    soup=BeautifulSoup(urllib2.urlopen(url).read())\n",
      "    return soup('span', {'class': re.compile(r\"ins-judgement.*\")})[0].text"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 135
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "url='http://www.ofsted.gov.uk/inspection-reports/find-inspection-report/provider/ELS/116734'\n",
      "outcome=stripper(url)\n",
      "print outcome"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Good\n"
       ]
      }
     ],
     "prompt_number": 136
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So now we have a tool for getting the overall assessment from a report page.\n",
      "\n",
      "The next step is to find a way of getting a list of the results, and more importantly, the web addresses/URLs for the reports listed in those results."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Parsing the Search Results Listing"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Recall that the results were presented over several pages, each of a common design. As we did with the individual report pages, let's see if we can grab results from one page first, then worry about how to cover all the pages later.\n",
      "\n",
      "As before, we can right click on an element in Chrome and select *Inspect Element* to see if there are any clues about how we can grab the element(s).\n",
      "\n",
      "!['Ofsted search results HTML'](files/ofstedsearchresultshtml.png)\n",
      "\n",
      "THe results appear to be contained within list items (`<li>`) within an unordered list (`<ul>`) of class `resultsList`.\n",
      "\n",
      "Let's see how to grab the list of items, then for each item the URL, the significant part of which at least is contined in the `href` attribute of the only anchor element (`<a>`) in the list item:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Identify a results page, then make some soup from it\n",
      "url='http://www.ofsted.gov.uk/inspection-reports/find-inspection-report/results/any/21/any/any/any/any/any/any/any/week/0/0?page=0'\n",
      "soup=BeautifulSoup(urllib2.urlopen(url).read())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 137
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for result in soup('ul',{'class':'resultsList'})[0].findAll('li'):\n",
      "    urlstub=result.find('a')['href']\n",
      "    print urlstub"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "/inspection-reports/find-inspection-report/provider/ELS/101417\n",
        "/inspection-reports/find-inspection-report/provider/ELS/122582\n",
        "/inspection-reports/find-inspection-report/provider/ELS/103970\n",
        "/inspection-reports/find-inspection-report/provider/ELS/116734\n",
        "/inspection-reports/find-inspection-report/provider/ELS/124196\n",
        "/inspection-reports/find-inspection-report/provider/ELS/112497\n",
        "/inspection-reports/find-inspection-report/provider/ELS/120778\n",
        "/inspection-reports/find-inspection-report/provider/ELS/121082\n",
        "/inspection-reports/find-inspection-report/provider/ELS/111939\n",
        "/inspection-reports/find-inspection-report/provider/ELS/112855\n"
       ]
      }
     ],
     "prompt_number": 138
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's make a quick function that can use this stub to generate the full web address of a report page, and then get the overal assessment result back from that page."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def fullstripper(urlstub):\n",
      "    url='http://www.ofsted.gov.uk'+urlstub\n",
      "    return stripper(url)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 139
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's tweak the code that grabbed the partial URLs, and use the `fullstripper()` function to display the assessemnt outcome for the school corresponding to each one."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for result in soup('ul',{'class':'resultsList'})[0].findAll('li'):\n",
      "    urlstub=result.find('a')['href']\n",
      "    print fullstripper(urlstub)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Requires Improvement\n",
        "Inadequate"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Inadequate"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 140
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "From the HTML of the results page, I notice that the first paragraph (`<p>`) element in each list item contains an address. Thiese appear to be in a standard, convention form: a posctode appears at the end of each address, preceded by a comma and a space.\n",
      "\n",
      "If we grab the first `<p>` element, split the string inside it at each comma, and pick the last split item, it should contain the postcode. Let's build on the previous bit of code to see if we can grab postcodes too: "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for result in soup('ul',{'class':'resultsList'})[0].findAll('li'):\n",
      "    urlstub=result.find('a')['href']\n",
      "    #Rather than print the outcome, pop it into a variable\n",
      "    outcome=fullstripper(urlstub)\n",
      "    #Find the first <p> element, split on commas, get last item\n",
      "    pc=result.find('p').text.split(',')[-1]\n",
      "    print outcome, pc"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Requires Improvement  DA7 6EQ\n",
        "Inadequate"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  NG15 6EZ\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  DY4 0RN\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  HR6 9LX\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  ST16 1PW\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  DE55 4BW\n",
        "Inadequate"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  NR11 8UG\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  NR10 3LF\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  PL34 0DU\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  DE7 6FS\n"
       ]
      }
     ],
     "prompt_number": 141
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Remember the URL stubs? They had the form `/inspection-reports/find-inspection-report/provider/ELS/112855`. The last part is actually the URN identifier for the school - let's grab that too:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for result in soup('ul',{'class':'resultsList'})[0].findAll('li'):\n",
      "    urlstub=result.find('a')['href']\n",
      "    #Grab the URN from the end of the URL stub\n",
      "    urn=urlstub.split('/')[-1]\n",
      "    #Rather than print the outcome, pop it into a variable\n",
      "    outcome=fullstripper(urlstub)\n",
      "    #Find the first <p> element, split on commas, get last item\n",
      "    pc=result.find('p').text.split(',')[-1]\n",
      "    print outcome, pc, urn"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Requires Improvement  DA7 6EQ 101417\n",
        "Inadequate"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  NG15 6EZ 122582\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  DY4 0RN 103970\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  HR6 9LX 116734\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  ST16 1PW 124196\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  DE55 4BW 112497\n",
        "Inadequate"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  NR11 8UG 120778\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  NR10 3LF 121082\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  PL34 0DU 111939\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  DE7 6FS 112855\n"
       ]
      }
     ],
     "prompt_number": 142
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Having got the postcode, there is a service we can use to get a latitude and lonigitude from somewhere around the middle of that unit postcode area. I won't go into details of how this works, but you may be able to figure it out... (JSON is Javascript Object Notation, a popular format for moving data around the web.)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#We need another helper library\n",
      "import json\n",
      "#Define a function to get latitude and longitude for a given UK postcode\n",
      "def geoCodePostcode(postcode):\n",
      "    #No spaces allowed in the postcode we pass to the geocoding service\n",
      "    url='http://uk-postcodes.com/postcode/'+postcode.replace(' ','')+'.json'\n",
      "    data = json.load(urllib2.urlopen(url))  \n",
      "    return data['geo']['lat'],data['geo']['lng']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 143
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Let's try it\n",
      "pc='MK7 6AA'\n",
      "lat,lng = geoCodePostcode(pc)\n",
      "print pc, lat, lng"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "MK7 6AA 52.0249148197 -0.709732906623\n"
       ]
      }
     ],
     "prompt_number": 144
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's work that in to where we were before:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for result in soup('ul',{'class':'resultsList'})[0].findAll('li'):\n",
      "    urlstub=result.find('a')['href']\n",
      "    #Grab the URN from the end of the URL stub\n",
      "    urn=urlstub.split('/')[-1]\n",
      "    #Rather than print the outcome, pop it into a variable\n",
      "    outcome=fullstripper(urlstub)\n",
      "    #Find the first <p> element, split on commas, get last item\n",
      "    pc=result.find('p').text.split(',')[-1]\n",
      "    #Geocode the postcode\n",
      "    lat,lng = geoCodePostcode(pc)\n",
      "    print outcome, pc, urn, lat, lng"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Requires Improvement  DA7 6EQ 101417 51.4568773266 0.166706916506\n",
        "Inadequate"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  NG15 6EZ 122582 53.0310189204 -1.21430274035\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  DY4 0RN 103970 52.5337301944 -2.05587338671\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  HR6 9LX 116734 52.251960633 -2.87982511819\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  ST16 1PW 124196 52.8219450955 -2.12846752628\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  DE55 4BW 112497 53.0698946677 -1.36432134803\n",
        "Inadequate"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  NR11 8UG 120778 52.8664965086 1.35545908317\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  NR10 3LF 121082 52.6952250035 1.28249418915\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  PL34 0DU 111939 50.6580313296 -4.74839945855\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  DE7 6FS 112855 52.9758021017 -1.38102662274\n"
       ]
      }
     ],
     "prompt_number": 145
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Looking good - but what's the name of the school? It's contained by the anchor element:\n",
      "    \n",
      "![Ofsted results school name](files/ofstedschoolname.png)\n",
      "\n",
      "Let's be quick and scruffy in how we pull this out...we're gong to catch the `1` from the comment, but then we'll strip it out..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for result in soup('ul',{'class':'resultsList'})[0].findAll('li'):\n",
      "    urlstub=result.find('a')['href']\n",
      "    #Grab the URN from the end of the URL stub\n",
      "    urn=urlstub.split('/')[-1]\n",
      "    #Rather than print the outcome, pop it into a variable\n",
      "    outcome=fullstripper(urlstub)\n",
      "    #Find the first <p> element, split on commas, get last item\n",
      "    pc=result.find('p').text.split(',')[-1]\n",
      "    #Geocode the postcode\n",
      "    lat,lng = geoCodePostcode(pc)\n",
      "    #Get the school name and strip out the cruft\n",
      "    name=result.find('a').text.strip('1')\n",
      "    print outcome, pc, urn, lat, lng, name"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Requires Improvement  DA7 6EQ 101417 51.4568773266 0.166706916506 Mayplace Primary School\n",
        "Inadequate"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  NG15 6EZ 122582 53.0310189204 -1.21430274035 Annie Holgate Junior School\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  DY4 0RN 103970 52.5337301944 -2.05587338671 Joseph Turner Primary School\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  HR6 9LX 116734 52.251960633 -2.87982511819 Shobdon Primary School\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  ST16 1PW 124196 52.8219450955 -2.12846752628 Tillington Manor Primary School\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  DE55 4BW 112497 53.0698946677 -1.36432134803 Riddings Junior School\n",
        "Inadequate"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  NR11 8UG 120778 52.8664965086 1.35545908317 Antingham and Southrepps Community Primary School\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  NR10 3LF 121082 52.6952250035 1.28249418915 St Faith&#039;s CofE Primary School\n",
        "Good"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  PL34 0DU 111939 50.6580313296 -4.74839945855 Tintagel Primary School\n",
        "Requires Improvement"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "  DE7 6FS 112855 52.9758021017 -1.38102662274 Stanley Common CofE Primary School\n"
       ]
      }
     ],
     "prompt_number": 146
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So we've managed to pull quite a bit of information from the results listing. It might have been tidier to extract the same information from the actual results page, in effect creating an API over the results page, but that can be left as an exercise for the reader... (i.e. just get the urlstb from the results listing then pull *all* the other information from an individual school report page).\n",
      "\n",
      "The next thing we need to address is getting the results list from each results page.\n",
      "\n",
      "![Ofsted results paging](files/ofstedresultspaging.png)\n",
      "\n",
      "There are two things to consider:\n",
      "\n",
      "* *how do we identify each page?* In this case, the results pages are numbered. We guess (and try it to check) that `page=0` also works for the first results page...\n",
      "* *how do we know when we're done?* We could just load the first results page (without using a page index value) and then look for the list of other results pages, grabbing those URLs and then cycling through them. Another technique is to make use of the *Displaying M to N of NN* statement. When *N==NN* we know we're done..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Let's have a quick try at parsing the line that sayas which results are displayed...\n",
      "txt=soup('p',{'class':'resultsSummary'})[0].text\n",
      "print txt\n",
      "#We can use a regular expression to parse out the values of interest\n",
      "m = re.match(\".*Displaying \\d* to (\\d*) of (\\d*) matches.*\", txt)\n",
      "print m.group(1), m.group(2)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Displaying 1 to 10 of 200 matches\n",
        "10 200\n"
       ]
      }
     ],
     "prompt_number": 147
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's try to put the pieces together, by first considering the logic of it..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Set the scraper running flag to True...\n",
      "running=True\n",
      "#Start with results page 0\n",
      "page=0\n",
      "\n",
      "#While we've still got results to fetch...\n",
      "while running:\n",
      "    #Create the results page URL\n",
      "    stub='http://www.ofsted.gov.uk/inspection-reports/find-inspection-report/results/any/21/any/any/any/any/any/any/any/week/0/0?page='\n",
      "    url=stub+str(page)\n",
      "    soup=BeautifulSoup(urllib2.urlopen(url).read())\n",
      "    page=page+1\n",
      "    print page,'...',\n",
      "    #Extracting results and then fetching info on each result would go here\n",
      "    bit=soup('p',{'class':'resultsSummary'})[0].text\n",
      "    m = re.match(\".*Displaying \\d* to (\\d*) of (\\d*) matches.*\", bit)\n",
      "    if m.group(1)==m.group(2):\n",
      "        running = False"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "4 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "5 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "6 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "7 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "8 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "9 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "10 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "11 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "12 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "13 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "14 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "15 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "16 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "17 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "18 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "19 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "20 ...\n"
       ]
      }
     ],
     "prompt_number": 148
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "running =True\n",
      "page=0\n",
      "#I'm going to build a list of reports\n",
      "reports=[]\n",
      "while running:\n",
      "    stub='http://www.ofsted.gov.uk/inspection-reports/find-inspection-report/results/any/21/any/any/any/any/any/any/any/week/0/0?page='\n",
      "    soup=BeautifulSoup(urllib2.urlopen(stub+str(page)).read())\n",
      "    page=page+1\n",
      "    print page,'...',\n",
      "    for result in soup('ul',{'class':'resultsList'})[0].findAll('li'):\n",
      "        urlstub=result.find('a')['href']\n",
      "        #Grab the URN from the end of the URL stub\n",
      "        urn=urlstub.split('/')[-1]\n",
      "        #Rather than print the outcome, pop it into a variable\n",
      "        outcome=fullstripper(urlstub)\n",
      "        #Find the first <p> element, split on commas, get last item\n",
      "        pc=result.find('p').text.split(',')[-1]\n",
      "        #Geocode the postcode\n",
      "        lat,lng = geoCodePostcode(pc)\n",
      "        #Get the school name and strip out the cruft\n",
      "        name=result.find('a').text.strip('1')\n",
      "        #Rather than print the data, let's add it to the report list as a line item\n",
      "        #print outcome, pc, urn, lat, lng, name\n",
      "        reports.append([outcome, pc, urn, lat, lng, name])\n",
      "    bit=soup('p',{'class':'resultsSummary'})[0].text\n",
      "    m = re.match(\".*Displaying \\d* to (\\d*) of (\\d*) matches.*\", bit)\n",
      "    if m.group(1)==m.group(2):\n",
      "        running = False"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "2 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "3 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "4 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "5 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "6 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "7 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "8 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "9 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "10 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "11 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "12 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "13 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "14 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "15 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "16 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "17 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "18 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "19 ... "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "20 ...\n"
       ]
      }
     ],
     "prompt_number": 149
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Preview the first few report lines\n",
      "for report in reports[:5]:\n",
      "    print report"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[u'Requires Improvement', u' DA7 6EQ', u'101417', 51.4568773266355, 0.16670691650597483, u'Mayplace Primary School']\n",
        "[u'Inadequate', u' NG15 6EZ', u'122582', 53.03101892042224, -1.2143027403533455, u'Annie Holgate Junior School']\n",
        "[u'Good', u' DY4 0RN', u'103970', 52.53373019436544, -2.055873386710021, u'Joseph Turner Primary School']\n",
        "[u'Good', u' HR6 9LX', u'116734', 52.25196063299823, -2.8798251181915, u'Shobdon Primary School']\n",
        "[u'Requires Improvement', u' ST16 1PW', u'124196', 52.821945095483734, -2.128467526279911, u'Tillington Manor Primary School']\n"
       ]
      }
     ],
     "prompt_number": 150
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Outputting the Data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Having grabbed all the data - and parsed it - we can write it out to a CSV file."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Get the CSV helper library that makes sure we write nice CSV out\n",
      "import csv"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 151
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "f = csv.writer(open('sampleReport.csv', 'wb+'))\n",
      "f.writerow(['outcome', 'pc', 'urn', 'lat', 'lng', 'name'])\n",
      "for report in reports:\n",
      "    f.writerow(report)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 152
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#Use a commandline command to preview the head of the CSV file\n",
      "!head sampleReport.csv"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "outcome,pc,urn,lat,lng,name\r",
        "\r\n",
        "Requires Improvement, DA7 6EQ,101417,51.4568773266355,0.16670691650597483,Mayplace Primary School\r",
        "\r\n",
        "Inadequate, NG15 6EZ,122582,53.03101892042224,-1.2143027403533455,Annie Holgate Junior School\r",
        "\r\n",
        "Good, DY4 0RN,103970,52.53373019436544,-2.055873386710021,Joseph Turner Primary School\r",
        "\r\n",
        "Good, HR6 9LX,116734,52.25196063299823,-2.8798251181915,Shobdon Primary School\r",
        "\r\n",
        "Requires Improvement, ST16 1PW,124196,52.821945095483734,-2.128467526279911,Tillington Manor Primary School\r",
        "\r\n",
        "Requires Improvement, DE55 4BW,112497,53.0698946677246,-1.3643213480338026,Riddings Junior School\r",
        "\r\n",
        "Inadequate, NR11 8UG,120778,52.8664965085534,1.3554590831692763,Antingham and Southrepps Community Primary School\r",
        "\r\n",
        "Requires Improvement, NR10 3LF,121082,52.69522500353183,1.2824941891515342,St Faith&#039;s CofE Primary School\r",
        "\r\n",
        "Good, PL34 0DU,111939,50.65803132959249,-4.748399458554156,Tintagel Primary School\r",
        "\r\n"
       ]
      }
     ],
     "prompt_number": 153
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Summary"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So that's all there is to it... break the problem down into small steps, and then piece them together..."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}