{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Scraping Recent Ofsted Reports" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the UK, schools are regularly inspected by Ofsted, the Office for Standards in Education, Children\u2019s Services and Skills. Inspection reports are made public and can be searched for via the Ofsted website. Searches can be made over different time periods and across different sectors.\n", "\n", "Here's one example of a particular investigation that required looking at reports published in the previous week (one of the available search limits) that referred to primary schools.\n", "\n", "!['Ofsted search form'](files/ofstedsearch.png)\n", "\n", "The search returns pages of two sorts:\n", "\n", "* results listings\n", "* individual school reports\n", "\n", "The results listings are themselves spread over several pages:\n", "\n", "!['Ofsted search results'](files/ofstedsearchresults.png)\n", "\n", "The indvidual school report pages are linked to from the results listing. The report pages are published according to a template, so they all have a similar look, and more importantly, a similar structure at the level of HTML, the language the pages are actually written in.\n", "\n", "!['Ofsted school report'](files/ofstedschoolreport.png)\n", "\n", "The brief was was capture the *Overall effectiveness* of each school. A further useful requirement was to obtain the information necessary to be able to pinpoint the location, even if only approximately, of the school on a map.\n", "\n", "This notebook describes a series of simple steps taken to construct a screenscraping/webscraping tool capable of:\n", "\n", "* getting a list of **all** the results of a search for primary school reports published in the last week, results that may be spread over several results pages\n", "* getting the name, identifier, and postcode (which provides enough information to crudely plot the location) of each school identified in the results\n", "\n", "Note that the scraper isn't necessarily as efficient as it could be. The intention as much as anything is to demonstrate one possible walkthrough of the sorts of problem solving you might engage in when trying to address this sort of task." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Obtaining Information About Each Reported School" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step we will take is to look at a typical school report page. The information we primarily want to extract is the *Overall effectiveness*." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#The first thing we need to do is bring in some helper libraries\n", "\n", "#We're going to load in web pages, so a tool for doing that\n", "import urllib\n", "\n", "#We may need to do some complex string matching, which typically require the use of regular expressions\n", "import re\n", "\n", "#There are various tools to help us extract information from web page defining HTML. I'm using BeautifulSoup\n", "#If you don't have BeautifulSoup installed, uncomment and execute the following shell command\n", "#!pip install beautifulsoup\n", "from BeautifulSoup import BeautifulSoup" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 131 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by trying to grab the information from a single report page.\n", "\n", "Here's what the structure of the page looks like - in the Chrome browser I can use the in-built *Developer Tools* to inspect the HTML structure of an element:\n", "\n", "* highlight the element in the page\n", "* raise the context sensitive menu by right-clicking\n", "* select *Inspect Element*\n", "* view the result in the developer tools area\n", "\n", "!['Chrome developer tools'](files/Chromedevelopertools.png)\n", "\n", "In this case I see that the *Overall effectiveness* result is contained within a `` element that has a particular `class`, *ins-judgement ins-judgement-2*.\n", "\n", "Let's see if we can grab that." ] }, { "cell_type": "code", "collapsed": false, "input": [ "#First we need to load in the page from the target web address/URL\n", "#urllib2.urlopen(url) opens the connection\n", "#.read() reads in the HTML from the connection\n", "#BeautifulSoup() parses the HTML and puts it in a form we can work with\n", "url='http://www.ofsted.gov.uk/inspection-reports/find-inspection-report/provider/ELS/116734'\n", "soup = BeautifulSoup(urllib2.urlopen(url).read())" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 132 }, { "cell_type": "code", "collapsed": false, "input": [ "#We can search the soup to look for span elements with the specified class\n", "#A list of results is returned so we pick the first (in fact, only) result which has index value [0]\n", "#Then we want to look at the text that is contained within that span element\n", "print soup('span', {'class': 'ins-judgement ins-judgement-2'})[0].text" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Good\n" ] } ], "prompt_number": 133 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inspecting some other pages, we notice that the second, numerically qualified *ins-judgement-N* element corresponds to the the overall outcome.\n", "\n", "This means that if we try to parse a page corresponding to a school with a different outcome, which leads to a class value of *ins-judgement-3* appearing, the scrape won't work - no match will be made.\n", "\n", "We can get round this by using a regular expression that will match the class on the first part, just the *ins-judgement*. In regular expression speak, the ^ means 'starting at the beginning of the string' and the .\\* means 'match any number (\\*) of any character (.)'." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print soup('span', {'class': re.compile(r\"^ins-judgement.*\") })[0].text" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Good\n" ] } ], "prompt_number": 134 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now make a little function out of this to grab the overall assessment from any report page:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def stripper(url):\n", " soup=BeautifulSoup(urllib2.urlopen(url).read())\n", " return soup('span', {'class': re.compile(r\"ins-judgement.*\")})[0].text" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 135 }, { "cell_type": "code", "collapsed": false, "input": [ "url='http://www.ofsted.gov.uk/inspection-reports/find-inspection-report/provider/ELS/116734'\n", "outcome=stripper(url)\n", "print outcome" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Good\n" ] } ], "prompt_number": 136 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So now we have a tool for getting the overall assessment from a report page.\n", "\n", "The next step is to find a way of getting a list of the results, and more importantly, the web addresses/URLs for the reports listed in those results." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Parsing the Search Results Listing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that the results were presented over several pages, each of a common design. As we did with the individual report pages, let's see if we can grab results from one page first, then worry about how to cover all the pages later.\n", "\n", "As before, we can right click on an element in Chrome and select *Inspect Element* to see if there are any clues about how we can grab the element(s).\n", "\n", "!['Ofsted search results HTML'](files/ofstedsearchresultshtml.png)\n", "\n", "THe results appear to be contained within list items (`
  • `) within an unordered list (`