{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OpenLearn XML Scraper\n", "\n", "OU OpenLearn materials are published in XML form, which allows some degree of structured access to the document contents.\n", "\n", "For example, we can construct a database of images used across OpenLearn unit material, or a list of quotes, or a list of activities.\n", "\n", "This notebook is a very first pass, just scraping images, and not adding as much metadata to the table (eg parent course) as it should. As and when I get time to tinker, I'll work on this... ;-)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import some stuff..." ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "#You will probably need to pip install requests_cache lxml scraperwikiimport requests\n", "import requests_cache\n", "requests_cache.install_cache('openlearn_cache')\n", "\n", "from urllib.parse import urlsplit, urlunsplit\n", "import unicodedata\n", "from lxml import etree\n", "\n", "import os\n", "\n", "os.environ['SCRAPERWIKI_DATABASE_NAME'] = 'sqlite:///openlearn.sqlite'\n", "#os.environ['SCRAPERWIKI_DATABASE_NAME'] = 'sqlite:///scraperwiki.sqlite'\n", "\n", "import scraperwiki" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#Example page\n", "xmlurl='http://www.open.edu/openlearn/people-politics-law/politics-policy-people/sociology/the-politics-devolution/altformat-ouxml'\n", "c=requests.get(xmlurl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## XML Parser\n", "\n", "Routines for parsing the OU XML." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [], "source": [ "#===\n", "#via http://stackoverflow.com/questions/5757201/help-or-advice-me-get-started-with-lxml/5899005#5899005\n", "def flatten(el): \n", " if el is None: return\n", " result = [ (el.text or \"\") ]\n", " for sel in el:\n", " result.append(flatten(sel))\n", " result.append(sel.tail or \"\")\n", " return unicodedata.normalize(\"NFKD\", \"\".join(result)) or ' '\n", "#===\n", "\n", "\n", "def droptable(table):\n", " print(\"Trying to drop table '{}'\".format(table))\n", " try:\n", " scraperwiki.sqlite.execute('drop table if exists \"{}\"'.format(table))\n", " except:\n", " pass\n", " print('...{} dropped'.format(table))\n", " \n", " \n", "def _course_code(xml_content):\n", " root = etree.fromstring(xml_content)\n", " return flatten(root.find('.//CourseCode'))\n", " \n", "def _xml_figures(xml_content, coursecode='', pageurl='',dbsave=True):\n", " figdicts=[]\n", " try:\n", " root = etree.fromstring(xml_content)\n", " except: return False\n", " figures=root.findall('.//Figure')\n", " #??Note that acknowledgements to figures are provided at the end of the XML file with only informal free text/figure number identifers available for associating a particular acknowledgement/copyright assignment with a given image. It would be so much neater if this could be bundled up with the figure itself, or if the figure and the acknowledgement could share the same unique identifier?\n", " figdict={}\n", " for figure in figures:\n", " figdict = {'xpageurl':pageurl,'caption':'','src':'','coursecode':coursecode,\n", " 'desc':'','owner':'','item':'','itemack':''}\n", " img=figure.find('Image')\n", " #The image url as given does not resolve - we need to add in provided hash info\n", " figdict['srcurl']=img.get('src')\n", " figdict['x_folderhash']=img.get('x_folderhash')\n", " figdict['x_contenthash']=img.get('x_contenthash')\n", " if figdict['x_contenthash'] is not None and figdict['x_contenthash'] is not None:\n", " path = urlsplit(figdict['srcurl'])\n", " sp=path.path.split('/')\n", " path=path._replace( path='/'.join(sp[:-1]+[figdict['x_folderhash'],figdict['x_contenthash']]+sp[-1:]))\n", " figdict['imgurl']=urlunsplit(path)\n", " else:figdict['imgurl']=''\n", " xsrc=img.get('x_imagesrc')\n", " figdict['caption']=flatten(figure.find('Caption'))\n", " #in desc, need to find a way of stripping element from start of description\n", " figdict['desc']=flatten(figure.find('Description'))\n", " #\n", " ref=figure.find('SourceReference')\n", " if ref is not None:\n", " rights=ref.find('ItemRights')\n", " if rights is not None:\n", " figdict['owner']=flatten(rights.find('ItemRights'))\n", " figdict['item']=flatten(rights.find('ItemRights'))\n", " figdict['itemack']=flatten(rights.find('ItemAcknowledgement'))\n", " #print( 'figures',xsrc,caption,desc,src)\n", " figdicts.append(figdict)\n", " if dbsave : scraperwiki.sqlite.save(unique_keys=[],table_name='xmlfigures',data=figdict)\n", " return figdicts\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#http://www.open.edu/openlearn/ocw/pluginfile.php/100160/mod_oucontent/oucontent/859/dd203_1_001i.jpg\n", "#http://www.open.edu/openlearn/ocw/pluginfile.php/100160/mod_oucontent/oucontent/859/0c10275d/2c1a8d77/dd203_1_001i.jpg" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'caption': ' Figure 1 Presentation of the Treaty of the Union between England and Scotland to Queen Anne © National Galleries of Scotland',\n", " 'coursecode': '',\n", " 'desc': 'Figure 1',\n", " 'imgurl': 'http://www.open.edu/openlearn/ocw/pluginfile.php/100160/mod_oucontent/oucontent/859/0c10275d/2c1a8d77/dd203_1_001i.jpg',\n", " 'item': '',\n", " 'itemack': '',\n", " 'owner': '',\n", " 'src': '',\n", " 'srcurl': 'http://www.open.edu/openlearn/ocw/pluginfile.php/100160/mod_oucontent/oucontent/859/dd203_1_001i.jpg',\n", " 'x_contenthash': '2c1a8d77',\n", " 'x_folderhash': '0c10275d',\n", " 'xpageurl': ''},\n", " {'caption': ' Figure 2 Illustration of Owain Glyndwr, by permission of Llyfrgell Genedlaethol Cymru/The National Library of Wales',\n", " 'coursecode': '',\n", " 'desc': ' Figure 2',\n", " 'imgurl': 'http://www.open.edu/openlearn/ocw/pluginfile.php/100160/mod_oucontent/oucontent/859/0c10275d/3d1e1b88/dd203_1_002i.jpg',\n", " 'item': '',\n", " 'itemack': '',\n", " 'owner': '',\n", " 'src': '',\n", " 'srcurl': 'http://www.open.edu/openlearn/ocw/pluginfile.php/100160/mod_oucontent/oucontent/859/dd203_1_002i.jpg',\n", " 'x_contenthash': '3d1e1b88',\n", " 'x_folderhash': '0c10275d',\n", " 'xpageurl': ''},\n", " {'caption': 'Figure 3 Contemporary wall mural of the Battle of the Boyne in Northern Ireland © Nezumi Dumousseau',\n", " 'coursecode': '',\n", " 'desc': 'Figure 3',\n", " 'imgurl': 'http://www.open.edu/openlearn/ocw/pluginfile.php/100160/mod_oucontent/oucontent/859/0c10275d/a64b5f52/dd203_1_003i.jpg',\n", " 'item': '',\n", " 'itemack': '',\n", " 'owner': '',\n", " 'src': '',\n", " 'srcurl': 'http://www.open.edu/openlearn/ocw/pluginfile.php/100160/mod_oucontent/oucontent/859/dd203_1_003i.jpg',\n", " 'x_contenthash': 'a64b5f52',\n", " 'x_folderhash': '0c10275d',\n", " 'xpageurl': ''},\n", " {'caption': ' Figure 4 Referendum campaigners for devolution in Scotland, 1997 © Sutton-Hibbert/Rex Features',\n", " 'coursecode': '',\n", " 'desc': 'Figure 4',\n", " 'imgurl': 'http://www.open.edu/openlearn/ocw/pluginfile.php/100160/mod_oucontent/oucontent/859/0c10275d/ddcfba15/dd203_1_004i.jpg',\n", " 'item': '',\n", " 'itemack': '',\n", " 'owner': '',\n", " 'src': '',\n", " 'srcurl': 'http://www.open.edu/openlearn/ocw/pluginfile.php/100160/mod_oucontent/oucontent/859/dd203_1_004i.jpg',\n", " 'x_contenthash': 'ddcfba15',\n", " 'x_folderhash': '0c10275d',\n", " 'xpageurl': ''},\n", " {'caption': None,\n", " 'coursecode': '',\n", " 'desc': None,\n", " 'imgurl': 'http://www.open.edu/openlearn/ocw/pluginfile.php/100160/mod_oucontent/oucontent/859/1b9129f0/d3c986e6/ol_skeleton_keeponlearning_image.jpg',\n", " 'item': '',\n", " 'itemack': '',\n", " 'owner': '',\n", " 'src': '',\n", " 'srcurl': 'http://www.open.edu/openlearn/ocw/pluginfile.php/100160/mod_oucontent/oucontent/859/ol_skeleton_keeponlearning_image.jpg',\n", " 'x_contenthash': 'd3c986e6',\n", " 'x_folderhash': '1b9129f0',\n", " 'xpageurl': ''}]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Here's an exampl of the figure data as parsed\n", "_xml_figures(c.content,dbsave=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grab Unit Locations\n", "\n", "OpenLearn publish an OPML feed of units. It used to be hierarchical, grouping unitis in to topics, now it seems to be flat with links to units as well as topic feeds. At some point, I'll grab the topic feeds and use it to generate lookup tables from topics to units." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def getUnitLocations():\n", " #The OPML file lists all OpenLearn units by topic area\n", " srcUrl='http://openlearn.open.ac.uk/rss/file.php/stdfeed/1/full_opml.xml'\n", " tree = etree.parse(srcUrl)\n", " root = tree.getroot()\n", " items=root.findall('.//body/outline')\n", " #Handle each topic area separately?\n", " #The OPML is linear and mixes links to content twith links to topic feeds\n", " #Need to harvest by topic?\n", " for item in items:\n", " tt = item.get('text')\n", " #print( tt)\n", " it=item.get('text')\n", " if it.startswith('Unit content for'):\n", " it=it.replace('Unit content for','')\n", " url=item.get('htmlUrl')\n", " rssurl=item.get('xmlUrl')\n", " #print(url)\n", " xmlurl=url.replace('content-section-0','altformat-ouxml')\n", " #print(xmlurl)\n", " c=requests.get(xmlurl)\n", " _xml_figures(c.content)\n", "\n", "#droptable('xmlfigures')\n", "getUnitLocations()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TO DO\n", "\n", "There are other things we can scrape data about as well as images:\n", "\n", "- quotes (`...`)\n", "- activities (``)\n", "- box (` ...`)\n", "- OU coursecode and title (`` and ``)\n", "- identifying references is unstructed in some units, structured in others (``)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example Query" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xpageurlcaptionsrccoursecodedescowneritemitemacksrcurlx_folderhashx_contenthashimgurl
0NoneNonehttp://www.open.edu/openlearn/ocw/pluginfile.p...1b9129f0d3c986e6http://www.open.edu/openlearn/ocw/pluginfile.p...
1Figure 1 Presentation of the Treaty of the Un...Figure 1http://www.open.edu/openlearn/ocw/pluginfile.p...0c10275d2c1a8d77http://www.open.edu/openlearn/ocw/pluginfile.p...
2Figure 2 Illustration of Owain Glyndwr, by pe...Figure 2http://www.open.edu/openlearn/ocw/pluginfile.p...0c10275d3d1e1b88http://www.open.edu/openlearn/ocw/pluginfile.p...
\n", "
" ], "text/plain": [ " xpageurl caption src coursecode \\\n", "0 None \n", "1 Figure 1 Presentation of the Treaty of the Un... \n", "2 Figure 2 Illustration of Owain Glyndwr, by pe... \n", "\n", " desc owner item itemack \\\n", "0 None \n", "1 Figure 1 \n", "2 Figure 2 \n", "\n", " srcurl x_folderhash \\\n", "0 http://www.open.edu/openlearn/ocw/pluginfile.p... 1b9129f0 \n", "1 http://www.open.edu/openlearn/ocw/pluginfile.p... 0c10275d \n", "2 http://www.open.edu/openlearn/ocw/pluginfile.p... 0c10275d \n", "\n", " x_contenthash imgurl \n", "0 d3c986e6 http://www.open.edu/openlearn/ocw/pluginfile.p... \n", "1 2c1a8d77 http://www.open.edu/openlearn/ocw/pluginfile.p... \n", "2 3d1e1b88 http://www.open.edu/openlearn/ocw/pluginfile.p... " ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "import sqlite3\n", "conn = sqlite3.connect('openlearn.sqlite')\n", "\n", "pd.read_sql('SELECT * FROM xmlfigures LIMIT 3', conn)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Database API\n", "\n", "We can use Simon Willison's rather wonderful [datasette](https://github.com/simonw/datasette) package to create a server that provides a query interface to the sqlite database. For example, here's a temporary one for demo purposes (this URL is subject to change and my Heroku limits may also max out)...\n", "\n", "https://sheltered-journey-73156.herokuapp.com/openlearn-29f1575/xmlfigures\n", "\n", "It would be nice if it was as easy to push a datesette to Reclaim Hosting as it is to push one to Heroku or Zeit Now..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scripted Forms" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "#! /Users/ajh59/anaconda3/bin/pip install scriptedforms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tidy Up" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "os.environ['SCRAPERWIKI_DATABASE_NAME'] = 'sqlite:///scraperwiki.sqlite'" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }