{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# How To Retrieve Unstructred Web Data In a Structured Manner with Riko\n", "## A Riveting 15-688 Tutorial
*by* Ahmet Emre Unal ([aemreunal](https://github.com/aemreunal))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might have heard about [Google Reader](https://en.wikipedia.org/wiki/Google_Reader). It was a free [RSS](https://en.wikipedia.org/wiki/RSS) reader that brought RSS reading to the masses. It was a great product and I, personally, was a very heavy user. Google Reader allowed me to follow many websites that publish things infrequently. This, though, was only possible through the RSS feeds published by the websites.\n", "\n", "It's great when a website admin takes the time to create the necessary RSS feeds (or implement the tool that does it) but every so often, you come across websites that you want to follow but don't have an RSS feed. How can you now make use of this beautiful system? Can you somehow parse the plain HTML web page to retrieve data in an ordered fashion?\n", "\n", "[The Riko library](https://github.com/nerevu/riko/) is a library that allows you to do exactly that. By using Riko, we can parse the plain HTML of a website and retrieve the elements in a website in an orderly fashion, like iterating through ```

``` elements with a for-loop, for example. \n", "\n", "I personally believe in walking through examples to learn something so let's jump right in (If you would like to follow along, you can [install Riko](https://github.com/nerevu/riko/blob/master/README.rst#installation) on your local environment):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n", "import itertools\n", "from riko.collections.sync import SyncPipe\n", "\n", "def get_test_site_url(test_site_name):\n", " return 'file://' + os.getcwd() + '/test_sites/' + test_site_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "##########################################################################################\n", "#\n", "# Note: You can use the following section to create the test sites' files:\n", "#\n", "##########################################################################################\n", "\n", "test_site_1_contents = '''\\n\\n\\n\\n

This is a simple example

\n", "

\\n

Coffee
Green Tea
Black Tea
Milk
Chocolate
Marshmallow

\\n

\\n\\n\\n\\n'''\n", "\n", "test_site_2_contents = '''\\n\\n\\n\\n

This is a slightly more complex example

\n", "

\\n

Coffee
Green Tea\\n
Oolong Tea
\n", " \\n
Black Tea\n", "
Rize Tea
\\n \\n
Milk
Chocolate
Marshmallow

\\n

\\n\\n\\n\\n'''\n", "\n", "# You can use the following functions to create the test sites' files:\n", "\n", "path = os.getcwd() + '/test_sites/'\n", "\n", "# Check if 'test_sites' folder exists\n", "if not os.path.exists(path):\n", " os.mkdir(path) # Create the 'test_sites' folder\n", " \n", "# Check if 'test1.html' file exists\n", "if not os.path.exists(path + 'test1.html'):\n", " with open(path + 'test1.html', \"w\") as test_site_1:\n", " test_site_1.write(test_site_1_contents)\n", " \n", "# Check if 'test2.html' file exists\n", "if not os.path.exists(path + 'test2.html'):\n", " with open(path + 'test2.html', \"w\") as test_site_2:\n", " test_site_2.write(test_site_2_contents)\n", "\n", "##########################################################################################" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the ```test_sites``` folder, you will find some number of HTML files that are simple website examples. The first one, ```test1.html```, is as follows:\n", "```html\n", "\n", "\n", "\n", "\n", "

This is a simple example

\n", "

Coffee
Green Tea
Black Tea
Milk
Chocolate
Marshmallow

\n", "

\n", "\n", "\n", "\n", "```\n", "Riko sees things through what's called a 'pipe'. By fetching a webpage through a URL and pointing Riko to the appropriate part of said webpage, we can obtain 'streams' coming from those 'pipe's that can be iterated. Let's start with a very simple act of retrieveing the webpage in its entirety. We can achieve this with the very simple [```fetchpage```](https://github.com/nerevu/riko/blob/master/riko/modules/fetchpage.py) module, which will literally just fetch a page:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "url = get_test_site_url('test1.html') # The URL of our test website\n", "fetch_conf = {'url': url} # A configuration dictionary for Riko\n", "pipe = SyncPipe('fetchpage', conf=fetch_conf) # A pipe that streams 'test1.html'\n", "stream = pipe.output # The stream being output from the pipe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What we did was to tell Riko to create a synchronous pipe (using the [SyncPipe class](https://github.com/nerevu/riko/blob/master/riko/collections/sync.py)) that uses the webpage fetching module (called ```fetchpage```) to fetch the URL specified in the ```fetch_conf``` configuration dictionary. \n", "\n", "We could've created the stream driectly by simply using the ```fetchpage``` module directly:\n", "```python\n", "from riko.modules import fetchpage\n", "stream = fetchpage.pipe(conf=fetch_conf)\n", "```\n", "but we'll see in a bit why we're using the ```SyncPipe``` class.\n", "\n", "You might've wondered when did Riko even have the time to go fetch the page? Well, pipes in Riko are *lazy*. That means it won't start fetching (or processing) a URL before we start iterating. So let's iterate:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for item in stream:\n", " print item" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I told you it would literally just fetch the entire page:\n", "```python\n", "{u'content': '\\n\\n\\n\\n\\n\\n\\n\\n

This is a simple example

\\n\\n

Coffee
Green Tea
Black Tea
Milk
Chocolate
Marshmallow

\\n\\n

\\n\\n\\n\\n\\n\\n\\n\\n\\n'}\n", "```\n", "The whole webpage being printed is not really that useful; there is nothing special about this. We could've at least specified a start and end tag for Riko to fetch only that part:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fetch_conf = { # The same config as above, but with the start and end tags to fetch specified\n", " 'url': url,\n", " 'start': '',\n", " 'end': ''\n", "}\n", "pipe = SyncPipe('fetchpage', conf=fetch_conf) # A pipe that streams 'test1.html' according to the config above\n", "stream = pipe.output # The stream being output from the pipe\n", "\n", "for item in stream:\n", " print item" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This isn't very useful either, honestly:\n", "```python\n", "{u'content': '\\n\\n\\n\\n

This is a simple example

\\n\\n

Coffee
Green Tea
Black Tea
Milk
Chocolate
Marshmallow

\\n\\n

\\n\\n\\n\\n'}\n", "```\n", "To get to the list items we want, we'd need to do some weird string processing. We don't want to do that and that's why we have Riko!\n", "***\n", "Let's take a side step and ask ourselves a question: a URL is a string that points to a webpage (or a file in the filesystem), but what could point to an element *inside* a webpage? The answer is [XPath](https://en.wikipedia.org/wiki/XPath). 'XPath' is very similar to a URL, only that it denotes a path inside a markup file. For example, the XPath of the ```

``` element under that ```