{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to use [Scrapy](https://scrapy.org/), which is a Python framework for automating data extraction from webpages. In this noteboodk we are following along a very good introduction to web scraping in general (along with some other tools that you might wish to explore) at the [Library Carpentry, \"Webscraping with Python\"\n", "October 2016](http://labs.timtom.ch/library-webscraping) site.\n", "\n", "Scrapy has already been installed in this notebook for you; but for reference, you can install it with `pip install scrapy`. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scrapy 1.5.1\r\n" ] } ], "source": [ "!scrapy version" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're going to create a spider that scrapes the collections search page of the [Canadian Museum of History](https://www.historymuseum.ca/collections/#nogo)(formerly the Canadian Museum of Civilization). " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New Scrapy project 'romanbrick', using template directory '/srv/conda/lib/python3.6/site-packages/scrapy/templates/project', created in:\r\n", " /home/jovyan/romanbrick\r\n", "\r\n", "You can start your first spider with:\r\n", " cd romanbrick\r\n", " scrapy genspider example example.com\r\n" ] } ], "source": [ "!scrapy startproject romanbrick" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you go to the Home page (right-click on the jupyter logo at top left and open in a new tab) you'll see the new folder called `romanbrick`. _Inside_ that folder there is a new folder called (again) `romanbrick` and a file called `config`. The second `romanbrick` folder contains all of the plumbing for our spider. Your entire scrapy project then looks like this:\n", "\n", "---\n", "```\n", "romanbrick/ # the root project directory\n", " scrapy.cfg # deploy configuration file\n", "\n", " romanbrick/ # project's Python module, you'll import your code from here\n", " __init__.py\n", "\n", " items.py # Holds the structure of the data we want to collect\n", "\n", " middlewares.py # We aren't going to use this today. \n", " # A file to manipulate how spiders process input, in case pretending to be a normal HTTP browser doesn't work.\n", "\n", " pipelines.py # We aren't going to use this today, either.\n", " # Once our spider writes things to the \"item,\" we can use pipelines to do additional processing before we export it.\n", "\n", " settings.py # project settings file\n", "\n", " spiders/ # a directory where you'll later put your spiders\n", " __init__.py\n", " ...\n", "```\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Spiders\n", "\n", "The `spiders` directory is where we're going to put our crawler (spiders...crawlers, geddit?). Spiders have three parts-\n", "\n", "1. the starting URL(s)\n", "2. a list of allowed domains, or places the crawler is allowed to go, for fear of downloading the entire internet (Not an idle fear, actually.)\n", "3. a way of parsing the results, to get the data you're after.\n", "\n", "Scrapy makes this easy for us. Say we're interested in the archaeology of the timber trade. At the Museum's search page, we enter 'logging' and end up with the following url:\n", "\n", "`https://finds.org.uk/database/search/results/q/stamped+brick`\n", "\n", "To create the spider, we tell Scrapy `scrapy genspider `, so:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created spider 'rbrickspider' using template 'basic' \r\n" ] } ], "source": [ "!scrapy genspider rbrickspider \"finds.org.uk/database/search/results/q/stamped+brick\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note** Open the file list (right-click on the Jupyter logo, open in a new tab so you don't close this notebook), and see where your `rbrickspider.py` file has been created. Tick off the check box, select 'move', and put it in the `spiders` folder, eg `\\romanbrick\\romanbrick\\spiders`. Once it's moved, you can click on this file and it will open in the text editor - it'll look like this:\n", "\n", "```\n", "# -*- coding: utf-8 -*-\n", "import scrapy\n", "\n", "\n", "class RbrickspiderSpider(scrapy.Spider):\n", " name = 'rbrickspider'\n", " allowed_domains = ['finds.org.uk/database/search/results/q/stamped+brick']\n", " start_urls = ['http://finds.org.uk/database/search/results/q/stamped+brick/']\n", "\n", " def parse(self, response):\n", " pass\n", "\n", "```\n", "\n", "Look at the `allowed_domains`. That's a bit too restrictive - only results that exactly match that pattern would be allowed. We want all the materials, so let's let the spider grab everything inside the top level domain. Change the pattern for `allowed_domains` to just the top level, then save the file. (psst: `finds.org.uk`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's run it! " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/jovyan/romanbrick\n" ] } ], "source": [ "%cd romanbrick" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!scrapy crawl rbrickspider -s DEPTH_LIMIT=1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It worked! You know this by the `downloader/response_status_count/200` - the scraper was able to reach the webpage and interact with it. However, because we haven't told the spider _what_ to parse, it hasn't grabbed anything." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add the following to your `rbrickspider.py` file (which is in `\\romanbrick\\romanbrick\\spiders` remember):\n", "\n", "```\n", " def parse(self, response):\n", " with open(\"test.html\", 'wb') as file:\n", " file.write(response.body)\n", "```\n", "\n", "This writes the html of the page to file. Save that, then run the spider again in the block below. Be careful with the copying and pasting; python is very fussy about tabs versus spaces and so on. If you get an error 'unexpected indent' you need to make sure that you haven't mixed spaces with tabs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# paste your crawl command in this block and run it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you check the contents of this directory with the `ls` command, you should see a new file `test.html`" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "romanbrick scrapy.cfg\ttest.html\r\n" ] } ], "source": [ "!ls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the first few lines of the file with the `head` command. Have we actually scraped anything?" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "\r\n", "\r\n", "\r\n", " Search results from the database Page: 1\r\n", " \r\n", "\r\n", "\r\n", "\r\n" ] } ], "source": [ "!head test.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But saving webpages isn't the point of scraping; rather, we want the data itself. In the file list screen (accessed by clicking on the Jupyter logo at top left) tick the box beside `test.html` and then select `edit`. Let's scroll through the results. Examine the html, paying attention to the `
` elements that wrap the information we want, and how these elements nest one within the other.\n", "\n", "We can tell Scrapy which of those bits of information we want to collect using [Xpath Selectors](https://resbazsql.github.io/lc-webscraping/02-dom/). Many websites are built by pulling data out of a database and then displaying it by running it through a 'document object model'; this model tells the webpage where to put and how to display the various information. An 'xpath' let's us specify which object in this model we want. We tell Scrapy the xpath to the info, and it duly collects all the data found in that path. \n", "\n", "There are a variety of ways of finding the correct xpath; some people recommend this [scraper extension for chrome](https://chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd).\n", "\n", "If you open a new terminal in Jupyter, type `cd romanbrick` and then \n", "\n", "`scrapy shell \"https://finds.org.uk/database/search/results/q/stamped+brick\"`\n", "\n", "you can try testing different xpath combinations and test what results you get using the `response.xpath` command:\n", "\n", "`>>> response.xpath(\"//THE XPATH SELECTOR GOES IN HERE AND MAKE SURE TO CHANGE QUOTES TO SINGLE QUOTES IF NECESSARY\")`\n", "\n", "\n", "After some examination, we see that `//*[@id='preview']/p/text()` will get the information we want. \n", "\n", "Edit `rbrickspider.py` so that you're parsing for just the data:\n", "\n", "```\n", " def parse(self, response):\n", " print(response.xpath(\"//*[@id='preview']/p/text()\").extract())\n", "\n", "```\n", "\n", "Now run\n", "```\n", "!scrapy crawl rbrickspider -s DEPTH_LIMIT=1\n", "```\n", "\n", "in the code block below. Check with `!pwd` to make sure you're in `jovyan\\canhist` first. \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#past the code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're getting there! But all of the data is coming in a big mess. Edit `rbrickspider.py` so that you're parsing for just the data:\n", "\n", "```\n", " def parse(self, response):\n", " for resource in response.xpath(\"//*[@id='preview']/p/text()\"):\n", " print(resource.extract())\n", "```\n", "\n", "Now run\n", "```\n", "!scrapy crawl rbrickspider -s DEPTH_LIMIT=1\n", "```\n", "\n", "(By the way - if you ever find that nothing happens when you run the scrapy crawl command, try refreshing the browser window and re-running the `%cd romanbrick` command first)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# put your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you look at the Finds.org.uk search results' page underlying html using Chrome's inspector, you'll notice that the images are all contained within the `ul` tag. You can right click on that html and select 'copy xpath'. You should get `//*[@id='preview']/ul/li/div/a`. Swap that into your spider and run it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point, you'll be getting the hang of it. For the final step of writing your data to csv, please scroll down to the [Writing to a csv](https://resbazsql.github.io/lc-webscraping/04-scrapy/) section of the Library Carpentry tutorial - but make sure you right-click on that link and open in a new tab! You now know how to grab two different sets of data from the finds page. Complete the tutorial so that you end up with a table of results from the finds.org.uk webpage.\n", "\n", "Save your work, and remember to file -> download as -> notebook when you're done. Remember that this binder will time out after ten minutes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }