{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Intro to Scraping" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Very often there is data on the internet that we would just love to use for our purposes as digital humanists. But, perhaps because it is humanities data, the people publishing it online might not have made it available in a format that is very easily used by you. In a perfect world, everyone would make available clearly described dumps of their data in formats that were usable by machines. In reality, a lot of times people just put things on a web page and call it a day. Web scraping refers to the act of using a computer program to pull down the content of a web page (or, often, many web pages). Scraping is very powerful - once you get the hang of it your potential objects of study will be exponentially increased, as you'll no longer be limited to the data that others make available to you. You can start building your own corpora using real-world information. \n", "\n", "This lesson will call on your knowledge of HTML and CSS, which we covered earlier in the week. If you need a refresher, don't hesitate to ask! A little bit goes a long way when it comes to scraping. To get started, first we'll import the packages we need. But first we will have to install these packages!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$ pip3 install bs4\n", "\n", "$ pip3 install lxml" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll import the packages in Python:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "from urllib import request" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Each of these takes care of certain aspects of the process. The main one to know here is [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), which is the Python library that allows us to process HTML we've pulled down from the web. The name comes from \"Alice in Wonderland,\" which is a fun fact you can throw around at parties. We'll need a base link to scrape from. I've set up a number of texts at the following github repository:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "url = \"https://github.com/humanitiesprogramming/scraping-corpus\"" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Now that we have that link saved as a variable, we can call it up again later. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://github.com/humanitiesprogramming/scraping-corpus\n" ] } ], "source": [ "print(url)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We can also modify the URL if we want to use that URL as a base but we need to use a variation on it." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://github.com/humanitiesprogramming/scraping-corpus/subdomain\n" ] } ], "source": [ "print(url + \"/subdomain\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We will use that URL to grab the basic HTML for a number of pages underneath it in the page structure. But first we need to go out and figure out what those links would be. Going to [the page](https://github.com/humanitiesprogramming/scraping-corpus) makes it pretty clear that there are a number of links that we want to grab, each of which pertains to a particular text. We could just copy and paste all those links ourselves to make a to do list: \n", "\n", "* link one\n", "* link two\n", "* link three\n", "\n", "and so on, and then pull in the contents from each page. But we can also get the list of links for the pages we want to scrape by scraping them as well! This is usually quicker as a way of grabbing the contents of a large number of pages on a site.\n", "\n", "The following code uses a Python package named \"request\" to go out and visit that webpage. The following two lines say, \"Take the link stored at the variable 'url'. Visit it, read back to me what you find, and store that result in a new variable named HTML." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "b'\\n\\n\\n\\n\\n\\n\\n\\n
\\n \\n\\n\\n\\n \\n \\n \\n \\n \\n \\n\\n \\n \\n