{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Web scraping the President's lies in 16 lines of Python\n", "\n", "*Created by Kevin Markham of [Data School](http://www.dataschool.io/). Hosted on [GitHub](https://github.com/justmarkham/trump-lies).*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "This an introductory tutorial on web scraping in Python. All that is required to follow along is a basic understanding of the Python programming language.\n", "\n", "By the end of this tutorial, you will be able to scrape data from a static web page using the **requests** and **Beautiful Soup** libraries, and export that data into a structured text file using the **pandas** library." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Outline\n", "\n", "- What is web scraping?\n", "- Examining the New York Times article\n", " - Examining the HTML\n", " - Fact 1: HTML consists of tags\n", " - Fact 2: Tags can have attributes\n", " - Fact 3: Tags can be nested\n", "- Reading the web page into Python\n", "- Parsing the HTML using Beautiful Soup\n", " - Collecting all of the records\n", " - Extracting the date\n", " - Extracting the lie\n", " - Extracting the explanation\n", " - Extracting the URL\n", " - Recap: Beautiful Soup methods and attributes\n", "- Building the dataset\n", " - Applying a tabular data structure\n", " - Exporting the dataset to a CSV file\n", "- Summary: 16 lines of Python code\n", " - Appendix A: Web scraping advice\n", " - Appendix B: Web scraping resources\n", " - Appendix C: Alternative syntax for Beautiful Soup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is web scraping?\n", "\n", "On July 21, 2017, the New York Times updated an opinion article called [Trump's Lies](https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html), detailing every public lie the President has told since taking office. Because this is a newspaper, the information was (of course) published as a block of text. This is a great format for human consumption, but it can't easily be understood by a computer. **In this tutorial, we'll extract the President's lies from the New York Times article and store them in a structured dataset.**\n", "\n", "This is a common scenario: You find a web page that contains data you want to analyze, but it's not presented in a format that you can easily download and read into your favorite data analysis tool. You might imagine manually copying and pasting the data into a spreadsheet, but in most cases, that is way too time consuming. A technique called **web scraping** is a useful way to automate this process.\n", "\n", "What is web scraping? It's the process of extracting information from a web page **by taking advantage of patterns** in the web page's underlying code. Let's start looking for these patterns!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examining the New York Times article\n", "\n", "Here's the way the article presented the information:\n", "\n", "![Screenshot of the article](images/article_1.png)\n", "\n", "When converting this into a dataset, **you can think of each lie as a \"record\" with four fields:**\n", "\n", "1. The date of the lie.\n", "2. The lie itself (as a quotation).\n", "3. The writer's brief explanation of why it was a lie.\n", "4. The URL of an article that substantiates the claim that it was a lie.\n", "\n", "Importantly, those fields have different formatting, which is consistent throughout the article: the date is bold red text, the lie is \"regular\" text, the explanation is gray italics text, and the URL is linked from the gray italics text.\n", "\n", "**Why does the formatting matter?** Because it's very likely that the code underlying the web page \"tags\" those fields differently, and we can take advantage of that pattern when scraping the page. Let's take a look at the source code, known as HTML:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examining the HTML\n", "\n", "To view the HTML code that generates a web page, you right click on it and select \"View Page Source\" in Chrome or Firefox, \"View Source\" in Internet Explorer, or \"Show Page Source\" in Safari. (If that option doesn't appear in Safari, just open Safari Preferences, select the Advanced tab, and check \"Show Develop menu in menu bar\".)\n", "\n", "Here are the first few lines you will see if you view the source of the New York Times article:\n", "\n", "![Screenshot of the source](images/source_1.png)\n", "\n", "Let's locate the **first lie** by searching the HTML for the text \"iraq\":\n", "\n", "![Screenshot of the source](images/source_2.png)\n", "\n", "Thankfully, you only have to understand **three basic facts** about HTML in order to get started with web scraping!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fact 1: HTML consists of tags\n", "\n", "You can see that the HTML contains the article text, along with \"tags\" (specified using angle brackets) that \"mark up\" the text. (\"HTML\" stands for Hyper Text Markup Language.)\n", "\n", "For example, one tag is ``, which means \"use bold formatting\". There is a `` tag before \"Jan. 21\" and a `` tag after it. The first is an \"opening tag\" and the second is a \"closing tag\" (denoted by the `/`), which indicates to the web browser **where to start and stop applying the formatting.** In other words, this tag tells the web browser to make the text \"Jan. 21\" bold. (Don't worry about the ` ` - we'll deal with that later.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fact 2: Tags can have attributes\n", "\n", "HTML tags can have \"attributes\", which are specified in the opening tag. For example, `` indicates that this particular `` tag has a `class` attribute with a value of `short-desc`.\n", "\n", "For the purpose of web scraping, **you don't actually need to understand** the meaning of ``, `class`, or `short-desc`. Instead, you just need to recognize that tags can have attributes, and that they are specified in this particular way." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fact 3: Tags can be nested\n", "\n", "Let's pretend my HTML code said:\n", "\n", "`Hello Data School students`\n", "\n", "The text **Data School students** would be bold, because all of that text is between the opening `` tag and the closing `` tag. The text ***Data School*** would also be in italics, because the `` tag means \"use italics\". The text \"Hello\" would not be bold or italics, because it's not within either the `` or `` tags. Thus, it would appear as follows:\n", "\n", "Hello ***Data School* students**\n", "\n", "The central point to take away from this example is that **tags \"mark up\" text from wherever they open to wherever they close,** regardless of whether they are nested within other tags.\n", "\n", "Got it? You now know enough about HTML in order to start web scraping!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading the web page into Python\n", "\n", "The first thing we need to do is to read the HTML for this article into Python, which we'll do using the [requests](http://docs.python-requests.org/en/master/) library. (If you don't have it, you can `pip install requests` from the command line.)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import requests\n", "r = requests.get('https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code above fetches our web page from the URL, and stores the result in a \"response\" object called `r`. That response object has a `text` attribute, which contains the same HTML code we saw when viewing the source from our web browser:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "