{ "metadata": { "name": "", "signature": "sha256:4c05d403de092669880ff0942cd89e435c3082e3670be5050c3d8087621c3bd5" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Getting data from markup languages\n", "\n", "So far we've discussed a number of sources for data: CSV files, web APIs, and unstructured text. There's a lot of data on the internet locked up in one of two \"markup\" languages: XML and HTML. Our goal today is to discuss and put into practice a few methods for extracting data from documents written in these languages.\n", "\n", "##HTML\n", "\n", "HTML stands for \"hypertext markup language.\" Most of the documents you see when you're browsing the web are written in this format. In most browsers, there's a \"View Source\" option that allows you to see the HTML source code for any page you're looking at. For example, in Chrome, you can CTRL-click anywhere on the page, or go to `View > Developer > View Source`:\n", "\n", "\"nytimes-view-source\"/\n", "\n", "You'll see something that looks like this, a mish-mash of angle brackets and quotes and slashes and text. This is HTML.\n", "\n", "\"nytimes-source\"/\n", "\n", "###What HTML looks like\n", "\n", "HTML consists of a series of *tags*. Tags have a *name*, a series of key/value pairs called *attributes*, and some textual *content*. Attributes are optional. Here's a simple example, using the HTML `

` tag (`p` means \"paragraph\"):\n", "\n", "

Mother said there'd be days like these.

\n", " \n", "This example has just one tag in it: a `

` tag. The source code for a tag has two parts, its opening tag (`

`) and its closing tag (`

`). In between the opening and closing tag, you see the tag's contents (in this case, the text `Mother said there'd be days like these.`).\n", "\n", "Here's another example, using the HTML `
` tag:\n", "\n", "
Mammoth Falls
\n", " \n", "In this example, the tag's name is `div`. The tag has two attributes: `class`, with value `header`, and `style`, with value `background: blue;`. The contents of this tag is `Mammoth Falls`.\n", "\n", "Tags can contain other tags, in a hierarchical relationship. For example, here's some HTML to make a bulletted list:\n", "\n", " \n", "\n", "The `