{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scraping And Parsing A Wikipedia List into Pandas\n", "\n", "- **Author:** [Chris Albon](http://www.chrisalbon.com/), [@ChrisAlbon](https://twitter.com/chrisalbon)\n", "- **Date:** -\n", "- **Repo:** [Python 3 code snippets for data science](https://github.com/chrisalbon/code_py)\n", "- **Note:** -" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preliminaries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import required modules\n", "import requests\n", "from bs4 import BeautifulSoup\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a beautiful soup object from the website" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create a variable with the URL to this tutorial\n", "url = 'http://en.wikipedia.org/wiki/List_of_airship_accidents'\n", "\n", "# Scrape the HTML at the url\n", "r = requests.get(url)\n", "\n", "# Turn the HTML into a Beautiful Soup object\n", "soup = BeautifulSoup(r.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parse the html into a list" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create a list for the scraping results\n", "disasters = []" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The structure of the page is such that if we take all the *li* items **without** a tag, and then ignore the last three, we will have a clean list of all the disasters" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create a list elment for each li without a class (except for the last three)\n", "# Then, for each row, append the text to disasters. \n", "for row in soup.find_all('li', class_=False)[:-3]: \n", " disasters.append(row.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data wrangle the list into a dataframe" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create a dataframe from the list\n", "df = pd.DataFrame(disasters, columns=['raw'])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Take everything before the \":\" and call that the date variable\n", "df['date'] = df['raw'].str.extract('(^[^_]+(?=:))')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Take everything after the \":\" and call that the description variable\n", "df['description'] = df['raw'].str.extract('\\:(.*)')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Set the date variable to be time\n", "df['date'] = pd.to_datetime(df['date'])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Set the date variable to be the dataFrame's index\n", "df.index = df['date']" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Drop the variables we no longer need\n", "df = df.drop(['raw', 'date'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### View the results" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | description | \n", "
---|---|
date | \n", "\n", " |
2 May 1902 | \n", "Semi-rigid airship Pax explodes over Paris, k... | \n", "
13 October 1902 | \n", "Separation of gondola from envelope over Pari... | \n", "
30 November 1907 | \n", "Loss of the French Army's Patrie - no fatalit... | \n", "
23 May 1908 | \n", "Morrell airship falls over Berkeley, Californ... | \n", "
4 August 1908 | \n", "Zeppelin LZ 4 caught fire near Echterdingen a... | \n", "