{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scraping And Parsing A Wikipedia List into Pandas\n", "\n", "- **Author:** [Chris Albon](http://www.chrisalbon.com/), [@ChrisAlbon](https://twitter.com/chrisalbon)\n", "- **Date:** -\n", "- **Repo:** [Python 3 code snippets for data science](https://github.com/chrisalbon/code_py)\n", "- **Note:** -" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preliminaries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import required modules\n", "import requests\n", "from bs4 import BeautifulSoup\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a beautiful soup object from the website" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create a variable with the URL to this tutorial\n", "url = 'http://en.wikipedia.org/wiki/List_of_airship_accidents'\n", "\n", "# Scrape the HTML at the url\n", "r = requests.get(url)\n", "\n", "# Turn the HTML into a Beautiful Soup object\n", "soup = BeautifulSoup(r.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parse the html into a list" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create a list for the scraping results\n", "disasters = []" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The structure of the page is such that if we take all the *li* items **without** a tag, and then ignore the last three, we will have a clean list of all the disasters" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create a list elment for each li without a class (except for the last three)\n", "# Then, for each row, append the text to disasters. \n", "for row in soup.find_all('li', class_=False)[:-3]: \n", " disasters.append(row.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data wrangle the list into a dataframe" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create a dataframe from the list\n", "df = pd.DataFrame(disasters, columns=['raw'])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Take everything before the \":\" and call that the date variable\n", "df['date'] = df['raw'].str.extract('(^[^_]+(?=:))')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Take everything after the \":\" and call that the description variable\n", "df['description'] = df['raw'].str.extract('\\:(.*)')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Set the date variable to be time\n", "df['date'] = pd.to_datetime(df['date'])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Set the date variable to be the dataFrame's index\n", "df.index = df['date']" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Drop the variables we no longer need\n", "df = df.drop(['raw', 'date'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### View the results" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
description
date
2 May 1902 Semi-rigid airship Pax explodes over Paris, k...
13 October 1902 Separation of gondola from envelope over Pari...
30 November 1907 Loss of the French Army's Patrie - no fatalit...
23 May 1908 Morrell airship falls over Berkeley, Californ...
4 August 1908 Zeppelin LZ 4 caught fire near Echterdingen a...
\n", "
" ], "text/plain": [ " description\n", "date \n", "2 May 1902 Semi-rigid airship Pax explodes over Paris, k...\n", "13 October 1902 Separation of gondola from envelope over Pari...\n", "30 November 1907 Loss of the French Army's Patrie - no fatalit...\n", "23 May 1908 Morrell airship falls over Berkeley, Californ...\n", "4 August 1908 Zeppelin LZ 4 caught fire near Echterdingen a..." ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View the top of the dataframe\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.3.5" } }, "nbformat": 4, "nbformat_minor": 0 }