{ "metadata": { "name": "", "signature": "sha256:5491ae613b8a8f5a563e044ded85b86438a58ddecf86fbc97f69869ae908f083" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Drilling Down With Beautiful Soup\n", "\n", "- **Author:** [Chris Albon](http://www.chrisalbon.com/), [@ChrisAlbon](https://twitter.com/chrisalbon)\n", "- **Date:** -\n", "- **Repo:** [Python 3 code snippets for data science](https://github.com/chrisalbon/code_py)\n", "- **Note:** -" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Import required modules\n", "import requests\n", "from bs4 import BeautifulSoup\n", "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Download the HTML and create a Beautiful Soup object" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Create a variable with the URL to this tutorial\n", "url = 'http://en.wikipedia.org/wiki/List_of_A_Song_of_Ice_and_Fire_characters'\n", "\n", "# Scrape the HTML at the url\n", "r = requests.get(url)\n", "\n", "# Turn the HTML into a Beautiful Soup object\n", "soup = BeautifulSoup(r.text)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we looked at the soup object, we'd see that the names we want are in a heirarchical list. In psuedo-code, it looks like:\n", "\n", "- class=toclevel-1 span=toctext\n", " - class=toclevel-2 span=toctext CHARACTER NAMES\n", " - class=toclevel-2 span=toctext CHARACTER NAMES\n", " - class=toclevel-2 span=toctext CHARACTER NAMES\n", " - class=toclevel-2 span=toctext CHARACTER NAMES\n", " - class=toclevel-2 span=toctext CHARACTER NAMES\n", " \n", "To get the CHARACTER NAMES, we are going to need to drill down to grap into loclevel-2 and grab the toctext" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting up where to put the results" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Create a variable to score the scraped data in\n", "character_name = []" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Drilling down with a forloop" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# for each item in all the toclevel-2 li items\n", "# (except the last three because they are not character names), \n", "for item in soup.find_all('li',{'class':'toclevel-2'})[:-3]: \n", " # find each span with class=toctext,\n", " for post in item.find_all('span',{'class':'toctext'}): \n", " # add the stripped string of each to character_name, one by one\n", " character_name.append(post.string.strip())" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The results" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# View all the character names\n", "character_name" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "['Eddard Stark',\n", " 'Catelyn Stark',\n", " 'Robb Stark',\n", " 'Sansa Stark',\n", " 'Arya Stark',\n", " 'Bran Stark',\n", " 'Rickon Stark',\n", " 'Jon Snow',\n", " 'Lyanna Stark',\n", " 'Roose Bolton',\n", " 'Ramsay Snow',\n", " 'Hodor',\n", " 'Osha',\n", " 'Jeyne Poole',\n", " 'Jojen and Meera Reed',\n", " 'Jeyne Westerling',\n", " 'Daenerys Targaryen',\n", " 'Viserys Targaryen',\n", " 'Rhaegar Targaryen',\n", " 'Aegon V Targaryen',\n", " 'Aerys II Targaryen',\n", " 'Aegon Targaryen',\n", " 'Jon Connington',\n", " 'Jorah Mormont',\n", " 'Brynden Rivers',\n", " 'Jon Arryn',\n", " 'Lysa Arryn',\n", " 'Robert Arryn',\n", " 'Tywin Lannister',\n", " 'Cersei Lannister',\n", " 'Jaime Lannister',\n", " 'Joffrey Baratheon',\n", " 'Tyrion Lannister',\n", " 'Kevan Lannister',\n", " 'Lancel Lannister',\n", " 'Bronn',\n", " 'Gregor Clegane',\n", " 'Sandor Clegane',\n", " 'Podrick Payne',\n", " 'Robert Baratheon',\n", " 'Myrcella Baratheon',\n", " 'Tommen Baratheon',\n", " 'Stannis Baratheon',\n", " 'Melisandre',\n", " 'Davos Seaworth',\n", " 'Renly Baratheon',\n", " 'Brienne of Tarth',\n", " 'Beric Dondarrion',\n", " 'Gendry',\n", " 'Balon Greyjoy',\n", " 'Asha Greyjoy',\n", " 'Theon Greyjoy',\n", " 'Euron Greyjoy',\n", " 'Victarion Greyjoy',\n", " 'Aeron Greyjoy',\n", " 'Doran Martell',\n", " 'Arianne Martell',\n", " 'Quentyn Martell',\n", " 'Elia Martell',\n", " 'Oberyn Martell',\n", " 'The Sand Snakes',\n", " 'Areo Hotah',\n", " 'Hoster Tully',\n", " 'Edmure Tully',\n", " 'Brynden Tully',\n", " 'Walder Frey',\n", " 'Mace Tyrell',\n", " 'Willas Tyrell',\n", " 'Garlan Tyrell',\n", " 'Loras Tyrell',\n", " 'Margaery Tyrell',\n", " 'Olenna Tyrell',\n", " 'Jeor Mormont',\n", " 'Maester Aemon',\n", " 'Yoren',\n", " 'Samwell Tarly',\n", " 'Janos Slynt',\n", " 'Mance Rayder',\n", " 'Ygritte',\n", " 'Val',\n", " 'Petyr Baelish',\n", " 'Varys',\n", " 'Pycelle',\n", " 'Barristan Selmy',\n", " 'Arys Oakheart',\n", " 'Ilyn Payne',\n", " 'Qyburn',\n", " 'Balon Swann',\n", " 'Khal Drogo',\n", " 'Syrio Forel',\n", " \"Jaqen H'ghar\",\n", " 'Illyrio Mopatis',\n", " 'Thoros of Myr',\n", " 'Ser Duncan the Tall']" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Quick analysis: Which house has the most main characters?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Create a list object where to store the for loop results\n", "houses = []" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "# For each element in the character_name list,\n", "for name in character_name:\n", " # split up the names by a blank space and select the last element\n", " # this works because it is the last name if they are a house, \n", " # but the first name if they only have one name,\n", " # Then append each last name to the houses list\n", " houses.append(name.split(' ')[-1])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "# Convert houses into a pandas series (so we can use value_counts())\n", "houses = pd.Series(houses)\n", "\n", "# Count the number of times each name/house name appears\n", "houses.value_counts()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "Stark 8\n", "Tyrell 6\n", "Targaryen 6\n", "Greyjoy 6\n", "Baratheon 6\n", "Lannister 6\n", "Martell 5\n", "Tully 3\n", "Arryn 3\n", "Payne 2\n", "Clegane 2\n", "Mormont 2\n", "Snow 2\n", "Bolton 1\n", "Aemon 1\n", "Val 1\n", "Frey 1\n", "Rivers 1\n", "Varys 1\n", "Bronn 1\n", "Hotah 1\n", "Tarly 1\n", "Osha 1\n", "Tall 1\n", "Yoren 1\n", "Rayder 1\n", "Snakes 1\n", "Myr 1\n", "Seaworth 1\n", "Qyburn 1\n", "Forel 1\n", "Baelish 1\n", "Poole 1\n", "H'ghar 1\n", "Drogo 1\n", "Swann 1\n", "Selmy 1\n", "Gendry 1\n", "Tarth 1\n", "Slynt 1\n", "Westerling 1\n", "Hodor 1\n", "Mopatis 1\n", "Pycelle 1\n", "Ygritte 1\n", "Reed 1\n", "Dondarrion 1\n", "Connington 1\n", "Oakheart 1\n", "Melisandre 1\n", "dtype: int64" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 } ], "metadata": {} } ] }