{ "metadata": { "name": "00-BeautifulSoup" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Beautiful Soup tutorial\n", "Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.\n", "## Beautiful Soup Urls\n", "* [Beautiful Soup Website](http://www.crummy.com/software/BeautifulSoup/)\n", "* [Beautiful Soup Documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)\n", "\n", "## Installation\n", "Open bash shell and run:\n", "\n", " pip install beautifulsoup4" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from bs4 import BeautifulSoup\n", "\n", "html_doc = \"\"\"\n", "The Dormouse's story\n", "\n", "

The Dormouse's story

\n", "\n", "

Once upon a time there were three little sisters; and their names were\n", "Elsie,\n", "Lacie and\n", "Tillie;\n", "and they lived at the bottom of a well.

\n", "\n", "

...

\n", "\"\"\"\n", "\n", "soup = BeautifulSoup(html_doc)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Web scraping disclaimer\n", "\n", "Read the terms of service for the web site before attempting to scrape data. Here is an example from [Tropicos](http://www.tropicos.org/) that forbids scraping of their web site that degrades service.\n", "\n", "* [Tropicos terms of service](http://www.tropicos.org/TermsOfUse.aspx)\n", "\n", "If the site provides a web service or api, make use of that instead of trying to parse the interface designed for web browsers.\n", "\n", "* [Taxonomic Name Resolution Service API](http://tnrs.iplantcollaborative.org/api.html)\n", "\n", "A recent example of the legalities around web scraping is the JSTOR/Aaron Swartz prosecution\n", "\n", "* http://en.wikipedia.org/wiki/Aaron_Swartz#JSTOR\n", "\n", "## Web scraping with Beautiful Soup\n", "\n", "Suppose we want to get of list of scientific names for a common name from FishBase.\n", "\n", "[FishBase search page](http://www.fishbase.org/search.php)\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import urllib2\n", "\n", "search_url = 'http://fishbase.se/search.php'\n", "search_page = urllib2.urlopen(search_url)\n", "\n", "print search_page.read(200)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now use Beautiful Soup to parse and explore the search page" ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup = BeautifulSoup(search_page)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Explore the search page in Beautiful Soup but avoid printing form tag contents.\n", "\n", "Bad:\n", "\n", " print soup.find_all('form')\n", "\n", "Good:\n", "\n", " for form in soup.find_all('form'):\n", " print form.text" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The search page contains a form to query Fish Base on common name. The form does a form post request to relative url\n", "\n", " /ComNames/CommonNameSearchList.php\n", "\n", "with the common name in the id\n", "\n", " CommonName\n", "\n", "Post requests appends form-data inside the body of the HTTP request (data is not shown is in URL). Get requests appends form-data into the URL in name/value pairs and user is able to bookmark the page. See if we can do a get request with the CommonName query.\n", "\n", "* http://www.w3schools.com/tags/att_form_method.asp\n", "\n", "Print out all the form actions urls contained on the search page " ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to get the results of a common name search. We can attempt to do a http get request to the url that is in the action attribute since we can see the get request syntax at the bottom of the search result page.\n", "\n", " http://fishbase.se/ComNames/CommonNameSearchList.php?resultPage=2&CommonName=Tuna" ] }, { "cell_type": "code", "collapsed": false, "input": [ "result_url = 'http://fishbase.se/ComNames/CommonNameSearchList.php?CommonName=Tuna'\n", "result_page = urllib2.urlopen(result_url)\n", "\n", "soup = BeautifulSoup(result_page)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a problem however. Does this really give us the full list of names for Tuna?" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }