{ "metadata": { "name": "", "signature": "sha256:539ed8e0f367a34000541f2bbd1e9f78193c38c08c58cd560e1278e79dcd95b3" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Homework assignment #5\n", "\n", "These problem sets focus on using the Beautiful Soup library to scrape web pages.\n", "\n", "##Problem Set #1: Basic scraping\n", "\n", "I've made a web page for you to scrape. It's available [here](http://static.decontextualize.com/widgets.html). The page concerns the catalog of a famous [widget](http://en.wikipedia.org/wiki/Widget) company. You'll be answering several questions about this web page. First off, in the cell below, write some code so that you end up with a variable called `html_str` that contains the HTML source code of the page. I've pre-filled the cell with some code; your job is to write the missing line. When you run the cell, it should print out `2801` (the number of characters in the HTML source code for `widgets.html`)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import urllib\n", "# YOUR CODE HERE\n", "\n", "print len(html_str)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Excellent. Now, in the cell below, use Beautiful Soup to write an expression that evaluates to the number of `
A soft cheese made in the Camembert region of France.
\n", "\n", "A yellow cheese made in the Cheddar region of... France, probably, idk whatevs.
\n", "\"\"\"" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If our task was to create a dictionary that maps the name of the cheese to the description that follows in the `` tag directly afterward, we'd be out of luck. Fortunately, Beautiful Soup has a `.find_next_sibling()` method, which allows us to search for the next tag that is a sibling of the tag you're calling it on (i.e., the two tags share a parent), that also matches particular criteria. So, for example, to accomplish the task outlined above:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "example_doc = BeautifulSoup(example_html)\n", "cheese_dict = {}\n", "for h2_tag in example_doc.find_all('h2'):\n", " cheese_name = h2_tag.string\n", " cheese_desc_tag = h2_tag.find_next_sibling('p')\n", " cheese_dict[cheese_name] = cheese_desc_tag.string\n", "\n", "cheese_dict" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With that knowledge in mind, let's go back to our widgets. In the cell below, write code that uses Beautiful Soup, and in particular the `.find_next_sibling()` method, to find out how many widgets are in the table *just beneath* the header \"Hallowed Widgets.\" (You can tell by looking at the page that there are four such widgets. But this is a programming class, so we have to write a program to do it.)" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, now, the final task. If you can accomplish this, you are truly an expert web scraper. I'll have little web scraper certificates made up and I'll give you one, if you manage to do this thing. And I know you can do it!\n", "\n", "In the cell below, I've created a variable `category_counts` and assigned to it an empty dictionary. Write code to populate this dictionary so that its keys are \"categories\" of widgets (e.g., the contents of the `