{
 "metadata": {
  "name": "",
  "signature": "sha256:4c05d403de092669880ff0942cd89e435c3082e3670be5050c3d8087621c3bd5"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Getting data from markup languages\n",
      "\n",
      "So far we've discussed a number of sources for data: CSV files, web APIs, and unstructured text. There's a lot of data on the internet locked up in one of two \"markup\" languages: XML and HTML. Our goal today is to discuss and put into practice a few methods for extracting data from documents written in these languages.\n",
      "\n",
      "##HTML\n",
      "\n",
      "HTML stands for \"hypertext markup language.\" Most of the documents you see when you're browsing the web are written in this format. In most browsers, there's a \"View Source\" option that allows you to see the HTML source code for any page you're looking at. For example, in Chrome, you can CTRL-click anywhere on the page, or go to `View > Developer > View Source`:\n",
      "\n",
      "<a href=\"http://static.decontextualize.com/snaps/nytimes-view-source.png\"><img src=\"http://static.decontextualize.com/snaps/nytimes-view-source.png\" alt=\"nytimes-view-source\"/></a>\n",
      "\n",
      "You'll see something that looks like this, a mish-mash of angle brackets and quotes and slashes and text. This is HTML.\n",
      "\n",
      "<a href=\"http://static.decontextualize.com/snaps/nytimes-source.png\"><img src=\"http://static.decontextualize.com/snaps/nytimes-source.png\" alt=\"nytimes-source\"/></a>\n",
      "\n",
      "###What HTML looks like\n",
      "\n",
      "HTML consists of a series of *tags*. Tags have a *name*, a series of key/value pairs called *attributes*, and some textual *content*. Attributes are optional. Here's a simple example, using the HTML `<p>` tag (`p` means \"paragraph\"):\n",
      "\n",
      "    <p>Mother said there'd be days like these.</p>\n",
      "    \n",
      "This example has just one tag in it: a `<p>` tag. The source code for a tag has two parts, its opening tag (`<p>`) and its closing tag (`</p>`). In between the opening and closing tag, you see the tag's contents (in this case, the text `Mother said there'd be days like these.`).\n",
      "\n",
      "Here's another example, using the HTML `<div>` tag:\n",
      "\n",
      "    <div class=\"header\" style=\"background: blue;\">Mammoth Falls</div>\n",
      "    \n",
      "In this example, the tag's name is `div`. The tag has two attributes: `class`, with value `header`, and `style`, with value `background: blue;`. The contents of this tag is `Mammoth Falls`.\n",
      "\n",
      "Tags can contain other tags, in a hierarchical relationship. For example, here's some HTML to make a bulletted list:\n",
      "\n",
      "    <ul>\n",
      "      <li>Item one</li>\n",
      "      <li>Item two</li>\n",
      "      <li>Item three</li>\n",
      "    </ul>\n",
      "\n",
      "The `<ul>` tag (`ul` stands for \"unordered list\") in this example has three other `<li>` tags inside of it (`li` stands for \"list item\"). The `<ul>` tag is said to be the \"parent\" of the `<li>` tags, and the `<li>` tags are the \"children\" of the `<ul>` tag. All tags grouped under a particular parent tag are called \"siblings.\"\n",
      "\n",
      "###HTML's shortcomings\n",
      "\n",
      "HTML documents are intended to add \"markup\" to text to add information that allows browsers to display the text in different ways---e.g., HTML markup might tell the browser to make the font of the text a particular size, or to position it in a particular place on the screen.\n",
      "\n",
      "Because the primary purpose of HTML is to change the appearance of text, HTML markup usually does *not* tell us anything useful about what the text means, or what kind of data it contains. When you look at a web page in the browser, it might appear to contain a list of newspaper articles, or a table with birth rates, or a series of names with associated biographies, or whatever. But that's information that we get, as humans, from reading the page. There's (usually) no easy way to extract this information with a computer program.\n",
      "\n",
      "HTML is also notoriously messy---web browsers are very forgiving of syntax errors and other irregularities in HTML (like mismatched or unclosed tags). (This is in contrast to data formats like JSON, where even small errors will fail to be parsed.) For this reason, we need special libraries to parse HTML into data structures that our Python programs can use, libraries that can make a \"good guess\" about what the structure of an HTML document is, even when that structure is written incorrectly or inconsistently.\n",
      "\n",
      "[Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) is a Python library that parses HTML (even if it's poorly formatted) and allows us to extract and manipulate its contents. We'll be using this library in the examples that follow.\n",
      "\n",
      "Keep in mind that there are only sketchy rules for what HTML elements \"mean\"---semantic information you figure out for one web page might not apply to the next. Values for `class` attributes especially are meaningful only in the context of a single page.\n",
      "\n",
      "> Note: There's an effort to add semantic information to HTML markup called [HTML Microformats](http://microformats.org/). If sites added microformats to their markup, you'd be able to write code that could more reliably extract information from web pages, because there would be a common language for what tags with particular classes and attributes mean. Alas, microformats remain unpopular, and until the anarcho-collectivists win a greater mindshare, we can count only on our own individual readings of individual HTML documents.\n",
      "\n",
      "###Inspecting HTML's anatomy with Developer Tools\n",
      "\n",
      "I've crafted a very simple example of HTML for us to work with. It concerns kittens. [Here's the rendered version](http://static.decontextualize.com/kittens.html), and [here's the HTML source code](https://raw.githubusercontent.com/ledeprogram/courses/master/databases/data/kittens.html).\n",
      "\n",
      "Now we're going to use Developer Tools in Chrome to take a look at how `kittens.html` is organized. Click on the \"rendered version\" link above. In Chrome, ctrl-click (or right click) anywhere on the page and select \"Inspect Element.\" This will open Chrome's Developer Tools. Your screen should look (something) like this:\n",
      "\n",
      "<a href=\"http://static.decontextualize.com/snaps/kittens-dev-tools.png\"><img src=\"http://static.decontextualize.com/snaps/kittens-dev-tools.png\" alt=\"kittens-dev-tools\"/></a>\n",
      "\n",
      "In the upper panel, you see the web page you're inspecting. In the lower panel, you see a version of the HTML source code, with little arrows next to some of the lines. (The little arrows allow you to collapse parts of the HTML source that are hierarchically related.) As you move your mouse over the elements in the top panel, different parts of the source code will be highlighted. Chrome is showing you which parts of the source code are causing which parts of the page to show up. Pretty spiffy!\n",
      "\n",
      "This relationship also works in reverse: you can move your mouse over some part of the source code in the lower panel, which will highlight in the top panel what that source code corresponds to on the page. We'll be using this later to visually identify the parts of the page that are interesting to us, so we can write code that extracts the contents of those parts automatically.\n",
      "\n",
      "###Characterizing the structure of kittens\n",
      "\n",
      "Here's what the source code of kittens.html looks like:\n",
      "\n",
      "\t<!doctype html>\n",
      "\t<html>\n",
      "\t  <head>\n",
      "\t    <title>Kittens!</title>\n",
      "\t  </head>\n",
      "\t  <body>\n",
      "\t    <h1>Kittens and the TV Shows They Love</h1>\n",
      "\t    <div class=\"kitten\">\n",
      "\t      <h2>Fluffy</h2>\n",
      "\t      <div><img src=\"http://placekitten.com/100/100\"></div>\n",
      "\t      <ul class=\"tvshows\">\n",
      "\t        <li><a href=\"http://www.imdb.com/title/tt0106145/\">Deep Space Nine</a></li>\n",
      "\t        <li><a href=\"http://www.imdb.com/title/tt0088576/\">Mr. Belvedere</a></li>\n",
      "\t      </ul>\n",
      "\t      Last check-up: <span class=\"lastcheckup\">2014-01-17</span>\n",
      "\t    </div>\n",
      "\t    <div class=\"kitten\">\n",
      "\t      <h2>Monsieur Whiskeurs</h2>\n",
      "\t      <div><img src=\"http://placekitten.com/150/100\"></div>\n",
      "\t      <ul class=\"tvshows\">\n",
      "\t        <li><a href=\"http://www.imdb.com/title/tt0106179/\">The X-Files</a></li>\n",
      "\t        <li><a href=\"http://www.imdb.com/title/tt0098800/\">Fresh Prince</a></li>\n",
      "\t      </ul>\n",
      "\t      Last check-up: <span class=\"lastcheckup\">2013-11-02</span>\n",
      "\t    </div>\n",
      "\t  </body>\n",
      "\t</html>\n",
      "\n",
      "This is pretty well organized HTML, but if you don't know how to read HTML, it will still look like a big jumble. Here's how I would characterize the structure of this HTML, reading in my own idea of what the meaning of the elements are.\n",
      "\n",
      "* We have two \"kittens,\" both of which are contained in `<div>` tags with class `kitten`.\n",
      "* Each \"kitten\" `<div>` has an `<h2>` tag with that kitten's name.\n",
      "* There's an image for each kitten, specified with an `<img>` tag.\n",
      "* Each kitten has a list (a `<ul>` with class `tvshows`) of television shows, contained within `<li>` tags.\n",
      "* Those list items themselves have links (`<a>` tags) with an `href` attribute that contains a link to an IMDB entry for that show.\n",
      "\n",
      "> BONUS QUIZ: What's the parent tag of `<a href=\"http://www.imdb.com/title/tt0088576/\">Mr. Belvedere</a>`? Both `<div class=\"kitten\">` tags share a parent tag---what is it? What attributes are present on both `<img>` tags?\n",
      "\n",
      "###Scraping kittens with Beautiful Soup\n",
      "\n",
      "We've examined `kittens.html` a bit now. What we'd like to do is write some code that is going to extract information from the HTML, like \"what is the last checkup date for each of these kittens?\" or \"what are Monsieur Whiskeur's favorite TV shows?\" To do so, we need to *parse* the HTML, and create a representation of it in our program that we can manipulate with Python.\n",
      "\n",
      "As mentioned above, HTML is hard to parse by hand. (Don't even try it. In particular, [don't parse HTML with regular expressions](http://stackoverflow.com/a/1732454).)\n",
      "\n",
      "Beautiful Soup is a Python library that will parse the HTML for us, and give us some Python objects that we can call methods on to poke at the data contained therein.\n",
      "\n",
      "The first thing we need to do is fetch the source code of that page. We can do that with our old friend `urllib.urlopen()`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import urllib\n",
      "\n",
      "html_str = urllib.urlopen(\"https://raw.githubusercontent.com/ledeprogram/courses/master/databases/data/kittens.html\").read()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now `html_str` is a string that contains the HTML source code of the page in question:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "print html_str"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<!doctype html>\n",
        "<html>\n",
        "\t<head>\n",
        "\t\t<title>Kittens!</title>\n",
        "\t</head>\n",
        "\t<body>\n",
        "\t\t<h1>Kittens and the TV Shows They Love</h1>\n",
        "\t\t<div class=\"kitten\">\n",
        "\t\t\t<h2>Fluffy</h2>\n",
        "\t\t\t<div><img src=\"http://placekitten.com/100/100\"></div>\n",
        "\t\t\t<ul class=\"tvshows\">\n",
        "\t\t\t\t<li>\n",
        "\t\t\t\t\t<a href=\"http://www.imdb.com/title/tt0106145/\">Deep Space Nine</a>\n",
        "\t\t\t\t</li>\n",
        "\t\t\t\t<li>\n",
        "\t\t\t\t\t<a href=\"http://www.imdb.com/title/tt0088576/\">Mr. Belvedere</a>\n",
        "\t\t\t\t</li>\n",
        "\t\t\t</ul>\n",
        "\t\t\tLast check-up: <span class=\"lastcheckup\">2014-01-17</span>\n",
        "\t\t</div>\n",
        "\t\t<div class=\"kitten\">\n",
        "\t\t\t<h2>Monsieur Whiskeurs</h2>\n",
        "\t\t\t<div><img src=\"http://placekitten.com/150/100\"></div>\n",
        "\t\t\t<ul class=\"tvshows\">\n",
        "\t\t\t\t<li>\n",
        "\t\t\t\t\t<a href=\"http://www.imdb.com/title/tt0106179/\">The X-Files</a>\n",
        "\t\t\t\t</li>\n",
        "\t\t\t\t<li>\n",
        "\t\t\t\t\t<a href=\"http://www.imdb.com/title/tt0098800/\">Fresh Prince</a>\n",
        "\t\t\t\t</li>\n",
        "\t\t\t</ul>\n",
        "\t\t\tLast check-up: <span class=\"lastcheckup\">2013-11-02</span>\n",
        "\t\t</div>\n",
        "\t</body>\n",
        "</html>\n",
        "\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Rad. Now we want to be able to ask questions about what's in the HTML. To do so, we're going to give the string to Beautiful Soup to parse."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from bs4 import BeautifulSoup\n",
      "\n",
      "document = BeautifulSoup(html_str)\n",
      "print type(document)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<class 'bs4.BeautifulSoup'>\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We've created a `BeautifulSoup` object and assigned it to a variable `document`. This object supports a number of interesting methods. We'll focus on just a few.\n",
      "\n",
      "###Finding a tag\n",
      "\n",
      "HTML documents are composed of tags. To represent this, Beautiful Soup has a type of value that represents tags. We can use the `.find()` method of the `BeautifulSoup` object to find a tag that matches a particular tag name. For example:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "h1_tag = document.find('h1')\n",
      "print type(h1_tag)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<class 'bs4.element.Tag'>\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "A `Tag` object has several interesting attributes and methods. The `string` attribute of a `Tag` object, for example, returns a string representing that tag's contents:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print h1_tag.string"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Kittens and the TV Shows They Love\n"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You can access the attributes of a tag by treating the tag object as though it were a dictionary, using the square-bracket index syntax, with the name of the attribute whose value you want as a string inside the brackets. For example, to print out the `src` attribute of the first `<img>` tag in the document:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "img_tag = document.find('img')\n",
      "print img_tag['src']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "http://placekitten.com/100/100\n"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> Note: You might have noticed that there is more than one `<img>` tag in `kittens.html`! If more than one tag matches the name you pass to `.find()`, it returns only the *first* matching tag. (A better name for `.find()` might be `find_first`.)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Finding multiple tags\n",
      "\n",
      "It's often the case that we want to find not just one tag that matches particular criteria, but ALL tags matching those criteria. For that, we use the `.find_all()` method of the `BeautifulSoup` object. For example, to find all `h2` tags in the document:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "h2_tags = document.find_all('h2')\n",
      "print type(h2_tags)\n",
      "[tag.string for tag in h2_tags]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<class 'bs4.element.ResultSet'>\n"
       ]
      },
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "[u'Fluffy', u'Monsieur Whiskeurs']"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Both the `.find()` and `.find_all()` methods can search not just for tags with particular names, but also for tags that have particular attributes. For that, we use the `attrs` keyword argument, giving it a dictionary that associates attribute names as keys and the desired attribute value as values. For example, to find all `span` tags with a `class` attribute of `lastcheckup`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "checkup_tags = document.find_all('span', attrs={'class': 'lastcheckup'})\n",
      "[tag.string for tag in checkup_tags]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 15,
       "text": [
        "[u'2014-01-17', u'2013-11-02']"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> Note: Beautiful Soup's `.find()` and `.find_all()` methods are actually more powerful than we're letting on here. [Check out the details in the official Beautiful Soup documentation.](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Finding tags within tags\n",
      "\n",
      "Let's say that we wanted to print out a list of the name of each kitten, along with a list of the names of that kitten's favorite TV shows. In other words, we want to print out something that looks like this:\n",
      "\n",
      "    Fluffy: Deep Space Nine, Mr. Belvedere\n",
      "    Monsieur Whiskeurs: The X-Files, Fresh Prince\n",
      "    \n",
      "In order to do this, we need to find *not just* tags with particular names, but tags with *particular hierarchical relationships* with other tags. I.e., we need to identify all of the kittens, and then find the shows that belong to that kitten. This kind of search is made easy by the fact tht you can use `.find()` and `.find_all()` methods not just on the entire document, but on individual tags. When you use these methods on tags, they search for matching tags that are specifically *children of* the tag that you call them on.\n",
      "\n",
      "In our kittens example, we can see that information about individual kittens is grouped together under `<div>` tags with a `class` attribute of `kitten`. So, to find a list of all `<div>` tags with `class` set to `kitten`, we might do this:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "kitten_tags = document.find_all(\"div\", attrs={\"class\": \"kitten\"})"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 20
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, we'll loop over that list of tags and find, inside each of them, the `<h2>` tag that is its child:\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for kitten_tag in kitten_tags:\n",
      "    h2_tag = kitten_tag.find('h2')\n",
      "    print h2_tag.string"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Fluffy\n",
        "Monsieur Whiskeurs\n"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, we'll go one extra step. Looping over all of the kitten tags, we'll find not just the `<h2>` tag with the kitten's name, but all `<a>` tags (which contain the names of the TV shows that we were looking for):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for kitten_tag in kitten_tags:\n",
      "    h2_tag = kitten_tag.find('h2')\n",
      "    a_tags = kitten_tag.find_all('a')\n",
      "    a_tag_strings = [tag.string for tag in a_tags]\n",
      "    a_tag_strings_joined = \", \".join(a_tag_strings)\n",
      "    print h2_tag.string + \": \" + a_tag_strings_joined"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Fluffy: Deep Space Nine, Mr. Belvedere\n",
        "Monsieur Whiskeurs: The X-Files, Fresh Prince\n"
       ]
      }
     ],
     "prompt_number": 25
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> EXERCISE: Modify the code above to print out a list of kitten names along with the last check-up date for that kitten."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "> EXTRA FUN EXERCISE: Rewrite the code above to create a dictionary that maps kitten names to a list of links to that kitten's favorite shows. I.e., you should end up with a dictionary that looks like this:\n",
      "\n",
      "    {u'Fluffy': [u'http://www.imdb.com/title/tt0106145/',\n",
      "      u'http://www.imdb.com/title/tt0088576/'],\n",
      "     u'Monsieur Whiskeurs': [u'http://www.imdb.com/title/tt0106179/',\n",
      "      u'http://www.imdb.com/title/tt0098800/']}"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Finding sibling tags\n",
      "\n",
      "Often, the tags we're looking for don't have a distinguishing characteristic, like a `class` attribute, that allows us to find them using `.find()` and `.find_all()`, and the tags also aren't in a parent-child relationship. This can be tricky! Take the following HTML snippet, for example:    "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cheese_html = \"\"\"\n",
      "<h2>Camembert</h2>\n",
      "<p>A soft cheese made in the Camembert region of France.</p>\n",
      "\n",
      "<h2>Cheddar</h2>\n",
      "<p>A yellow cheese made in the Cheddar region of... France, probably, idk whatevs.</p>\n",
      "\"\"\""
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 29
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If our task was to create a dictionary that maps the name of the cheese to the description that follows in the `<p>` tag directly afterward, we'd be out of luck. Fortunately, Beautiful Soup has a `.find_next_sibling()` method, which allows us to search for the next tag that is a *sibling* of the tag you're calling it on (i.e., the two tags share a parent), that also matches particular criteria. So, for example, to accomplish the task outlined above:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "document = BeautifulSoup(cheese_html)\n",
      "cheese_dict = {}\n",
      "for h2_tag in document.find_all('h2'):\n",
      "    cheese_name = h2_tag.string\n",
      "    cheese_desc_tag = h2_tag.find_next_sibling('p')\n",
      "    cheese_dict[cheese_name] = cheese_desc_tag.string\n",
      "\n",
      "cheese_dict"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 33,
       "text": [
        "{u'Camembert': u'A soft cheese made in the Camembert region of France.',\n",
        " u'Cheddar': u'A yellow cheese made in the Cheddar region of... France, probably, idk whatevs.'}"
       ]
      }
     ],
     "prompt_number": 33
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You now know most of what you need to know to scrape web pages effectively. Good job!"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###When things go wrong with Beautiful Soup\n",
      "\n",
      "A number of things might go wrong with Beautiful Soup. You might, for example, search for a tag that doesn't exist in the document:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "footer_tag = document.find(\"footer\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 35
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Beautiful Soup doesn't return an error if it can't find the tag you want. Instead, it returns `None`:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print footer_tag"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "None\n"
       ]
      }
     ],
     "prompt_number": 37
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you try to call a method on the object that Beautiful Soup returned anyway, you might end up with an error like this:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "footer_tag.find(\"p\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "ename": "AttributeError",
       "evalue": "'NoneType' object has no attribute 'find'",
       "output_type": "pyerr",
       "traceback": [
        "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
        "\u001b[0;32m<ipython-input-49-68ae6cd15b29>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mfooter_tag\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"p\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
        "\u001b[0;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'find'"
       ]
      }
     ],
     "prompt_number": 49
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You might also inadvertently try to get an attribute of a tag that wasn't actually found. You'll get a similar error in that case:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "footer_tag['title']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "ename": "TypeError",
       "evalue": "'NoneType' object has no attribute '__getitem__'",
       "output_type": "pyerr",
       "traceback": [
        "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
        "\u001b[0;32m<ipython-input-50-914072fc1329>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mfooter_tag\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'title'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
        "\u001b[0;31mTypeError\u001b[0m: 'NoneType' object has no attribute '__getitem__'"
       ]
      }
     ],
     "prompt_number": 50
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Whenever you see something like `AttributeError: 'NoneType' object has no attribute 'find'` or `TypeError: 'NoneType' object has no attribute '__getitem__'`, it's a good idea to check to see whether your method calls are indeed finding the thing you were looking for.\n",
      "\n",
      "However, the `.find_all()` method will return an empty list if it doesn't find any of the tags you wanted:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "footer_tags = document.find_all(\"footer\")\n",
      "print footer_tags"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[]\n"
       ]
      }
     ],
     "prompt_number": 40
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you attempt to access one of the elements of this regardless..."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print footer_tags[0].string"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "ename": "IndexError",
       "evalue": "list index out of range",
       "output_type": "pyerr",
       "traceback": [
        "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mIndexError\u001b[0m                                Traceback (most recent call last)",
        "\u001b[0;32m<ipython-input-44-26f61605c5b0>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mprint\u001b[0m \u001b[0mfooter_tags\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstring\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
        "\u001b[0;31mIndexError\u001b[0m: list index out of range"
       ]
      }
     ],
     "prompt_number": 44
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "...you'll get an `IndexError`."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##A real-world example, sort of\n",
      "\n",
      "The rocky blue golf ball we live on is simply filthy with HTML pages ripe for scraping. As an example today, we're going to take a look at [this one](http://www.journalism.columbia.edu/page/10/10?category_ids%5B%5D=2&category_ids%5B%5D=3&category_ids%5B%5D=37). That's right, the faculty list page for the Columbia Graduate School of Journalism. Whoa.\n",
      "\n",
      "Our task is to create a list of every faculty member, along with their title, and a link to their photo (if any). Let's say that, in the end, we want to end up with a list of dictionaries that looks like this:\n",
      "\n",
      "    [\n",
      "        {'name': 'Adkison, Abbey', 'title': 'Digital Media Coordinator', 'img_src': None,\n",
      "        {'name': 'Barclay, Dolores', 'title': 'Adjunct Faculty', 'img_src': 'http://www.journalism.columbia.edu/system/photos/1943/default/Dolores-Barclay.gif?1365711292',\n",
      "        {'name': 'Baum, Geraldine', 'title': 'Adjunct Faculty', 'img_src': None}\n",
      "        ...\n",
      "    ]\n",
      "\n",
      "(We could then do fun things with this data, like turn it into a PANDAS table or whatever.)\n",
      "\n",
      "We're going to make some mistakes along the way, so you know what it looks like when mistakes happen and so you'll learn a little bit about how to recover from them.\n",
      "\n",
      "The first step in this task? Use Developer Tools to find the elements we're looking for. In this screenshot, I'm mousing over the unit on the page that seems to contain the information I'm looking for. I've \"expanded\" some of the collapsed code sections to make it easier to see the hierarchy of tags.\n",
      "\n",
      "<a href=\"http://static.decontextualize.com/snaps/Google%20Chrome.png\"><img src=\"http://static.decontextualize.com/snaps/Google%20Chrome.png\" alt=\"Google Chrome\"/></a>\n",
      "\n",
      "Based on what I'm seeing here, I can start to formulate a plan to scrape the document. Here's what I came up with:\n",
      "\n",
      "* It looks like each faculty member has an `<li>` tag, so I'll find all of those.\n",
      "* For each `<li>` tag, I need to find an `<img>` tag---specifically, I need to grab the `src` attribute from that tag.\n",
      "* The faculty member's name is inside an `<a>` tag---specifically, an `<a>` tag inside of an `<h4>` tag.\n",
      "* The faculty member's title seems to be located inside a `<p>` tag with `class` attribute `description`.\n",
      "\n",
      "Let's write some code to do that!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "html_str = urllib.urlopen(\"http://www.journalism.columbia.edu/page/10/10?category_ids%5B%5D=2&category_ids%5B%5D=3&category_ids%5B%5D=37\").read()\n",
      "document = BeautifulSoup(html_str)\n",
      "faculty_list = []\n",
      "for faculty_tag in document.find_all('li'):\n",
      "    # create empty dictionary to store this faculty member\n",
      "    faculty_dict = {}\n",
      "    # faculty name\n",
      "    h4_tag = faculty_tag.find('h4')\n",
      "    a_tag = h4_tag.find('a')\n",
      "    faculty_dict['name'] = a_tag.string\n",
      "    # image URL\n",
      "    img_tag = faculty_tag.find('img')\n",
      "    faculty_dict['img_src'] = img_tag['src']\n",
      "    # title\n",
      "    p_tag = faculty_tag.find('p', attrs={'class': 'description'})\n",
      "    faculty_dict['title'] = p_tag.string\n",
      "    # append to list\n",
      "    faculty_list.append(faculty_dict)\n",
      "faculty_list"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "ename": "AttributeError",
       "evalue": "'NoneType' object has no attribute 'find'",
       "output_type": "pyerr",
       "traceback": [
        "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
        "\u001b[0;32m<ipython-input-69-80e2b0320d95>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      7\u001b[0m     \u001b[0;31m# faculty name\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      8\u001b[0m     \u001b[0mh4_tag\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfaculty_tag\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'h4'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m     \u001b[0ma_tag\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mh4_tag\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'a'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     10\u001b[0m     \u001b[0mfaculty_dict\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'name'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0ma_tag\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstring\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     11\u001b[0m     \u001b[0;31m# image URL\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
        "\u001b[0;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'find'"
       ]
      }
     ],
     "prompt_number": 69
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Looks good, but wait! When we run it, we get an error: `AttributeError: 'NoneType' object has no attribute 'find'`. Specifically, the `h4_tag`, whose `.find()` method we're trying to use, seems to be `None`... which means that there wasn't an `h4` tag where we expected there to be one. To diagnose the problem, let's simplify our code a little bit to see where the problem is. (I used the range syntax `[:5]` in the `for` loop here, so that we'll grab just the first five results---that should be enough to diagnose our problem.)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "html_str = urllib.urlopen(\"http://www.journalism.columbia.edu/page/10/10?category_ids%5B%5D=2&category_ids%5B%5D=3&category_ids%5B%5D=37\").read()\n",
      "document = BeautifulSoup(html_str)\n",
      "for faculty_tag in document.find_all('li')[:5]:\n",
      "    print faculty_tag"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<li><a href=\"/page/1-about-the-school/1\">About the School</a> </li>\n",
        "<li><a href=\"/page/7-our-programs/7\">Academic Programs</a> </li>\n",
        "<li><a href=\"/page/13-career-services/13\">Career Services</a> </li>\n",
        "<li><a href=\"/page/14-events-calendar/14\">Events</a> </li>\n",
        "<li><a class=\"active\" href=\"/page/10-full-time-adjunct-visiting-faculty/10\">Faculty</a></li>\n"
       ]
      }
     ],
     "prompt_number": 60
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "When you `print` a `Tag` object, BeautifulSoup displays the source code for those tags. Here we can see that the `<li>` tags we've found aren't quite the `<li>` tags we were looking for---these seem to be `<li>` tags from another part of the page! Whoops. I guess we need to be more specific about which `<li>` tags we want. How do we do that, though? Let's go back to Developer Tools.\n",
      "\n",
      "<a href=\"http://static.decontextualize.com/snaps/Google%20Chrome.png\"><img src=\"http://static.decontextualize.com/snaps/Google%20Chrome.png\" alt=\"Google Chrome\"/></a>\n",
      "\n",
      "Now it *looks* like all of the relevant `<li>` tags have a single parent tag---`<ul class=\"experts-list\">`. So what we need to do is find not *all `<li>` tags on the page*, but *only those `<li>` tags that are children of this particular `<ul>` tag*. Here's some revised code to do just that:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "experts_ul_tag = document.find('ul', attrs={'class': 'experts-list'})\n",
      "for faculty_tag in experts_ul_tag.find_all('li')[:5]:\n",
      "    print faculty_tag"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<li class=\"label\" id=\"goto-a\"></li>\n",
        "<li>\n",
        "<div class=\"content\">\n",
        "<h4><a href=\"/profile/228-abbey-adkison/10\">Adkison, Abbey </a></h4>\n",
        "<p class=\"description\">Digital Media Coordinator</p>\n",
        "</div>\n",
        "</li>\n",
        "<li class=\"label\" id=\"goto-b\"></li>\n",
        "<li>\n",
        "<a href=\"/profile/387-dolores-barclay/10\"><img alt=\"Dolores-barclay\" border=\"1\" height=\"90\" src=\"/system/photos/1943/default/Dolores-Barclay.gif?1365711292\" width=\"110\"/></a>\n",
        "<div class=\"content\">\n",
        "<h4><a href=\"/profile/387-dolores-barclay/10\">Barclay, Dolores </a></h4>\n",
        "<p class=\"description\">Adjunct Faculty</p>\n",
        "</div>\n",
        "</li>\n",
        "<li>\n",
        "<div class=\"content\">\n",
        "<h4><a href=\"/profile/341-geraldine-baum/10\">Baum, Geraldine</a></h4>\n",
        "<p class=\"description\">Adjunct Faculty</p>\n",
        "</div>\n",
        "</li>\n"
       ]
      }
     ],
     "prompt_number": 61
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This looks a little bit better, but we've still got a few weird things: namely, there are some `<li>` tags, (those with a `class` attribute of `label`) which don't seem to contain `h4` tags---or any other content we're interested in at all. We need to put in a check so our code will disregard any `<li>` tags like this. Here's another attempt:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "experts_ul_tag = document.find('ul', attrs={'class': 'experts-list'})\n",
      "for faculty_tag in experts_ul_tag.find_all('li')[:5]:\n",
      "    h4_tag = faculty_tag.find('h4')\n",
      "    if h4_tag is None:\n",
      "        continue\n",
      "    print h4_tag"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<h4><a href=\"/profile/228-abbey-adkison/10\">Adkison, Abbey </a></h4>\n",
        "<h4><a href=\"/profile/387-dolores-barclay/10\">Barclay, Dolores </a></h4>\n",
        "<h4><a href=\"/profile/341-geraldine-baum/10\">Baum, Geraldine</a></h4>\n"
       ]
      }
     ],
     "prompt_number": 68
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Okay, now we're in business. At last. Let's put this code together with the previous example."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "html_str = urllib.urlopen(\"http://www.journalism.columbia.edu/page/10/10?category_ids%5B%5D=2&category_ids%5B%5D=3&category_ids%5B%5D=37\").read()\n",
      "document = BeautifulSoup(html_str)\n",
      "faculty_list = []\n",
      "experts_ul_tag = document.find('ul', attrs={'class': 'experts-list'})\n",
      "for faculty_tag in experts_ul_tag.find_all('li'):\n",
      "    # create empty dictionary to store this faculty member\n",
      "    faculty_dict = {}\n",
      "    # faculty name\n",
      "    h4_tag = faculty_tag.find('h4')\n",
      "    if h4_tag is None:\n",
      "        continue\n",
      "    a_tag = h4_tag.find('a')\n",
      "    faculty_dict['name'] = a_tag.string\n",
      "    # image URL\n",
      "    img_tag = faculty_tag.find('img')\n",
      "    faculty_dict['img_src'] = img_tag['src']\n",
      "    # title\n",
      "    p_tag = faculty_tag.find('p', attrs={'class': 'description'})\n",
      "    faculty_dict['title'] = p_tag.string\n",
      "    # append to list\n",
      "    faculty_list.append(faculty_dict)\n",
      "faculty_list"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "ename": "TypeError",
       "evalue": "'NoneType' object has no attribute '__getitem__'",
       "output_type": "pyerr",
       "traceback": [
        "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
        "\u001b[0;32m<ipython-input-85-5d4df53d970e>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     14\u001b[0m     \u001b[0;31m# image URL\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     15\u001b[0m     \u001b[0mimg_tag\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfaculty_tag\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'img'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m     \u001b[0mfaculty_dict\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'img_src'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mimg_tag\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'src'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     17\u001b[0m     \u001b[0;31m# title\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     18\u001b[0m     \u001b[0mp_tag\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfaculty_tag\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'p'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mattrs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m{\u001b[0m\u001b[0;34m'class'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'description'\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
        "\u001b[0;31mTypeError\u001b[0m: 'NoneType' object has no attribute '__getitem__'"
       ]
      }
     ],
     "prompt_number": 85
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "ARGH! Another `'NoneType' object has no attribute '__getitem__'`! It looks like the problem this time is with the `img_tag` variable. In particular, if we examine the source code, we find that faculty members without head shots don't even have an `<img>` tag in their `<div>`s. So let's add one more thing to fix that---we'll check to see if the `<img>` tag is present, and only then will we attemt to get its `src` attribute:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "html_str = urllib.urlopen(\"http://www.journalism.columbia.edu/page/10/10?category_ids%5B%5D=2&category_ids%5B%5D=3&category_ids%5B%5D=37\").read()\n",
      "document = BeautifulSoup(html_str)\n",
      "faculty_list = []\n",
      "experts_ul_tag = document.find('ul', attrs={'class': 'experts-list'})\n",
      "for faculty_tag in experts_ul_tag.find_all('li'):\n",
      "    # create empty dictionary to store this faculty member\n",
      "    faculty_dict = {}\n",
      "    # faculty name\n",
      "    h4_tag = faculty_tag.find('h4')\n",
      "    if h4_tag is None:\n",
      "        continue\n",
      "    a_tag = h4_tag.find('a')\n",
      "    faculty_dict['name'] = a_tag.string\n",
      "    # image URL: if <img> tag found, grab its src. if not, use None\n",
      "    img_tag = faculty_tag.find('img')\n",
      "    if img_tag is None:\n",
      "        faculty_dict['img_src'] = None\n",
      "    else:\n",
      "        faculty_dict['img_src'] = img_tag['src']\n",
      "    # title\n",
      "    p_tag = faculty_tag.find('p', attrs={'class': 'description'})\n",
      "    faculty_dict['title'] = p_tag.string\n",
      "    # append to list\n",
      "    faculty_list.append(faculty_dict)\n",
      "faculty_list"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 86,
       "text": [
        "[{'img_src': None,\n",
        "  'name': u'Adkison, Abbey ',\n",
        "  'title': u'Digital Media Coordinator'},\n",
        " {'img_src': u'/system/photos/1943/default/Dolores-Barclay.gif?1365711292',\n",
        "  'name': u'Barclay, Dolores ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': None, 'name': u'Baum, Geraldine', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2056/default/EBell_112811.jpg?1322508884',\n",
        "  'name': u'Bell, Emily',\n",
        "  'title': u'Professor of Professional Practice & Director, Tow Center for Digital Journalism'},\n",
        " {'img_src': u'/system/photos/2057/default/HBenedict_112811.jpg?1322509591',\n",
        "  'name': u'Benedict, Helen ',\n",
        "  'title': u'Professor'},\n",
        " {'img_src': u'/system/photos/2982/default/Bennet_John.gif?1365697019',\n",
        "  'name': u'Bennet, John ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2984/default/Bennett_Rob.gif?1365706134',\n",
        "  'name': u'Bennett, Rob',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2725/default/Nina-Berman.gif?1365711635',\n",
        "  'name': u'Berman, Nina',\n",
        "  'title': u'Associate Professor'},\n",
        " {'img_src': None, 'name': u'Blair, Gwenda ', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2985/default/Blum_David.gif?1365706164',\n",
        "  'name': u'Blum, David ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2692/default/GeorgeBodarky.gif?1365714064',\n",
        "  'name': u'Bodarky, George',\n",
        "  'title': u'Adjunct Assistant Professor '},\n",
        " {'img_src': u'/system/photos/150/default/Walt-Bogdanich.gif?1365714085',\n",
        "  'name': u'Bogdanich, Walt ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3055/default/Lennart-Bourin.jpg?1368456160',\n",
        "  'name': u'Bourin, Lennart',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/1912/default/CurtisBrainard.gif?1365714245',\n",
        "  'name': u'Brainard, Curtis ',\n",
        "  'title': u'Staff Writer'},\n",
        " {'img_src': u'/system/photos/842/default/bruder.jpg?1392672045',\n",
        "  'name': u'Bruder, Jessica',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/864/default/burford.jpg?1392672030',\n",
        "  'name': u'Burford, Melanie ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2986/default/Burleigh_Nina.gif?1365706179',\n",
        "  'name': u'Burleigh, Nina ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2713/default/HeatherCabot.gif?1365714437',\n",
        "  'name': u'Cabot, Heather',\n",
        "  'title': u'Adjunct Professor'},\n",
        " {'img_src': u'/system/photos/3203/default/Elena.gif?1375217143',\n",
        "  'name': u'Cabral, Elena ',\n",
        "  'title': u'Adjunct Faculty & Assistant Director, Student Services'},\n",
        " {'img_src': u'/system/photos/2987/default/Canipe_Chris.gif?1365706198',\n",
        "  'name': u'Canipe, Chris',\n",
        "  'title': None},\n",
        " {'img_src': u'/system/photos/2988/default/Charnas_Dan.gif?1365706213',\n",
        "  'name': u'Charnas, Dan ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3412/default/Cohen_Julie.jpg?1384536342',\n",
        "  'name': u'Cohen, Julie',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3262/default/Lisa-Cohen.gif?1376423474',\n",
        "  'name': u'Cohen, Lisa R.',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2989/default/Cohen_Sarah.gif?1365706228',\n",
        "  'name': u'Cohen, Sarah',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3278/default/Coll-web.gif?1377281425',\n",
        "  'name': u'Coll, Steve',\n",
        "  'title': u'Dean & Henry R. Luce Professor of Journalism'},\n",
        " {'img_src': u'/system/photos/147/default/AnnCooper2.jpg?1276009818',\n",
        "  'name': u'Cooper, Ann',\n",
        "  'title': u'CBS Professor of Professional Practice in International Journalism'},\n",
        " {'img_src': u'/system/photos/2990/default/Coronel_Sheila.gif?1365706241',\n",
        "  'name': u'Coronel, Sheila ',\n",
        "  'title': u'Dean of Academic Affairs'},\n",
        " {'img_src': u'/system/photos/160/default/Unknown-1.jpeg?1378227266',\n",
        "  'name': u'Coyne , Kevin ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2993/default/Hockenberry_-Alison-Craiglow.gif?1365706287',\n",
        "  'name': u'Craiglow Hockenberry, Alison',\n",
        "  'title': None},\n",
        " {'img_src': u'/system/photos/2059/default/JCross_112811.jpg?1322510850',\n",
        "  'name': u'Cross, June ',\n",
        "  'title': u'Professor '},\n",
        " {'img_src': u'/system/photos/861/default/Brent-Cunningham.gif?1365714937',\n",
        "  'name': u'Cunningham, Brent ',\n",
        "  'title': u'Deputy Editor'},\n",
        " {'img_src': u'/system/photos/1265/default/ADepalma.jpg?1291223442',\n",
        "  'name': u'DePalma, Anthony',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/1973/default/BruceDeSilva.gif?1365715036',\n",
        "  'name': u'DeSilva, Bruce ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3256/default/deitsch_.jpg?1376325514',\n",
        "  'name': u'Deitsch, Richard',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2060/default/JDinges_112811.jpg?1322511090',\n",
        "  'name': u'Dinges, John',\n",
        "  'title': u'Godfrey Lowell Cabot Professor of Journalism'},\n",
        " {'img_src': u'/system/photos/3075/default/SDodd_horiz.gif?1369166642',\n",
        "  'name': u'Dodd, Scott',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/1811/default/donahue.jpg?1392672057',\n",
        "  'name': u'Donahue, Kerry ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': None, 'name': u'Drew, Christopher ', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/936/default/edsall.jpg?1304373008',\n",
        "  'name': u'Edsall, Thomas B. ',\n",
        "  'title': None},\n",
        " {'img_src': u'/system/photos/882/default/Epstein.jpg?1280954937',\n",
        "  'name': u'Epstein, Randi Hutter ',\n",
        "  'title': u'Adjunct Faculty '},\n",
        " {'img_src': None, 'name': u'Evans, Farrell ', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/1977/default/TysonEvans.gif?1365715311',\n",
        "  'name': u'Evans , Tyson ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3013/default/Fishman.gif?1365724719',\n",
        "  'name': u'Fishman, Elizabeth Weinreb',\n",
        "  'title': u'Associate Dean for Communications'},\n",
        " {'img_src': u'/system/photos/3488/default/ford.jpg?1392672068',\n",
        "  'name': u'Ford, Constance Mitchell ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/839/default/frederick.jpg?1392672079',\n",
        "  'name': u'Frederick, Pamela Platt',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2061/default/SFreedman_112811.jpg?1322511767',\n",
        "  'name': u'Freedman, Samuel ',\n",
        "  'title': u'Professor'},\n",
        " {'img_src': None, 'name': u'Freeman, George ', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2991/default/Freeman_John.gif?1365706256',\n",
        "  'name': u'Freeman, John',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/93/default/HFrench.jpg?1291237303',\n",
        "  'name': u'French, Howard ',\n",
        "  'title': u'Associate Professor'},\n",
        " {'img_src': u'/system/photos/162/default/Stephen_Fried.gif?1365716551',\n",
        "  'name': u'Fried, Stephen ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3119/default/Vanessa.gif?1371498827',\n",
        "  'name': u'Gezari, Vanessa',\n",
        "  'title': None},\n",
        " {'img_src': None, 'name': u'Gilderman, Greg', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/837/default/SigGissler.gif?1365716632',\n",
        "  'name': u'Gissler, Sig',\n",
        "  'title': u'Administrator'},\n",
        " {'img_src': u'/system/photos/113/default/TGitlin.jpg?1291237356',\n",
        "  'name': u'Gitlin, Todd',\n",
        "  'title': u'Professor & Chair, Ph.D. Program'},\n",
        " {'img_src': None, 'name': u'Giudice, Barbara ', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/836/default/MartyGoldensohn.gif?1365716789',\n",
        "  'name': u'Goldensohn, Marty',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/86/default/AGoldman.jpg?1291237401',\n",
        "  'name': u'Goldman, Ari ',\n",
        "  'title': u'Professor'},\n",
        " {'img_src': None, 'name': u'Goldstein, Jacob', 'title': u'Adjunct Professor'},\n",
        " {'img_src': u'/system/photos/88/default/WGrueskin.jpg?1291236298',\n",
        "  'name': u'Grueskin, Bill',\n",
        "  'title': u'Professor of Professional Practice '},\n",
        " {'img_src': u'/system/photos/1512/default/AHaburchak.gif?1365716878',\n",
        "  'name': u'Haburchak, Alan',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2992/default/Hajdu_David.gif?1365706270',\n",
        "  'name': u'Hajdu, David ',\n",
        "  'title': u'Associate Professor '},\n",
        " {'img_src': None, 'name': u'Hall, Stephen', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/97/default/LynNell-faculty.jpg?1368455738',\n",
        "  'name': u'Hancock, LynNell',\n",
        "  'title': u'H. Gordon Garbedian Professor of Journalism & Director, Spencer Fellowship Program'},\n",
        " {'img_src': u'/system/photos/3545/default/hansen.jpg?1392670367',\n",
        "  'name': u'Hansen, Mark',\n",
        "  'title': u'Director, David and Helen Gurley Brown Institute for Media Innovation & Professor of Journalism '},\n",
        " {'img_src': None, 'name': u'Harris, Mark', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3210/default/Julie.gif?1375217224',\n",
        "  'name': u'Hartenstein, Julie',\n",
        "  'title': u'Associate Dean'},\n",
        " {'img_src': u'/system/photos/1530/default/LarryHeinzerling.gif?1365717071',\n",
        "  'name': u'Heinzerling, Larry',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/154/default/TomHerman.gif?1365718880',\n",
        "  'name': u'Herman, Tom ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': None, 'name': u'Hickey, Neil ', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3059/default/Lars-head-shot-v2.jpg?1368463497',\n",
        "  'name': u'Hoel, Lars ',\n",
        "  'title': None},\n",
        " {'img_src': u'/system/photos/3529/default/hogan.jpg?1392672092',\n",
        "  'name': u'Hogan, Pamela',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/155/default/Marguerite_Holloway2.jpg?1276019659',\n",
        "  'name': u'Holloway, Marguerite ',\n",
        "  'title': u'Associate Professor & Director, Science and Environmental Journalism'},\n",
        " {'img_src': u'/system/photos/2994/default/Hoyt_-Mike.gif?1365706301',\n",
        "  'name': u'Hoyt, Michael ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2064/default/SIsaacs_112811.jpg?1322512995',\n",
        "  'name': u'Isaacs, Stephen',\n",
        "  'title': u'Professor Emeritus of Journalism'},\n",
        " {'img_src': u'/system/photos/101/default/RJohn.jpg?1291236478',\n",
        "  'name': u'John, Richard R. ',\n",
        "  'title': u'Professor '},\n",
        " {'img_src': None,\n",
        "  'name': u'Jones, Matthew L. ',\n",
        "  'title': u'Instructor, The Lede Program'},\n",
        " {'img_src': None, 'name': u'Kalita, S. Mitra ', 'title': u'Adjunct Faculty '},\n",
        " {'img_src': None, 'name': u'Kann, Peter R. ', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/833/default/kantrowitz.jpg?1392672104',\n",
        "  'name': u'Kantrowitz, Barbara',\n",
        "  'title': u'Adjunct Faculty & Associate Director, Continuing Education'},\n",
        " {'img_src': u'/system/photos/2995/default/Karle_-Stuart.gif?1365706445',\n",
        "  'name': u'Karle, Stuart',\n",
        "  'title': u'Adjunct Faculty; William J. Brennan Jr. Visiting Professor of First Amendment Issues'},\n",
        " {'img_src': u'/system/photos/858/default/RickKArr.gif?1365717241',\n",
        "  'name': u'Karr, Rick',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/158/default/ThomasKent.gif?1365717319',\n",
        "  'name': u'Kent, Thomas ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/1259/default/DKlatell.jpg?1291217138',\n",
        "  'name': u'Klatell, David',\n",
        "  'title': u'Professor of Professional Practice & Chair, International Studies'},\n",
        " {'img_src': None, 'name': u'Klein, Adam', 'title': u'Adjunct Professor'},\n",
        " {'img_src': u'/system/photos/3056/default/Kim-Kleman-1.jpg?1368463019',\n",
        "  'name': u'Kleman, Kim ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/832/default/knee.jpg?1392672119',\n",
        "  'name': u'Knee, Jonathan',\n",
        "  'title': u'Adjunct Professor'},\n",
        " {'img_src': None, 'name': u'Konner, Joan', 'title': u'Dean Emerita'},\n",
        " {'img_src': u'/system/photos/2507/default/MKottler.jpg?1336577232',\n",
        "  'name': u'Kottler, Mark',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/857/default/landis.jpg?1392672129',\n",
        "  'name': u'Landis, Peter',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2650/default/DLee_New.jpg?1345145608',\n",
        "  'name': u'Lee, Deborah ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/156/default/haupt.jpg?1392672140',\n",
        "  'name': u'Lehmann-Haupt, Christopher ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2065/default/NLemann_112811.jpg?1322513292',\n",
        "  'name': u'Lemann, Nicholas',\n",
        "  'title': u'Joseph Pulitzer II and Edith Pulitzer Moore Professor of Journalism; Dean Emeritus'},\n",
        " {'img_src': None, 'name': u'Levenson, Jacob ', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/157/default/SethLipsky.gif?1365717822',\n",
        "  'name': u'Lipsky, Seth ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2066/default/RLipton_112811.jpg?1322513465',\n",
        "  'name': u'Lipton, Rhoda ',\n",
        "  'title': u'Senior Lecturer in Discipline'},\n",
        " {'img_src': None, 'name': u'Lombardi, Kristen', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/1391/default/TamiLuhby.gif?1365717895',\n",
        "  'name': u'Luhby, Tami',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3263/default/Tony-Maciulius.gif?1376424088',\n",
        "  'name': u'Maciulis, Tony',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2996/default/Maharidge_-Dale.gif?1365706460',\n",
        "  'name': u'Maharidge, Dale ',\n",
        "  'title': u'Professor '},\n",
        " {'img_src': u'/system/photos/2690/default/TomMason.gif?1365718118',\n",
        "  'name': u'Mason, Tom',\n",
        "  'title': None},\n",
        " {'img_src': u'/system/photos/2997/default/Matloff_-Judith.gif?1365706478',\n",
        "  'name': u'Matloff, Judith ',\n",
        "  'title': u'Adjunct faculty'},\n",
        " {'img_src': None, 'name': u'Maytal, Itai', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': None, 'name': u'McCormick, David ', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': None, 'name': u'McCray, Melvin', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2998/default/McDonald_-Erica.gif?1365706494',\n",
        "  'name': u'McDonald, Erica',\n",
        "  'title': None},\n",
        " {'img_src': u'/system/photos/2068/default/SMcGregor_112811.jpg?1322514173',\n",
        "  'name': u'McGregor, Susan E.',\n",
        "  'title': u'Assistant Professor & Assistant Director, Tow Center for Digital Journalism'},\n",
        " {'img_src': None, 'name': u'McMasters, Kelly', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': None, 'name': u'Mencher, Melvin', 'title': u'Professor Emeritus'},\n",
        " {'img_src': u'/system/photos/2999/default/Merchant_-Preston.gif?1365706509',\n",
        "  'name': u'Merchant, Preston',\n",
        "  'title': None},\n",
        " {'img_src': None,\n",
        "  'name': u'Miller , Stephen C.',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3000/default/Mintz_-Jim.gif?1365706529',\n",
        "  'name': u'Mintz, James',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2069/default/SNasar_112811.jpg?1322514324',\n",
        "  'name': u'Nasar, Sylvia ',\n",
        "  'title': u'John S. and James L. Knight Professor of Business Journalism'},\n",
        " {'img_src': u'/system/photos/2070/default/VNavasky_112811.jpg?1322514777',\n",
        "  'name': u'Navasky, Victor ',\n",
        "  'title': u'George T. Delacorte Professor in Magazine Journalism; Director, Delacorte Center for Magazine Journalism; Chair, Columbia Journalism Review '},\n",
        " {'img_src': None, 'name': u'Newman, Maria', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2714/default/nisenholtz.jpg?1392672182',\n",
        "  'name': u'Nisenholtz, Martin',\n",
        "  'title': u'Adjunct Professor '},\n",
        " {'img_src': u'/system/photos/854/default/nocera.jpg?1392672191',\n",
        "  'name': u'Nocera, Joseph',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2659/default/RNorton.jpg?1345495471',\n",
        "  'name': u'Norton, Rob',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3288/default/habibanosheen2.jpg?1378324386',\n",
        "  'name': u'Nosheen, Habiba',\n",
        "  'title': u'Adjunct Professor'},\n",
        " {'img_src': u'/system/photos/827/default/AmyNutt.gif?1365718410',\n",
        "  'name': u'Nutt, Amy',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3074/default/Bridget-O_Brian-headshot.gif?1369166593',\n",
        "  'name': u\"O'Brian, Bridget \",\n",
        "  'title': None},\n",
        " {'img_src': u'/system/photos/1993/default/CharlesOrnstein.gif?1365718296',\n",
        "  'name': u'Ornstein, Charles',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/826/default/padawer.jpg?1392672201',\n",
        "  'name': u'Padawer , Ruth',\n",
        "  'title': u'Adjunct Professor'},\n",
        " {'img_src': u'/system/photos/2073/default/SPadwe_112811.jpg?1322516163',\n",
        "  'name': u'Padwe, Sandy ',\n",
        "  'title': u'Special Lecturer'},\n",
        " {'img_src': None, 'name': u'Parker, Diantha', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': None,\n",
        "  'name': u'Parrish, Adam ',\n",
        "  'title': u'Instructor, The Lede Program'},\n",
        " {'img_src': u'/system/photos/1914/default/Patel_Headshot2.jpg?1319040513',\n",
        "  'name': u'Patel, Samir S.',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3001/default/Perlman_-Merrill.gif?1365706552',\n",
        "  'name': u'Perlman, Merrill',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/852/default/pool-eckert.jpg?1392672210',\n",
        "  'name': u'Pool-Eckert, Marquita',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': None, 'name': u'Quinn, T.J. ', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': None, 'name': u'Rate, Betsy', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': None, 'name': u'Richardson, Lynda ', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/824/default/richmn.jpg?1392672219',\n",
        "  'name': u'Richman, Joe',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3377/default/robbins.jpg?1392672228',\n",
        "  'name': u'Robbins, Ed',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': None, 'name': u'Roberts, Fletcher', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3002/default/Rubinstein_-Julian.gif?1365706572',\n",
        "  'name': u'Rubinstein, Julian',\n",
        "  'title': u'Web Editor'},\n",
        " {'img_src': u'/system/photos/851/default/Sacha.jpg?1280952529',\n",
        "  'name': u'Sacha, Bob ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/823/default/RichSchapiro.gif?1365718608',\n",
        "  'name': u'Schapiro, Rich',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': None, 'name': u'Schatz, Robin', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/1913/default/BJSchechter.gif?1365718488',\n",
        "  'name': u'Schecter, B.J.',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3584/default/HilkeSchellmann_final.jpg?1395348920',\n",
        "  'name': u'Schellmann, Hilke',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3432/default/schoen.jpg?1392672237',\n",
        "  'name': u'Schoen, John',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': None,\n",
        "  'name': u'Schoonmaker, Mary Ellen',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2075/default/MSchudson_112811.jpg?1322517107',\n",
        "  'name': u'Schudson, Michael ',\n",
        "  'title': u'Professor '},\n",
        " {'img_src': u'/system/photos/2076/default/ESchumacher_112811.jpg?1322517858',\n",
        "  'name': u'Schumacher-Matos, Ed ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': None, 'name': u'Schwartz, Jack ', 'title': None},\n",
        " {'img_src': u'/system/photos/872/default/Seave.jpg?1280954557',\n",
        "  'name': u'Seave, Ava ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/849/default/Seidman.gif?1365718701',\n",
        "  'name': u'Seideman, David',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': None,\n",
        "  'name': u'Shanor, Donald',\n",
        "  'title': u'G. L. Cabot Professor Emeritus '},\n",
        " {'img_src': u'/system/photos/3003/default/Shapiro_-Bruce.gif?1365706590',\n",
        "  'name': u'Shapiro, Bruce',\n",
        "  'title': u'Executive Director'},\n",
        " {'img_src': u'/system/photos/2077/default/MShapiro_112811.jpg?1322518042',\n",
        "  'name': u'Shapiro, Michael ',\n",
        "  'title': u'Professor '},\n",
        " {'img_src': u'/system/photos/2682/default/ahmed-shihab-eldin.gif?1365717716',\n",
        "  'name': u'Shihab-Eldin, Ahmed ',\n",
        "  'title': u'Adjunct Assistant Professor '},\n",
        " {'img_src': u'/system/photos/3235/default/Siegel_-Lloyd-2012.jpg?1375710566',\n",
        "  'name': u'Siegel, Lloyd',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/871/default/singer.jpg?1392672245',\n",
        "  'name': u'Singer, Amy ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/1953/default/MariaSliwa.gif?1365717479',\n",
        "  'name': u'Sliwa, Maria',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3004/default/Solomon_-Alisa.gif?1365706611',\n",
        "  'name': u'Solomon, Alisa',\n",
        "  'title': u'Professor & Director, Arts Concentration, M.A. Program'},\n",
        " {'img_src': u'/system/photos/3664/default/jonathan-soma.jpg?1399473825',\n",
        "  'name': u'Soma, Jonathan',\n",
        "  'title': u'Instructor, The Lede Program'},\n",
        " {'img_src': u'/system/photos/3204/default/Ernie.gif?1375217153',\n",
        "  'name': u'Sotomayor, Ernest',\n",
        "  'title': u'Dean of Student Affairs'},\n",
        " {'img_src': u'/system/photos/848/default/PaulaSpan.gif?1365717399',\n",
        "  'name': u'Span, Paula ',\n",
        "  'title': u'Adjunct Professor'},\n",
        " {'img_src': u'/system/photos/2079/default/SSreenavisan_112811.jpg?1322518253',\n",
        "  'name': u'Sreenivasan , Sree ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3236/default/Karen-Stabiner.jpg?1375717512',\n",
        "  'name': u'Stabiner, Karen',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2080/default/JStewart_112811.jpg?1322518368',\n",
        "  'name': u'Stewart, James',\n",
        "  'title': u'Bloomberg Professor of Business Journalism'},\n",
        " {'img_src': u'/system/photos/82/default/Stille.gif?1365718788',\n",
        "  'name': u'Stille, Alexander',\n",
        "  'title': u'San Paolo Professor of International Journalism'},\n",
        " {'img_src': u'/system/photos/3523/default/stivers.jpg?1392672253',\n",
        "  'name': u'Stivers, Cyndi',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3005/default/Subramanian_-Sushma.gif?1365706634',\n",
        "  'name': u'Subramanian, Sushma',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3434/default/Mike_Sullivan.jpg?1386103502',\n",
        "  'name': u'Sullivan, Michael',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/847/default/Surowicz.jpg?1280952446',\n",
        "  'name': u'Surowicz, Simon',\n",
        "  'title': None},\n",
        " {'img_src': u'/system/photos/3006/default/Tamman_-Maurice.gif?1365706648',\n",
        "  'name': u'Tamman, Maurice',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': None,\n",
        "  'name': u'Tenen, Dennis',\n",
        "  'title': u'Instructor, The Lede Program'},\n",
        " {'img_src': u'/system/photos/105/default/topping.jpg?1392672489',\n",
        "  'name': u'Topping, Seymour  ',\n",
        "  'title': u'San Paolo Professor of International Journalism Emeritus'},\n",
        " {'img_src': u'/system/photos/3057/default/Dody.jpg?1368463129',\n",
        "  'name': u'Tsiantar, Dody ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/2082/default/DLinhTu_112811.jpg?1322583281',\n",
        "  'name': u'Tu, Duy Linh',\n",
        "  'title': u'Assistant Professor of Professional Practice & Director, Digital Media Program '},\n",
        " {'img_src': u'/system/photos/125/default/Andie_Tucher2.jpg?1275665800',\n",
        "  'name': u'Tucher, Andie ',\n",
        "  'title': u'Associate Professor; Director, Ph.D. Program'},\n",
        " {'img_src': u'/system/photos/3188/default/Mike-Ventura.gif?1374078031',\n",
        "  'name': u'Ventura, Michael',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3007/default/Wald_-Jonathan.gif?1365706666',\n",
        "  'name': u'Wald, Jonathan',\n",
        "  'title': u'Adjunt Faculty'},\n",
        " {'img_src': u'/system/photos/126/default/Richard_Wald2.jpg?1275665983',\n",
        "  'name': u'Wald, Richard',\n",
        "  'title': u'Fred W. Friendly Professor of Professional Practice in Media and Society'},\n",
        " {'img_src': u'/system/photos/1261/default/wayne.jpg?1392672262',\n",
        "  'name': u'Wayne, Leslie',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/127/default/JWeiner.jpg?1291237982',\n",
        "  'name': u'Weiner, Jonathan ',\n",
        "  'title': u'Maxwell M. Geffen Professor of Medical and Scientific Journalism '},\n",
        " {'img_src': u'/system/photos/2269/default/weiss.jpg?1392672270',\n",
        "  'name': u'Weiss, Gary',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/819/default/welby.jpg?1392672278',\n",
        "  'name': u'Welby, Julianne ',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/128/default/Betsy_West2.jpg?1275668385',\n",
        "  'name': u'West, Betsy ',\n",
        "  'title': u'Associate Professor of Professional Practice'},\n",
        " {'img_src': u'/system/photos/3008/default/Wheatley_-Bill.gif?1365706683',\n",
        "  'name': u'Wheatley, Jr., William',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3665/default/chris-wiggins.jpg?1399473834',\n",
        "  'name': u'Wiggins, Chris',\n",
        "  'title': u'Instructor, The Lede Program'},\n",
        " {'img_src': u'/system/photos/3009/default/Williams_-Josh---high-res.gif?1365706700',\n",
        "  'name': u'Williams, Josh',\n",
        "  'title': u'Adjunct Faculty'},\n",
        " {'img_src': None, 'name': u'Wilson, Duff', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': None, 'name': u'Wolk, Joshua', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3010/default/woodward.jpg?1392672287',\n",
        "  'name': u'Woodward, Tali ',\n",
        "  'title': u'Adjunct Faculty & Director, M.A. Program'},\n",
        " {'img_src': None,\n",
        "  'name': u'Yu, Frederick T C.',\n",
        "  'title': u'CBS Professor Emeritus International Journalism'},\n",
        " {'img_src': None, 'name': u'Zucker, John', 'title': u'Adjunct Faculty'},\n",
        " {'img_src': u'/system/photos/3058/default/zuckerman.jpg?1392672295',\n",
        "  'name': u'Zuckerman, Jocelyn Craugh ',\n",
        "  'title': u'Adjunct Faculty'}]"
       ]
      }
     ],
     "prompt_number": 86
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It worked! So good. Now we can do fun stuff with the data, like making a pandas data frame and seeing how many are listed as \"Adjunct Faculty\":"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "\n",
      "faculty_frame = pd.DataFrame(faculty_list)\n",
      "faculty_frame[faculty_frame[\"title\"]==\"Adjunct Faculty\"]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>img_src</th>\n",
        "      <th>name</th>\n",
        "      <th>title</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>1  </th>\n",
        "      <td> /system/photos/1943/default/Dolores-Barclay.gi...</td>\n",
        "      <td>          Barclay, Dolores </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2  </th>\n",
        "      <td>                                              None</td>\n",
        "      <td>            Baum, Geraldine</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5  </th>\n",
        "      <td> /system/photos/2982/default/Bennet_John.gif?13...</td>\n",
        "      <td>              Bennet, John </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>6  </th>\n",
        "      <td> /system/photos/2984/default/Bennett_Rob.gif?13...</td>\n",
        "      <td>               Bennett, Rob</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8  </th>\n",
        "      <td>                                              None</td>\n",
        "      <td>             Blair, Gwenda </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>9  </th>\n",
        "      <td> /system/photos/2985/default/Blum_David.gif?136...</td>\n",
        "      <td>               Blum, David </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>11 </th>\n",
        "      <td> /system/photos/150/default/Walt-Bogdanich.gif?...</td>\n",
        "      <td>           Bogdanich, Walt </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>12 </th>\n",
        "      <td> /system/photos/3055/default/Lennart-Bourin.jpg...</td>\n",
        "      <td>            Bourin, Lennart</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>14 </th>\n",
        "      <td>  /system/photos/842/default/bruder.jpg?1392672045</td>\n",
        "      <td>            Bruder, Jessica</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>15 </th>\n",
        "      <td> /system/photos/864/default/burford.jpg?1392672030</td>\n",
        "      <td>          Burford, Melanie </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>16 </th>\n",
        "      <td> /system/photos/2986/default/Burleigh_Nina.gif?...</td>\n",
        "      <td>            Burleigh, Nina </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>20 </th>\n",
        "      <td> /system/photos/2988/default/Charnas_Dan.gif?13...</td>\n",
        "      <td>              Charnas, Dan </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>21 </th>\n",
        "      <td> /system/photos/3412/default/Cohen_Julie.jpg?13...</td>\n",
        "      <td>               Cohen, Julie</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>22 </th>\n",
        "      <td> /system/photos/3262/default/Lisa-Cohen.gif?137...</td>\n",
        "      <td>             Cohen, Lisa R.</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>23 </th>\n",
        "      <td> /system/photos/2989/default/Cohen_Sarah.gif?13...</td>\n",
        "      <td>               Cohen, Sarah</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>27 </th>\n",
        "      <td> /system/photos/160/default/Unknown-1.jpeg?1378...</td>\n",
        "      <td>             Coyne , Kevin </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>31 </th>\n",
        "      <td> /system/photos/1265/default/ADepalma.jpg?12912...</td>\n",
        "      <td>           DePalma, Anthony</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>32 </th>\n",
        "      <td> /system/photos/1973/default/BruceDeSilva.gif?1...</td>\n",
        "      <td>            DeSilva, Bruce </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>33 </th>\n",
        "      <td> /system/photos/3256/default/deitsch_.jpg?13763...</td>\n",
        "      <td>           Deitsch, Richard</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>35 </th>\n",
        "      <td> /system/photos/3075/default/SDodd_horiz.gif?13...</td>\n",
        "      <td>                Dodd, Scott</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>36 </th>\n",
        "      <td> /system/photos/1811/default/donahue.jpg?139267...</td>\n",
        "      <td>            Donahue, Kerry </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>37 </th>\n",
        "      <td>                                              None</td>\n",
        "      <td>         Drew, Christopher </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>40 </th>\n",
        "      <td>                                              None</td>\n",
        "      <td>            Evans, Farrell </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>41 </th>\n",
        "      <td> /system/photos/1977/default/TysonEvans.gif?136...</td>\n",
        "      <td>             Evans , Tyson </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>43 </th>\n",
        "      <td>   /system/photos/3488/default/ford.jpg?1392672068</td>\n",
        "      <td>  Ford, Constance Mitchell </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>44 </th>\n",
        "      <td> /system/photos/839/default/frederick.jpg?13926...</td>\n",
        "      <td>    Frederick, Pamela Platt</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>46 </th>\n",
        "      <td>                                              None</td>\n",
        "      <td>           Freeman, George </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>47 </th>\n",
        "      <td> /system/photos/2991/default/Freeman_John.gif?1...</td>\n",
        "      <td>              Freeman, John</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>49 </th>\n",
        "      <td> /system/photos/162/default/Stephen_Fried.gif?1...</td>\n",
        "      <td>            Fried, Stephen </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>51 </th>\n",
        "      <td>                                              None</td>\n",
        "      <td>            Gilderman, Greg</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>...</th>\n",
        "      <td>...</td>\n",
        "      <td>...</td>\n",
        "      <td>...</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>135</th>\n",
        "      <td>   /system/photos/851/default/Sacha.jpg?1280952529</td>\n",
        "      <td>                Sacha, Bob </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>136</th>\n",
        "      <td> /system/photos/823/default/RichSchapiro.gif?13...</td>\n",
        "      <td>             Schapiro, Rich</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>137</th>\n",
        "      <td>                                              None</td>\n",
        "      <td>              Schatz, Robin</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>138</th>\n",
        "      <td> /system/photos/1913/default/BJSchechter.gif?13...</td>\n",
        "      <td>             Schecter, B.J.</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>139</th>\n",
        "      <td> /system/photos/3584/default/HilkeSchellmann_fi...</td>\n",
        "      <td>          Schellmann, Hilke</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>140</th>\n",
        "      <td> /system/photos/3432/default/schoen.jpg?1392672237</td>\n",
        "      <td>               Schoen, John</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>141</th>\n",
        "      <td>                                              None</td>\n",
        "      <td>    Schoonmaker, Mary Ellen</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>143</th>\n",
        "      <td> /system/photos/2076/default/ESchumacher_112811...</td>\n",
        "      <td>      Schumacher-Matos, Ed </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>145</th>\n",
        "      <td>   /system/photos/872/default/Seave.jpg?1280954557</td>\n",
        "      <td>                Seave, Ava </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>146</th>\n",
        "      <td> /system/photos/849/default/Seidman.gif?1365718701</td>\n",
        "      <td>            Seideman, David</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>151</th>\n",
        "      <td> /system/photos/3235/default/Siegel_-Lloyd-2012...</td>\n",
        "      <td>              Siegel, Lloyd</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>152</th>\n",
        "      <td>  /system/photos/871/default/singer.jpg?1392672245</td>\n",
        "      <td>               Singer, Amy </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>153</th>\n",
        "      <td> /system/photos/1953/default/MariaSliwa.gif?136...</td>\n",
        "      <td>               Sliwa, Maria</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>158</th>\n",
        "      <td> /system/photos/2079/default/SSreenavisan_11281...</td>\n",
        "      <td>        Sreenivasan , Sree </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>159</th>\n",
        "      <td> /system/photos/3236/default/Karen-Stabiner.jpg...</td>\n",
        "      <td>            Stabiner, Karen</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>162</th>\n",
        "      <td> /system/photos/3523/default/stivers.jpg?139267...</td>\n",
        "      <td>             Stivers, Cyndi</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>163</th>\n",
        "      <td> /system/photos/3005/default/Subramanian_-Sushm...</td>\n",
        "      <td>        Subramanian, Sushma</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>164</th>\n",
        "      <td> /system/photos/3434/default/Mike_Sullivan.jpg?...</td>\n",
        "      <td>          Sullivan, Michael</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>166</th>\n",
        "      <td> /system/photos/3006/default/Tamman_-Maurice.gi...</td>\n",
        "      <td>            Tamman, Maurice</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>169</th>\n",
        "      <td>   /system/photos/3057/default/Dody.jpg?1368463129</td>\n",
        "      <td>            Tsiantar, Dody </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>172</th>\n",
        "      <td> /system/photos/3188/default/Mike-Ventura.gif?1...</td>\n",
        "      <td>           Ventura, Michael</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>175</th>\n",
        "      <td>  /system/photos/1261/default/wayne.jpg?1392672262</td>\n",
        "      <td>              Wayne, Leslie</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>177</th>\n",
        "      <td>  /system/photos/2269/default/weiss.jpg?1392672270</td>\n",
        "      <td>                Weiss, Gary</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>178</th>\n",
        "      <td>              /system/photos/819/default/welby.jpg</td>\n",
        "      <td>           Welby, Julianne </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>180</th>\n",
        "      <td> /system/photos/3008/default/Wheatley_-Bill.gif...</td>\n",
        "      <td>     Wheatley, Jr., William</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>182</th>\n",
        "      <td> /system/photos/3009/default/Williams_-Josh---h...</td>\n",
        "      <td>             Williams, Josh</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>183</th>\n",
        "      <td>                                              None</td>\n",
        "      <td>               Wilson, Duff</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>184</th>\n",
        "      <td>                                              None</td>\n",
        "      <td>               Wolk, Joshua</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>187</th>\n",
        "      <td>                                              None</td>\n",
        "      <td>               Zucker, John</td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>188</th>\n",
        "      <td> /system/photos/3058/default/zuckerman.jpg?1392...</td>\n",
        "      <td> Zuckerman, Jocelyn Craugh </td>\n",
        "      <td> Adjunct Faculty</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "<p>104 rows \u00d7 3 columns</p>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 84,
       "text": [
        "                                               img_src  \\\n",
        "1    /system/photos/1943/default/Dolores-Barclay.gi...   \n",
        "2                                                 None   \n",
        "5    /system/photos/2982/default/Bennet_John.gif?13...   \n",
        "6    /system/photos/2984/default/Bennett_Rob.gif?13...   \n",
        "8                                                 None   \n",
        "9    /system/photos/2985/default/Blum_David.gif?136...   \n",
        "11   /system/photos/150/default/Walt-Bogdanich.gif?...   \n",
        "12   /system/photos/3055/default/Lennart-Bourin.jpg...   \n",
        "14    /system/photos/842/default/bruder.jpg?1392672045   \n",
        "15   /system/photos/864/default/burford.jpg?1392672030   \n",
        "16   /system/photos/2986/default/Burleigh_Nina.gif?...   \n",
        "20   /system/photos/2988/default/Charnas_Dan.gif?13...   \n",
        "21   /system/photos/3412/default/Cohen_Julie.jpg?13...   \n",
        "22   /system/photos/3262/default/Lisa-Cohen.gif?137...   \n",
        "23   /system/photos/2989/default/Cohen_Sarah.gif?13...   \n",
        "27   /system/photos/160/default/Unknown-1.jpeg?1378...   \n",
        "31   /system/photos/1265/default/ADepalma.jpg?12912...   \n",
        "32   /system/photos/1973/default/BruceDeSilva.gif?1...   \n",
        "33   /system/photos/3256/default/deitsch_.jpg?13763...   \n",
        "35   /system/photos/3075/default/SDodd_horiz.gif?13...   \n",
        "36   /system/photos/1811/default/donahue.jpg?139267...   \n",
        "37                                                None   \n",
        "40                                                None   \n",
        "41   /system/photos/1977/default/TysonEvans.gif?136...   \n",
        "43     /system/photos/3488/default/ford.jpg?1392672068   \n",
        "44   /system/photos/839/default/frederick.jpg?13926...   \n",
        "46                                                None   \n",
        "47   /system/photos/2991/default/Freeman_John.gif?1...   \n",
        "49   /system/photos/162/default/Stephen_Fried.gif?1...   \n",
        "51                                                None   \n",
        "..                                                 ...   \n",
        "135    /system/photos/851/default/Sacha.jpg?1280952529   \n",
        "136  /system/photos/823/default/RichSchapiro.gif?13...   \n",
        "137                                               None   \n",
        "138  /system/photos/1913/default/BJSchechter.gif?13...   \n",
        "139  /system/photos/3584/default/HilkeSchellmann_fi...   \n",
        "140  /system/photos/3432/default/schoen.jpg?1392672237   \n",
        "141                                               None   \n",
        "143  /system/photos/2076/default/ESchumacher_112811...   \n",
        "145    /system/photos/872/default/Seave.jpg?1280954557   \n",
        "146  /system/photos/849/default/Seidman.gif?1365718701   \n",
        "151  /system/photos/3235/default/Siegel_-Lloyd-2012...   \n",
        "152   /system/photos/871/default/singer.jpg?1392672245   \n",
        "153  /system/photos/1953/default/MariaSliwa.gif?136...   \n",
        "158  /system/photos/2079/default/SSreenavisan_11281...   \n",
        "159  /system/photos/3236/default/Karen-Stabiner.jpg...   \n",
        "162  /system/photos/3523/default/stivers.jpg?139267...   \n",
        "163  /system/photos/3005/default/Subramanian_-Sushm...   \n",
        "164  /system/photos/3434/default/Mike_Sullivan.jpg?...   \n",
        "166  /system/photos/3006/default/Tamman_-Maurice.gi...   \n",
        "169    /system/photos/3057/default/Dody.jpg?1368463129   \n",
        "172  /system/photos/3188/default/Mike-Ventura.gif?1...   \n",
        "175   /system/photos/1261/default/wayne.jpg?1392672262   \n",
        "177   /system/photos/2269/default/weiss.jpg?1392672270   \n",
        "178               /system/photos/819/default/welby.jpg   \n",
        "180  /system/photos/3008/default/Wheatley_-Bill.gif...   \n",
        "182  /system/photos/3009/default/Williams_-Josh---h...   \n",
        "183                                               None   \n",
        "184                                               None   \n",
        "187                                               None   \n",
        "188  /system/photos/3058/default/zuckerman.jpg?1392...   \n",
        "\n",
        "                           name            title  \n",
        "1             Barclay, Dolores   Adjunct Faculty  \n",
        "2               Baum, Geraldine  Adjunct Faculty  \n",
        "5                 Bennet, John   Adjunct Faculty  \n",
        "6                  Bennett, Rob  Adjunct Faculty  \n",
        "8                Blair, Gwenda   Adjunct Faculty  \n",
        "9                  Blum, David   Adjunct Faculty  \n",
        "11             Bogdanich, Walt   Adjunct Faculty  \n",
        "12              Bourin, Lennart  Adjunct Faculty  \n",
        "14              Bruder, Jessica  Adjunct Faculty  \n",
        "15            Burford, Melanie   Adjunct Faculty  \n",
        "16              Burleigh, Nina   Adjunct Faculty  \n",
        "20                Charnas, Dan   Adjunct Faculty  \n",
        "21                 Cohen, Julie  Adjunct Faculty  \n",
        "22               Cohen, Lisa R.  Adjunct Faculty  \n",
        "23                 Cohen, Sarah  Adjunct Faculty  \n",
        "27               Coyne , Kevin   Adjunct Faculty  \n",
        "31             DePalma, Anthony  Adjunct Faculty  \n",
        "32              DeSilva, Bruce   Adjunct Faculty  \n",
        "33             Deitsch, Richard  Adjunct Faculty  \n",
        "35                  Dodd, Scott  Adjunct Faculty  \n",
        "36              Donahue, Kerry   Adjunct Faculty  \n",
        "37           Drew, Christopher   Adjunct Faculty  \n",
        "40              Evans, Farrell   Adjunct Faculty  \n",
        "41               Evans , Tyson   Adjunct Faculty  \n",
        "43    Ford, Constance Mitchell   Adjunct Faculty  \n",
        "44      Frederick, Pamela Platt  Adjunct Faculty  \n",
        "46             Freeman, George   Adjunct Faculty  \n",
        "47                Freeman, John  Adjunct Faculty  \n",
        "49              Fried, Stephen   Adjunct Faculty  \n",
        "51              Gilderman, Greg  Adjunct Faculty  \n",
        "..                          ...              ...  \n",
        "135                 Sacha, Bob   Adjunct Faculty  \n",
        "136              Schapiro, Rich  Adjunct Faculty  \n",
        "137               Schatz, Robin  Adjunct Faculty  \n",
        "138              Schecter, B.J.  Adjunct Faculty  \n",
        "139           Schellmann, Hilke  Adjunct Faculty  \n",
        "140                Schoen, John  Adjunct Faculty  \n",
        "141     Schoonmaker, Mary Ellen  Adjunct Faculty  \n",
        "143       Schumacher-Matos, Ed   Adjunct Faculty  \n",
        "145                 Seave, Ava   Adjunct Faculty  \n",
        "146             Seideman, David  Adjunct Faculty  \n",
        "151               Siegel, Lloyd  Adjunct Faculty  \n",
        "152                Singer, Amy   Adjunct Faculty  \n",
        "153                Sliwa, Maria  Adjunct Faculty  \n",
        "158         Sreenivasan , Sree   Adjunct Faculty  \n",
        "159             Stabiner, Karen  Adjunct Faculty  \n",
        "162              Stivers, Cyndi  Adjunct Faculty  \n",
        "163         Subramanian, Sushma  Adjunct Faculty  \n",
        "164           Sullivan, Michael  Adjunct Faculty  \n",
        "166             Tamman, Maurice  Adjunct Faculty  \n",
        "169             Tsiantar, Dody   Adjunct Faculty  \n",
        "172            Ventura, Michael  Adjunct Faculty  \n",
        "175               Wayne, Leslie  Adjunct Faculty  \n",
        "177                 Weiss, Gary  Adjunct Faculty  \n",
        "178            Welby, Julianne   Adjunct Faculty  \n",
        "180      Wheatley, Jr., William  Adjunct Faculty  \n",
        "182              Williams, Josh  Adjunct Faculty  \n",
        "183                Wilson, Duff  Adjunct Faculty  \n",
        "184                Wolk, Joshua  Adjunct Faculty  \n",
        "187                Zucker, John  Adjunct Faculty  \n",
        "188  Zuckerman, Jocelyn Craugh   Adjunct Faculty  \n",
        "\n",
        "[104 rows x 3 columns]"
       ]
      }
     ],
     "prompt_number": 84
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##XML\n",
      "\n",
      "XML (eXtensible Markup Language) is a markup language very much like HTML. Here's what XML looks like :\n",
      "\n",
      "    <?xml version=\"1.0\" encoding=\"utf-8\"?>\n",
      "    <feed xmlns=\"http://www.w3.org/2005/Atom\">\n",
      "      <title>Example Feed</title>\n",
      "\t  <subtitle>A subtitle.</subtitle>\n",
      "\t  <link href=\"http://example.org/feed/\" rel=\"self\" />\n",
      "\t  <link href=\"http://example.org/\" />\n",
      "\t  <id>urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6</id>\n",
      "\t  <updated>2003-12-13T18:30:02Z</updated>\n",
      "      <entry>\n",
      "        <title>Atom-Powered Robots Run Amok</title>\n",
      "        <link href=\"http://example.org/2003/12/13/atom03\" />\n",
      "        <link rel=\"alternate\" type=\"text/html\" href=\"http://example.org/2003/12/13/atom03.html\"/>\n",
      "        <link rel=\"edit\" href=\"http://example.org/2003/12/13/atom03/edit\"/>\n",
      "        <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>\n",
      "        <updated>2003-12-13T18:30:02Z</updated>\n",
      "        <summary>Some text.</summary>\n",
      "        <content type=\"xhtml\">\n",
      "          <div xmlns=\"http://www.w3.org/1999/xhtml\">\n",
      "             <p>This is the entry content.</p>\n",
      "          </div>\n",
      "        </content>\n",
      "        <author>\n",
      "          <name>John Doe</name>\n",
      "          <email>johndoe@example.com</email>\n",
      "        </author>\n",
      "\t  </entry>\n",
      "    </feed>\n",
      "\n",
      "As you can see, XML looks a lot like HTML: tags, with attributes and contents, exist in a hierarchical relationship with other tags. The main difference is that in XML, there isn't a pre-defined list of \"valid\" tag names---when you create a document, you can use whatever tag and attribute names you want. As you can see in the example above, there are tags called `feed` and `entry` that aren't a part of the HTML standard, but are valid XML.\n",
      "\n",
      "The second important difference between XML and HTML is that in XML, all tags must consist of both an opening tag AND a closing tag. HTML doesn't have this restriction (as we saw with `<img>` tags in the HTML examples above). Also, in general, tools that work with XML are much more strict about syntax than tools that work with HTML. Browsers tend to be very forgiving of errors in HTML, but will immediately reject XML that isn't well-formed.\n",
      "\n",
      "XML documents generally conform to a \"standard\" or \"format,\"---that is, a pre-defined list of tag names and attribute names and rules for which tags can have which attributes and which tags can contain which other tags. For example, the document in the above is in the Atom XML format, [which you can find out more about here](http://en.wikipedia.org/wiki/Atom_(standard)). XML standards also give you some idea of what the document *means*---a consistent mapping between the document's structure and its semantics.\n",
      "\n",
      "In sum: XML documents conform to standards, they must be syntactically valid, and they have agreed-upon semantics. For these reasons, XML documents are considered to be much more friendly for computers to read than HTML documents. \n",
      "\n",
      "> CLEVER PEOPLE NOTE: XML and HTML work similarly enough, and XML documents can have standards, so why not just make an XML standard that defines all of the tags and attributes in HTML, and have the best of both worlds? [It's been tried before](http://en.wikipedia.org/wiki/XHTML), and there are several drawbacks, [enumerated here](http://stackoverflow.com/questions/5558502/is-html5-valid-xml), but mostly having to do with backwards compatibility."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Dealing with XML data\n",
      "\n",
      "Now, you *can* parse XML data with Beautiful Soup ([with one important caveat](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#id17)). But one of the benefits of data in XML is that there are many pre-existing libraries for Python that are purpose-built for working with data in whichever XML standard. These libraries will save you the effort of having to figure out how documents in that particular standard are put together.\n",
      "\n",
      "There are [a truly bewildering number of XML standards](http://en.wikipedia.org/wiki/Category:XML-based_standards), each devised for more or less domain-specific tasks. (There is even [a truly bewildering number of XML standards for writing documents that define XML standards](http://en.wikipedia.org/wiki/XML_schema#XML_schema_languages)). Listed below are a few standards of interest to journalists, along with links to Python libraries for dealing with documents using those standards:\n",
      "\n",
      "* [Keyhole Markup Language](http://en.wikipedia.org/wiki/Keyhole_Markup_Language) (KML), used for geographic data: [fastkml](https://pypi.python.org/pypi/fastkml/)\n",
      "* [Scalable Vector Graphics](http://en.wikipedia.org/wiki/Scalable_Vector_Graphics) (SVG), used for images and drawings: [pySVG](http://codeboje.de/pysvg/)\n",
      "* [SOAP](http://en.wikipedia.org/wiki/SOAP_(protocol)), used for some web services: [pysimplesoap](https://code.google.com/p/pysimplesoap/)\n",
      "* [Atom](http://en.wikipedia.org/wiki/Atom_(standard)), a set of standards used for web publishing and services: [feedparser](https://pypi.python.org/pypi/feedparser). (The `feedparser` library also helps to parse all manner of other web syndication formats.)\n",
      "* \n",
      "\n",
      "###An example: RSS feeds\n",
      "\n",
      "One of the first tasks many students set themselves to after learning about web scraping is to scrape the front page of the New York Times. *DON'T DO THIS* if you can avoid it. You're inviting disaster, as the NYTimes is free at any moment to change the way their HTML is structured, and your scraper will break. Instead, try using the New York Times RSS feed!\n",
      "\n",
      "RSS is a format that many websites use to publish their articles in computer-readable formats. (RSS support used to be all the rage back in the Internet days, and fewer sites now support it than used to, and some web sites---like the New York Times---support it but don't advertise that fact.) It's an XML format. Here's a link to the New York Times RSS feed for their front-page articles:\n",
      "\n",
      "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml\n",
      "\n",
      "Click on that link, and you'll see a big mess of XML that doesn't make any sense. We're going to use the `feedparser` library mentioned above to parse this RSS and get back a list of all of the article titles. The `feedparser` library essentially takes a big ball of RSS XML and turns it into a Python data structure (to be specific, a list of dictionaries, where each dictionary represents an article in the feed).\n",
      "\n",
      "First, check to see if you have `feedparser` installed."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import feedparser"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If you get `ImportError: No module named feedparser`, try running this line (this will work ONLY on your AWS instances):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!sudo pip install feedparser"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Password:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\r\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Otherwise, you can use your `pip` skills to install feedparser however you'd like.\n",
      "\n",
      "Once you have `feedparser` installed, we can use it to read in a remote RSS file:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import feedparser\n",
      "\n",
      "feed = feedparser.parse(\"http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml\")\n",
      "print type(feed.entries)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<type 'list'>\n"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The `feedparser.parse` function returns a `feedparser` object, which has an attribute `entries` that is a list of articles in the feed. Let's take a look at one of them:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "feed.entries[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 18,
       "text": [
        "{'author': u'By JACK HEALY',\n",
        " 'author_detail': {'name': u'By JACK HEALY'},\n",
        " 'authors': [{}],\n",
        " 'guidislink': False,\n",
        " 'id': u'http://www.nytimes.com/2014/06/24/us/Tom-Tancredos-Colorado-Governor-Puts-His-Party-on-Alert-.html',\n",
        " 'link': u'http://rss.nytimes.com/c/34625/f/640350/s/3bcb5250/sc/7/l/0L0Snytimes0N0C20A140C0A60C240Cus0CTom0ETancredos0EColorado0EGovernor0EPuts0EHis0EParty0Eon0EAlert0E0Bhtml0Dpartner0Frss0Gemc0Frss/story01.htm',\n",
        " 'links': [{'href': u'http://www.nytimes.com/2014/06/24/us/Tom-Tancredos-Colorado-Governor-Puts-His-Party-on-Alert-.html?partner=rss&emc=rss',\n",
        "   'rel': u'standout',\n",
        "   'type': u'text/html'},\n",
        "  {'href': u'http://rss.nytimes.com/c/34625/f/640350/s/3bcb5250/sc/7/l/0L0Snytimes0N0C20A140C0A60C240Cus0CTom0ETancredos0EColorado0EGovernor0EPuts0EHis0EParty0Eon0EAlert0E0Bhtml0Dpartner0Frss0Gemc0Frss/story01.htm',\n",
        "   'rel': u'alternate',\n",
        "   'type': u'text/html'}],\n",
        " 'media_content': [{'height': u'151',\n",
        "   'lang': u'',\n",
        "   'url': u'http://graphics8.nytimes.com/images/2014/06/24/us/TANCREDO1/TANCREDO1-moth.jpg',\n",
        "   'width': u'151'}],\n",
        " 'media_credit': {'scheme': u'urn:ebu'},\n",
        " 'media_description': u'A volunteer campaigned on a bridge in\\xa0Littleton, Colo.',\n",
        " 'published': u'Mon, 23 Jun 2014 17:09:14 GMT',\n",
        " 'published_parsed': time.struct_time(tm_year=2014, tm_mon=6, tm_mday=23, tm_hour=17, tm_min=9, tm_sec=14, tm_wday=0, tm_yday=174, tm_isdst=0),\n",
        " 'summary': u'Some Republicans in Colorado say the views of Mr. Tancredo, a former congressman, could energize the state\\u2019s Democrats while alienating moderate Republicans and unaffiliated voters.<img border=\"0\" height=\"1\" src=\"http://rss.nytimes.com/c/34625/f/640350/s/3bcb5250/sc/7/mf.gif\" width=\"1\" /><br clear=\"all\" /><br /><br /><a href=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/rc/1/rc.htm\" rel=\"nofollow\"><img border=\"0\" src=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/rc/1/rc.img\" /></a><br /><a href=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/rc/2/rc.htm\" rel=\"nofollow\"><img border=\"0\" src=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/rc/2/rc.img\" /></a><br /><a href=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/rc/3/rc.htm\" rel=\"nofollow\"><img border=\"0\" src=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/rc/3/rc.img\" /></a><br /><br /><a href=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/a2.htm\"><img border=\"0\" src=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/a2.img\" /></a><img border=\"0\" height=\"1\" src=\"http://pi.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/a2t.img\" width=\"1\" />',\n",
        " 'summary_detail': {'base': u'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml',\n",
        "  'language': None,\n",
        "  'type': u'text/html',\n",
        "  'value': u'Some Republicans in Colorado say the views of Mr. Tancredo, a former congressman, could energize the state\\u2019s Democrats while alienating moderate Republicans and unaffiliated voters.<img border=\"0\" height=\"1\" src=\"http://rss.nytimes.com/c/34625/f/640350/s/3bcb5250/sc/7/mf.gif\" width=\"1\" /><br clear=\"all\" /><br /><br /><a href=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/rc/1/rc.htm\" rel=\"nofollow\"><img border=\"0\" src=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/rc/1/rc.img\" /></a><br /><a href=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/rc/2/rc.htm\" rel=\"nofollow\"><img border=\"0\" src=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/rc/2/rc.img\" /></a><br /><a href=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/rc/3/rc.htm\" rel=\"nofollow\"><img border=\"0\" src=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/rc/3/rc.img\" /></a><br /><br /><a href=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/a2.htm\"><img border=\"0\" src=\"http://da.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/a2.img\" /></a><img border=\"0\" height=\"1\" src=\"http://pi.feedsportal.com/r/199108753875/u/57/f/640350/c/34625/s/3bcb5250/sc/7/a2t.img\" width=\"1\" />'},\n",
        " 'tags': [{'label': None,\n",
        "   'scheme': u'http://www.nytimes.com/namespaces/keywords/nyt_geo',\n",
        "   'term': u'Colorado'},\n",
        "  {'label': None,\n",
        "   'scheme': u'http://www.nytimes.com/namespaces/nyt_org_all',\n",
        "   'term': u'Facebook Inc|FB|NASDAQ'},\n",
        "  {'label': None,\n",
        "   'scheme': u'http://www.nytimes.com/namespaces/keywords/nyt_per',\n",
        "   'term': u'Tancredo, Tom'},\n",
        "  {'label': None,\n",
        "   'scheme': u'http://www.nytimes.com/namespaces/nyt_org_all',\n",
        "   'term': u'Harley-Davidson Inc|HOG|NYSE'},\n",
        "  {'label': None,\n",
        "   'scheme': u'http://www.nytimes.com/namespaces/keywords/des',\n",
        "   'term': u'Midterm Elections (2014)'},\n",
        "  {'label': None,\n",
        "   'scheme': u'http://www.nytimes.com/namespaces/keywords/nyt_org_all',\n",
        "   'term': u'Republican Party'}],\n",
        " 'title': u'Tom Tancredo\\u2019s Bid for Colorado Governor Puts His Party on Alert',\n",
        " 'title_detail': {'base': u'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml',\n",
        "  'language': None,\n",
        "  'type': u'text/plain',\n",
        "  'value': u'Tom Tancredo\\u2019s Bid for Colorado Governor Puts His Party on Alert'}}"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Okay, cool. Looking over this data structure, it looks like we have a dictionary, and the thing we want---the title of the article---is the value for the `title` key. Let's make a list comprehension to pull them out:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "[article['title'] for article in feed.entries]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 20,
       "text": [
        "[u'Tom Tancredo\\u2019s Bid for Colorado Governor Puts His Party on Alert',\n",
        " u'Justices, With Limits, Let E.P.A. Curb Power-Plant Gases',\n",
        " u'Kerry Says ISIS Threat Could Hasten Military Action',\n",
        " u'Top Afghan Election Official Resigns Amid Candidate\\u2019s Claims of Vote Fraud',\n",
        " u'Here Lies Progress: Asian Actors Fill the Playbill',\n",
        " u'Justice Department Found It Lawful to Target Anwar al-Awlaki',\n",
        " u'The Global Game: Drawing Lots at World Cup? There Must Be a Better Way',\n",
        " u'Top Investigator Has Blistering Criticism for V.A. Response to Whistle-Blowers',\n",
        " u'ArtsBeat: Annie Missing? No Worries, Dick Tracy Is on the Case',\n",
        " u'Last of Syria\\u2019s Declared Chemical Arms Shipped Abroad',\n",
        " u'City Room: New York Today: Fire in the Dark',\n",
        " u'Afghan Official Quits in Bid to End Crisis',\n",
        " u'Report: Pennsylvania Governor Did Not Deliberately Delay Sandusky Case',\n",
        " u'Steve Rossi, Singer Who Found Fame in Comedy Duo, Dies at 82',\n",
        " u'Sunni Militants Seize Crossing on Iraq-Jordan Border',\n",
        " u'ArtsBeat: \\u2018True Blood\\u2019 Recap: Back to the Beginning',\n",
        " u'New Search Plan for Flight 370 Is Based on Farther, Controlled Flying',\n",
        " u'Vice Has Many Media Giants Salivating, but Its Terms Will Be Rich',\n",
        " u'Egyptian Court Convicts 3 Al Jazeera Journalists',\n",
        " u'Egyptian Court Convicts 3 Al Jazeera Journalists',\n",
        " u'Baptism by Fire: A New York Firefighter Confronts His First Test',\n",
        " u'Soldier Accused of Killing 5 Is Captured in South Korea',\n",
        " u'DealBook: An Employee Dies, and the Company Collects the Insurance',\n",
        " u'A Survey Says: Poll Shows No Consensus in U.S. for Helping in Iraq',\n",
        " u'Netherlands and Chile Will Fight to Win Group B']"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Conclusion\n",
      "\n",
      "By the end of this tutorial, you should feel confident in your ability to extract information from HTML and XML documents. There are a lot of subtleties we didn't go over, but you're well on your way! Here are some further links to aid in your exploration.\n",
      "\n",
      "* [A Gentle Introduction to XML](http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SG.html), from [TEI](http://www.tei-c.org/index.xml).\n",
      "* [Intro to Beautiful Soup](http://programminghistorian.org/lessons/intro-to-beautiful-soup)"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}