{
"metadata": {
"name": "",
"signature": "sha256:32f3e29e6aeeeec29ae4f62fe4257393d79d9767a6153e319a2d9a4b23348e90"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#Getting data from markup languages\n",
"\n",
"So far we've discussed a number of sources for data: CSV files, web APIs, and unstructured text. There's a lot of data on the internet locked up in one of two \"markup\" languages: XML and HTML. Our goal today is to discuss and put into practice a few methods for extracting data from documents written in these languages.\n",
"\n",
"##HTML\n",
"\n",
"HTML stands for \"hypertext markup language.\" Most of the documents you see when you're browsing the web are written in this format. In most browsers, there's a \"View Source\" option that allows you to see the HTML source code for any page you're looking at. For example, in Chrome, you can CTRL-click anywhere on the page, or go to `View > Developer > View Source`:\n",
"\n",
"\n",
"\n",
"You'll see something that looks like this, a mish-mash of angle brackets and quotes and slashes and text. This is HTML.\n",
"\n",
"
\n",
"\n",
"###What HTML looks like\n",
"\n",
"HTML consists of a series of *tags*. Tags have a *name*, a series of key/value pairs called *attributes*, and some textual *content*. Attributes are optional. Here's a simple example, using the HTML `
` tag (`p` means \"paragraph\"):\n", "\n", "
Mother said there'd be days like these.
\n", " \n", "This example has just one tag in it: a `` tag. The source code for a tag has two parts, its opening tag (`
`) and its closing tag (`
`). In between the opening and closing tag, you see the tag's contents (in this case, the text `Mother said there'd be days like these.`).\n", "\n", "Here's another example, using the HTML `A soft cheese made in the Camembert region of France.
\n", "\n", "A yellow cheese made in the Cheddar region of... France, probably, idk whatevs.
\n", "\"\"\"" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "If our task was to create a dictionary that maps the name of the cheese to the description that follows in the `` tag directly afterward, we'd be out of luck. Fortunately, Beautiful Soup has a `.find_next_sibling()` method, which allows us to search for the next tag that is a *sibling* of the tag you're calling it on (i.e., the two tags share a parent), that also matches particular criteria. So, for example, to accomplish the task outlined above:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"document = BeautifulSoup(cheese_html)\n",
"cheese_dict = {}\n",
"for h2_tag in document.find_all('h2'):\n",
" cheese_name = h2_tag.string\n",
" cheese_desc_tag = h2_tag.find_next_sibling('p')\n",
" cheese_dict[cheese_name] = cheese_desc_tag.string\n",
"\n",
"cheese_dict"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": [
"{u'Camembert': u'A soft cheese made in the Camembert region of France.',\n",
" u'Cheddar': u'A yellow cheese made in the Cheddar region of... France, probably, idk whatevs.'}"
]
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You now know most of what you need to know to scrape web pages effectively. Good job!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###When things go wrong with Beautiful Soup\n",
"\n",
"A number of things might go wrong with Beautiful Soup. You might, for example, search for a tag that doesn't exist in the document:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"footer_tag = document.find(\"footer\")"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 15
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Beautiful Soup doesn't return an error if it can't find the tag you want. Instead, it returns `None`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print footer_tag"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"None\n"
]
}
],
"prompt_number": 16
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you try to call a method on the object that Beautiful Soup returned anyway, you might end up with an error like this:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"footer_tag.find(\"p\")"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "AttributeError",
"evalue": "'NoneType' object has no attribute 'find'",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m ` tag with `class` attribute `description`.\n",
"\n",
"Let's write some code to do that!"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"html_str = urllib.urlopen(\"http://www.journalism.columbia.edu/page/10/10?category_ids%5B%5D=2&category_ids%5B%5D=3&category_ids%5B%5D=37\").read()\n",
"document = BeautifulSoup(html_str)\n",
"faculty_list = []\n",
"for faculty_tag in document.find_all('li'):\n",
" # create empty dictionary to store this faculty member\n",
" faculty_dict = {}\n",
" # faculty name\n",
" h4_tag = faculty_tag.find('h4')\n",
" a_tag = h4_tag.find('a')\n",
" faculty_dict['name'] = a_tag.string\n",
" # image URL\n",
" img_tag = faculty_tag.find('img')\n",
" faculty_dict['img_src'] = img_tag['src']\n",
" # title\n",
" p_tag = faculty_tag.find('p', attrs={'class': 'description'})\n",
" faculty_dict['title'] = p_tag.string\n",
" # append to list\n",
" faculty_list.append(faculty_dict)\n",
"faculty_list"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "AttributeError",
"evalue": "'NoneType' object has no attribute 'find'",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m Digital Media Coordinator Assistant Professor of Broadcast Journalism Expertise: BROADCAST Adjunct Faculty 83 rows \u00d7 3 columns This is the entry content.\n",
"\n",
"Based on what I'm seeing here, I can start to formulate a plan to scrape the document. Here's what I came up with:\n",
"\n",
"* It looks like each faculty member has an `
` tag---specifically, I need to grab the `src` attribute from that tag.\n",
"* The faculty member's name is inside an `` tag---specifically, an `` tag inside of an `
` tag.\n",
"* The faculty member's title seems to be located inside a `
\n",
"\n",
"Now it *looks* like all of the relevant `
`. So what we need to do is find not *all `
` tag*. Here's some revised code to do just that:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"experts_ul_tag = document.find('ul', attrs={'class': 'experts-list'})\n",
"for faculty_tag in experts_ul_tag.find_all('li')[:5]:\n",
" print faculty_tag"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n",
"
Adkison, Abbey
\n",
"\n",
"
Alarc\u00f3n, Daniel
\n",
"\n",
"
Barclay, Dolores
\n",
"Adkison, Abbey
\n",
"Alarc\u00f3n, Daniel
\n",
"Barclay, Dolores
\n"
]
}
],
"prompt_number": 26
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Okay, now we're in business. At last. Let's put this code together with the previous example."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"html_str = urllib.urlopen(\"http://www.journalism.columbia.edu/page/10/10?category_ids%5B%5D=2&category_ids%5B%5D=3&category_ids%5B%5D=37\").read()\n",
"document = BeautifulSoup(html_str)\n",
"faculty_list = []\n",
"experts_ul_tag = document.find('ul', attrs={'class': 'experts-list'})\n",
"for faculty_tag in experts_ul_tag.find_all('li'):\n",
" # create empty dictionary to store this faculty member\n",
" faculty_dict = {}\n",
" # faculty name\n",
" h4_tag = faculty_tag.find('h4')\n",
" if h4_tag is None:\n",
" continue\n",
" a_tag = h4_tag.find('a')\n",
" faculty_dict['name'] = a_tag.string\n",
" # image URL\n",
" img_tag = faculty_tag.find('img')\n",
" faculty_dict['img_src'] = img_tag['src']\n",
" # title\n",
" p_tag = faculty_tag.find('p', attrs={'class': 'description'})\n",
" faculty_dict['title'] = p_tag.string\n",
" # append to list\n",
" faculty_list.append(faculty_dict)\n",
"faculty_list"
],
"language": "python",
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "'NoneType' object has no attribute '__getitem__'",
"output_type": "pyerr",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m` tag in their `
` tag is present, and only then will we attemt to get its `src` attribute:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"html_str = urllib.urlopen(\"http://www.journalism.columbia.edu/page/10/10?category_ids%5B%5D=2&category_ids%5B%5D=3&category_ids%5B%5D=37\").read()\n",
"document = BeautifulSoup(html_str)\n",
"faculty_list = []\n",
"experts_ul_tag = document.find('ul', attrs={'class': 'experts-list'})\n",
"for faculty_tag in experts_ul_tag.find_all('li'):\n",
" # create empty dictionary to store this faculty member\n",
" faculty_dict = {}\n",
" # faculty name\n",
" h4_tag = faculty_tag.find('h4')\n",
" if h4_tag is None:\n",
" continue\n",
" a_tag = h4_tag.find('a')\n",
" faculty_dict['name'] = a_tag.string\n",
" # image URL: if
tag found, grab its src. if not, use None\n",
" img_tag = faculty_tag.find('img')\n",
" if img_tag is None:\n",
" faculty_dict['img_src'] = None\n",
" else:\n",
" faculty_dict['img_src'] = img_tag['src']\n",
" # title\n",
" p_tag = faculty_tag.find('p', attrs={'class': 'description'})\n",
" faculty_dict['title'] = p_tag.string\n",
" # append to list\n",
" faculty_list.append(faculty_dict)\n",
"faculty_list"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 28,
"text": [
"[{'img_src': None,\n",
" 'name': u'Adkison, Abbey ',\n",
" 'title': u'Digital Media Coordinator'},\n",
" {'img_src': u'/system/photos/3771/default/daniel_a.jpg?1408652577',\n",
" 'name': u'Alarc\\xf3n, Daniel',\n",
" 'title': u'Assistant Professor of Broadcast Journalism'},\n",
" {'img_src': u'/system/photos/1943/default/Dolores-Barclay.gif?1365711292',\n",
" 'name': u'Barclay, Dolores ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': None, 'name': u'Baum, Geraldine', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2056/default/EBell_112811.jpg?1322508884',\n",
" 'name': u'Bell, Emily',\n",
" 'title': u'Professor of Professional Practice & Director, Tow Center for Digital Journalism'},\n",
" {'img_src': u'/system/photos/2057/default/HBenedict_112811.jpg?1322509591',\n",
" 'name': u'Benedict, Helen ',\n",
" 'title': u'Professor'},\n",
" {'img_src': u'/system/photos/2982/default/Bennet_John.gif?1365697019',\n",
" 'name': u'Bennet, John ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2984/default/Bennett_Rob.gif?1365706134',\n",
" 'name': u'Bennett, Rob',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2725/default/Nina-Berman.gif?1365711635',\n",
" 'name': u'Berman, Nina',\n",
" 'title': u'Associate Professor'},\n",
" {'img_src': None, 'name': u'Blair, Gwenda ', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2985/default/Blum_David.gif?1365706164',\n",
" 'name': u'Blum, David ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2692/default/GeorgeBodarky.gif?1365714064',\n",
" 'name': u'Bodarky, George',\n",
" 'title': u'Adjunct Assistant Professor '},\n",
" {'img_src': u'/system/photos/150/default/Walt-Bogdanich.gif?1365714085',\n",
" 'name': u'Bogdanich, Walt ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3055/default/Lennart-Bourin.jpg?1368456160',\n",
" 'name': u'Bourin, Lennart',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': None, 'name': u'Bradley, Theresa', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/1912/default/CurtisBrainard.gif?1365714245',\n",
" 'name': u'Brainard, Curtis ',\n",
" 'title': u'Staff Writer'},\n",
" {'img_src': u'/system/photos/842/default/bruder.jpg?1392672045',\n",
" 'name': u'Bruder, Jessica',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/864/default/burford.jpg?1392672030',\n",
" 'name': u'Burford, Melanie ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2986/default/Burleigh_Nina.gif?1365706179',\n",
" 'name': u'Burleigh, Nina ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2713/default/HeatherCabot.gif?1365714437',\n",
" 'name': u'Cabot, Heather',\n",
" 'title': u'Adjunct Professor'},\n",
" {'img_src': u'/system/photos/3203/default/Elena.gif?1375217143',\n",
" 'name': u'Cabral, Elena ',\n",
" 'title': u'Adjunct Faculty & Assistant Director, Student Services'},\n",
" {'img_src': u'/system/photos/2987/default/Canipe_Chris.gif?1365706198',\n",
" 'name': u'Canipe, Chris',\n",
" 'title': None},\n",
" {'img_src': u'/system/photos/3412/default/Cohen_Julie.jpg?1384536342',\n",
" 'name': u'Cohen, Julie',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3262/default/LIsaRCohenPic.jpg?1425311022',\n",
" 'name': u'Cohen, Lisa R.',\n",
" 'title': u'Adjunct Associate Professor; Director, Professional Prizes'},\n",
" {'img_src': u'/system/photos/2989/default/Cohen_Sarah.gif?1365706228',\n",
" 'name': u'Cohen, Sarah',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3278/default/Coll-web.gif?1377281425',\n",
" 'name': u'Coll, Steve',\n",
" 'title': u'Dean & Henry R. Luce Professor of Journalism'},\n",
" {'img_src': u'/system/photos/147/default/AnnCooper2.jpg?1276009818',\n",
" 'name': u'Cooper, Ann',\n",
" 'title': u'CBS Professor of Professional Practice in International Journalism'},\n",
" {'img_src': u'/system/photos/2990/default/Coronel_Sheila.gif?1365706241',\n",
" 'name': u'Coronel, Sheila ',\n",
" 'title': u'Toni Stabile Professor of Professional Practice in Investigative Journalism; Director, Toni Stabile Center for Investigative Journalism, and Dean of Academic Affairs'},\n",
" {'img_src': u'/system/photos/160/default/Unknown-1.jpeg?1378227266',\n",
" 'name': u'Coyne , Kevin ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2059/default/JCross_112811.jpg?1322510850',\n",
" 'name': u'Cross, June ',\n",
" 'title': u'Professor '},\n",
" {'img_src': u'/system/photos/861/default/Brent-Cunningham.gif?1365714937',\n",
" 'name': u'Cunningham, Brent ',\n",
" 'title': u'Deputy Editor'},\n",
" {'img_src': u'/system/photos/1265/default/ADepalma.jpg?1291223442',\n",
" 'name': u'DePalma, Anthony',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3256/default/deitsch_.jpg?1376325514',\n",
" 'name': u'Deitsch, Richard',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3998/default/IMG_1046.JPG?1416323809',\n",
" 'name': u'Diamond, Becky',\n",
" 'title': None},\n",
" {'img_src': u'/system/photos/2060/default/JDinges_112811.jpg?1322511090',\n",
" 'name': u'Dinges, John',\n",
" 'title': u'Godfrey Lowell Cabot Professor of Journalism'},\n",
" {'img_src': u'/system/photos/1811/default/Donahue03.jpg?1413472417',\n",
" 'name': u'Donahue, Kerry ',\n",
" 'title': u'Adjunct Faculty & Director, Radio Program'},\n",
" {'img_src': None, 'name': u'Drew, Christopher ', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/936/default/edsall.jpg?1304373008',\n",
" 'name': u'Edsall, Thomas B. ',\n",
" 'title': None},\n",
" {'img_src': u'/system/photos/4036/default/Cheryl_Einhorn.jpg?1417630376',\n",
" 'name': u'Einhorn, Cheryl',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3746/default/Justin_Elliott.png?1407529630',\n",
" 'name': u'Elliott, Justin ',\n",
" 'title': u'Adjunct Assistant Professor'},\n",
" {'img_src': u'/system/photos/882/default/Epstein.jpg?1280954937',\n",
" 'name': u'Epstein, Randi Hutter ',\n",
" 'title': u'Adjunct Faculty '},\n",
" {'img_src': None, 'name': u'Evans, Farrell ', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3488/default/ford.jpg?1392672068',\n",
" 'name': u'Ford, Constance Mitchell ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2061/default/SFreedman_112811.jpg?1322511767',\n",
" 'name': u'Freedman, Samuel ',\n",
" 'title': u'Professor'},\n",
" {'img_src': u'/system/photos/645/default/Freeman.jpg?1279731376',\n",
" 'name': u'Freeman, George ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/93/default/HFrench.jpg?1291237303',\n",
" 'name': u'French, Howard ',\n",
" 'title': u'Associate Professor'},\n",
" {'img_src': u'/system/photos/162/default/Stephen_Fried.gif?1365716551',\n",
" 'name': u'Fried, Stephen ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3489/default/mario2015b.jpg?1420472366',\n",
" 'name': u'Garcia, Mario',\n",
" 'title': u'Senior Adviser on News Design/Adjunct Professor'},\n",
" {'img_src': u'/system/photos/3119/default/Vanessa.gif?1371498827',\n",
" 'name': u'Gezari, Vanessa',\n",
" 'title': None},\n",
" {'img_src': None, 'name': u'Gilderman, Greg', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/113/default/TGitlin.jpg?1291237356',\n",
" 'name': u'Gitlin, Todd',\n",
" 'title': u'Professor & Chair, Ph.D. Program'},\n",
" {'img_src': None, 'name': u'Giudice, Barbara ', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/836/default/MartyGoldensohn.gif?1365716789',\n",
" 'name': u'Goldensohn, Marty',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/86/default/AGoldman.jpg?1291237401',\n",
" 'name': u'Goldman, Ari ',\n",
" 'title': u'Professor'},\n",
" {'img_src': None, 'name': u'Goldstein, Jacob', 'title': u'Adjunct Professor'},\n",
" {'img_src': u'/system/photos/88/default/WGrueskin.jpg?1291236298',\n",
" 'name': u'Grueskin, Bill',\n",
" 'title': u'Professor of Professional Practice '},\n",
" {'img_src': u'/system/photos/1512/default/AHaburchak.gif?1365716878',\n",
" 'name': u'Haburchak, Alan',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2992/default/Hajdu_David.gif?1365706270',\n",
" 'name': u'Hajdu, David ',\n",
" 'title': u'Associate Professor '},\n",
" {'img_src': u'/system/photos/97/default/LynNell-faculty.jpg?1368455738',\n",
" 'name': u'Hancock, LynNell',\n",
" 'title': u'H. Gordon Garbedian Professor of Journalism & Director, Spencer Fellowship Program'},\n",
" {'img_src': u'/system/photos/3545/default/hansen.jpg?1392670367',\n",
" 'name': u'Hansen, Mark',\n",
" 'title': u'Director, David and Helen Gurley Brown Institute for Media Innovation & Professor of Journalism '},\n",
" {'img_src': None, 'name': u'Harris, Mark', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3210/default/Julie.gif?1375217224',\n",
" 'name': u'Hartenstein, Julie',\n",
" 'title': u'Associate Dean'},\n",
" {'img_src': u'/system/photos/1530/default/LarryHeinzerling.gif?1365717071',\n",
" 'name': u'Heinzerling, Larry',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/154/default/TomHerman.gif?1365718880',\n",
" 'name': u'Herman, Tom ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': None, 'name': u'Hickey, Neil ', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3059/default/Lars-head-shot-v2.jpg?1368463497',\n",
" 'name': u'Hoel, Lars ',\n",
" 'title': None},\n",
" {'img_src': u'/system/photos/3529/default/hogan.jpg?1392672092',\n",
" 'name': u'Hogan, Pamela',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/155/default/Marguerite_Holloway2.jpg?1276019659',\n",
" 'name': u'Holloway, Marguerite ',\n",
" 'title': u'Associate Professor & Director, Science and Environmental Journalism'},\n",
" {'img_src': u'/system/photos/2994/default/Hoyt_-Mike.gif?1365706301',\n",
" 'name': u'Hoyt, Michael ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/101/default/RJohn.jpg?1291236478',\n",
" 'name': u'John, Richard R. ',\n",
" 'title': u'Professor of History and Communications'},\n",
" {'img_src': None,\n",
" 'name': u'Jones, Matthew L. ',\n",
" 'title': u'Instructor, The Lede Program'},\n",
" {'img_src': None, 'name': u'Kann, Peter R. ', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2995/default/Karle_-Stuart.gif?1365706445',\n",
" 'name': u'Karle, Stuart',\n",
" 'title': u'Adjunct Faculty; William J. Brennan Jr. Visiting Professor of First Amendment Issues'},\n",
" {'img_src': u'/system/photos/858/default/RickKArr.gif?1365717241',\n",
" 'name': u'Karr, Rick',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': None, 'name': u'Kellogg, David', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/158/default/ThomasKent.gif?1365717319',\n",
" 'name': u'Kent, Thomas ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/1259/default/DKlatell.jpg?1291217138',\n",
" 'name': u'Klatell, David',\n",
" 'title': u'Professor of Professional Practice & Chair, International Studies'},\n",
" {'img_src': None, 'name': u'Klein, Adam', 'title': u'Adjunct Professor'},\n",
" {'img_src': u'/system/photos/3056/default/Kim-Kleman-1.jpg?1368463019',\n",
" 'name': u'Kleman, Kim ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/832/default/knee.jpg?1392672119',\n",
" 'name': u'Knee, Jonathan',\n",
" 'title': u'Adjunct Professor'},\n",
" {'img_src': None, 'name': u'Konner, Joan', 'title': u'Dean Emerita'},\n",
" {'img_src': u'/system/photos/4171/default/Matt_Kozar.jpg?1425569363',\n",
" 'name': u'Kozar, Matt',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/156/default/haupt.jpg?1392672140',\n",
" 'name': u'Lehmann-Haupt, Christopher ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2065/default/NLemann_112811.jpg?1322513292',\n",
" 'name': u'Lemann, Nicholas',\n",
" 'title': u'Joseph Pulitzer II and Edith Pulitzer Moore Professor of Journalism; Dean Emeritus'},\n",
" {'img_src': None, 'name': u'Levenson, Jacob ', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/157/default/SethLipsky.gif?1365717822',\n",
" 'name': u'Lipsky, Seth ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': None, 'name': u'Lombardi, Kristen', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/1391/default/TamiLuhby.gif?1365717895',\n",
" 'name': u'Luhby, Tami',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3263/default/Tony-Maciulius.gif?1376424088',\n",
" 'name': u'Maciulis, Tony',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2996/default/Maharidge_-Dale.gif?1365706460',\n",
" 'name': u'Maharidge, Dale ',\n",
" 'title': u'Professor '},\n",
" {'img_src': u'/system/photos/2690/default/TomMason.gif?1365718118',\n",
" 'name': u'Mason, Tom',\n",
" 'title': None},\n",
" {'img_src': u'/system/photos/2997/default/Matloff_-Judith.gif?1365706478',\n",
" 'name': u'Matloff, Judith ',\n",
" 'title': u'Adjunct faculty'},\n",
" {'img_src': None, 'name': u'Maytal, Itai', 'title': u'Adjunct Faculty'},\n",
" {'img_src': None, 'name': u'McCormick, David ', 'title': u'Adjunct Faculty'},\n",
" {'img_src': None, 'name': u'McCray, Melvin', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2998/default/McDonald_-Erica.gif?1365706494',\n",
" 'name': u'McDonald, Erica',\n",
" 'title': None},\n",
" {'img_src': u'/system/photos/2068/default/SMcGregor_112811.jpg?1322514173',\n",
" 'name': u'McGregor, Susan E.',\n",
" 'title': u'Assistant Professor & Assistant Director, Tow Center for Digital Journalism'},\n",
" {'img_src': None, 'name': u'Mencher, Melvin', 'title': u'Professor Emeritus'},\n",
" {'img_src': u'/system/photos/2999/default/Merchant_-Preston.gif?1365706509',\n",
" 'name': u'Merchant, Preston',\n",
" 'title': None},\n",
" {'img_src': u'/system/photos/3000/default/Mintz_-Jim.gif?1365706529',\n",
" 'name': u'Mintz, James',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': None, 'name': u'Morais, Betsy', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2069/default/SNasar_112811.jpg?1322514324',\n",
" 'name': u'Nasar, Sylvia ',\n",
" 'title': u'John S. and James L. Knight Professor of Business Journalism'},\n",
" {'img_src': u'/system/photos/2070/default/VNavasky_112811.jpg?1322514777',\n",
" 'name': u'Navasky, Victor ',\n",
" 'title': u'George T. Delacorte Professor in Magazine Journalism; Director, Delacorte Center for Magazine Journalism; Chair, Columbia Journalism Review '},\n",
" {'img_src': None, 'name': u'Newman, Maria', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2714/default/nisenholtz.jpg?1392672182',\n",
" 'name': u'Nisenholtz, Martin',\n",
" 'title': u'Adjunct Professor '},\n",
" {'img_src': u'/system/photos/2659/default/RNorton.jpg?1345495471',\n",
" 'name': u'Norton, Rob',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3288/default/habibanosheen2.jpg?1378324386',\n",
" 'name': u'Nosheen, Habiba',\n",
" 'title': u'Adjunct Professor'},\n",
" {'img_src': u'/system/photos/1993/default/CharlesOrnstein.gif?1365718296',\n",
" 'name': u'Ornstein, Charles',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/826/default/padawer.jpg?1392672201',\n",
" 'name': u'Padawer , Ruth',\n",
" 'title': u'Adjunct Professor'},\n",
" {'img_src': u'/system/photos/2073/default/SPadwe_112811.jpg?1322516163',\n",
" 'name': u'Padwe, Sandy ',\n",
" 'title': u'Special Lecturer'},\n",
" {'img_src': u'/system/photos/3706/default/Diantha_Parker.jpg?1404750842',\n",
" 'name': u'Parker, Diantha',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': None,\n",
" 'name': u'Parrish, Allison ',\n",
" 'title': u'Instructor, The Lede Program'},\n",
" {'img_src': u'/system/photos/1914/default/Patel_Headshot2.jpg?1319040513',\n",
" 'name': u'Patel, Samir S.',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3001/default/Perlman_-Merrill.gif?1365706552',\n",
" 'name': u'Perlman, Merrill',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3708/default/Lisa_Pollak_Photo.jpg?1405027267',\n",
" 'name': u'Pollak, Lisa',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/852/default/pool-eckert.jpg?1392672210',\n",
" 'name': u'Pool-Eckert, Marquita',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': None, 'name': u'Richardson, Lynda ', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/824/default/richmn.jpg?1392672219',\n",
" 'name': u'Richman, Joe',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3377/default/robbins.jpg?1392672228',\n",
" 'name': u'Robbins, Ed',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': None, 'name': u'Roberts, Fletcher', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/851/default/Sacha.jpg?1280952529',\n",
" 'name': u'Sacha, Bob ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/823/default/RichSchapiro.gif?1365718608',\n",
" 'name': u'Schapiro, Rich',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': None, 'name': u'Schatz, Robin', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/1913/default/BJSchechter.gif?1365718488',\n",
" 'name': u'Schecter, B.J.',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3584/default/HilkeSchellmann_final.jpg?1395348920',\n",
" 'name': u'Schellmann, Hilke',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3432/default/schoen.jpg?1392672237',\n",
" 'name': u'Schoen, John',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': None,\n",
" 'name': u'Schoonmaker, Mary Ellen',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2075/default/MSchudson_112811.jpg?1322517107',\n",
" 'name': u'Schudson, Michael ',\n",
" 'title': u'Professor '},\n",
" {'img_src': u'/system/photos/2076/default/ESchumacher_112811.jpg?1322517858',\n",
" 'name': u'Schumacher-Matos, Ed ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': None, 'name': u'Schwartz, Jack ', 'title': None},\n",
" {'img_src': u'/system/photos/872/default/Seave.jpg?1280954557',\n",
" 'name': u'Seave, Ava ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3718/default/Giannina_Segnini.jpg?1406325332',\n",
" 'name': u'Segnini, Giannina ',\n",
" 'title': u'James Madison Visiting Professor on First Amendment Issues'},\n",
" {'img_src': None,\n",
" 'name': u'Shanor, Donald',\n",
" 'title': u'G. L. Cabot Professor Emeritus '},\n",
" {'img_src': u'/system/photos/3003/default/Shapiro_-Bruce.gif?1365706590',\n",
" 'name': u'Shapiro, Bruce',\n",
" 'title': u'Executive Director for the Dart Center, Senior Executive Director for Professional Programs'},\n",
" {'img_src': u'/system/photos/2077/default/MShapiro_112811.jpg?1322518042',\n",
" 'name': u'Shapiro, Michael ',\n",
" 'title': u'Professor '},\n",
" {'img_src': u'/system/photos/2682/default/ahmed-shihab-eldin.gif?1365717716',\n",
" 'name': u'Shihab-Eldin, Ahmed ',\n",
" 'title': u'Adjunct Assistant Professor '},\n",
" {'img_src': u'/system/photos/4039/default/SICHA_headshot.jpg?1417546936',\n",
" 'name': u'Sicha, Choire',\n",
" 'title': u'Adjunct Professor'},\n",
" {'img_src': u'/system/photos/3235/default/Siegel_-Lloyd-2012.jpg?1375710566',\n",
" 'name': u'Siegel, Lloyd',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/871/default/singer.jpg?1392672245',\n",
" 'name': u'Singer, Amy ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/1953/default/MariaSliwa.gif?1365717479',\n",
" 'name': u'Sliwa, Maria',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3004/default/Solomon_-Alisa.gif?1365706611',\n",
" 'name': u'Solomon, Alisa',\n",
" 'title': u'Professor & Director, Arts Concentration, M.A. Program'},\n",
" {'img_src': u'/system/photos/3664/default/jonathan-soma.jpg?1399473825',\n",
" 'name': u'Soma, Jonathan',\n",
" 'title': u'Director, The Lede Program'},\n",
" {'img_src': u'/system/photos/3204/default/Ernie.gif?1375217153',\n",
" 'name': u'Sotomayor, Ernest',\n",
" 'title': u'Dean of Student Affairs'},\n",
" {'img_src': u'/system/photos/848/default/PaulaSpan.gif?1365717399',\n",
" 'name': u'Span, Paula ',\n",
" 'title': u'Adjunct Professor'},\n",
" {'img_src': u'/system/photos/3236/default/Karen-Stabiner.jpg?1375717512',\n",
" 'name': u'Stabiner, Karen',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2080/default/JStewart_112811.jpg?1322518368',\n",
" 'name': u'Stewart, James',\n",
" 'title': u'Bloomberg Professor of Business Journalism'},\n",
" {'img_src': u'/system/photos/82/default/Stille.gif?1365718788',\n",
" 'name': u'Stille, Alexander',\n",
" 'title': u'San Paolo Professor of International Journalism'},\n",
" {'img_src': u'/system/photos/3005/default/Subramanian_-Sushma.gif?1365706634',\n",
" 'name': u'Subramanian, Sushma',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/847/default/Surowicz.jpg?1280952446',\n",
" 'name': u'Surowicz, Simon',\n",
" 'title': None},\n",
" {'img_src': None,\n",
" 'name': u'Tenen, Dennis',\n",
" 'title': u'Instructor, The Lede Program'},\n",
" {'img_src': u'/system/photos/105/default/topping.jpg?1392672489',\n",
" 'name': u'Topping, Seymour ',\n",
" 'title': u'San Paolo Professor of International Journalism Emeritus'},\n",
" {'img_src': u'/system/photos/4167/default/Yogi_Trivedi.jpg?1425308161',\n",
" 'name': u'Trivedi, Yogi ',\n",
" 'title': u'Adjunct Professor'},\n",
" {'img_src': u'/system/photos/3057/default/Dody.jpg?1368463129',\n",
" 'name': u'Tsiantar, Dody ',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/91/default/Duy_headshot-2web.jpg?1423507352',\n",
" 'name': u'Tu, Duy Linh',\n",
" 'title': u'Assistant Professor of Professional Practice & Director, Digital Media Program '},\n",
" {'img_src': u'/system/photos/125/default/Andie_Tucher2.jpg?1275665800',\n",
" 'name': u'Tucher, Andie ',\n",
" 'title': u'Associate Professor; Director, Ph.D. Program'},\n",
" {'img_src': u'/system/photos/3188/default/Mike-Ventura.gif?1374078031',\n",
" 'name': u'Ventura, Michael',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3007/default/Wald_-Jonathan.gif?1365706666',\n",
" 'name': u'Wald, Jonathan',\n",
" 'title': u'Adjunt Faculty'},\n",
" {'img_src': u'/system/photos/126/default/Richard_Wald2.jpg?1275665983',\n",
" 'name': u'Wald, Richard',\n",
" 'title': u'Fred W. Friendly Professor of Professional Practice in Media and Society'},\n",
" {'img_src': u'/system/photos/1261/default/wayne.jpg?1392672262',\n",
" 'name': u'Wayne, Leslie',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/127/default/JWeiner.jpg?1291237982',\n",
" 'name': u'Weiner, Jonathan ',\n",
" 'title': u'Maxwell M. Geffen Professor of Medical and Scientific Journalism '},\n",
" {'img_src': u'/system/photos/128/default/Betsy_West2.jpg?1275668385',\n",
" 'name': u'West, Betsy ',\n",
" 'title': u'Associate Professor of Professional Practice'},\n",
" {'img_src': u'/system/photos/3008/default/Wheatley_-Bill.gif?1365706683',\n",
" 'name': u'Wheatley, Jr., William',\n",
" 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3665/default/chris-wiggins.jpg?1399473834',\n",
" 'name': u'Wiggins, Chris',\n",
" 'title': u'Instructor, The Lede Program'},\n",
" {'img_src': None, 'name': u'Wilson, Duff', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/2604/default/Tali_Web.jpg?1339446488',\n",
" 'name': u'Woodward, Tali ',\n",
" 'title': u'Director, M.A. Program'},\n",
" {'img_src': u'/system/photos/3983/default/wu_faculty.jpg?1415814909',\n",
" 'name': u'Wu, Tim',\n",
" 'title': u'Director of the Saul and Janice Poliak Center for the Study of First Amendment Issues'},\n",
" {'img_src': None,\n",
" 'name': u'Yu, Frederick T C.',\n",
" 'title': u'CBS Professor Emeritus International Journalism'},\n",
" {'img_src': None, 'name': u'Zucker, John', 'title': u'Adjunct Faculty'},\n",
" {'img_src': u'/system/photos/3058/default/zuckerman.jpg?1392672295',\n",
" 'name': u'Zuckerman, Jocelyn Craugh ',\n",
" 'title': u'Adjunct Faculty'}]"
]
}
],
"prompt_number": 28
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It worked! So good. Now we can do fun stuff with the data, like making a pandas data frame and seeing how many are listed as \"Adjunct Faculty\":"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"\n",
"faculty_frame = pd.DataFrame(faculty_list)\n",
"faculty_frame[faculty_frame[\"title\"]==\"Adjunct Faculty\"]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stderr",
"text": [
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.\n",
" .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))\n"
]
},
{
"html": [
"
\n",
" \n",
"
\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" img_src \n",
" name \n",
" title \n",
" \n",
" \n",
" 2 \n",
" /system/photos/1943/default/Dolores-Barclay.gi... \n",
" Barclay, Dolores \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 3 \n",
" None \n",
" Baum, Geraldine \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 6 \n",
" /system/photos/2982/default/Bennet_John.gif?13... \n",
" Bennet, John \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 7 \n",
" /system/photos/2984/default/Bennett_Rob.gif?13... \n",
" Bennett, Rob \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 9 \n",
" None \n",
" Blair, Gwenda \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 10 \n",
" /system/photos/2985/default/Blum_David.gif?136... \n",
" Blum, David \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 12 \n",
" /system/photos/150/default/Walt-Bogdanich.gif?... \n",
" Bogdanich, Walt \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 13 \n",
" /system/photos/3055/default/Lennart-Bourin.jpg... \n",
" Bourin, Lennart \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 14 \n",
" None \n",
" Bradley, Theresa \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 16 \n",
" /system/photos/842/default/bruder.jpg?1392672045 \n",
" Bruder, Jessica \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 17 \n",
" /system/photos/864/default/burford.jpg?1392672030 \n",
" Burford, Melanie \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 18 \n",
" /system/photos/2986/default/Burleigh_Nina.gif?... \n",
" Burleigh, Nina \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 22 \n",
" /system/photos/3412/default/Cohen_Julie.jpg?13... \n",
" Cohen, Julie \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 24 \n",
" /system/photos/2989/default/Cohen_Sarah.gif?13... \n",
" Cohen, Sarah \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 28 \n",
" /system/photos/160/default/Unknown-1.jpeg?1378... \n",
" Coyne , Kevin \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 31 \n",
" /system/photos/1265/default/ADepalma.jpg?12912... \n",
" DePalma, Anthony \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 32 \n",
" /system/photos/3256/default/deitsch_.jpg?13763... \n",
" Deitsch, Richard \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 36 \n",
" None \n",
" Drew, Christopher \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 38 \n",
" /system/photos/4036/default/Cheryl_Einhorn.jpg... \n",
" Einhorn, Cheryl \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 41 \n",
" None \n",
" Evans, Farrell \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 42 \n",
" /system/photos/3488/default/ford.jpg?1392672068 \n",
" Ford, Constance Mitchell \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 44 \n",
" /system/photos/645/default/Freeman.jpg?1279731376 \n",
" Freeman, George \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 46 \n",
" /system/photos/162/default/Stephen_Fried.gif?1... \n",
" Fried, Stephen \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 49 \n",
" None \n",
" Gilderman, Greg \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 51 \n",
" None \n",
" Giudice, Barbara \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 52 \n",
" /system/photos/836/default/MartyGoldensohn.gif... \n",
" Goldensohn, Marty \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 56 \n",
" /system/photos/1512/default/AHaburchak.gif?136... \n",
" Haburchak, Alan \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 60 \n",
" None \n",
" Harris, Mark \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 62 \n",
" /system/photos/1530/default/LarryHeinzerling.g... \n",
" Heinzerling, Larry \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 63 \n",
" /system/photos/154/default/TomHerman.gif?13657... \n",
" Herman, Tom \n",
" Adjunct Faculty \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 110 \n",
" /system/photos/3706/default/Diantha_Parker.jpg... \n",
" Parker, Diantha \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 112 \n",
" /system/photos/1914/default/Patel_Headshot2.jp... \n",
" Patel, Samir S. \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 113 \n",
" /system/photos/3001/default/Perlman_-Merrill.g... \n",
" Perlman, Merrill \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 114 \n",
" /system/photos/3708/default/Lisa_Pollak_Photo.... \n",
" Pollak, Lisa \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 115 \n",
" /system/photos/852/default/pool-eckert.jpg?139... \n",
" Pool-Eckert, Marquita \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 116 \n",
" None \n",
" Richardson, Lynda \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 117 \n",
" /system/photos/824/default/richmn.jpg?1392672219 \n",
" Richman, Joe \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 118 \n",
" /system/photos/3377/default/robbins.jpg?139267... \n",
" Robbins, Ed \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 119 \n",
" None \n",
" Roberts, Fletcher \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 120 \n",
" /system/photos/851/default/Sacha.jpg?1280952529 \n",
" Sacha, Bob \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 121 \n",
" /system/photos/823/default/RichSchapiro.gif?13... \n",
" Schapiro, Rich \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 122 \n",
" None \n",
" Schatz, Robin \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 123 \n",
" /system/photos/1913/default/BJSchechter.gif?13... \n",
" Schecter, B.J. \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 124 \n",
" /system/photos/3584/default/HilkeSchellmann_fi... \n",
" Schellmann, Hilke \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 125 \n",
" /system/photos/3432/default/schoen.jpg?1392672237 \n",
" Schoen, John \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 126 \n",
" None \n",
" Schoonmaker, Mary Ellen \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 128 \n",
" /system/photos/2076/default/ESchumacher_112811... \n",
" Schumacher-Matos, Ed \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 130 \n",
" /system/photos/872/default/Seave.jpg?1280954557 \n",
" Seave, Ava \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 137 \n",
" /system/photos/3235/default/Siegel_-Lloyd-2012... \n",
" Siegel, Lloyd \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 138 \n",
" /system/photos/871/default/singer.jpg?1392672245 \n",
" Singer, Amy \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 139 \n",
" /system/photos/1953/default/MariaSliwa.gif?136... \n",
" Sliwa, Maria \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 144 \n",
" /system/photos/3236/default/Karen-Stabiner.jpg... \n",
" Stabiner, Karen \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 147 \n",
" /system/photos/3005/default/Subramanian_-Sushm... \n",
" Subramanian, Sushma \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 152 \n",
" /system/photos/3057/default/Dody.jpg?1368463129 \n",
" Tsiantar, Dody \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 155 \n",
" /system/photos/3188/default/Mike-Ventura.gif?1... \n",
" Ventura, Michael \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 158 \n",
" /system/photos/1261/default/wayne.jpg?1392672262 \n",
" Wayne, Leslie \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 161 \n",
" /system/photos/3008/default/Wheatley_-Bill.gif... \n",
" Wheatley, Jr., William \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 163 \n",
" None \n",
" Wilson, Duff \n",
" Adjunct Faculty \n",
" \n",
" \n",
" 167 \n",
" None \n",
" Zucker, John \n",
" Adjunct Faculty \n",
" \n",
" \n",
" \n",
"168 \n",
" /system/photos/3058/default/zuckerman.jpg?1392... \n",
" Zuckerman, Jocelyn Craugh \n",
" Adjunct Faculty \n",
" ` tags in the HTML examples above). Also, in general, tools that work with XML are much more strict about syntax than tools that work with HTML. Browsers tend to be very forgiving of errors in HTML, but will immediately reject XML that isn't well-formed.\n",
"\n",
"XML documents generally conform to a \"standard\" or \"format,\"---that is, a pre-defined list of tag names and attribute names and rules for which tags can have which attributes and which tags can contain which other tags. For example, the document in the above is in the Atom XML format, [which you can find out more about here](http://en.wikipedia.org/wiki/Atom_(standard)). XML standards also give you some idea of what the document *means*---a consistent mapping between the document's structure and its semantics.\n",
"\n",
"In sum: XML documents conform to standards, they must be syntactically valid, and they have agreed-upon semantics. For these reasons, XML documents are considered to be much more friendly for computers to read than HTML documents. \n",
"\n",
"> CLEVER PEOPLE NOTE: XML and HTML work similarly enough, and XML documents can have standards, so why not just make an XML standard that defines all of the tags and attributes in HTML, and have the best of both worlds? [It's been tried before](http://en.wikipedia.org/wiki/XHTML), and there are several drawbacks, [enumerated here](http://stackoverflow.com/questions/5558502/is-html5-valid-xml), but mostly having to do with backwards compatibility."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"###Dealing with XML data\n",
"\n",
"Now, you *can* parse XML data with Beautiful Soup ([with one important caveat](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#id17)). But one of the benefits of data in XML is that there are many pre-existing libraries for Python that are purpose-built for working with data in whichever XML standard. These libraries will save you the effort of having to figure out how documents in that particular standard are put together.\n",
"\n",
"There are [a truly bewildering number of XML standards](http://en.wikipedia.org/wiki/Category:XML-based_standards), each devised for more or less domain-specific tasks. (There is even [a truly bewildering number of XML standards for writing documents that define XML standards](http://en.wikipedia.org/wiki/XML_schema#XML_schema_languages)). Listed below are a few standards of interest to journalists, along with links to Python libraries for dealing with documents using those standards:\n",
"\n",
"* [Keyhole Markup Language](http://en.wikipedia.org/wiki/Keyhole_Markup_Language) (KML), used for geographic data: [fastkml](https://pypi.python.org/pypi/fastkml/)\n",
"* [Scalable Vector Graphics](http://en.wikipedia.org/wiki/Scalable_Vector_Graphics) (SVG), used for images and drawings: [pySVG](http://codeboje.de/pysvg/)\n",
"* [SOAP](http://en.wikipedia.org/wiki/SOAP_(protocol)), used for some web services: [pysimplesoap](https://code.google.com/p/pysimplesoap/)\n",
"* [Atom](http://en.wikipedia.org/wiki/Atom_(standard)), a set of standards used for web publishing and services: [feedparser](https://pypi.python.org/pypi/feedparser). (The `feedparser` library also helps to parse all manner of other web syndication formats.)\n",
"* \n",
"\n",
"###An example: RSS feeds\n",
"\n",
"One of the first tasks many students set themselves to after learning about web scraping is to scrape the front page of the New York Times. *DON'T DO THIS* if you can avoid it. You're inviting disaster, as the NYTimes is free at any moment to change the way their HTML is structured, and your scraper will break. Instead, try using the New York Times RSS feed!\n",
"\n",
"RSS is a format that many websites use to publish their articles in computer-readable formats. (RSS support used to be all the rage back in the Internet days, and fewer sites now support it than used to, and some web sites---like the New York Times---support it but don't advertise that fact.) It's an XML format. Here's a link to the New York Times RSS feed for their front-page articles:\n",
"\n",
"http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml\n",
"\n",
"Click on that link, and you'll see a big mess of XML that doesn't make any sense. We're going to use the `feedparser` library mentioned above to parse this RSS and get back a list of all of the article titles. The `feedparser` library essentially takes a big ball of RSS XML and turns it into a Python data structure (to be specific, a list of dictionaries, where each dictionary represents an article in the feed).\n",
"\n",
"First, check to see if you have `feedparser` installed."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import feedparser"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 30
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you get `ImportError: No module named feedparser`, try running this line (this will work ONLY on your AWS instances):"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!sudo pip install feedparser"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Password:"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r\n"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Otherwise, you can use your `pip` skills to install feedparser however you'd like.\n",
"\n",
"Once you have `feedparser` installed, we can use it to read in a remote RSS file:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import feedparser\n",
"\n",
"feed = feedparser.parse(\"http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml\")\n",
"print type(feed.entries)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"
',\n",
" 'summary_detail': {'base': u'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml',\n",
" 'language': None,\n",
" 'type': u'text/html',\n",
" 'value': u'Researchers have little experience with raises of such magnitude. The effects, especially in areas with low median incomes, could be profound.
'},\n",
" 'tags': [{'label': None,\n",
" 'scheme': u'http://www.nytimes.com/namespaces/keywords/nyt_geo',\n",
" 'term': u'New York City'},\n",
" {'label': None,\n",
" 'scheme': u'http://www.nytimes.com/namespaces/keywords/mdes',\n",
" 'term': u'Layoffs and Job Reductions'},\n",
" {'label': None,\n",
" 'scheme': u'http://www.nytimes.com/namespaces/keywords/des',\n",
" 'term': u'United States Economy'},\n",
" {'label': None,\n",
" 'scheme': u'http://www.nytimes.com/namespaces/keywords/mdes',\n",
" 'term': u'Research'},\n",
" {'label': None,\n",
" 'scheme': u'http://www.nytimes.com/namespaces/keywords/nyt_geo',\n",
" 'term': u'San Francisco (Calif)'},\n",
" {'label': None,\n",
" 'scheme': u'http://www.nytimes.com/namespaces/keywords/des',\n",
" 'term': u'Minimum Wage'},\n",
" {'label': None,\n",
" 'scheme': u'http://www.nytimes.com/namespaces/keywords/nyt_geo',\n",
" 'term': u'New York State'},\n",
" {'label': None,\n",
" 'scheme': u'http://www.nytimes.com/namespaces/nyt_org_all',\n",
" 'term': u'Amazon.com Inc|AMZN|NASDAQ'},\n",
" {'label': None,\n",
" 'scheme': u'http://www.nytimes.com/namespaces/keywords/des',\n",
" 'term': u'Wages and Salaries'},\n",
" {'label': None,\n",
" 'scheme': u'http://www.nytimes.com/namespaces/keywords/des',\n",
" 'term': u'Labor and Jobs'}],\n",
" 'title': u'Scale of Minimum Wage Rise Has Experts Guessing at Effect',\n",
" 'title_detail': {'base': u'http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml',\n",
" 'language': None,\n",
" 'type': u'text/plain',\n",
" 'value': u'Scale of Minimum Wage Rise Has Experts Guessing at Effect'}}"
]
}
],
"prompt_number": 32
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Okay, cool. Looking over this data structure, it looks like we have a dictionary, and the thing we want---the title of the article---is the value for the `title` key. Let's make a list comprehension to pull them out:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"[article['title'] for article in feed.entries]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 33,
"text": [
"[u'Scale of Minimum Wage Rise Has Experts Guessing at Effect',\n",
" u'Mets 3, Dodgers 2, 10 Innings: Mets Break Zack Greinke\\u2019s Scoreless Streak in Win Over Dodgers',\n",
" u'Republicans Alter Script on Abortion, Seeking to Shift Debate',\n",
" u'Bobby Jindal Calls for States to Follow Louisiana\\u2019s Example in Toughening Gun Laws',\n",
" u'Fiat Chrysler Gets Record $105 Million Fine for Safety Issues',\n",
" u'Ethiopia\\u2019s Human Rights Activists See Scant Hope in Obama\\u2019s Visit',\n",
" u'Senate Resurrection of Export-Import Bank Goes to Divided House',\n",
" u'Spelman College Terminates Professorship Endowed by Bill Cosby',\n",
" u'Racial Divide Persists in Texas County Where Sandra Bland Died',\n",
" u'Hologram Performance by Chief Keef Is Shut Down by Police',\n",
" u'Lynch Says Death in Police Custody Highlights Fears Among Blacks',\n",
" u'Obama Delivers Tough-Love Message to End Kenya Trip',\n",
" u'Kitty Genovese Killing Is Retold in the Film \\u201837\\u2019',\n",
" u'12 Are Killed in Bombing Outside Hotel in Somalia',\n",
" u'Gawker\\u2019s Future: A Conversation With Nick Denton',\n",
" u'British Lord Resigns Posts in Scandal Over Drugs',\n",
" u'ArtsBeat: Adam Sandler\\u2019s \\u2018Pixels\\u2019 Can\\u2019t Topple \\u2018Ant-Man\\u2019 at Box Office',\n",
" u'Grace Notes: A Manhattan Project Veteran Had a Unique View of Atomic Bomb Work',\n",
" u'Your Weekend Briefing',\n",
" u'The Working Life: Proposed Raise for Fast-Food Employees Divides Low-Wage Workers']"
]
}
],
"prompt_number": 33
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##Conclusion\n",
"\n",
"By the end of this tutorial, you should feel confident in your ability to extract information from HTML and XML documents. There are a lot of subtleties we didn't go over, but you're well on your way! Here are some further links to aid in your exploration.\n",
"\n",
"* [A Gentle Introduction to XML](http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SG.html), from [TEI](http://www.tei-c.org/index.xml).\n",
"* [Intro to Beautiful Soup](http://programminghistorian.org/lessons/intro-to-beautiful-soup)"
]
}
],
"metadata": {}
}
]
}