` tags (which contain the names of the TV shows that we were looking for):"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fluffy: Deep Space Nine, Mr. Belvedere\n",
"Monsieur Whiskeurs: The X-Files, Fresh Prince\n"
]
}
],
"source": [
"for kitten_tag in kitten_tags:\n",
" h2_tag = kitten_tag.find('h2')\n",
" a_tags = kitten_tag.find_all('a')\n",
" a_tag_strings = [tag.string for tag in a_tags]\n",
" a_tag_strings_joined = \", \".join(a_tag_strings)\n",
" print(h2_tag.string + \": \" + a_tag_strings_joined)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> EXERCISE: Modify the code above to print out a list of kitten names along with the last check-up date for that kitten."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> EXTRA FUN EXERCISE: Rewrite the code above to create a dictionary that maps kitten names to a list of links to that kitten's favorite shows. I.e., you should end up with a dictionary that looks like this:\n",
"\n",
" {u'Fluffy': [u'http://www.imdb.com/title/tt0106145/',\n",
" u'http://www.imdb.com/title/tt0088576/'],\n",
" u'Monsieur Whiskeurs': [u'http://www.imdb.com/title/tt0106179/',\n",
" u'http://www.imdb.com/title/tt0098800/']}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding sibling tags\n",
"\n",
"Often, the tags we're looking for don't have a distinguishing characteristic, like a `class` attribute, that allows us to find them using `.find()` and `.find_all()`, and the tags also aren't in a parent-child relationship. This can be tricky! Take the following HTML snippet, for example: "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"cheese_html = \"\"\"\n",
"Camembert
\n",
"A soft cheese made in the Camembert region of France.
\n",
"\n",
"Cheddar
\n",
"A yellow cheese made in the Cheddar region of... France, probably, idk whatevs.
\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If our task was to create a dictionary that maps the name of the cheese to the description that follows in the `` tag directly afterward, we'd be out of luck. Fortunately, Beautiful Soup has a `.find_next_sibling()` method, which allows us to search for the next tag that is a *sibling* of the tag you're calling it on (i.e., the two tags share a parent), that also matches particular criteria. So, for example, to accomplish the task outlined above:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'Camembert': 'A soft cheese made in the Camembert region of France.',\n",
" 'Cheddar': 'A yellow cheese made in the Cheddar region of... France, probably, idk whatevs.'}"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"document = BeautifulSoup(cheese_html, \"html.parser\")\n",
"cheese_dict = {}\n",
"for h2_tag in document.find_all('h2'):\n",
" cheese_name = h2_tag.string\n",
" cheese_desc_tag = h2_tag.find_next_sibling('p')\n",
" cheese_dict[cheese_name] = cheese_desc_tag.string\n",
"\n",
"cheese_dict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You now know most of what you need to know to scrape web pages effectively. Good job!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### When things go wrong with Beautiful Soup\n",
"\n",
"A number of things might go wrong with Beautiful Soup. You might, for example, search for a tag that doesn't exist in the document:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"footer_tag = document.find(\"footer\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Beautiful Soup doesn't return an error if it can't find the tag you want. Instead, it returns `None`:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"NoneType"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(footer_tag)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you try to call a method on the object that Beautiful Soup returned anyway, you might end up with an error like this:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"ename": "AttributeError",
"evalue": "'NoneType' object has no attribute 'find'",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mfooter_tag\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfind\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"p\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mAttributeError\u001b[0m: 'NoneType' object has no attribute 'find'"
]
}
],
"source": [
"footer_tag.find(\"p\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You might also inadvertently try to get an attribute of a tag that wasn't actually found. You'll get a similar error in that case:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"ename": "TypeError",
"evalue": "'NoneType' object is not subscriptable",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mfooter_tag\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'title'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m: 'NoneType' object is not subscriptable"
]
}
],
"source": [
"footer_tag['title']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whenever you see something like `TypeError: 'NoneType' object is not subscriptable`, it's a good idea to check to see whether your method calls are indeed finding the thing you were looking for.\n",
"\n",
"However, the `.find_all()` method will return an empty list if it doesn't find any of the tags you wanted:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[]\n"
]
}
],
"source": [
"footer_tags = document.find_all(\"footer\")\n",
"print(footer_tags)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you attempt to access one of the elements of this regardless..."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"ename": "IndexError",
"evalue": "list index out of range",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfooter_tags\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstring\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mIndexError\u001b[0m: list index out of range"
]
}
],
"source": [
"print(footer_tags[0].string)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"...you'll get an `IndexError`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Further reading\n",
"\n",
"* [Chapter 11](https://automatetheboringstuff.com/chapter11/) from Al Sweigart's [Automate the Boring Stuff with Python](https://automatetheboringstuff.com/) is another good take on this material (and discusses a wider range of techniques).\n",
"* [The official Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) provides a systematic walkthrough of the library's functionality. If you find yourself thinking, \"it really should be easy to do the thing that I want to do, why isn't it easier?\" then check the documentation! Leonard's probably already thought of a way to make it easier and implemented a feature in the code to help you out.\n",
"* Beautiful Soup is the best scraping library out there for quick jobs, but if you have a larger site that you need to scrape, you might look into [Scrapy](http://scrapy.org/), which bundles a good parser with a framework for writing web \"spiders\" (i.e., programs that parse web pages and follow the links found there, in order to make a catalog of an entire web site, not just a single web page)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 1
}