{ "metadata": { "name": "", "signature": "sha256:bd871b050722ec78607d72ddb284b5d7b3ab9d510a9183b7a2f4210003f450f6" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Beautiful Soup Basic HTML Scraping\n", "\n", "- **Author:** [Chris Albon](http://www.chrisalbon.com/), [@ChrisAlbon](https://twitter.com/chrisalbon)\n", "- **Date:** -\n", "- **Repo:** [Python 3 code snippets for data science](https://github.com/chrisalbon/code_py)\n", "- **Note:** -" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import the modules" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Import required modules\n", "import requests\n", "from bs4 import BeautifulSoup" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 61 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scrap the html and turn into a beautiful soup object" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Create a variable with the url\n", "url = 'http://chrisralbon.com'\n", "\n", "# Use requests to get the contents\n", "r = requests.get(url)\n", "\n", "# Get the text of the contents\n", "html_content = r.text\n", "\n", "# Convert the html content into a beautiful soup object\n", "soup = BeautifulSoup(html_content)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select the website's title" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# View the title tag of the soup object\n", "soup.title" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 63, "text": [ "Chris R. Albon" ] } ], "prompt_number": 63 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Website title's tag" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# View the name of the title\n", "soup.title.name" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 64, "text": [ "'title'" ] } ], "prompt_number": 64 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Website title tag's string" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# View the string within the title tag\n", "soup.title.string" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 65, "text": [ "'Chris R. Albon'" ] } ], "prompt_number": 65 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### First paragraph tag" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# view the paragraph tag of the soup\n", "soup.p" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 67, "text": [ "

Data for social good.

" ] } ], "prompt_number": 67 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The parent of the title tag" ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.title.parent.name" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 68, "text": [ "'head'" ] } ], "prompt_number": 68 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The class of the first paragraph tag" ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.p['class']" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 69, "text": [ "['site-description']" ] } ], "prompt_number": 69 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The first link tag" ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.a" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 70, "text": [ "\"Blog" ] } ], "prompt_number": 70 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Find all the link tags" ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.find_all('a')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 71, "text": [ "[\"Blog,\n", " Chris R. Albon,\n", " Conflict Health Has Shut Down,\n", " About Chris Albon,\n", " About,\n", " Twitter,\n", " GitHub,\n", " Pinboard]" ] } ], "prompt_number": 71 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Get all the text on the page" ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.get_text()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 79, "text": [ "'\\n\\n\\n\\nChris R. Albon\\n\\n\\n\\n\\n\\n\\n\\n\\n\\r\\n\\t\\tvar infinite_conf = {\"button_text\":\"Older posts\",\"no_more_post\":\"No More Post\",\"enable_infinite\":\"1\"};\\r\\n\\t\\t\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nChris R. Albon\\n\\nData for social good.\\n\\n\\n\\n\\n\\n\\n\\nFeb 16, 2014\\n\\nConflict Health Has Shut Down\\n\\n In 2008, I launched the blog Conflict Health to investigate and defend the role of health workers during political violence and armed conflicts. Four years later, I had written almost 500 posts on Conflict Health\u2026\\n\\n\\nFeb 14, 2014\\n\\nAbout Chris Albon\\n\\nShort version: I use data for social good. I also write about it. Longer version: I am the Director of a new crisis data project at Ushahidi, leading our work around the use of data\u2026\\n\\n\\n\\nPage 1 of 1\\n\\n\\n\\n\\n\\n\\nAbout | Twitter | GitHub | Pinboard\\n\\n\\n\\n\\n\\n\\n\\n\\n'" ] } ], "prompt_number": 79 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The string inside the first paragraph tag" ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.p.string" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 88, "text": [ "'Data for social good.'" ] } ], "prompt_number": 88 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Find all the h2 tags" ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.find_all('h2')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 121, "text": [ "[

\n", " Conflict Health Has Shut Down\n", "

,

\n", " About Chris Albon\n", "

]" ] } ], "prompt_number": 121 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Find all the links on the page" ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.find_all('a')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 97, "text": [ "[\"Blog,\n", " Chris R. Albon,\n", " Conflict Health Has Shut Down,\n", " About Chris Albon,\n", " About,\n", " Twitter,\n", " GitHub,\n", " Pinboard]" ] } ], "prompt_number": 97 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Find all the tag pairs with class=logo" ] }, { "cell_type": "code", "collapsed": false, "input": [ "soup.find_all(class_='logo')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 100, "text": [ "[\"Blog]" ] } ], "prompt_number": 100 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Select the string in front of the link nested inside the h2 tag pair" ] }, { "cell_type": "code", "collapsed": false, "input": [ "posts = soup.select(\"h2 > a\")\n", "posts[0].string" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 120, "text": [ "'Conflict Health Has Shut Down'" ] } ], "prompt_number": 120 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Print the pretty, nested version of the Beautiful Soup object" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(soup.prettify())" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n", "\n", " \n", " \n", " \n", " \n", " Chris R. Albon\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", "
\n", " \n", " \"Blog\n", " \n", "
\n", "

\n", " \n", " Chris R. Albon\n", " \n", "

\n", "

\n", " Data for social good.\n", "

\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", "

\n", " \n", " Conflict Health Has Shut Down\n", " \n", "

\n", "

\n", " In 2008, I launched the blog Conflict Health to investigate and defend the role of health workers during political violence and armed conflicts. Four years later, I had written almost 500 posts on Conflict Health\u2026\n", "

\n", "
\n", "
\n", " \n", "

\n", " \n", " About Chris Albon\n", " \n", "

\n", "

\n", " Short version: I use data for social good. I also write about it. Longer version: I am the Director of a new crisis data project at Ushahidi, leading our work around the use of data\u2026\n", "

\n", "
\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n" ] } ], "prompt_number": 114 } ], "metadata": {} } ] }