{
 "metadata": {
  "name": "",
  "signature": "sha256:bd871b050722ec78607d72ddb284b5d7b3ab9d510a9183b7a2f4210003f450f6"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Beautiful Soup Basic HTML Scraping\n",
      "\n",
      "- **Author:** [Chris Albon](http://www.chrisalbon.com/), [@ChrisAlbon](https://twitter.com/chrisalbon)\n",
      "- **Date:** -\n",
      "- **Repo:** [Python 3 code snippets for data science](https://github.com/chrisalbon/code_py)\n",
      "- **Note:** -"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Import the modules"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Import required modules\n",
      "import requests\n",
      "from bs4 import BeautifulSoup"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 61
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Scrap the html and turn into a beautiful soup object"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Create a variable with the url\n",
      "url = 'http://chrisralbon.com'\n",
      "\n",
      "# Use requests to get the contents\n",
      "r = requests.get(url)\n",
      "\n",
      "# Get the text of the contents\n",
      "html_content = r.text\n",
      "\n",
      "# Convert the html content into a beautiful soup object\n",
      "soup = BeautifulSoup(html_content)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Select the website's title"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# View the title tag of the soup object\n",
      "soup.title"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 63,
       "text": [
        "<title>Chris R. Albon</title>"
       ]
      }
     ],
     "prompt_number": 63
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Website title's tag"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# View the name of the title\n",
      "soup.title.name"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 64,
       "text": [
        "'title'"
       ]
      }
     ],
     "prompt_number": 64
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Website title tag's string"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# View the string within the title tag\n",
      "soup.title.string"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 65,
       "text": [
        "'Chris R. Albon'"
       ]
      }
     ],
     "prompt_number": 65
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### First paragraph tag"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# view the paragraph tag of the soup\n",
      "soup.p"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 67,
       "text": [
        "<p class=\"site-description\">Data for social good.</p>"
       ]
      }
     ],
     "prompt_number": 67
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### The parent of the title tag"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "soup.title.parent.name"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 68,
       "text": [
        "'head'"
       ]
      }
     ],
     "prompt_number": 68
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### The class of the first paragraph tag"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "soup.p['class']"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 69,
       "text": [
        "['site-description']"
       ]
      }
     ],
     "prompt_number": 69
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### The first link tag"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "soup.a"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 70,
       "text": [
        "<a class=\"logo\" href=\"http://www.chrisralbon.com/about-chris-albon/\"><img alt=\"Blog Logo\" src=\"/content/images/2014/Feb/chrisalbon_radial-16.png\"/></a>"
       ]
      }
     ],
     "prompt_number": 70
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Find all the link tags"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "soup.find_all('a')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 71,
       "text": [
        "[<a class=\"logo\" href=\"http://www.chrisralbon.com/about-chris-albon/\"><img alt=\"Blog Logo\" src=\"/content/images/2014/Feb/chrisalbon_radial-16.png\"/></a>,\n",
        " <a href=\"http://www.chrisralbon.com\">Chris R. Albon</a>,\n",
        " <a href=\"/conflict-health/\">Conflict Health Has Shut Down</a>,\n",
        " <a href=\"/about-chris-albon/\">About Chris Albon</a>,\n",
        " <a href=\"http://www.chrisralbon.com/about-chris-albon/\">About</a>,\n",
        " <a href=\"https://twitter.com/chrisalbon\">Twitter</a>,\n",
        " <a href=\"https://github.com/chrisalbon\">GitHub</a>,\n",
        " <a href=\"https://pinboard.in/u:chrisalbon\">Pinboard</a>]"
       ]
      }
     ],
     "prompt_number": 71
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Get all the text on the page"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "soup.get_text()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 79,
       "text": [
        "'\\n\\n\\n\\nChris R. Albon\\n\\n\\n\\n\\n\\n\\n\\n\\n\\r\\n\\t\\tvar infinite_conf = {\"button_text\":\"Older posts\",\"no_more_post\":\"No More Post\",\"enable_infinite\":\"1\"};\\r\\n\\t\\t\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nChris R. Albon\\n\\nData for social good.\\n\\n\\n\\n\\n\\n\\n\\nFeb 16, 2014\\n\\nConflict Health Has Shut Down\\n\\n In 2008, I launched the blog Conflict Health to investigate and defend the role of health workers during political violence and armed conflicts. Four years later, I had written almost 500 posts on Conflict Health\u2026\\n\\n\\nFeb 14, 2014\\n\\nAbout Chris Albon\\n\\nShort version: I use data for social good. I also write about it. Longer version: I am the Director of a new crisis data project at Ushahidi, leading our work around the use of data\u2026\\n\\n\\n\\nPage 1 of 1\\n\\n\\n\\n\\n\\n\\nAbout | Twitter | GitHub | Pinboard\\n\\n\\n\\n\\n\\n\\n\\n\\n'"
       ]
      }
     ],
     "prompt_number": 79
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### The string inside the first paragraph tag"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "soup.p.string"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 88,
       "text": [
        "'Data for social good.'"
       ]
      }
     ],
     "prompt_number": 88
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Find all the h2 tags"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "soup.find_all('h2')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 121,
       "text": [
        "[<h2 class=\"entry-title\">\n",
        " <a href=\"/conflict-health/\">Conflict Health Has Shut Down</a>\n",
        " </h2>, <h2 class=\"entry-title\">\n",
        " <a href=\"/about-chris-albon/\">About Chris Albon</a>\n",
        " </h2>]"
       ]
      }
     ],
     "prompt_number": 121
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Find all the links on the page"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "soup.find_all('a')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 97,
       "text": [
        "[<a class=\"logo\" href=\"http://www.chrisralbon.com/about-chris-albon/\"><img alt=\"Blog Logo\" src=\"/content/images/2014/Feb/chrisalbon_radial-16.png\"/></a>,\n",
        " <a href=\"http://www.chrisralbon.com\">Chris R. Albon</a>,\n",
        " <a href=\"/conflict-health/\">Conflict Health Has Shut Down</a>,\n",
        " <a href=\"/about-chris-albon/\">About Chris Albon</a>,\n",
        " <a href=\"http://www.chrisralbon.com/about-chris-albon/\">About</a>,\n",
        " <a href=\"https://twitter.com/chrisalbon\">Twitter</a>,\n",
        " <a href=\"https://github.com/chrisalbon\">GitHub</a>,\n",
        " <a href=\"https://pinboard.in/u:chrisalbon\">Pinboard</a>]"
       ]
      }
     ],
     "prompt_number": 97
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Find all the tag pairs with class=logo"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "soup.find_all(class_='logo')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 100,
       "text": [
        "[<a class=\"logo\" href=\"http://www.chrisralbon.com/about-chris-albon/\"><img alt=\"Blog Logo\" src=\"/content/images/2014/Feb/chrisalbon_radial-16.png\"/></a>]"
       ]
      }
     ],
     "prompt_number": 100
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Select the string in front of the link nested inside the h2 tag pair"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "posts = soup.select(\"h2 > a\")\n",
      "posts[0].string"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 120,
       "text": [
        "'Conflict Health Has Shut Down'"
       ]
      }
     ],
     "prompt_number": 120
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Print the pretty, nested version of the Beautiful Soup object"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(soup.prettify())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<!DOCTYPE html>\n",
        "<html>\n",
        " <head>\n",
        "  <meta charset=\"utf-8\" content=\"text/html\" http-equiv=\"Content-Type\"/>\n",
        "  <meta content=\"IE=edge,chrome=1\" http-equiv=\"X-UA-Compatible\"/>\n",
        "  <title>\n",
        "   Chris R. Albon\n",
        "  </title>\n",
        "  <meta content=\"Data for social good.\" name=\"description\"/>\n",
        "  <!--[if lt IE 9]>\r\n",
        "\t\t\t<script type=\"text/javascript\" src=\"/assets/js/html5shiv.js\"></script>\r\n",
        "\t\t<![endif]-->\n",
        "  <meta content=\"True\" name=\"HandheldFriendly\"/>\n",
        "  <meta content=\"320\" name=\"MobileOptimized\"/>\n",
        "  <meta content=\"width=device-width, initial-scale=1.0\" name=\"viewport\"/>\n",
        "  <link href=\"http://fonts.googleapis.com/css?family=Lato|Oswald|Open+Sans:300|Merriweather:400,400italic,700,700italic\" rel=\"stylesheet\" type=\"text/css\"/>\n",
        "  <link href=\"/assets/css/framework.css\" rel=\"stylesheet\" type=\"text/css\"/>\n",
        "  <link href=\"/assets/css/style.css\" rel=\"stylesheet\" type=\"text/css\"/>\n",
        "  <script type=\"text/javascript\">\n",
        "   var infinite_conf = {\"button_text\":\"Older posts\",\"no_more_post\":\"No More Post\",\"enable_infinite\":\"1\"};\n",
        "  </script>\n",
        "  <meta content=\"Ghost 0.4\" name=\"generator\"/>\n",
        "  <link href=\"/rss/\" rel=\"alternate\" title=\"Chris R. Albon\" type=\"application/rss+xml\"/>\n",
        "  <link href=\"http://www.chrisralbon.com/\" rel=\"canonical\"/>\n",
        " </head>\n",
        " <body class=\"home-template\">\n",
        "  <header class=\"header-section container\" style=\"background-image: url(/content/images/2014/Feb/orange_bg.png)\">\n",
        "   <div class=\"row\">\n",
        "    <a class=\"logo\" href=\"http://www.chrisralbon.com/about-chris-albon/\">\n",
        "     <img alt=\"Blog Logo\" src=\"/content/images/2014/Feb/chrisalbon_radial-16.png\"/>\n",
        "    </a>\n",
        "    <div class=\"branding\">\n",
        "     <h1 class=\"site-title\">\n",
        "      <a href=\"http://www.chrisralbon.com\">\n",
        "       Chris R. Albon\n",
        "      </a>\n",
        "     </h1>\n",
        "     <p class=\"site-description\">\n",
        "      Data for social good.\n",
        "     </p>\n",
        "    </div>\n",
        "   </div>\n",
        "  </header>\n",
        "  <section class=\"main-content container\">\n",
        "   <div class=\"row\">\n",
        "    <div class=\"post-list\">\n",
        "     <article class=\"entry-post post\">\n",
        "      <time class=\"entry-date\">\n",
        "       Feb 16, 2014\n",
        "      </time>\n",
        "      <h2 class=\"entry-title\">\n",
        "       <a href=\"/conflict-health/\">\n",
        "        Conflict Health Has Shut Down\n",
        "       </a>\n",
        "      </h2>\n",
        "      <p>\n",
        "       In 2008, I launched the blog Conflict Health to investigate and defend the role of health workers during political violence and armed conflicts. Four years later, I had written almost 500 posts on Conflict Health\u2026\n",
        "      </p>\n",
        "     </article>\n",
        "     <article class=\"entry-post post\">\n",
        "      <time class=\"entry-date\">\n",
        "       Feb 14, 2014\n",
        "      </time>\n",
        "      <h2 class=\"entry-title\">\n",
        "       <a href=\"/about-chris-albon/\">\n",
        "        About Chris Albon\n",
        "       </a>\n",
        "      </h2>\n",
        "      <p>\n",
        "       Short version: I use data for social good. I also write about it. Longer version: I am the Director of a new crisis data project at Ushahidi, leading our work around the use of data\u2026\n",
        "      </p>\n",
        "     </article>\n",
        "    </div>\n",
        "    <nav class=\"pagination clearfix\" role=\"pagination\">\n",
        "     <span class=\"page-number\">\n",
        "      Page 1 of 1\n",
        "     </span>\n",
        "    </nav>\n",
        "   </div>\n",
        "  </section>\n",
        "  <footer class=\"footer-section container\">\n",
        "   <div class=\"row\">\n",
        "    <div class=\"signature\">\n",
        "     <a href=\"http://www.chrisralbon.com/about-chris-albon/\">\n",
        "      About\n",
        "     </a>\n",
        "     |\n",
        "     <a href=\"https://twitter.com/chrisalbon\">\n",
        "      Twitter\n",
        "     </a>\n",
        "     |\n",
        "     <a href=\"https://github.com/chrisalbon\">\n",
        "      GitHub\n",
        "     </a>\n",
        "     |\n",
        "     <a href=\"https://pinboard.in/u:chrisalbon\">\n",
        "      Pinboard\n",
        "     </a>\n",
        "    </div>\n",
        "   </div>\n",
        "  </footer>\n",
        "  <!-- .footer-section -->\n",
        "  <script src=\"/public/jquery.js?v=821a9ed878\">\n",
        "  </script>\n",
        "  <script src=\"https://google-code-prettify.googlecode.com/svn/loader/run_prettify.js\">\n",
        "  </script>\n",
        "  <img alt=\"\" hidden=\"\" src=\"/view.gif?page=/\" style=\"display:none\"/>\n",
        " </body>\n",
        "</html>\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 114
    }
   ],
   "metadata": {}
  }
 ]
}