{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Mining the Social Web, 2nd Edition\n",
      "\n",
      "##Chapter 8: Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing Over RDF, and More\n",
      "\n",
      "This IPython Notebook provides an interactive way to follow along with and explore the numbered examples from [_Mining the Social Web (2nd Edition)_](http://bit.ly/135dHfs). The intent behind this notebook is to reinforce the concepts from the sample code in a fun, convenient, and effective way. This notebook assumes that you are reading along with the book and have the context of the discussion as you work through these exercises.\n",
      "\n",
      "In the somewhat unlikely event that you've somehow stumbled across this notebook outside of its context on GitHub, [you can find the full source code repository here](http://bit.ly/16kGNyb).\n",
      "\n",
      "## Copyright and Licensing\n",
      "\n",
      "You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 1. Extracting geo-microformatted data from a Wikipedia page"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import requests # pip install requests\n",
      "from BeautifulSoup import BeautifulSoup # pip install BeautifulSoup\n",
      "\n",
      "# XXX: Any URL containing a geo microformat...\n",
      "\n",
      "URL = 'http://en.wikipedia.org/wiki/Franklin,_Tennessee'\n",
      "\n",
      "# In the case of extracting content from Wikipedia, be sure to\n",
      "# review its \"Bot Policy,\" which is defined at\n",
      "# http://meta.wikimedia.org/wiki/Bot_policy#Unacceptable_usage\n",
      "\n",
      "req = requests.get(URL, headers={'User-Agent' : \"Mining the Social Web\"})\n",
      "soup = BeautifulSoup(req.text)\n",
      "\n",
      "geoTag = soup.find(True, 'geo')\n",
      "\n",
      "if geoTag and len(geoTag) > 1:\n",
      "    lat = geoTag.find(True, 'latitude').string\n",
      "    lon = geoTag.find(True, 'longitude').string\n",
      "    print 'Location is at', lat, lon\n",
      "elif geoTag and len(geoTag) == 1:\n",
      "    (lat, lon) = geoTag.string.split(';')\n",
      "    (lat, lon) = (lat.strip(), lon.strip())\n",
      "    print 'Location is at', lat, lon\n",
      "else:\n",
      "    print 'No location found'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 2. Displaying geo-microformats with Google Maps in IPython Notebook"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.display import IFrame\n",
      "from IPython.core.display import display\n",
      "\n",
      "# Google Maps URL template for an iframe\n",
      "\n",
      "google_maps_url = \"http://maps.google.com/maps?q={0}+{1}&\" + \\\n",
      "  \"ie=UTF8&t=h&z=14&{0},{1}&output=embed\".format(lat, lon)\n",
      "\n",
      "display(IFrame(google_maps_url, '425px', '350px'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 3. Extracting hRecipe data from a web page"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import sys\n",
      "import requests\n",
      "import json\n",
      "import BeautifulSoup\n",
      "\n",
      "# Pass in a URL containing hRecipe...\n",
      "\n",
      "URL = 'http://britishfood.about.com/od/recipeindex/r/applepie.htm'\n",
      "\n",
      "# Parse out some of the pertinent information for a recipe.\n",
      "# See http://microformats.org/wiki/hrecipe.\n",
      "\n",
      "\n",
      "def parse_hrecipe(url):\n",
      "    req = requests.get(URL)\n",
      "    \n",
      "    soup = BeautifulSoup.BeautifulSoup(req.text)\n",
      "    \n",
      "    hrecipe = soup.find(True, 'hrecipe')\n",
      "\n",
      "    if hrecipe and len(hrecipe) > 1:\n",
      "        fn = hrecipe.find(True, 'fn').string\n",
      "        author = hrecipe.find(True, 'author').find(text=True)\n",
      "        ingredients = [i.string \n",
      "                            for i in hrecipe.findAll(True, 'ingredient') \n",
      "                                if i.string is not None]\n",
      "\n",
      "        instructions = []\n",
      "        for i in hrecipe.find(True, 'instructions'):\n",
      "            if type(i) == BeautifulSoup.Tag:\n",
      "                s = ''.join(i.findAll(text=True)).strip()\n",
      "            elif type(i) == BeautifulSoup.NavigableString:\n",
      "                s = i.string.strip()\n",
      "            else:\n",
      "                continue\n",
      "\n",
      "            if s != '': \n",
      "                instructions += [s]\n",
      "\n",
      "        return {\n",
      "            'name': fn,\n",
      "            'author': author,\n",
      "            'ingredients': ingredients,\n",
      "            'instructions': instructions,\n",
      "            }\n",
      "    else:\n",
      "        return {}\n",
      "\n",
      "\n",
      "recipe = parse_hrecipe(URL)\n",
      "print json.dumps(recipe, indent=4)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 4. Parsing hReview-aggregate microformat data for a recipe"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import requests\n",
      "import json\n",
      "from BeautifulSoup import BeautifulSoup\n",
      "\n",
      "# Pass in a URL that contains hReview-aggregate info...\n",
      "\n",
      "URL = 'http://britishfood.about.com/od/recipeindex/r/applepie.htm'\n",
      "\n",
      "def parse_hreview_aggregate(url, item_type):\n",
      "    \n",
      "    req = requests.get(URL)\n",
      "    \n",
      "    soup = BeautifulSoup(req.text)\n",
      "    \n",
      "    # Find the hRecipe or whatever other kind of parent item encapsulates\n",
      "    # the hReview (a required field).\n",
      "    \n",
      "    item_element = soup.find(True, item_type)\n",
      "    item = item_element.find(True, 'item').find(True, 'fn').text\n",
      "        \n",
      "    # And now parse out the hReview\n",
      "    \n",
      "    hreview = soup.find(True, 'hreview-aggregate')\n",
      "    \n",
      "    # Required field\n",
      "    \n",
      "    rating = hreview.find(True, 'rating').find(True, 'value-title')['title']\n",
      "    \n",
      "    # Optional fields\n",
      "    \n",
      "    try:\n",
      "        count = hreview.find(True, 'count').text\n",
      "    except AttributeError: # optional\n",
      "        count = None\n",
      "    try:\n",
      "        votes = hreview.find(True, 'votes').text\n",
      "    except AttributeError: # optional\n",
      "        votes = None\n",
      "\n",
      "    try:\n",
      "        summary = hreview.find(True, 'summary').text\n",
      "    except AttributeError: # optional\n",
      "        summary = None\n",
      "\n",
      "    return {\n",
      "        'item': item,\n",
      "        'rating': rating,\n",
      "        'count': count,\n",
      "        'votes': votes,\n",
      "        'summary' : summary\n",
      "    }\n",
      "\n",
      "# Find hReview aggregate information for an hRecipe\n",
      "\n",
      "reviews = parse_hreview_aggregate(URL, 'hrecipe')\n",
      "\n",
      "print json.dumps(reviews, indent=4)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Note: You may also want to try Google's [structured data testing tool](http://www.google.com/webmasters/tools/richsnippets) to extract semantic markup from a webpage**"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Note: You can use bash cell magic as shown below to invoke FuXi on the [sample data file](files/resources/ch08-semanticweb/chuck-norris.n3) introduced at the end of the chapter as follows:**"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "\n",
      "FuXi --rules=resources/ch08-semanticweb/chuck-norris.n3 --ruleFacts --naive"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**You can explore other options for FuXi by invoking its --help command**"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%%bash\n",
      "\n",
      "FuXi --help"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}