{
 "metadata": {
  "name": "Collecting web networks with scraping"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Collecting web networks with scraping\n",
      "\n",
      "Alex Hanna, University of Wisconsin-Madison <br />\n",
      "[alex-hanna.com](http://alex-hanna.com) <br />\n",
      "[@alexhanna](http://twitter.com/alexhanna)\n",
      "\n",
      "Yesterday we focused on collecting data via APIs (application program interfaces). While APIs are great, sometimes they don't give us the data which we want or need. We're bound by what they give us and somewhat limited in that sense. \n",
      "\n",
      "As an alternative, one way to get Internet data is to scrape the websites themselves for connections. Say we want to get a sense of how several political candidates are connected to each other, or to understand how particular online communities organize through their websites. Those are some examples in which scrapers can thrive."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## The intuition behind scraping\n",
      "\n",
      "The idea behind scraping is that we're looking at all the links on a webpage and listing each of the links as a connection. We are then looking at all the links on the pages linked from the original page. And so on and so on. In computer science, this is known as a [*breadth first search*](http://en.wikipedia.org/wiki/Breadth-first_search).\n",
      "\n",
      "So, for instance, on my blog [Bad Hessian](badhessian.org), there are a set of links that can be traversed.\n",
      "\n",
      "<a href=\"http://badhessian.org/wp-content/uploads/2013/08/bh-1.png\"><img src=\"http://badhessian.org/wp-content/uploads/2013/08/bh-1.png\" width=510></a>\n",
      "\n",
      "<a href=\"http://badhessian.org/wp-content/uploads/2013/08/bh-2.png\"><img src=\"http://badhessian.org/wp-content/uploads/2013/08/bh-2.png\" width=510></a>\n",
      "\n",
      "This is a post drawn from [this post](http://badhessian.org/2013/07/institutionalizing-computational-social-scienc). \n",
      "\n",
      "So the list of pages that we may traverse may look like this list when we visit Bad Hessian first:\n",
      "\n",
      "$G = {B_1, B_2, B_3... B_N}$\n",
      "\n",
      "where $B_i$ is a link on the Bad Hessian article. Then we add on, say, the original [OrgTheory article](http://orgtheory.wordpress.com/2013/07/03/institutionalizing-computational-social-science/) which is the link $B_1$, and then the list looks like this:\n",
      "\n",
      "$G = {B_2, B_3... B_N, O_1, O_2, O_3... O_N}$\n",
      "\n",
      "where $O_i$ is a link on the OrgTheory article. Note that $B_1$ has been removed from the list because we've already visited the link.\n",
      "\n",
      "Blogs tend to link to many different things. So this blog post links to different types of pages -- to other blogs, to social media, to academic citations and journal articles, and to itself. The network map might look something like this:\n",
      "\n",
      "<a href=\"http://badhessian.org/wp-content/uploads/2013/08/blog-links.png\"><img src=\"http://badhessian.org/wp-content/uploads/2013/08/blog-links.png\" width=510></a>\n",
      "\n",
      "If we are looking into a particular kind of phenomenon, like a blogging community, maybe we want to restrict our links to those that link to other blogs. How do we do that?\n",
      "\n",
      "Luckily, blogs tend to shared the same sorts of URLs. Blogspot, Wordpress, and Livejournal all have similar URL structures. For this exercise, we're going to focus on collecting network data on blogging networks."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Starting up\n",
      "\n",
      "Like yesterday, you first need to connect to the Amazon EC2 server. The hostname is <code>ec2-54-225-7-147.compute-1.amazonaws.com</code>. If you are in Windows you need to log in with PuTTy, and if you are using Mac, you would be logging on with the Terminal.\n",
      "\n",
      "Once you have logged in, you need to grab the GitHub repository that contains the files we're working from today. Run this command:\n",
      "\n",
      "     [hse0@ip-10-196-55-224 ~]$ git clone https://github.com/raynach/hse-scraping\n",
      "\n",
      "If you're successful you should see this:\n",
      "\n",
      "    Cloning into 'hse-scraping'...\n",
      "    remote: Counting objects: 44, done.\n",
      "    remote: Compressing objects: 100% (37/37), done.\n",
      "    remote: Total 44 (delta 11), reused 35 (delta 5)\n",
      "    Unpacking objects: 100% (44/44), done.\n",
      "\n",
      "This sets up the basic framework for the <code>scrapy</code> Python package. We won't get much into the complicated internals of the package. Once you've done that you need to enter into the <code>hse-scraping/blogcrawler/</code> directory.\n",
      "\n",
      "    [hse0@ip-10-196-55-224 ~]$ cd hse-scraping/blogcrawler/\n",
      "    [hse0@ip-10-196-55-224 blogcrawler]$ \n",
      "\n",
      "Now that we're there, the main file that I want to focus on is <code>blogcrawler/spiders/blogspider.py</code> "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#!/usr/bin/python\n",
      "\n",
      "from scrapy.contrib.spiders import CrawlSpider, Rule\n",
      "from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor\n",
      "from blogcrawler.items import BlogcrawlerItem\n",
      "from urlparse import urlparse\n",
      "\n",
      "class BlogSpider(CrawlSpider):    \n",
      "    name = \"blogspider\"\n",
      "    allowed_domains = [\n",
      "        \"wordpress.org\", \n",
      "        \"blogspot.com\", \n",
      "        \"blogger.com\",\n",
      "        \"livejournal.com\",\n",
      "        \"typepad.com\", \n",
      "        \"tumblr.com\"]\n",
      "\n",
      "    start_urls = [\"http://badhessian.org\"]\n",
      "\n",
      "    rules = (\n",
      "        Rule(SgmlLinkExtractor(\n",
      "            allow=('/', ),\n",
      "            deny=('www\\.blogger\\.com', 'profile\\.typepad\\.com', \n",
      "                'http:\\/\\/wordpress\\.com', '.+\\.trac\\.wordpress\\.org',\n",
      "                '.+\\.wordpress\\.org', 'wordpress\\.org', 'www\\.tumblr\\.com', \n",
      "                'en\\..+\\.wordpress\\.com', 'vip\\.wordpress\\.com'),\n",
      "                ), callback = \"parse_item\", follow = True), \n",
      "    )\n",
      "\n",
      "    def parse_item(self, response):\n",
      "        item = BlogcrawlerItem()\n",
      "\n",
      "        item['url1'] = urlparse(response.request.headers.get('Referer'))[1]\n",
      "        item['url2'] = urlparse(response.url)[1]\n",
      "\n",
      "        yield item"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The intuition behind this file is in the <code>allowed_domains</code> list and the <code>parse_item</code> function. We are just trying to get blogs, so there are a list of blogging services from which we will choose to gather information. The downside is that we don't get blogs that are not on one of these services.\n",
      "\n",
      "Next, look at the <code>parse_item</code> function."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "    def parse_item(self, response):\n",
      "        item = BlogcrawlerItem()\n",
      "\n",
      "        item['url1'] = urlparse(response.request.headers.get('Referer'))[1]\n",
      "        item['url2'] = urlparse(response.url)[1]\n",
      "\n",
      "        yield item"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This function is called every time that the crawl visits a link. The \"item\" here can be considered a network edge. <code>url1</code> is the source node while <code>url2</code> is the destination node.\n",
      "\n",
      "## Running scrapy\n",
      "\n",
      "To actually run <code>scrapy</code>, type the following:\n",
      "\n",
      "    [hse0@ip-10-196-55-224 blogcrawler]$ scrapy crawl blogspider -o output.csv -t csv\n",
      "\n",
      "This outputs the data from the crawler into a CSV file which represents an edgelist. You'll see a lot of stuff being produced when this is happening.\n",
      "\n",
      "The process could go on indefinitely. To stop the process, press Ctrl + C and it should close itself. Give it a few seconds to do so.\n",
      "\n",
      "If you want to suppress all the log messages that come along with it, use this command:\n",
      "\n",
      "    [hse0@ip-10-196-55-224 blogcrawler]$ scrapy crawl blogspider -o output.csv -t csv --nolog\n",
      "\n",
      "You'll get an output that looks like this:\n",
      "\n",
      "    [hse0@ip-10-196-55-224 blogcrawler]$ more output.csv \n",
      "    url1,url2\n",
      "    badhessian.org,orgtheory.wordpress.com\n",
      "    badhessian.org,scatter.wordpress.com\n",
      "    badhessian.org,orgtheory.wordpress.com\n",
      "    badhessian.org,orgtheory.wordpress.com\n",
      "    badhessian.org,scatter.wordpress.com\n",
      "    badhessian.org,permut.wordpress.com\n",
      "    badhessian.org,mobilizingideas.wordpress.com\n",
      "    badhessian.org,asecondmouse.wordpress.com\n",
      "    badhessian.org,codeandculture.wordpress.com\n",
      "    badhessian.org,dartthrowingchimp.wordpress.com\n",
      "    badhessian.org,exploringpossibilityspace.blogspot.com\n",
      "    orgtheory.wordpress.com,orgtheory.wordpress.com\n",
      "    orgtheory.wordpress.com,orgtheory.wordpress.com\n",
      "    orgtheory.wordpress.com,orgtheory.wordpress.com\n",
      "    orgtheory.wordpress.com,orgtheory.wordpress.com\n",
      "    ...\n",
      "\n",
      "If you look at the Blog Roll on the side of the page, you'll notice that this closely matches the list of blogs listed there. You'll note that many of the blogs will repeat themselves on the first line -- that means they are linking to themselves. Like the diagram above, these nodes are in a loop."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Additional materials\n",
      "\n",
      "There are a lot more options available in <code>scrapy</code>. If you want to learn more about <code>scrapy</code> and the various types of tasks you can accomplish with it, you can check out their [documentation](http://scrapy.org/doc/)."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}