{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Scraping headlines" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![](https://raw.github.com/nealcaren/workshop_2014/master/notebooks/images/upworth_2.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "How I scrape a page\n", "\n", "1. Download the page.\n", "1. Look at the source code of a sample page.\n", "2. Find the thing that you want, and the stuff around that thing.\n", "3. Write a regular expression that matches what you want.\n", "4. Write regular expression that actually matches what you want.\n", "5. Production!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import requests" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "prompt_number": 52 }, { "cell_type": "code", "collapsed": false, "input": [ "url = 'http://www.upworthy.com/page/2'" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 53 }, { "cell_type": "code", "collapsed": false, "input": [ "page = requests.get(url)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 54 }, { "cell_type": "code", "collapsed": false, "input": [ "print page.headers" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "CaseInsensitiveDict({'status': '200 OK', 'x-request-id': 'a9cf8301-b932-4294-b957-7df4300cc778', 'via': '1.1 varnish, 1.1 varnish', 'x-cache': 'MISS, HIT', 'content-encoding': 'gzip', 'accept-ranges': 'bytes', 'x-timer': 'S1391631280.295958757,VS0,VE0', 'vary': 'Accept-Encoding', 'content-length': '10333', 'connection': 'keep-alive', 'etag': '\"985228e196db36102f8d3ffb894952b5\"', 'x-cache-hits': '0, 25', 'x-ua-compatible': 'IE=Edge,chrome=1', 'x-served-by': 'cache-v44-ASH, cache-jfk1027-JFK', 'cache-control': 'max-age=5, public', 'date': 'Wed, 05 Feb 2014 20:14:40 GMT', 'content-type': 'text/html; charset=utf-8', 'age': '4750', 'x-runtime': '0.250320'})\n" ] } ], "prompt_number": 55 }, { "cell_type": "code", "collapsed": false, "input": [ "page.headers['status']" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 56, "text": [ "'200 OK'" ] } ], "prompt_number": 56 }, { "cell_type": "code", "collapsed": false, "input": [ "page.text" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 57, "text": [ "u'\\n\\n\\n\\n\\n\\n \\n
\\n \\n \\n \\n\\n \\n \\n \\n\\n\\n \\n \\n\\n\\n A special Upworthy series about global health and\\npoverty.
Check it out!\\n
\\n A special Upworthy series about\\n work and the economy.
Check it out!\\n