{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ScrapyDo Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[ScrapyDo](https://github.com/darkrho/scrapydo) is a [crochet](https://github.com/itamarst/crochet)-based blocking API for [Scrapy](http://scrapy.org). It allows the usage of Scrapy as a library, mainly aimed to be used in spiders prototyping and data exploration in [IPython notebooks](http://ipython.org/notebook.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we are going to show how to use `scrapydo` and how it helps to rapidly crawl and explore data. Our main premise is that we want to crawl the internet as a mean to analysis data and not as an end." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initialization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function `setup` must be called before any call to other functions." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import scrapydo\n", "scrapydo.setup()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The `fetch` function and highlight helper" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `fetch` function returns a `scrapy.Response` object for a given URL." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "<200 http://httpbin.org/get?show_env=1>" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response = scrapydo.fetch(\"http://httpbin.org/get?show_env=1\")\n", "response" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `highlight` function is a helper to highlight text content using the [pygments](http://pygments.org) module. It is very useful to inspect text content." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "{\n", " "args": {\n", " "show_env": "1"\n", " }, \n", " "headers": {\n", " "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n", " "Accept-Encoding": "gzip,deflate", \n", " "Accept-Language": "en", \n", " "Host": "httpbin.org", \n", " "Runscope-Service": "httpbin", \n", " "User-Agent": "Scrapy/1.0.1 (+http://scrapy.org)", \n", " "X-Forwarded-For": "181.114.87.105", \n", " "X-Real-Ip": "181.114.87.105"\n", " }, \n", " "origin": "181.114.87.105", \n", " "url": "http://httpbin.org/get?show_env=1"\n", "}\n", "
<!DOCTYPE html>\n", "<html>\n", "<head>\n", " <meta http-equiv='content-type' value='text/html;charset=utf8'>\n", " <meta name='generator' value='Ronn/v0.7.3 (http://github.com/rtomayko/ronn/tree/0.7.3)'>\n", " <title>httpbin(1): HTTP Client Testing Service</title>\n", " <style type='text/css' media='all'>\n", " /* style: man */\n", "
[u'<p>Freely hosted in <a href="http://httpbin.org">HTTP</a>, <a href="https://httpbin.org">HTTPS</a> & <a href="http://eu.httpbin.org/">EU</a> flavors by <a href="https://www.runscope.com/">Runscope</a></p>',\n", " u'<p>Testing an HTTP Library can become difficult sometimes. <a href="http://requestb.in">RequestBin</a> is fantastic for testing POST requests, but doesn\\'t let you control the response. This exists to cover all kinds of HTTP scenarios. Additional endpoints are being considered.</p>',\n", " u'<p>All endpoint responses are JSON-encoded.</p>',\n", " u'<p>You can install httpbin as a library from PyPI and run it as a WSGI app. For example, using Gunicorn:</p>',\n", " u'<p>A <a href="https://www.runscope.com/community">Runscope Community Project</a>.</p>',\n", " u'<p>Originally created by <a href="http://kennethreitz.com/">Kenneth Reitz</a>.</p>',\n", " u'<p><a href="https://hurl.it">Hurl.it</a> - Make HTTP requests.</p>',\n", " u'<p><a href="http://requestb.in">RequestBin</a> - Inspect HTTP requests.</p>',\n", " u'<p><a href="http://python-requests.org" data-bare-link="true">http://python-requests.org</a></p>']\n", "
{'Access-Control-Allow-Credentials': ['true'],\n", " 'Access-Control-Allow-Origin': ['*'],\n", " 'Content-Type': ['text/html; charset=utf-8'],\n", " 'Date': ['Mon, 27 Jul 2015 04:27:22 GMT'],\n", " 'Server': ['nginx']}\n", "
\n", " | title | \n", "length | \n", "
---|---|---|
0 | \n", "EuroPython 2015 on | \n", "18 | \n", "
1 | \n", "StartupChats Remote Working Q&A on | \n", "34 | \n", "
2 | \n", "PyCon Philippines 2015 on | \n", "25 | \n", "
3 | \n", "Why MongoDB Is a Bad Choice for Storing Our Sc... | \n", "59 | \n", "
4 | \n", "Introducing Crawlera, a Smart Page Downloader on | \n", "48 | \n", "
[{'crawled': datetime.datetime(2015, 7, 27, 4, 27, 55, 80723),\n", " 'description': u'- A remote debugger and IDE that can also be used for local debugging.',\n", " 'name': u'Hap Python Remote Debugger',\n", " 'spider': 'dmoz',\n", " 'url': u'http://hapdebugger.sourceforge.net/'},\n", " {'crawled': datetime.datetime(2015, 7, 27, 4, 27, 55, 86720),\n", " 'description': u'- An enhanced interactive Python shell with many features for object introspection, system shell access, and its own special command system for adding functionality when working interactively. [Open Source, LGPL]',\n", " 'name': u'IPython',\n", " 'spider': 'dmoz',\n", " 'url': u'http://ipython.scipy.org/'},\n", " {'crawled': datetime.datetime(2015, 7, 27, 4, 27, 55, 87918),\n", " 'description': u'- An interactive, graphical Python shell written in Python using wxPython.',\n", " 'name': u'PyCrust - The Flakiest Python Shell',\n", " 'spider': 'dmoz',\n", " 'url': u'http://sourceforge.net/projects/pycrust/'}]\n", "