{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Easy Scraping\n", "\n", "* Author: [Pili Hu](http://hupili.net/)\n", "* Repo: [Easy Scraping in Python](https://github.com/hupili/workshop-easy-scraping)\n", "* Demo: scrapely, python-readability, pyQuery, httpie\n", "\n", "Prerequisites:\n", "\n", "* Python3\n", "* `pip install -r reuiqrements.txt`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Useful trick in IPython notebook" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pprint\n", "from IPython.core.display import HTML" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "Logo of Initium Lab: " ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML('Logo of Initium Lab: ' % 'http://initiumlab.com/favicon-32x32.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A small hack to allow longer output area" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "application/javascript": [ "//IPython.OutputArea.auto_scroll_threshold = 9999;\n", "IPython.OutputArea.prototype._should_scroll = function(){return false;}" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%javascript\n", "//IPython.OutputArea.auto_scroll_threshold = 9999;\n", "IPython.OutputArea.prototype._should_scroll = function(){return false;}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Readability\n", "\n", "We use a version ported to Python3:\n", "\n", "(already included in the `reuqirements.txt` file)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from readability.readability import Document\n", "import requests\n", "html = requests.get('http://initiumlab.com/').content\n", "readable_article = Document(html).summary()\n", "readable_title = Document(html).short_title()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "
\n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", "

The video is also available on YouTube and Youku.

\n", "

What did we do?

Jackathon is short for “Journalism-Hackathon”. At Initium Lab, we aim to push limits of Journalism with Technology. We hold regular Jackathons to advance our knowledge and skills in using new technology for obtaining data, analysing information, and reporting.

\n", "

This week, the goal for each participant is to read one the the 60 Data Science books collected by KDnuggets within 8 hours.
Participants could pick one or two books to finish reading in 8 hours and present findings / insights to the others.

\n", " \n", " \n", " \n", "
\n", "\n", "
\n" ] } ], "source": [ "print(readable_article)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", "

The video is also available on YouTube and Youku.

\n", "

What did we do?

Jackathon is short for “Journalism-Hackathon”. At Initium Lab, we aim to push limits of Journalism with Technology. We hold regular Jackathons to advance our knowledge and skills in using new technology for obtaining data, analysing information, and reporting.

\n", "

This week, the goal for each participant is to read one the the 60 Data Science books collected by KDnuggets within 8 hours.
Participants could pick one or two books to finish reading in 8 hours and present findings / insights to the others.

\n", " \n", " \n", " \n", "
\n", "\n", "
" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(readable_article)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PyQuery\n", "\n", "Let's fix the above URL problems" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[

,

,

]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pyquery\n", "r = pyquery.PyQuery(readable_article)\n", "r('p')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'./blog/20150922-jackathon3-review/jackathon3-timelapse-poster.png'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r('video').attr('poster')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'./blog/20150922-jackathon3-review/jackathon3-timelapse.mp4'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r('video source').attr('src')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[

\\n\\n \\n \\n\\n \\n \\n \\n\\n

The video is also available on YouTube and Youku.

\\n

What did we do?

Jackathon is short for “Journalism-Hackathon”. At Initium Lab, we aim to push limits of Journalism with Technology. We hold regular Jackathons to advance our knowledge and skills in using new technology for obtaining data, analysing information, and reporting.

\\n

This week, the goal for each participant is to read one the the 60 Data Science books collected by KDnuggets within 8 hours.
Participants could pick one or two books to finish reading in 8 hours and present findings / insights to the others.

\\n \\n \\n \\n
\\n\\n
'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r.html()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "application/javascript": [ "//IPython.OutputArea.auto_scroll_threshold = 9999;\n", "IPython.OutputArea.prototype._should_scroll = function(){return false;}" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%%javascript\n", "//IPython.OutputArea.auto_scroll_threshold = 9999;\n", "IPython.OutputArea.prototype._should_scroll = function(){return false;}" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", "\n", " \n", " \n", " \n", "\n", "

The video is also available on YouTube and Youku.

\n", "

What did we do?

Jackathon is short for “Journalism-Hackathon”. At Initium Lab, we aim to push limits of Journalism with Technology. We hold regular Jackathons to advance our knowledge and skills in using new technology for obtaining data, analysing information, and reporting.

\n", "

This week, the goal for each participant is to read one the the 60 Data Science books collected by KDnuggets within 8 hours.
Participants could pick one or two books to finish reading in 8 hours and present findings / insights to the others.

\n", " \n", " \n", " \n", "
\n", "\n", "
" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "HTML(r.html())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scrapely" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from scrapely import Scraper\n", "s = Scraper()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on method train in module scrapely:\n", "\n", "train(url, data, encoding=None) method of scrapely.Scraper instance\n", "\n" ] } ], "source": [ "help(s.train)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from urllib import parse\n", "def get_localhost_url(url):\n", " filename = parse.quote_plus(url)\n", " fullpath = 'tmp/%s' % filename\n", " html = requests.get(url).content\n", " open(fullpath, 'wb').write(html)\n", " return 'http://localhost:8888/files/%s?download=1' % parse.quote_plus(fullpath)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [], "source": [ "training_url = 'http://initiumlab.com/blog/20150916-legco-eng/'\n", "training_data = {'title': 'Legco Matrix Brief (English)', \n", " 'author': 'Initium Lab', \n", " 'date': '2015-09-16'}\n", "s.train(get_localhost_url(training_url), training_data)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[{'author': ['Andy Shu'],\n", " 'date': ['\\n 2015-09-01\\n '],\n", " 'title': ['\\n \\n \\n \\n 可視化火了 盲人怎麼辦\\n \\n \\n ']}]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "testing_url = 'http://initiumlab.com/blog/20150901-data-journalism-for-the-blind/'\n", "s.scrape(get_localhost_url(testing_url))" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[{'author': ['Initium Lab'],\n", " 'date': ['\\n 2015-09-22\\n '],\n", " 'title': ['\\n \\n \\n \\n Jackathon #3 -- Read a data science book in 8 hours\\n \\n \\n ']}]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "testing_url = 'http://initiumlab.com/blog/20150922-jackathon3-review/'\n", "s.scrape(get_localhost_url(testing_url))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## HTTPie & pQuery\n", "\n", "* Demo repo: https://github.com/hupili/60-data-science-book-visualisation\n", "* HTTPie: https://github.com/jkbrzt/httpie\n", "* pquery: https://github.com/hupili/pquery (CLI wrapper of PyQuery)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Easy Scraping.html\r\n", "Easy Scraping.ipynb\r\n", "README.md\r\n", "requirements.txt\r\n", "tmp\r\n", "venv\r\n" ] } ], "source": [ "!ls -1" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [], "source": [ "a = !ls -1" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['Easy Scraping.html',\n", " 'Easy Scraping.ipynb',\n", " 'README.md',\n", " 'requirements.txt',\n", " 'tmp',\n", " 'venv']" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"args\": {\n", " \"at\": \"Hardcore scraping workshop!\", \n", " \"name\": \"hupili\"\n", " }, \n", " \"headers\": {\n", " \"Accept\": \"*/*\", \n", " \"Accept-Encoding\": \"gzip, deflate\", \n", " \"Host\": \"httpbin.org\", \n", " \"User-Agent\": \"HTTPie/0.9.2\"\n", " }, \n", " \"origin\": \"118.140.67.2\", \n", " \"url\": \"http://httpbin.org/get?at=Hardcore+scraping+workshop!&name=hupili\"\n", "}\n" ] } ], "source": [ "%%sh\n", "http get 'http://httpbin.org/get' name==hupili at=='Hardcore scraping workshop!'" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"args\": {\n", " \"name\": \"hupili\"\n", " }, \n", " \"headers\": {\n", " \"Accept\": \"*/*\", \n", " \"Accept-Encoding\": \"gzip, deflate\", \n", " \"Host\": \"httpbin.org\", \n", " \"User-Agent\": \"Arbitrarily name your user agent!\"\n", " }, \n", " \"origin\": \"118.140.67.2\", \n", " \"url\": \"http://httpbin.org/get?name=hupili\"\n", "}\n" ] } ], "source": [ "%%sh\n", "http get 'http://httpbin.org/get' name==hupili 'User-Agent: Arbitrarily name your user agent!'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "HTTPie request construction. From `http --help`\n", "\n", "```\n", " ':' HTTP headers:\n", " Referer:http://httpie.org Cookie:foo=bar User-Agent:bacon/1.0\n", " \n", " '==' URL parameters to be appended to the request URI:\n", " search==httpie\n", " \n", " '=' Data fields to be serialized into a JSON object (with --json, -j)\n", " or form data (with --form, -f):\n", " name=HTTPie language=Python description='CLI HTTP client'\n", " \n", " ':=' Non-string JSON data fields (only with --json, -j):\n", " awesome:=true amount:=42 colors:='[\"red\", \"green\", \"blue\"]'\n", " \n", " '@' Form file fields (only with --form, -f):\n", " cs@~/Documents/CV.pdf\n", " \n", " '=@' A data field like '=', but takes a file path and embeds its content:\n", " essay=@Documents/essay.txt\n", " \n", " ':=@' A raw JSON field like ':=', but takes a file path and embeds its content:\n", " package:=@./package.json\n", " \n", " You can use a backslash to escape a colliding separator in the field name:\n", " field-name-with\\:colon=value\n", "```" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", "\n", "\n" ] } ], "source": [ "%%sh\n", "http --body 'http://www.kdnuggets.com/2015/09/free-data-science-books.html' | head -n 5" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "An Introduction to Data Science\n", "School of Data Handbook\n", "Data Jujitsu: The Art of Turning Data into Product\n", "The Data Science Handbook\n", "The Data Analytics Handbook\n" ] } ], "source": [ "%%sh\n", "http --body 'http://www.kdnuggets.com/2015/09/free-data-science-books.html' |\\\n", "pquery '.three_ul li strong a' -p text |\\\n", "head -n 5" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://docs.google.com/file/d/0B6iefdnF22XQeVZDSkxjZ0Z5VUE/edit?pli=1\n", "http://schoolofdata.org/handbook/\n", "http://www.oreilly.com/data/free/data-jujitsu.csp\n", "http://www.thedatasciencehandbook.com/#get-the-book\n", "https://www.teamleada.com/handbook\n" ] } ], "source": [ "%%sh\n", "http --body 'http://www.kdnuggets.com/2015/09/free-data-science-books.html' |\\\n", "pquery '.three_ul li strong a' -p href |\\\n", "head -n 5" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"An Introduction to Data Science\",https://docs.google.com/file/d/0B6iefdnF22XQeVZDSkxjZ0Z5VUE/edit?pli=1\n", "\"School of Data Handbook\",http://schoolofdata.org/handbook/\n", "\"Data Jujitsu: The Art of Turning Data into Product\",http://www.oreilly.com/data/free/data-jujitsu.csp\n", "\"The Data Science Handbook\",http://www.thedatasciencehandbook.com/#get-the-book\n", "\"The Data Analytics Handbook\",https://www.teamleada.com/handbook\n" ] } ], "source": [ "%%sh\n", "http --body 'http://www.kdnuggets.com/2015/09/free-data-science-books.html' |\\\n", "pquery '.three_ul li strong a' -f '\"{text}\",{href}' |\\\n", "head -n 5" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }