{ "metadata": { "celltoolbar": "Slideshow", "name": "", "signature": "sha256:071596ad2dfe26bcb054ffbce55702d3c14627e9a3d90ac47e89600822f15089" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "## all imports\n", "from IPython.display import HTML\n", "import numpy as np\n", "import urllib2\n", "import bs4 #this is beautiful soup\n", "\n", "from pandas import Series\n", "import pandas as pd\n", "from pandas import DataFrame\n", "\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "import seaborn as sns\n", "sns.set_context(\"talk\")\n", "sns.set_style(\"white\")" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "CS109\n", "=====\n", "\n", "Verena Kaynig-Fittkau\n", "\n", "* vkaynig@seas.harvard.edu\n", "* staff@cs109.org" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Announcements\n", "==============\n", "\n", "* Nice page to promote your projects\n", "\n", "http://sites.fas.harvard.edu/~huit-apps/archive/index.html" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Announcements\n", "==============\n", "\n", "* HW1 solutions are online\n", "* Awesome TFs are trying to get the grading done this week\n", "* be proactive:\n", " - start early\n", " - use office hours and Piazza\n", " - if resources are not enough: let us know!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Announcements\n", "==============\n", "\n", "* homework submission format\n", " - create a folder lastname_firstinitial_hw# \n", " - place the notebook and any others files into that folder\n", " - notebooks should be executed\n", " - compress the folder\n", " - submit to iSites" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Todays lecture:\n", "===============\n", "\n", "* all about data scraping\n", "* ***What is it? ***\n", "* How to do it:\n", " - from a website\n", " - with an API\n", "* Plus: Some more SVD!\n", "\n", "Answer: Data scraping is about obtaining data from webpages. There is low level scraping where you parse the data out of the html code of the webpage. There also is scraping over APIs from websites who try to make your life a bit easier.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "API registrations\n", "=================\n", "\n", "* Rotten Tomatoes\n", "\n", "http://developer.rottentomatoes.com/member/register\n", "\n", "* Twitter\n", "\n", "https://apps.twitter.com/app/new\n", "\n", "* Twitter instructions\n", "\n", "https://twittercommunity.com/t/how-to-get-my-api-key/7033" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Python data scraping\n", "====================\n", "\n", "* Why scrape the web?\n", " - vast source of information\n", " - automate tasks\n", " - keep up with sites\n", " - fun!\n", "\n", "*** Can you think of examples ? ***\n", " \n", "Answer: Some examples we had were stock market monitoring, sports data, or airline prices." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Read and Tweet!\n", "=================\n", "\n", "![ReadTweet](http://developer.nytimes.com/files/readtweet.jpg\n", " \"We read we tweet\")\n", "\n", "* by Justin Blinder\n", "* http://projects.justinblinder.com/We-Read-We-Tweet" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Twitter Sentiments\n", "=================\n", "\n", "![TwitterSentiments](http://www.csc.ncsu.edu/faculty/healey/tweet_viz/figs/tweet-viz-ex.png\n", " \"Twitter Sentiments\")\n", "\n", "* by Healey and Ramaswamy\n", "* http://www.csc.ncsu.edu/faculty/healey/tweet_viz/tweet_app/" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "L.A. Happy Hours\n", "===============\n", "\n", "* http://www.downtownla.com/3_10_happyHours.asp?action=ALL\n", "\n", "* by Katharine Jarmul" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Python data scraping\n", "====================\n", "\n", "* copyrights and permission:\n", " - be careful and polite\n", " - give credit\n", " - care about media law\n", " - don't be evil (no spam, overloading sites, etc.)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Robots.txt\n", "==========\n", "\n", "![Robots.txt](images/robots_txt.jpg \"Robots.txt\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Robots.txt\n", "==========\n", "\n", "* specified by web site owner\n", "* gives instructions to web robots (aka your script)\n", "* is located at the top-level directory of the web server\n", "\n", "http://www.example.com/robots.txt\n", "\n", "If you want you can also have a look at\n", "\n", "http://google.com/robots.txt" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Robots.txt\n", "==========\n", "\n", "*** What does this one do? ***\n", "\n", "Answer: This file allows google to search through everything on the server, while all others should stay completely away." ] }, { "cell_type": "raw", "metadata": {}, "source": [ "\n", "User-agent: Google\n", "Disallow:\n", "\n", "User-agent: *\n", "Disallow: /" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Things to consider:\n", "-------------------\n", "\n", "* can be just ignored\n", "* can be a security risk - *** Why? ***\n", "\n", "Answer: You are basically telling everybody who cares to look into the file where you have stored sensitive information.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Scraping with Python:\n", "=====================\n", "\n", "* scraping is all about HTML tags\n", "* bad news: \n", " - need to learn about tags\n", " - websites can be ugly" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "HTML\n", "=====\n", "\n", "* HyperText Markup Language\n", "\n", "* standard for creating webpages\n", "\n", "* HTML tags \n", " - have angle brackets\n", " - typically come in pairs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is an example for a minimal webpage defined in HTML tags. The root tag is `` and then you have the `
` tag. This part of the page typically includes the title of the page and might also have other meta information like the author or keywords that are important for search engines. The `` tag marks the actual content of the page. You can play around with the `Hello world!
\n", " \n", "\"\"\"\n", "\n", "h = HTML(s)\n", "h" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "html": [ "\n", "\n", " \n", "Hello world!
\n", " \n", "" ], "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "Hello world!
\"\"\"\n", "## get bs4 object\n", "tree = bs4.BeautifulSoup(s)\n", "\n", "## get html root node\n", "root_node = tree.html\n", "\n", "## get head from root using contents\n", "head = root_node.contents[0]\n", "\n", "## get body from root\n", "body = root_node.contents[1]\n", "\n", "## could directly access body\n", "tree.body" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 67, "text": [ "Hello world!
" ] } ], "prompt_number": 67 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Quiz:\n", "=====\n", "\n", "* Find the `h3` tag by parsing the tree starting at `body`\n", "* Create a list of all __Hall of Fame__ entries listed on the Beautiful Soup webpage\n", " - hint: it is the only unordered list in the page (tag `ul`)\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "## get h3 tag from body\n", "body.contents[0]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 68, "text": [ "\n", " | critics | \n", "audience | \n", "
---|---|---|
Captain America: The Winter Soldier | \n", "89 | \n", "93 | \n", "
The Amazing Spider-Man 2 | \n", "53 | \n", "68 | \n", "
Godzilla | \n", "73 | \n", "69 | \n", "
\n", " | created_at | \n", "favorite_count | \n", "favorited | \n", "hashtags | \n", "id | \n", "in_reply_to_screen_name | \n", "in_reply_to_status_id | \n", "in_reply_to_user_id | \n", "lang | \n", "retweet_count | \n", "retweeted | \n", "retweeted_status | \n", "source | \n", "text | \n", "truncated | \n", "urls | \n", "user | \n", "user_mentions | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "Wed Sep 24 13:56:29 +0000 2014 | \n", "7 | \n", "False | \n", "NaN | \n", "514775379941548033 | \n", "rstudio | \n", "NaN | \n", "235261861 | \n", "en | \n", "1 | \n", "False | \n", "NaN | \n", "<a href=\"http://twitter.com\" rel=\"nofollow\">Tw... | \n", "@rstudio spell checker is suggesting I correct... | \n", "False | \n", "NaN | \n", "{u'id': 177729631, u'profile_sidebar_fill_colo... | \n", "[{u'screen_name': u'rstudio', u'id': 235261861... | \n", "
1 | \n", "Mon Sep 22 13:09:38 +0000 2014 | \n", "14 | \n", "False | \n", "[gi2014] | \n", "514038816387387392 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "en | \n", "10 | \n", "False | \n", "NaN | \n", "<a href=\"http://twitter.com\" rel=\"nofollow\">Tw... | \n", "#gi2014 if you want more info on how to deal w... | \n", "False | \n", "{u'http://t.co/YdYAv8DLqo': u'http://genomicsc... | \n", "{u'id': 177729631, u'profile_sidebar_fill_colo... | \n", "NaN | \n", "