{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 05 Scraping data with Requests and Beautiful Soup\n", "To now, we've covered means of grabbing data that are formatted to grab. The term 'web scraping' refers to the messier means of pulling material from web sites that were really meant for people, not for computers. Web sites, of course, can include a variety of objects: text, images, video, flash, etc., and your success at scraping what you want will vary. In other words, scraping involves a bit of [MacGyvering](https://en.wiktionary.org/wiki/MacGyver). \n", "\n", "Useful packages for scraping are `requests` and `bs4`/`BeautifulSoup`, which code is included to install these below.\n", "\n", "We'll run through a few quick examples, but for more on this topic, I recommend:\n", "* http://www.pythonforbeginners.com/python-on-the-web/beautifulsoup-4-python/\n", "* http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html\n", "* http://stanford.edu/~mgorkove/cgi-bin/rpython_tutorials/webscraping_with_lxml.php" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import the requests and beautiful soup\n", "import requests\n", "try:\n", " from bs4 import BeautifulSoup\n", "except:\n", " !pip install beautifulsoup4\n", " from bs4 import BeautifulSoup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import re, a package for using regular expressions\n", "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `requests` package works a lot like the urllib package in that it sends a request to a server and stores the servers response in a variable, here named `response`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Send a request to a web page\n", "response = requests.get('https://xkcd.com/869')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The response object simply has the contents of the web page at the address provided\n", "print(response.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`BeautifulSoup` is designed to intelligently read raw HTML code, i.e., what is stored in the `response` variable generated above. The command below reads in the raw HTML and parses it into logical components that we can command. \n", "\n", "The `lxml` in the command specifies a particular parser for deconstructing the HTML... " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# BeautifulSoup \n", "soup = BeautifulSoup(response.text, 'lxml')\n", "type(soup)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we search the text of the web page's body for any instances of `https://....png`, that is any link to a PNG image embedded in the page. This is done using `re` and implementing regular expressions (see https://developers.google.com/edu/python/regular-expressions for more info on this useful module...) \n", "\n", "The `match` object returned by `search(`) holds information about the nature of the match, including the original input string, the regular expression used, and the location within the original string where the pattern occurs. The `group` property of the match is the full string that's returned" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Search the page for emebedded links to PNG files\n", "match = re.search('https://.*\\.png', soup.body.text)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#What was found in the search\n", "print(match.group())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#And here is some Juptyer code to display the picture resulting from it\n", "from IPython.display import Image\n", "Image(url=match.group())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" } }, "nbformat": 4, "nbformat_minor": 2 }