{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python 101 \n", "## Part VII.\n", "\n", "---\n", "\n", "## Web scraping\n", "\n", "### Prerequisites\n", "\n", "We'll use the __`requests`__ and the __`BeautfulSoup`__ libraries for web scraping, let's install them:\n", "```bash\n", "pip install -U requests beautifulsoup4\n", "```\n", "\n", "### 0. Easy file sharing\n", "Start your own web-server:\n", "- in command prompt change your directory to the notebook directory\n", "- start the server with the `python -m http.server` command\n", "\n", "### 1. Obtain a webpage\n", "\n", "The easiest way is to use a third party library called __`requests`__. Let's import it right away!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import requests" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then we simply ask a server to give us an html document by requesting it through an url." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "existing_url = 'http://localhost:8000/data/test.html'\n", "response = requests.get(existing_url)\n", "print(response.status_code) # hopefully 200 -> successful download" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "not_existing_url = 'http://localhost:8000/test1.html'\n", "response = requests.get(not_existing_url)\n", "print(response.status_code) # unfortunately 404 -> not exists" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Common status codes:__\n", "- 200: success\n", "- 301: permanent redirect\n", "- 303: redirect\n", "- 400: bad request\n", "- 401: unauthorized\n", "- 404: not exists\n", "- 500: internal server error" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "response = requests.get(existing_url)\n", "print(response.content.decode('utf-8'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jupyter can render the page if it was successfully downloaded." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import HTML\n", "if response.status_code == 200:\n", " result = HTML(response.content.decode('utf-8'))\n", "else:\n", " result = 'Nah, let\\'s have a beer instead!'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Process HTML\n", "\n", "#### Story time: The skeleton of a html document\n", "\n", "__HTML__ is a markup language, its basic build blocks are the ``s.
\n", "(Almost) every `` has two parts:\n", "\n", "- Opening `` \n", "- Closing `` \n", "\n", "Important html ``s:\n", "\n", "- ``\n", "- ``\n", "- ``\n", "- `

`, ..., `

`\n", "- `

`\n", "- ``\n", "- `

`\n", "- `
`\n", "- ```\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " ...\n", " \n", " \n", "

\n", " ```\n", "- `

` / `

` + `