{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Online Data\n", "\n", "Webscraping is the activity of downloading, manipulating, and using information obtained online. Webscraping can get very complicated, and we won't do much in this course. This set of lecture notes can help you get started on the basics. We'll look into this a bit more when we get to regular expressions in a few lectures. \n", "\n", "## Downloading Files\n", "\n", "There are several modules for downloading files from the internet. We'll use `urllib`: " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import urllib" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "url = \"https://philchodrow.github.io/PIC16A/content/IO_and_modules/IO/palmer_penguins.csv\"\n", "\n", "filedata = urllib.request.urlopen(url)\n", "to_write = filedata.read()\n", "\n", "with open(\"downloaded_penguins.csv\", \"wb\") as f:\n", " f.write(to_write)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having run this code, you can check in your file explorer that a file called `downloaded_penguins.csv` now lives in the same directory as this notebook. We used the somewhat unusual flag `\"wb\"` to `open()` in order to indicate that we need to write a *binary* file, rather than the usual text file. This is because `to_write`, the return value of `filedata.read()`, is by default binary data. We might ask you in assignments to use this pattern, but you we won't evaluate you on it in any timed or closed-book contexts. \n", "\n", "The module `wget` is another popular tool for downloading files from the internet. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data from Websites\n", "\n", "Often, we want to access the contents of a webpage. In this case, the `request.urlopen` submodule of `urllib` can help us easily access the contents of a desired URL. " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", " PIC16A: Course Schedule (Fall 2020)\n", " \n", "\n", " \n", " \n", " ]+)', html)\n", "\n", "urls\n", "\n", "[url for url in urls if \"http\" in url]\n", "# ---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can imagine, *parsing* HTML in order to extract useful content is a difficult problem. We will revisit this problem when we learn regular expressions in a few lectures. Here's an example of the kind of thing we'll be able to do: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }