# Online Data

Webscraping is the activity of downloading, manipulating, and using information obtained online. Webscraping can get very complicated, and we won't do much in this course. This set of lecture notes can help you get started on the basics. We'll look into this a bit more when we get to regular expressions in a few lectures. 

## Downloading Files

There are several modules for downloading files from the internet. We'll use `urllib`: 

In [1]:
import urllib

In [3]:
url = "https://philchodrow.github.io/PIC16A/content/IO_and_modules/IO/palmer_penguins.csv"

filedata = urllib.request.urlopen(url)
to_write = filedata.read()

with open("downloaded_penguins.csv", "wb") as f:
    f.write(to_write)

Having run this code, you can check in your file explorer that a file called `downloaded_penguins.csv` now lives in the same directory as this notebook. We used the somewhat unusual flag `"wb"` to `open()` in order to indicate that we need to write a *binary* file, rather than the usual text file. This is because `to_write`, the return value of `filedata.read()`, is by default binary data. We might ask you in assignments to use this pattern, but you we won't evaluate you on it in any timed or closed-book contexts. 

The module `wget` is another popular tool for downloading files from the internet. 

# Data from Websites

Often, we want to access the contents of a webpage. In this case, the `request.urlopen` submodule of `urllib` can help us easily access the contents of a desired URL. 

In [11]:
from urllib.request import urlopen

url = "https://philchodrow.github.io/PIC16A/schedule/"

page = urlopen(url)
html_bits = page.read()

html = html_bits.decode("utf-8")

print(html[0:500])

<!DOCTYPE html>
<html>

  <head>
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge">
  <meta name="viewport" content="width=device-width, initial-scale=1">

  <title>PIC16A: Course Schedule (Fall 2020)</title>
  <meta name="description" content="Course materials for PIC16A at UCLA">

  <link rel="stylesheet" href="/PIC16A//_css/main.css">
  <link rel="canonical" href="http://philchodrow.github.io/PIC16A/PIC16A//schedule/">
  <link rel="alternate" type="application/rs


In [13]:
import re

urls = re.findall(r'href=[\'"]?([^\'">]+)', html)

urls

[url for url in urls if "http" in url]
# ---

['http://philchodrow.github.io/PIC16A/PIC16A//schedule/',
 'http://philchodrow.github.io/PIC16A/PIC16A//feed.xml',
 'https://fonts.googleapis.com/css?family=Titillium+Web:600italic,600,400,400italic',
 'https://fonts.googleapis.com/css2?family=Lato&display=swap',
 'https://fonts.googleapis.com/css2?family=Lato:ital,wght@0,400;0,700;1,400&display=swap',
 'https://fonts.googleapis.com/css2?family=Raleway&display=swap',
 'https://use.fontawesome.com/releases/v5.2.0/css/all.css',
 'http://philchodrow.github.io/PIC16A/syllabus/',
 'http://philchodrow.github.io/PIC16A/schedule/',
 'http://philchodrow.github.io/PIC16A/materials/',
 'https://github.com/philchodrow/PIC16A',
 'http://www.philchodrow.com',
 'https://docs.anaconda.com/anaconda/install/',
 'https://docs.python.org/3/tutorial/appetite.html',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/numbers.ipynb',
 'https://nbviewer.jupyter.org/github/PhilChodrow/PIC16A/blob/master/content/basics/strings.ip

As you can imagine, *parsing* HTML in order to extract useful content is a difficult problem. We will revisit this problem when we learn regular expressions in a few lectures. Here's an example of the kind of thing we'll be able to do: 