Getting data from the web: scraping

MACS 30500 University of Chicago

November 15, 2017

Web scraping

Data on a website with no API
Still want a programatic, reproducible way to obtain data
Ability to scrape depends on the quality of the website

HTML

Process of HTML

The web browser sends a request to the server that hosts the website
The server sends the browser an HTML document
The browser uses instructions in the HTML to render the website

Components of HTML code

<html>
  <head>
    <title>Title</title>
    <link rel="icon" type="icon" href="http://a" />
    <link rel="icon" type="icon" href="http://b" />
    <script src="https://c.js"></script>
  </head>
  <body>
    <div>
      <p>Click <b>here</b> now.</p>
      <span>Frozen</span>
    </div>
    <table style="width:100%">
      <tr>
        <td>Kristen</td>
        <td>Bell</td>
      </tr>
      <tr>
        <td>Idina</td>
        <td>Menzel</td>
      </tr>
    </table>
  <img src="http://ia.media-imdb.com/images.png"/>
  </body>
</html>

Components of HTML code

<a href="http://github.com">GitHub</a>

<a></a> - tag name
href - attribute (name)
"http://github.com" - attribute (value)
GitHub - content

Nested structure of HTML

html
- head
  - title
  - link
  - link
  - script
- body
  - div
    - p
      - b
    - span
  - table
    - tr
      - td
      - td
    - tr
      - td
      - td
  - img

Find the content “here”

html
- head
  - title
  - link
  - link
  - script
- body
  - div
    - p
      - b
    - span
  - table
    - tr
      - td
      - td
    - tr
      - td
      - td
  - img

Find the source code

IMDB page for Frozen

Find the source code

HTML tag for Kristen Bell

HTML only

HTML + CSS

CSS code

span {
  color: #ffffff;
}

.num {
  color: #a8660d;
}

table.data {
  width: auto;
}

#firstname {
  background-color: yellow;
}

CSS code

<span class="bigname" id="shiny">Shiny</span>

<span></span> - tag name
bigname - class (optional)
shiny - id (optional)

CSS selectors

span

.bigname

span.bigname

#shiny

CSS selectors

Prefix	Matches
none	tag
.	class
#	id

CSS diner

Find the CSS selector

IMDB page for Frozen

Find the CSS selector

`rvest`

Download the HTML and turn it into an XML file with read_html()
Extract specific nodes with html_nodes()
Extract content from nodes with various functions

Download the HTML

library(rvest)
frozen <- read_html("http://www.imdb.com/title/tt2294629/")
frozen

## {xml_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body id="styleguide-v2" class="fixed">\n<script>\n    if (typeof ue ...

Extract nodes

itals <- html_nodes(frozen, "em")
itals

## {xml_nodeset (1)}
## [1] <em class="nobr">Written by\n<a href="/search/title?plot_author=DeAl ...

Extract content from nodes

itals

## {xml_nodeset (1)}
## [1] <em class="nobr">Written by\n<a href="/search/title?plot_author=DeAl ...

html_text(itals)

## [1] "Written by\nDeAlan Wilson for ComedyE.com"

html_name(itals)

## [1] "em"

html_children(itals)

## {xml_nodeset (1)}
## [1] <a href="/search/title?plot_author=DeAlan%20Wilson%20for%20ComedyE.c ...

html_attr(itals, "class")

## [1] "nobr"

html_attrs(itals)

## [[1]]
##  class 
## "nobr"

Extract content

Read in the Frozen HTML
Select the nodes that are both spans and class = "itemprop"
Extract the text from the nodes

Extract content

library(rvest)
frozen <- read_html("http://www.imdb.com/title/tt2294629/")
cast <- html_nodes(frozen, "span.itemprop")
html_text(cast)

##  [1] "Animation"                     "Adventure"                    
##  [3] "Comedy"                        "Chris Buck"                   
##  [5] "Jennifer Lee"                  "Jennifer Lee"                 
##  [7] "Hans Christian Andersen"       "Kristen Bell"                 
##  [9] "Idina Menzel"                  "Jonathan Groff"               
## [11] "Kristen Bell"                  "Idina Menzel"                 
## [13] "Jonathan Groff"                "Josh Gad"                     
## [15] "Santino Fontana"               "Alan Tudyk"                   
## [17] "Ciarán Hinds"                  "Chris Williams"               
## [19] "Stephen J. Anderson"           "Maia Wilson"                  
## [21] "Edie McClurg"                  "Robert Pine"                  
## [23] "Maurice LaMarche"              "Livvy Stubenrauch"            
## [25] "Eva Bella"                     "female protagonist"           
## [27] "sister sister relationship"    "snowman"                      
## [29] "sister love"                   "magic"                        
## [31] "Walt Disney Animation Studios" "Walt Disney Pictures"

SelectorGadget

GUI tool used to identify CSS selector combinations from a webpage.

Run vignette("selectorgadget")
Drag SelectorGadget link into your browser’s bookmark bar

Using SelectorGadget

Navigate to a webpage
Open the SelectorGadget bookmark
Click on the item to scrape
Click on yellow items you do not want to scrape
Click on additional items that you do want to scrape
Rinse and repeat until only the items you want to scrape are highlighted in yellow
Copy the selector to use with html_nodes()

Practice using SelectorGadget

Use SelectorGadget to find a CSS selector combination that identifies just the cast member names

Practice using SelectorGadget

cast2 <- html_nodes(frozen, "#titleCast span.itemprop")
html_text(cast2)

##  [1] "Kristen Bell"        "Idina Menzel"        "Jonathan Groff"     
##  [4] "Josh Gad"            "Santino Fontana"     "Alan Tudyk"         
##  [7] "Ciarán Hinds"        "Chris Williams"      "Stephen J. Anderson"
## [10] "Maia Wilson"         "Edie McClurg"        "Robert Pine"        
## [13] "Maurice LaMarche"    "Livvy Stubenrauch"   "Eva Bella"

cast3 <- html_nodes(frozen, ".itemprop .itemprop")
html_text(cast3)

##  [1] "Kristen Bell"        "Idina Menzel"        "Jonathan Groff"     
##  [4] "Josh Gad"            "Santino Fontana"     "Alan Tudyk"         
##  [7] "Ciarán Hinds"        "Chris Williams"      "Stephen J. Anderson"
## [10] "Maia Wilson"         "Edie McClurg"        "Robert Pine"        
## [13] "Maurice LaMarche"    "Livvy Stubenrauch"   "Eva Bella"

Practice scraping data

Look up the cost of living for your hometown on Sperling’s Best Places
Extract it with html_nodes() and html_text()

Practice scraping data

sterling <- read_html("http://www.bestplaces.net/cost_of_living/city/virginia/sterling")

col <- html_nodes(sterling, css = "#mainContent_dgCostOfLiving tr:nth-child(2) td:nth-child(2)")
html_text(col)

## [1] "136"

# or use a piped operation
sterling %>%
  html_nodes(css = "#mainContent_dgCostOfLiving tr:nth-child(2) td:nth-child(2)") %>%
  html_text()

## [1] "136"

Tables

tables <- html_nodes(sterling, css = "table")

tables %>%
  # get the second table
  nth(2) %>%
  # convert to data frame
  html_table(header = TRUE)

##   COST OF LIVING Sterling, Virginia United States
## 1        Overall              136.0           100
## 2        Grocery              113.9           100
## 3         Health              101.0           100
## 4        Housing              203.0           100
## 5      Utilities              107.0           100
## 6 Transportation              108.0           100
## 7  Miscellaneous               98.0           100

Extract climate statistics

Extract the climate statistics of your hometown as a data frame with useful column names

Extract climate statistics

sterling_climate <- read_html("http://www.bestplaces.net/climate/city/virginia/sterling")

climate <- html_nodes(sterling_climate, css = "table")
html_table(climate, header = TRUE, fill = TRUE)[[2]]

##                         CLIMATE Sterling, Virginia United States
## 1                Rainfall (in.)            42.0447          39.2
## 2                Snowfall (in.)            21.5351          25.8
## 3            Precipitation Days            74.1000           102
## 4                    Sunny Days           197.0000           205
## 5                Avg. July High            87.4170          86.1
## 6                 Avg. Jan. Low            23.9660          22.6
## 7 Comfort Index (higher=better)            47.0000            54
## 8                      UV Index             4.0000           4.3
## 9                 Elevation ft.           457.0000         1,443

sterling_climate %>%
  html_nodes(css = "table") %>%
  nth(2) %>%
  html_table(header = TRUE)

##                         CLIMATE Sterling, Virginia United States
## 1                Rainfall (in.)            42.0447          39.2
## 2                Snowfall (in.)            21.5351          25.8
## 3            Precipitation Days            74.1000           102
## 4                    Sunny Days           197.0000           205
## 5                Avg. July High            87.4170          86.1
## 6                 Avg. Jan. Low            23.9660          22.6
## 7 Comfort Index (higher=better)            47.0000            54
## 8                      UV Index             4.0000           4.3
## 9                 Elevation ft.           457.0000         1,443

Random observations on scraping

Make sure you’ve obtained only what you want
If you are having trouble parsing, try selecting a smaller subset of the thing you are seeking
Confirm that there is no R package and no API