Getting data from the web: scraping

MACS 30500 University of Chicago

November 15, 2017

Web scraping

  • Data on a website with no API
  • Still want a programatic, reproducible way to obtain data
  • Ability to scrape depends on the quality of the website

HTML

tags

Process of HTML

  1. The web browser sends a request to the server that hosts the website
  2. The server sends the browser an HTML document
  3. The browser uses instructions in the HTML to render the website

Components of HTML code

<html>
  <head>
    <title>Title</title>
    <link rel="icon" type="icon" href="http://a" />
    <link rel="icon" type="icon" href="http://b" />
    <script src="https://c.js"></script>
  </head>
  <body>
    <div>
      <p>Click <b>here</b> now.</p>
      <span>Frozen</span>
    </div>
    <table style="width:100%">
      <tr>
        <td>Kristen</td>
        <td>Bell</td>
      </tr>
      <tr>
        <td>Idina</td>
        <td>Menzel</td>
      </tr>
    </table>
  <img src="http://ia.media-imdb.com/images.png"/>
  </body>
</html>

Components of HTML code

<a href="http://github.com">GitHub</a>
  • <a></a> - tag name
  • href - attribute (name)
  • "http://github.com" - attribute (value)
  • GitHub - content

Nested structure of HTML

  • html
    • head
      • title
      • link
      • link
      • script
    • body
      • div
        • p
          • b
        • span
      • table
        • tr
          • td
          • td
        • tr
          • td
          • td
      • img

Find the content “here”

  • html
    • head
      • title
      • link
      • link
      • script
    • body
      • div
        • p
          • b
        • span
      • table
        • tr
          • td
          • td
        • tr
          • td
          • td
      • img

Find the source code

IMDB page for Frozen

Find the source code

HTML tag for Kristen Bell

HTML only

HTML only

HTML + CSS

HTML + CSS

CSS code

span {
  color: #ffffff;
}

.num {
  color: #a8660d;
}

table.data {
  width: auto;
}

#firstname {
  background-color: yellow;
}

CSS code

<span class="bigname" id="shiny">Shiny</span>
  • <span></span> - tag name
  • bigname - class (optional)
  • shiny - id (optional)

CSS selectors

span
.bigname
span.bigname
#shiny

CSS selectors

Prefix Matches
none tag
. class
# id

CSS diner

Find the CSS selector

IMDB page for Frozen

Find the CSS selector

rvest

  1. Download the HTML and turn it into an XML file with read_html()
  2. Extract specific nodes with html_nodes()
  3. Extract content from nodes with various functions

Download the HTML

library(rvest)
frozen <- read_html("http://www.imdb.com/title/tt2294629/")
frozen
## {xml_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body id="styleguide-v2" class="fixed">\n<script>\n    if (typeof ue ...

Extract nodes

itals <- html_nodes(frozen, "em")
itals
## {xml_nodeset (1)}
## [1] <em class="nobr">Written by\n<a href="/search/title?plot_author=DeAl ...

Extract content from nodes

itals
## {xml_nodeset (1)}
## [1] <em class="nobr">Written by\n<a href="/search/title?plot_author=DeAl ...
html_text(itals)
## [1] "Written by\nDeAlan Wilson for ComedyE.com"
html_name(itals)
## [1] "em"
html_children(itals)
## {xml_nodeset (1)}
## [1] <a href="/search/title?plot_author=DeAlan%20Wilson%20for%20ComedyE.c ...
html_attr(itals, "class")
## [1] "nobr"
html_attrs(itals)
## [[1]]
##  class 
## "nobr"

Extract content

  1. Read in the Frozen HTML
  2. Select the nodes that are both spans and class = "itemprop"
  3. Extract the text from the nodes

Extract content

library(rvest)
frozen <- read_html("http://www.imdb.com/title/tt2294629/")
cast <- html_nodes(frozen, "span.itemprop")
html_text(cast)
##  [1] "Animation"                     "Adventure"                    
##  [3] "Comedy"                        "Chris Buck"                   
##  [5] "Jennifer Lee"                  "Jennifer Lee"                 
##  [7] "Hans Christian Andersen"       "Kristen Bell"                 
##  [9] "Idina Menzel"                  "Jonathan Groff"               
## [11] "Kristen Bell"                  "Idina Menzel"                 
## [13] "Jonathan Groff"                "Josh Gad"                     
## [15] "Santino Fontana"               "Alan Tudyk"                   
## [17] "Ciarán Hinds"                  "Chris Williams"               
## [19] "Stephen J. Anderson"           "Maia Wilson"                  
## [21] "Edie McClurg"                  "Robert Pine"                  
## [23] "Maurice LaMarche"              "Livvy Stubenrauch"            
## [25] "Eva Bella"                     "female protagonist"           
## [27] "sister sister relationship"    "snowman"                      
## [29] "sister love"                   "magic"                        
## [31] "Walt Disney Animation Studios" "Walt Disney Pictures"

SelectorGadget

  • GUI tool used to identify CSS selector combinations from a webpage.
  1. Run vignette("selectorgadget")
  2. Drag SelectorGadget link into your browser’s bookmark bar

Using SelectorGadget

  1. Navigate to a webpage
  2. Open the SelectorGadget bookmark
  3. Click on the item to scrape
  4. Click on yellow items you do not want to scrape
  5. Click on additional items that you do want to scrape
  6. Rinse and repeat until only the items you want to scrape are highlighted in yellow
  7. Copy the selector to use with html_nodes()

Practice using SelectorGadget

Use SelectorGadget to find a CSS selector combination that identifies just the cast member names

Practice using SelectorGadget

cast2 <- html_nodes(frozen, "#titleCast span.itemprop")
html_text(cast2)
##  [1] "Kristen Bell"        "Idina Menzel"        "Jonathan Groff"     
##  [4] "Josh Gad"            "Santino Fontana"     "Alan Tudyk"         
##  [7] "Ciarán Hinds"        "Chris Williams"      "Stephen J. Anderson"
## [10] "Maia Wilson"         "Edie McClurg"        "Robert Pine"        
## [13] "Maurice LaMarche"    "Livvy Stubenrauch"   "Eva Bella"
cast3 <- html_nodes(frozen, ".itemprop .itemprop")
html_text(cast3)
##  [1] "Kristen Bell"        "Idina Menzel"        "Jonathan Groff"     
##  [4] "Josh Gad"            "Santino Fontana"     "Alan Tudyk"         
##  [7] "Ciarán Hinds"        "Chris Williams"      "Stephen J. Anderson"
## [10] "Maia Wilson"         "Edie McClurg"        "Robert Pine"        
## [13] "Maurice LaMarche"    "Livvy Stubenrauch"   "Eva Bella"

Practice scraping data

  1. Look up the cost of living for your hometown on Sperling’s Best Places
  2. Extract it with html_nodes() and html_text()

Practice scraping data

sterling <- read_html("http://www.bestplaces.net/cost_of_living/city/virginia/sterling")

col <- html_nodes(sterling, css = "#mainContent_dgCostOfLiving tr:nth-child(2) td:nth-child(2)")
html_text(col)
## [1] "136"
# or use a piped operation
sterling %>%
  html_nodes(css = "#mainContent_dgCostOfLiving tr:nth-child(2) td:nth-child(2)") %>%
  html_text()
## [1] "136"

Tables

tables <- html_nodes(sterling, css = "table")

tables %>%
  # get the second table
  nth(2) %>%
  # convert to data frame
  html_table(header = TRUE)
##   COST OF LIVING Sterling, Virginia United States
## 1        Overall              136.0           100
## 2        Grocery              113.9           100
## 3         Health              101.0           100
## 4        Housing              203.0           100
## 5      Utilities              107.0           100
## 6 Transportation              108.0           100
## 7  Miscellaneous               98.0           100

Extract climate statistics

Extract the climate statistics of your hometown as a data frame with useful column names

Extract climate statistics

sterling_climate <- read_html("http://www.bestplaces.net/climate/city/virginia/sterling")

climate <- html_nodes(sterling_climate, css = "table")
html_table(climate, header = TRUE, fill = TRUE)[[2]]
##                         CLIMATE Sterling, Virginia United States
## 1                Rainfall (in.)            42.0447          39.2
## 2                Snowfall (in.)            21.5351          25.8
## 3            Precipitation Days            74.1000           102
## 4                    Sunny Days           197.0000           205
## 5                Avg. July High            87.4170          86.1
## 6                 Avg. Jan. Low            23.9660          22.6
## 7 Comfort Index (higher=better)            47.0000            54
## 8                      UV Index             4.0000           4.3
## 9                 Elevation ft.           457.0000         1,443
sterling_climate %>%
  html_nodes(css = "table") %>%
  nth(2) %>%
  html_table(header = TRUE)
##                         CLIMATE Sterling, Virginia United States
## 1                Rainfall (in.)            42.0447          39.2
## 2                Snowfall (in.)            21.5351          25.8
## 3            Precipitation Days            74.1000           102
## 4                    Sunny Days           197.0000           205
## 5                Avg. July High            87.4170          86.1
## 6                 Avg. Jan. Low            23.9660          22.6
## 7 Comfort Index (higher=better)            47.0000            54
## 8                      UV Index             4.0000           4.3
## 9                 Elevation ft.           457.0000         1,443

Random observations on scraping

  • Make sure you’ve obtained only what you want
  • If you are having trouble parsing, try selecting a smaller subset of the thing you are seeking
  • Confirm that there is no R package and no API