10/29/2019

Functions

Showcase

Problems

  • Unanticipated/unexpected (debugging)
    • browser(): interactive state inside the function
    • traceback(): call stack, the sequence of calls that lead up to an error
  • Anticipated (condition handling)
    • stop(): fatal errors, force all execution to terminate (no way for a function to continue).
    • warning(): display potential problems (e.g., log(-1:2)).
    • message(): give informative output (e.g., let the user know what value the function has chosen for an important missing argument).

Web scraping

Story

  • Dinner to meet your [girl/boy]friend’s parents
  • A good impression
  • His/Her father is a big fun of wine (Chasselas?)
  • You have no clue about wine
  • No boring wine encyclopedia
  • But you are a data scientist!
  • Vivino

Web scraping

Web scraping is the process of collecting the data from the World Wide Web and transforming it into a structured format.

Types of web scraping:

  • Human manual copy-and-paste
  • Text pattern matching
  • Using API/SDK (socket programming)
  • HTML parsing

Denial-of-service (DoS), crawl rate, and robots.txt

  • What is denial-of-service?
  • What is web crawling? Crawl rate?
  • What is robots.txt and what is the story behind?
  • How it is related to web scraping?

Why the web scraping can be seen negatively?

  • Bad-behaved bots: too high crawl rate, access parts that it should not
  • Unethical use of data: The Social Network (2010)

How to avoid troubles?

  1. Read carefully the Terms of Service.

  2. Respect robots.txt:

    • Identify which parts can be accessed.
    • Do not be too aggressive with too frequent requests. If the crawl rate is not mentioned in robot.txt (section crawl-delay:), then use something reasonable, e.g. one request per 10 seconds.
  3. Make sure that your idea is ethical.

How to web scrape?

0. Make sure that scraping is legal and ethical for a given web-page (robotstxt).

1. Load (fetch) and parse the HTML file (xml2).

2. Define a css element of interest (selectorgadget)

3. Extract data from from the web-page (rvest)

Your turn!