Functions (continued) & Web scraping in R

10/29/2019

Functions

Showcase

Problems

Unanticipated/unexpected (debugging)
- browser(): interactive state inside the function
- traceback(): call stack, the sequence of calls that lead up to an error
Anticipated (condition handling)
- stop(): fatal errors, force all execution to terminate (no way for a function to continue).
- warning(): display potential problems (e.g., log(-1:2)).
- message(): give informative output (e.g., let the user know what value the function has chosen for an important missing argument).

Web scraping

Story

Dinner to meet your [girl/boy]friend’s parents
A good impression
His/Her father is a big fun of wine (Chasselas?)
You have no clue about wine
No boring wine encyclopedia
But you are a data scientist!
Vivino

Web scraping

Web scraping is the process of collecting the data from the World Wide Web and transforming it into a structured format.

Types of web scraping:

Human manual copy-and-paste
Text pattern matching
Using API/SDK (socket programming)
HTML parsing

Denial-of-service (DoS), crawl rate, and robots.txt

What is denial-of-service?
What is web crawling? Crawl rate?
What is robots.txt and what is the story behind?
How it is related to web scraping?

Why the web scraping can be seen negatively?

Bad-behaved bots: too high crawl rate, access parts that it should not
Unethical use of data: The Social Network (2010)

How to avoid troubles?

Read carefully the Terms of Service.
Respect robots.txt:
- Identify which parts can be accessed.
- Do not be too aggressive with too frequent requests. If the crawl rate is not mentioned in robot.txt (section crawl-delay:), then use something reasonable, e.g. one request per 10 seconds.
Make sure that your idea is ethical.

How to web scrape?

Functions

Showcase

Problems

Web scraping

Story

Web scraping

Denial-of-service (DoS), crawl rate, and robots.txt

Why the web scraping can be seen negatively?

How to avoid troubles?

How to web scrape?

0. Make sure that scraping is legal and ethical for a given web-page (`robotstxt`).

1. Load (fetch) and parse the HTML file (`xml2`).

2. Define a css element of interest (`selectorgadget`)

3. Extract data from from the web-page (`rvest`)

Your turn!

Functions

Showcase

Problems

Web scraping

Story

Web scraping

Denial-of-service (DoS), crawl rate, and robots.txt

Why the web scraping can be seen negatively?

How to avoid troubles?

How to web scrape?

0. Make sure that scraping is legal and ethical for a given web-page (robotstxt).

1. Load (fetch) and parse the HTML file (xml2).

2. Define a css element of interest (selectorgadget)

3. Extract data from from the web-page (rvest)

Your turn!

0. Make sure that scraping is legal and ethical for a given web-page (`robotstxt`).

1. Load (fetch) and parse the HTML file (`xml2`).

2. Define a css element of interest (`selectorgadget`)

3. Extract data from from the web-page (`rvest`)