```{r setup, include=FALSE} opts_chunk$set(cache=TRUE) ``` Web Scraping part 2: Digging in ======================================================== width: 1200 author: Rolf Fredheim and Aiora Zabala date: University of Cambridge font-family: 'Rockwell' 25/02/2014 Today we will: ======================================================== - Become good at writing bad* functions - Learn how to access information in web-pages (HTML, XML, etc) Get the docs: http://fredheir.github.io/WebScraping/Lecture2/p2.html https://raw.github.com/fredheir/WebScraping/master/Lecture2/p2.Rpres http://fredheir.github.io/WebScraping/Lecture2/p2.R


*mabye 'utilitarian', 'expedient', and 'functional' would be more accurate, if more pompous descriptions of this practice Digital data collection ======================= - Devise a means of accessing data - Retrieve that data - tabulate and store the data Last week step two involved JSON. Today we work with HTML. Steps 1 and 3 do not change the same Revisiting the programming basics ================ type:section Good functions ======================================================== What makes a function good? - Clear name - Instructions to user - Short - Performs a single task - Is efficient - Can handle errors - Is predictable - Does not use **global variables** (someone explain?) Bad functions ======================================================== ... break the rules/ guidelines above. But they can be useful to: - hide a script behind a function - get overview - move on to the next task - don't worry about methods and error handling: simplifies the process If writing functions for your own use, it's ok* to write bad functions. *But basic notes throughout the code reminding yourself what you did will be invaluable Revision ============ What is a variable? What are they for? Variables ================= type:sq Two main purposes: quicker to write ```{r} uni= "The University of Cambridge" uni ``` *** quicker to change the code. It is good practice to declare variables near the start of your code Paying tax: 9400 tax free ```{r} (20000-9440)*20/100 #OR: wage=20000 taxFree=9400 rate=20 (wage-taxFree)*rate/100 ``` Functions without variables ================ ```{r} printName <- function(){ print ("My name is Rolf Fredheim") } printName() ``` This is a useless function. But sometimes, if we have many lines of code requiring no particular input, it can be useful to file them away like this. e.g. for simulations ============ ```{r} sillySimulation <- function(){ x1 <- runif(500,80,100) x2 <- runif(500,0,100) v1 <- c(x1,x2) x3 <- runif(1000,0,100) df <- data.frame(v1,x3) require(ggplot2) print(ggplot(df, aes(v1,x3))+geom_point()+ggtitle("simulation of some sort")) } ``` ===== Just as this slide hides the code on the previous slide, so the function hides the underlying code. ```{r} sillySimulation() ``` Inserting variables ========= Let's hammer home how to use variables what variables could we add to the function below? ```{r} desperateTimes <- function(){ print(paste0("Rolf is struggling to finish his PhD on time. Time remaining: 6 months")) } ``` Name =========== ```{r} desperateTimes <- function(name){ print(paste0(name ," is struggling to finish his PhD on time. Time remaining: 6 months")) } desperateTimes(name="Tom") ``` Gender =========== type:sq we specify a default value ```{r} desperateTimes <- function(name,gender="m"){ if(gender=="m"){ pronoun="his" }else{ pronoun="her" } print(paste0(name ," is struggling to finish ",pronoun," PhD on time. Time remaining: 6 months")) } desperateTimes(name="Tanya",gender="f") ``` Is this a good function? Why (not)? degree ============== ```{r} desperateTimes <- function(name,gender="m",degree){ if(gender=="m"){ pronoun="his" }else{ pronoun="her" } print(paste0(name ," is struggling to finish ",pronoun," ",degree," on time. Time remaining: 6 months")) } desperateTimes(name="Rolf",gender="m","Mphil") ``` Days til deadline ============ type:sq1 ```{r} require(lubridate) require(ggplot2) deadline=as.Date("2014-09-01") daysLeft <- deadline-Sys.Date() totDays <- deadline-as.Date("2011-10-01") print(daysLeft) print(paste0("Rolf is struggling to finish his PhD on time. Days remaining: ", as.numeric(daysLeft))) ``` part2 ========== type:sq ```{r} print(paste0("Percentage to go: ",round(as.numeric(daysLeft)/as.numeric(totDays)*100))) df <- data.frame(days=c(daysLeft,totDays-daysLeft),lab=c("to go","completed")) ggplot(df,aes(1,days,fill=lab))+geom_bar(stat="identity",position="fill") ``` =========== type:sq1 We could put all this code in a function, and forget about it ```{r} timeToWorry <- function(){ require(lubridate) deadline=as.Date("2014-09-01") daysLeft <- deadline-Sys.Date() totDays <- deadline-as.Date("2011-10-01") print(daysLeft) print(paste0("Rolf is struggling to finish his PhD on time. Days remaining: ", as.numeric(daysLeft))) print(paste0("Percentage to go: ",round(as.numeric(daysLeft)/as.numeric(totDays)*100))) df <- data.frame(days=c(daysLeft,totDays-daysLeft),lab=c("to go","completed")) ggplot(df,aes(1,days,fill=lab))+geom_bar(stat="identity",position="fill") } ``` File it away until in need of a reminder ====== ```{r} timeToWorry() ``` Finishing up last week's material ============================= type:section What does this have to do with webscraping? ============ Bad functions like this will help us to break the task into bitesize chunks Rather than working with long unruly scripts, we write a little script that works, identify any necessary variables, and file it away. A typical structure might be: - Load packages, set working directory - Download one example - Extract the necessary information - Store the information - Repeat. -> either by looping, or by completing one step at a time Last week's code ================= type:sq example and explanation downloading data Check the code is correct ```{r} require(rjson) url <- "http://stats.grok.se/json/en/201201/web_scraping" raw.data <- readLines(url, warn="F") rd <- fromJSON(raw.data) rd.views <- rd$daily_views rd.views <- unlist(rd.views) rd <- as.data.frame(rd.views) rd$date <- rownames(rd) rownames(rd) <- NULL rd ``` Turn it into a function ======================== type:sq "url" is the only thing that changes. Thus we have one variable At the end we "return" the data to the user ```{r} getData <- function(url){ require(rjson) raw.data <- readLines(url, warn="F") rd <- fromJSON(raw.data) rd.views <- rd$daily_views rd.views <- unlist(rd.views) rd <- as.data.frame(rd.views) rd$date <- rownames(rd) rownames(rd) <- NULL rd$date <- as.Date(rd$date) return(rd) } ``` Now we can forget about *how* we download data, after checking the code works: getData("http://stats.grok.se/json/en/201201/web_scraping") ============ The script and the function achieve exactly the same thing. But: compressing the code to a single function is good to relieve the brain, and to de-clutter your code Creating the URLS ========= type:sq1 ```{r} getUrls <- function(y1,y2,term){ root="http://stats.grok.se/json/en/" urls <- NULL for (year in y1:y2){ for (month in 1:9){ urls <- c(urls,(paste(root,year,0,month,"/",term,sep=""))) } for (month in 10:12){ urls <- c(urls,(paste(root,year,month,"/",term,sep=""))) } } return(urls) } ``` Put it together ====== type:sq2 ```{r} #create some URLS urls <- getUrls(y1=2013,y2=2014,"Euromaidan") #get data for each of them and store that data results=NULL for (url in urls){ results <- rbind(results,getData(url)) } head(results) ``` Inspect the data, visualise it *** ```{r} ggplot(tail(results,100),aes(date,rd.views))+geom_line() ``` OK, lets move on ============= type:section - HTML - XPath - CSS and attributes Getting to know HTML structure ============================== http://en.wikipedia.org/wiki/Euromaidan Let's look at this webpage - Headings - Images - links - references - tables To look at the code (in Google Chrome), right-click somewhere on the page and select 'inspect element' Tree-structure (parents, siblings) Back to Wikipedia ==================== HTML tags. They come in pairs and are surrounded by these guys: <> e.g. a heading might look like this: \MY HEADING\

MY HEADING

Which others do you know or can you find? HTML tags ====================== - \: starts html code - \ : contains meta data etc - \