--- title: "Web Scraping" author: "Dr. Hua Zhou @ UCLA" date: "Feb 11, 2020" output: # ioslides_presentation: default html_document: toc: true toc_depth: 4 subtitle: Biostat 203B --- ```{r setup, include=FALSE} knitr::opts_chunk$set(fig.align = 'center', cache = TRUE) ``` ```{r} sessionInfo() ``` Load tidyverse and other packages for this lecture: ```{r} library("tidyverse") library("rvest") library("quantmod") ``` ## Web scraping There is a wealth of data on internet. How to scrape them and analyze them? ## rvest [rvest](https://github.com/hadley/rvest) is an R package written by Hadley Wickham which makes web scraping easy. ## Example: Scraping from webpage - We follow instructions in a [Blog by SAURAV KAUSHIK](https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/) to find the most popular feature films of 2019. - Install the [SelectorGadget](https://selectorgadget.com/) extension for Chrome. - The 100 most popular feature films released in 2019 can be accessed at page . ```{r} #Loading the rvest and tidyverse package #Specifying the url for desired website to be scraped url <- "http://www.imdb.com/search/title?count=100&release_date=2019,2019&title_type=feature" #Reading the HTML code from the website (webpage <- read_html(url)) ``` - Suppose we want to scrape following 11 features from this page: - Rank - Title - Description - Runtime - Genre - Rating - Metascore - Votes - Gross_Eerning_in_Mil - Director - Actor

### Rank - Use SelectorGadget to highlight the element we want to scrape

- Use the CSS selector to get the rankings ```{r} # Use CSS selectors to scrap the rankings section (rank_data_html <- html_nodes(webpage, '.text-primary')) # Convert the ranking data to text (rank_data <- html_text(rank_data_html)) # Turn into numerical values (rank_data <- as.integer(rank_data)) ``` ### Title - Use SelectorGadget to find the CSS selector `.lister-item-header a`. ```{r} # Using CSS selectors to scrap the title section (title_data_html <- html_nodes(webpage, '.lister-item-header a')) # Converting the title data to text (title_data <- html_text(title_data_html)) ``` ### Description - ```{r} # Using CSS selectors to scrap the description section (description_data_html <- html_nodes(webpage, '.ratings-bar+ .text-muted')) # Converting the description data to text description_data <- html_text(description_data_html) # take a look at first few head(description_data) # strip the '\n' description_data <- str_replace(description_data, "^\\n\\s+", "") head(description_data) ``` ### Runtime - ```{r} # Using CSS selectors to scrap the Movie runtime section (runtime_data <- webpage %>% html_nodes('.runtime') %>% html_text() %>% str_replace(" min", "") %>% as.integer()) ``` ```{r} # Using CSS selectors to scrap the Movie runtime section runtime_data_html <- html_nodes(webpage, '.runtime') # Converting the runtime data to text runtime_data <- html_text(runtime_data_html) # Let's have a look at the runtime head(runtime_data) # Data-Preprocessing: removing mins and converting it to numerical runtime_data <- str_replace(runtime_data, " min", "") runtime_data <- as.numeric(runtime_data) #Let's have another look at the runtime data head(runtime_data) ``` ### Genre - Collect the (first) genre of each movie: ```{r} # Using CSS selectors to scrap the Movie genre section genre_data_html <- html_nodes(webpage, '.genre') # Converting the genre data to text genre_data <- html_text(genre_data_html) # Let's have a look at the genre data head(genre_data) # Data-Preprocessing: retrieve the first word genre_data <- str_extract(genre_data, "[:alpha:]+") # Convering each genre from text to factor #genre_data <- as.factor(genre_data) # Let's have another look at the genre data head(genre_data) ``` ### Rating - ```{r} # Using CSS selectors to scrap the IMDB rating section rating_data_html <- html_nodes(webpage, '.ratings-imdb-rating strong') # Converting the ratings data to text rating_data <- html_text(rating_data_html) # Let's have a look at the ratings head(rating_data) # Data-Preprocessing: converting ratings to numerical rating_data <- as.numeric(rating_data) # Let's have another look at the ratings data rating_data ``` ### Votes - ```{r} # Using CSS selectors to scrap the votes section votes_data_html <- html_nodes(webpage, '.sort-num_votes-visible span:nth-child(2)') # Converting the votes data to text votes_data <- html_text(votes_data_html) # Let's have a look at the votes data head(votes_data) # Data-Preprocessing: removing commas votes_data <- str_replace(votes_data, ",", "") # Data-Preprocessing: converting votes to numerical votes_data <- as.numeric(votes_data) #Let's have another look at the votes data votes_data ``` ### Director - ```{r} # Using CSS selectors to scrap the directors section (directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)')) # Converting the directors data to text directors_data <- html_text(directors_data_html) # Let's have a look at the directors data directors_data ``` ### Actor - ```{r} # Using CSS selectors to scrap the actors section (actors_data_html <- html_nodes(webpage, '.lister-item-content .ghost+ a')) # Converting the gross actors data to text actors_data <- html_text(actors_data_html) # Let's have a look at the actors data head(actors_data) ``` ### Metascore - Be careful with missing data. ```{r} # Using CSS selectors to scrap the metascore section metascore_data_html <- html_nodes(webpage, '.metascore') # Converting the runtime data to text metascore_data <- html_text(metascore_data_html) # Let's have a look at the metascore head(metascore_data) # Data-Preprocessing: removing extra space in metascore metascore_data <- str_replace(metascore_data, "\\s*$", "") metascore_data <- as.numeric(metascore_data) metascore_data # Lets check the length of metascore data length(metascore_data) # Visual inspection finds 24, 85, 100 don't have metascore ms <- rep(NA, 100) ms[-c(24, 85, 100)] <- metascore_data (metascore_data <- ms) ``` ### Gross - Be careful with missing data. ```{r} # Using CSS selectors to scrap the gross revenue section gross_data_html <- html_nodes(webpage,'.ghost~ .text-muted+ span') # Converting the gross revenue data to text gross_data <- html_text(gross_data_html) # Let's have a look at the gross data head(gross_data) # Data-Preprocessing: removing '$' and 'M' signs gross_data <- str_replace(gross_data, "M", "") gross_data <- str_sub(gross_data, 2, 10) #(gross_data <- str_extract(gross_data, "[:digit:]+.[:digit:]+")) gross_data <- as.numeric(gross_data) # Let's check the length of gross data length(gross_data) # Visual inspection finds below movies don't have gross #gs_data <- rep(NA, 100) #gs_data[-c(1, 2, 3, 5, 61, 69, 71, 74, 78, 82, 84:87, 90)] <- gross_data #(gross_data <- gs_data) ``` 60 (out of 100) movies don't have gross data yet! We need a better way to figure out missing entries. ```{r} (rank_and_gross <- webpage %>% html_nodes('.ghost~ .text-muted+ span , .text-primary') %>% html_text() %>% str_replace("\\s+", "") %>% str_replace_all("[$M]", "")) ``` ```{r} isrank <- str_detect(rank_and_gross, "\\.$") ismissing <- isrank[1:(length(rank_and_gross) - 1)] & isrank[2:(length(rank_and_gross))] ismissing[length(ismissing)+1] <- isrank[length(isrank)] missingpos <- as.integer(rank_and_gross[ismissing]) gs_data <- rep(NA, 100) gs_data[-missingpos] <- gross_data (gross_data <- gs_data) ``` ### Missing entries - more reproducible way - Following code programatically figures out missing entries for metascore. ```{r} # Use CSS selectors to scrap the rankings section (rank_metascore_data_html <- html_nodes(webpage, '.unfavorable , .favorable , .mixed , .text-primary')) # Convert the ranking data to text (rank_metascore_data <- html_text(rank_metascore_data_html)) # Strip spaces (rank_metascore_data <- str_replace(rank_metascore_data, "\\s+", "")) # a rank followed by another rank means the metascore for the 1st rank is missing (isrank <- str_detect(rank_metascore_data, "\\.$")) ismissing <- isrank[1:length(rank_metascore_data)-1] & isrank[2:length(rank_metascore_data)] ismissing[length(ismissing)+1] <- isrank[length(isrank)] (missingpos <- as.integer(rank_metascore_data[ismissing])) #(rank_metascore_data <- as.integer(rank_metascore_data)) ``` - You (students) should work out the code for finding missing positions for gross. ### Visualizing movie data - Form a tibble: ```{r} # Combining all the lists to form a data frame movies <- tibble(Rank = rank_data, Title = title_data, Description = description_data, Runtime = runtime_data, Genre = genre_data, Rating = rating_data, Metascore = metascore_data, Votes = votes_data, Gross_Earning_in_Mil = gross_data, Director = directors_data, Actor = actors_data) movies %>% print(width=Inf) ``` - How many top 100 movies are in each genre? (Be careful with interpretation.) ```{r} movies %>% ggplot() + geom_bar(mapping = aes(x = Genre)) ``` - Which genre is most profitable in terms of average gross earnings? ```{r} movies %>% group_by(Genre) %>% summarise(avg_earning = mean(Gross_Earning_in_Mil, na.rm=TRUE)) %>% ggplot() + geom_col(mapping = aes(x = Genre, y = avg_earning)) + labs(y = "avg earning in millions") ``` ```{r} ggplot(data = movies) + geom_boxplot(mapping = aes(x = Genre, y = Gross_Earning_in_Mil)) + labs(y = "Gross earning in millions") ``` - Is there a relationship between gross earning and rating? Find the best selling movie (by gross earning) in each genre ```{r} library("ggrepel") (best_in_genre <- movies %>% group_by(Genre) %>% filter(row_number(desc(Gross_Earning_in_Mil)) == 1)) ggplot(movies, mapping = aes(x = Rating, y = Gross_Earning_in_Mil)) + geom_point(mapping = aes(size = Votes, color = Genre)) + ggrepel::geom_label_repel(aes(label = Title), data = best_in_genre) + labs(y = "Gross earning in millions") ``` ## Example: Scraping image data from Google Complete search operators are described at . ```{r} searchTerm <- "ucla" # tbm=isch (images), app (apps), bks (books), nws (news), pts (patents), vid (videos) # tbs=isz:m (medium images) # (url <- str_c("https://www.google.com/search?q=", searchTerm, "&source=lnms&tbm=isch&sa=X&tbs=isz:m")) webpage <- read_html(url) (imageurl <- webpage %>% html_nodes("img") %>% html_attr("src")) ``` Following code still not working... ```{r, error = TRUE} downloadImages <- function(files, brand, outPath="images"){ for(i in 1:length(files)){ download.file(files[i], destfile = paste0(outPath, "/", brand, "_", i, ".jpg"), mode = 'wb') } } downloadImages(imageurl, "ucla") ``` ```{bash} ls images/ ``` ## Example: Scraping finance data - `quantmod` package contains many utility functions for retrieving and plotting finance data. E.g., ```{r} library(quantmod) stock <- getSymbols("GOOG", src = "yahoo", auto.assign = FALSE, from = "2010-02-11") head(stock) chartSeries(stock, theme = chartTheme("white"), type = "line", log.scale = FALSE, TA = NULL) ``` ## Example: Pull tweets into R - Read blog: - [`twitteR` package](https://www.rdocumentation.org/packages/twitteR) is useful for pulling tweets text data into R. ```{r} library(twitteR) #load package ``` - Step 1: apply for a [Twitter developer account](https://developer.twitter.com). It takes some time to get approved. - Step 2: Generate and copy the Twitter App Keys. ``` consumer_key <- 'XXXXXXXXXX' consumer_secret <- 'XXXXXXXXXX' access_token <- 'XXXXXXXXXX' access_secret <- 'XXXXXXXXXX' ``` ```{r include=FALSE} consumer_key <- 'P952Y45yrM1Xu9ez3r00AUH57' consumer_secret <- 'KVmvwfL3dmc0TsY81fo3CPwCMTRldS5CzceM2JhyO3xSjZpwEH' access_token <- '783517197012504576-B5Q1D0whnX2KzpHnJJwUHErU8ttGtGQ' access_secret <- 'oZoUWWZ7JD1aT3BOAc92zC98dYa1nVETbBHnUoquaKjvO' ``` - Step 3. Set up authentication ```{r} setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret) ``` - Step 4: Pull tweets ```{r} virus <- searchTwitter('#China + #Coronavirus', n = 1000, since = '2020-01-01', retryOnRateLimit = 1e3) virus_df <- as_tibble(twListToDF(virus)) virus_df %>% print(width = Inf) ``` ## Example: Import data from Google sheets See HW3.