--- title: "Lecture 11: Working with Strings and Date/Time Variables" # potentially push to header subtitle: "Managing and Manipulating Data Using R" author: date: classoption: dvipsnames # for colors fontsize: 8pt urlcolor: blue output: beamer_presentation: keep_tex: true toc: false slide_level: 3 theme: default # AnnArbor # push to header? #colortheme: "dolphin" # push to header? #fonttheme: "structurebold" highlight: tango # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax highlighting); push to header df_print: tibble #default # tibble # push to header? latex_engine: xelatex # Available engines are pdflatex [default], xelatex, and lualatex; The main reasons you may want to use xelatex or lualatex are: (1) They support Unicode better; (2) It is easier to make use of system fonts. includes: in_header: ../beamer_header.tex #after_body: table-of-contents.txt --- ```{r, echo=FALSE, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE) #comment = "#>" makes it so results from a code chunk start with "#>"; default is "##" ``` # Introduction ### Logistics __Reading__ - GW chapter 14 (Strings) - GW chapter 16 (Dates and Times) __No class next week (11/13)__ - Problem set for strings + date/times still due on 11/13 - No problem set due on 11/20 ### What we will do today \tableofcontents ```{r, eval=FALSE, echo=FALSE} #Use this if you want TOC to show level 2 headings \tableofcontents #Use this if you don't want TOC to show level 2 headings \tableofcontents[subsectionstyle=hide/hide/hide] ``` ### Load the packages we will use today (output omitted) - __you must run this code chunk after installing these packages__ ```{r warning=FALSE, message=FALSE} library(tidyverse) library(stringr) library(lubridate) library(nycflights13) ``` __If package not yet installed__, then must install before you load. Install in "console" rather than .Rmd file - Generic syntax: `install.packages("package_name")` - Install "tidyverse": `install.packages("stringr")` Note: when we load package, name of package is not in quotes; but when we install package, name of package is in quotes: - `install.packages("tidyverse")` - `library(tidyverse)` ### Load data we will use today - Western Washington University student list data ```{r} load(url("https://github.com/ozanj/rclass/raw/master/data/prospect_list/wwlist_merged.RData")) ``` - Jeopardy text data [Link](https://blog.cambridgespark.com/50-free-machine-learning-datasets-natural-language-processing-d88fb9c5c8da) ```{r} jeopardy <- read_csv("https://raw.githubusercontent.com/ozanj/rclass/master/data/text_data/JEOPARDY_CSV.csv") ``` # Working with Strings ## String basics ### What are strings? String refers to a "data type" used in programming to represent text rather than numbers (although it can include numbers) - Strings have `character`types ```{r} string1<- "Apple" typeof(string1) #type is charater ``` - Create strings using `" "` ```{r} string2 <- "This is a string" ``` - If string contains a quotation, use `' " " '` ```{r} string3 <- 'example of a "quote" within a string' ``` - To print a string, use `writeLines()` ```{r} print(string3) #will print using \ writeLines(string3) ``` ### Common uses of strings Basic uses: - Names of files and directories ```{r, results='hide'} acs_tract <- read_csv("https://raw.githubusercontent.com/ozanj/rclass/master/data/acs/tracts.csv") ``` - Names of elements in data objects ```{r} num_vec <- 1:5 names(num_vec) <- c('uno', 'dos', 'tres', 'cuatro', 'cinco') num_vec ``` ### Common uses of strings - Text elements displayed in plots, graphs, maps ```{r} plot(iris$Petal.Length, iris$Petal.Width, main = 'Title', sub = 'Subtitle', xlab = 'x-axis label', ylab = 'y-axis label', col = 'red') ``` ### String basics We will use the `stringr` library for working with strings, rather than Base R - `stringr` functions have intuitive names and all begin with `str_` - Base R functions for working with strings can be inconsistent (avoid using them) __Basic functions:__ - String length using `str_length()` ```{r} #example 1 string2 <- "This is a string" str_length(string2) #example 2 str_length(c("a", "strings are fun", NA)) ``` ### Combining strings - Combining strings using `str_c()` ```{r} #example 1 x_var <- "x" y_var <- "y" str_c(x_var, y_var) #example 2 str_c("x", "y") ``` - Use `sep` argument to control how strings are seperated when combined ```{r} str_c("x", "y", sep= ", ") ``` - `NA` are still contagious, if you want a string "NA" rather than `NA` use `str_replace_na()` ```{r} street_dir<- c("East", "West", NA) str_c("Direction: ", street_dir) str_c("Direction: ", str_replace_na(street_dir)) ``` ### Subsetting strings -Extract parts of a string using `str_sub()`, which uses `start` and `end` arguments to extract the position of the substring wanted ```{r} fruits<- c("Apple", "Banana", "Orange") #first three elements str_sub(fruits, 1, 3) #end argument in inclusive #last three elements str_sub(fruits, -3, -1) #neg nums count backwards from end ``` - __Task: extract 6-digit zip code from `zip9` in `wwlist`__ ```{r, results="hide"} wwlist %>% mutate( zip=str_sub(zip9, 1, 5) ) ``` ### Lower-case and Upper-case functions - Changing strings to lower or upper case ```{r} str_to_lower("HELLO") str_to_upper("hello") ``` - __Task: lower-case `hs_name` in `wwlist`__ ```{r} wwlist %>% select(receive_date, hs_name) %>% mutate( hs_name_lwr=str_to_lower(hs_name), ) ``` ### Student Exercises 1. Combine `school_type` and `school_category` in the `wwlist` dataframe to create one school type + category varibale. Be sure to seperate type and category using a comma AND deal with contagious NAs by using string "NA" if `school_type` and/or `school_category` are `NA`. 1. The last four digits of `zip9` indicate the delivery route within the 5-digit zip code area. Create a new `route` variable that extracts the last four digits from `zip9`. ### Student Excercises (Solutions) 1. Combine `school_type` and `school_category` in the `wwlist` dataframe to create one school type + category varibale. Be sure to seperate type and category using a comma AND deal with contagious NAs by using string "NA" if `school_type` and/or `school_category` are `NA`. ```{r} wwlist %>% select(school_type, school_category) %>% mutate( type_cat= str_c(str_replace_na(school_type), str_replace_na(school_category), sep= ", ") ) ``` ### Student Excercises (Solutions) 1. The last four digits of `zip9` indicate the delivery route within the 5-digit zip code area. Create a new `route` variable that extracts the last four digits from `zip9`. ```{r} wwlist %>% select(zip9) %>% mutate( route=str_sub(zip9, -4, -1) ) ``` ## Regular Expressions ### What are regular expressions (e.g., regex)? __Regular expressions are an entirely different and concise "language" used to describe patterns in strings__ - One of the most powerful and sophisticated data science tools! - They have a wide range of uses - They are universal: can be used and are consistent across any programming language (e.g., R, Python, JavaScript) - __BUT__ they take a while to wrap your head around and can get really complex really quickly! __I will attempt to give an approachable introduction to regular expressions__ - I still stuggle with regular expression tasks! - My favorite tool for building, testing, debugging regular expressions: [web regex app](https://regex101.com/) ### Basic Matches The simplest patterns match exact (sub)strings! - `str_view()` shows the first match; `str_view_all` shows all the matches ```{r} x <- c("apple", "banana", "pear") # str_view(x, "an") #uncomment to view outside of beamer presentation ``` - To detect matches in a column of a dataframe, use `str_detect` and `filter()` - `str_detect` determines a match and returns a logical vector the same length as the input __Task:__ Detect whether high school names abbreviate "high school" as "HS"? ```{r} wwlist %>% select(hs_name) %>% filter(str_detect(hs_name, "HS")) ``` ### Basic Matches The next step-up in complexity is using `.` which matches any character (including white space but except a newline) __Task:__ Detect whether there are any "HS" abbreviations that have _any_ character before and after the abbreviation? ```{r} wwlist %>% select(hs_name) %>% filter(str_detect(hs_name, ".HS.")) ``` ### Anchors Regular expressions will match any part of a string. Sometimes it's useful to _anchor_ the regular expression so that it matches from the start or the end of the string: - `^` will match the start of the string - `$` will match the end of the string ```{r} x <- c("apple", "banana", "pear") str_detect(x, "^a") str_detect(x, "a$") ``` ### Escapes Regular expressions use special characters (i.e., `.`, `\`) to match patterns in strings. If you're trying to match an actual character exactly rather than use it's special behavior, you need to use an _escape_ - The double backslash `\\` is used to _escape_ special behavior for these characters ```{r} # To create the regular expression, we need \\ dot <- "\\." # But the expression itself only contains one: writeLines(dot) # And this tells R to look for an explicit . str_detect(c("abc", "a.c", "bef"), "a\\.c") ``` - for a full list of escape characters type the following into your console: `?"'" ` ### Common special patterns There are other "special patterns" that will match more than one character and can be really useful. __Matching characters__ - `.` matches any character - `\d` matches any digit (or `\D` for non-digits) - `\s` matches any whitespace such as space, tab, newline (or `\S` fir non-whitepsace) __Matching alternates__ - `[abe]` matches one of a, b, or e - `[^ab3]` matches anything but a, b, e - `[a-f]` matches range ### Tools: other `stringr` Functions I only highlighted a few `stringr` functions in this lecture. But there are many functions that are helpful in applying regular expressions to real data problems (i.e., determining match, finding positions of matches, extracting context of matches, replacing values based on matches) - [`stringr` cheat sheet](http://edrub.in/CheatSheets/cheatSheetStringr.pdf) - Some common functions: | **Task** | **Function** | | ------ | -------- | | Detect matches | `str_detect`, `str_which`, `str_count`, `str_locate` | | Subset strings | `str_sub`, `str_subset`, `str_extract`, `str_match` | | Mutate strings | `str_sub`, `str_replace`, `str_to_lower`, `str_to_upper` | | Join or split strings | `str_c`, `str_dup`, `str_plit_fixed`| ### Examples __Task:__ Look for any Jeopardy categories that begin with a digit ```{r} jeopardy %>% select(Category) %>% filter(str_detect(Category, "^\\d")) ``` ### Examples __Task:__ Look for any Jeopardy categories that may contain a year variable as a sequence of 4 digits. Create a new `year_category` indicator variable if the Jeopardy category involves a year. ```{r} jeopardy1 <- jeopardy %>% select(Category) %>% mutate( year_category=str_detect(Category, "[0-9][0-9][0-9][0-9]") ) ``` - Print some observations ```{r} jeopardy1 %>% select(Category, year_category) %>% filter(year_category==TRUE) rm(jeopardy1) ``` ### Why are string manipulations and regular expressions useful? Basic examples: - Dealing with identification numbers (leading or trailing zeros) ```{r} typeof(acs_tract$fips_county_code) acs_tract <- acs_tract %>% mutate(char_county= str_pad(as.character(fips_county_code), side = "left" ,3, pad="0")) ``` - Complex reshaping (tidying) of data - Problem: multiple variables crammed into the column names - new_ prefix = new cases - sp/rel/sp/ep describe how the case was diagnosed - m/f gives the gender - digits are age ranges ```{r, eval=FALSE, echo=FALSE} # explore WHO data # tuberculosis cases broken down by year, country, age, gender, and diagnosis method #View(who) #names(who) ``` ```{r, results="hide"} who %>% pivot_longer( cols = new_sp_m014:newrel_f65, names_to = c("diagnosis", "gender", "age"), names_pattern = "new_?(.*)_(.)(.*)", values_to = "count" ) ``` ### Why are string manipulations and regular expressions useful? Advanced examples: - Web-scraping - Find and scrape all linked pages of recruiters assigned by states: (https://gobama.ua.edu/staff/) - Parsing raw HTML to convert it into tabular data - Natural Language Processing - Analyzing university president speeches for promotion of interdisciplinary research (IDR) - Predict sentiment of promotion of IDR # Working with Dates and Times ### Working with date/time variables Working with dates and times in data management seems simpler than it really is! - Does every year have 365 days? - Does every day have 24 hours? - Does every minute have 60 seconds? These details matter for: - Calculating changes over time - Analyzing longitudinal data - Predicting the occurance/timing of events There are three ways you're likely to create a date/time variable: - From a string (most common) - From date and time individual components - From an existing date/time object ## Creating Date/Times ### Creating Date/Times from strings The most common way you're likely to create Date/Time variables is from primary/secondary data where dates and times are recorded and/or stores as strings. For Dates: - Use `lubridate` "helpers" to identify the order of year/month/day ```{r} ymd("2017/01/31") mdy("January 31st, 2017") dmy("31-01-2017") ymd(20170131) ``` For Dates: - Use `lubridate` "helpers" to identify the order of year/month/day __AND__ hours/minutes/seconds ```{r} ymd_hms("2017-01-31 20:11:59") mdy_hm("01/31/2017 08:01") ``` ### Creating Date/Times from individual variables What if your dates and times are recorded across multiple columns/variables? EX: NYC flights data ```{r} flights %>% select(year, month, day, hour, minute) ``` - Create a date variable using `make_date()` ```{r} flights1<- flights %>% select(year, month, day) %>% mutate( depart= make_date(year, month, day) ) typeof(flights1$depart) class(flights1$depart) ``` - Create a date variable using `make_datetime()` ```{r} flights1<- flights %>% select(year, month, day, hour, minute) %>% mutate( depart= make_datetime(year, month, day, hour, minute ) ) typeof(flights1$depart) class(flights1$depart) ``` ## Using Date/Time Variables ### Time spans Arithmatic with dates works differently than with any numeric type! There are three date/time classes that represent time spans: - Durations: represent the duration of time to an exact number of seconds - Periods: represent the period of time such as weeks/months/years - Intervals: represent a starting and end point in time __Durations__ When you subtract two dates, the result is a `difftime` object - A `difftime` object records time span as seconds (not intuitive) - Use `as.duration` to make the difftime object more intuitive (but records time span in seconds) ```{r} # How old is Karina? k_age <- today() - ymd(19890321) k_age typeof(k_age) class(k_age) as.duration(k_age) ``` - Durations uses "constructers" ```{r} dseconds(30) dminutes(10) dweeks(3) dyears(1) ``` Because durations are recorded in seconds, this can sometimes create some odd results ```{r} one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York") one_pm # 1pm one_pm + ddays(1) #2pm (DST) ``` ### Time spans __Periods__ Periods don't record time spans in exact seconds and are more intuitive to the way we think about time! ```{r} one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York") one_pm # 1pm one_pm + days(1) #1pm one_pm + years(1) ```