---
title: "Introduction to strings, dates, and time"
subtitle: "EDUC 260A: Managing and Manipulating Data Using R"
author:
date:
urlcolor: blue
output:
html_document:
toc: true
toc_depth: 2
toc_float: true # toc_float option to float the table of contents to the left of the main document content. floating table of contents will always be visible even when the document is scrolled
#collapsed: false # collapsed (defaults to TRUE) controls whether the TOC appears with only the top-level (e.g., H2) headers. If collapsed initially, the TOC is automatically expanded inline when necessary
#smooth_scroll: true # smooth_scroll (defaults to TRUE) controls whether page scrolls are animated when TOC items are navigated to via mouse clicks
number_sections: true
fig_caption: true # ? this option doesn't seem to be working for figure inserted below outside of r code chunk
highlight: tango # Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", and "haddock" (specify null to prevent syntax
theme: default # theme specifies the Bootstrap theme to use for the page. Valid themes include default, cerulean, journal, flatly, readable, spacelab, united, cosmo, lumen, paper, sandstone, simplex, and yeti.
df_print: tibble #options: default, tibble, paged
---
```{r, echo=FALSE, include=FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>", highlight = TRUE)
#comment = "#>" makes it so results from a code chunk start with "#>"; default is "##"
```
# Introduction
Load packages:
```{r, message=FALSE}
library(tidyverse)
library(stringr) # package for manipulating strings (part of tidyverse)
library(lubridate) # package for working with dates and times
#library(rvest) # package for reading and manipulating HTML
```
Resources used to create this lecture:
- https://r4ds.had.co.nz/strings.html
- https://www.tutorialspoint.com/r/r_strings.htm
- https://swcarpentry.github.io/r-novice-inflammation/13-supp-data-structures/
- https://www.statmethods.net/input/datatypes.html
- https://www.stat.berkeley.edu/~s133/dates.html
## Dataset we will use
We will use `rtweet` to pull Twitter data from the PAC-12 universities. We will use the university admissions Twitter handle if there is one, or the main Twitter handle for the university if there isn't one:
- We wrote a short tutorial on using `rtweet` in the Fall 2020 version of this class:
- [LINK](https://anyone-can-cook.github.io/rclass1/in_class/rtweet/intro_to_rtweet.html) to html file
- [LINK](https://anyone-can-cook.github.io/rclass1/in_class/rtweet/intro_to_rtweet.Rmd) to .Rmd file
```{r}
# library(rtweet)
#
# p12 <- c("uaadmissions", "FutureSunDevils", "caladmissions", "UCLAAdmission",
# "futurebuffs", "uoregon", "BeaverVIP", "USCAdmission",
# "engagestanford", "UtahAdmissions", "UW", "WSUPullman")
# p12_full_df <- search_tweets(paste0("from:", p12, collapse = " OR "), n = 500)
#
# saveRDS(p12_full_df, "p12_dataset.RDS")
# Load previously pulled Twitter data
# p12_full_df <- readRDS("p12_dataset.RDS")
p12_full_df <- readRDS(url("https://github.com/anyone-can-cook/rclass1/raw/master/data/twitter/p12_dataset.RDS", "rb"))
glimpse(p12_full_df)
p12_df <- p12_full_df %>% select("user_id", "created_at", "screen_name", "text", "location")
head(p12_df)
```
# Data structures and types
What is an **object**?
- Everything in R is an object
- We can classify objects based on their **class** and **type**
- `class()`: What kind of object is it (high-level)?
- The class of the object determines what kind of functions we can apply to it
- `typeof()`: What is the object's data type (low-level)?
- Objects may be combined to form **data structures**
[![](https://d33wubrfki0l68.cloudfront.net/1d1b4e1cf0dc5f6e80f621b0225354b0addb9578/6ee1c/diagrams/data-structures-overview.png){width=400px}](https://r4ds.had.co.nz/vectors.html)
*Credit: [R for Data Science](https://r4ds.had.co.nz/vectors.html)*
Basic **data types**:
- Logical (`TRUE`, `FALSE`)
- Numeric (e.g., `5`, `2.5`)
- Integer (e.g., `1L`, `4L`, where `L` tells R to store as `integer` type)
- Character (e.g., `"R is fun"`)
Basic **data structures**:
- [Atomic vectors](#atomtic-vectors)
- [Lists](#lists)
- [Dataframes](#dataframes)
## Atomtic vectors
What are **atomic vectors**?
- **Atomic vectors** are objects that contains elements
- Elements must be of the same data type (i.e., _homogeneous_)
- The `class()` and `typeof()` a vector describes the elements it contains
**Example**: Investigating logical vectors
```{r}
v <- c(TRUE, FALSE, FALSE, TRUE)
str(v)
class(v)
typeof(v)
```
**Example**: Investigating numeric vectors
```{r}
v <- c(1, 3, 5, 7)
str(v)
class(v)
typeof(v)
```
**Example**: Investigating integer vectors
```{r}
v <- c(1L, 3L, 5L, 7L)
str(v)
class(v)
typeof(v)
```
**Example**: Investigating character vectors
Each element in a `character` vector is a **string** (covered in next section):
```{r}
v <- c("a", "b", "c", "d")
str(v)
class(v)
typeof(v)
```
## Lists
What are **lists**?
- **Lists** are objects that contains elements
- Elements do not need to be of the same type (i.e., _heterogeneous_)
- Elements can be atomic vectors or even other lists
- The `class()` and `typeof()` a list is `list`
**Example**: Investigating heterogeneous lists
```{r}
l <- list(2.5, "abc", TRUE, c(1L, 2L, 3L))
str(l)
class(l)
typeof(l)
```
**Example**: Investigating nested lists
```{r}
l <- list(list(TRUE, c(1, 2, 3), list(c("a", "b", "c"))), FALSE, 10L)
str(l)
class(l)
typeof(l)
```
### Dataframes
What are **dataframes**?
- **Dataframes** are a special kind of **list** with the following characteristics:
- Each element is a **vector** (i.e., _a column in the dataframe_)
- The element should be named (i.e., _column name in the dataframe_)
- Each of the vectors must be the same length (i.e., _same number of rows in the dataframe_)
- The data type of each vector may be different
- Dataframes can be created using the function `data.frame()`
- The `class()` of a dataframe is `data.frame`
- The `typeof()` a dataframe is `list`
**Example**: Investigating dataframe
```{r}
df <- data.frame(
colA = c(1, 2, 3),
colB = c("a", "b", "c"),
colC = c(TRUE, FALSE, TRUE),
stringsAsFactors = FALSE
)
df
str(df)
class(df)
typeof(df)
```
## Converting between types
Functions for converting between types:
- `as.logical()`: Convert to `logical`
- `as.numeric()`: Convert to `numeric`
- `as.integer()`: Convert to `integer`
- `as.character()`: Convert to `character`
- `as.list()`: Convert to `list`
- `as.data.frame()`: Convert to `data.frame`
**Example**: Using `as.logical()` to convert to `logical`
Character vector coerced to logical vector:
```{r}
# Only "TRUE"/"FALSE", "True"/"False", "T"/"F", "true"/"false" are able to be coerced to logical type
as.logical(c("TRUE", "FALSE", "True", "False", "true", "false", "T", "F", "t", "f", ""))
```
Numeric vector coerced to logical vector:
```{r}
# 0 is treated as FALSE, while all other numeric values are treated as TRUE
as.logical(c(0, 0.0, 1, -1, 20, 5.5))
```
**Example**: Using `as.numeric()` to convert to `numeric`
Logical vector coerced to numeric vector:
```{r}
# FALSE is mapped to 0 and TRUE is mapped to 1
as.numeric(c(FALSE, TRUE))
```
Character vector coerced to numeric vector:
```{r, warning = FALSE}
# Strings containing numeric values can be coerced to numeric (leading 0's are dropped)
# All other characters become NA
as.numeric(c("0", "007", "2.5", "abc", "."))
```
**Example**: Using `as.integer()` to convert to `integer`
Logical vector coerced to integer vector:
```{r}
# FALSE is mapped to 0 and TRUE is mapped to 1
as.integer(c(FALSE, TRUE))
```
Character vector coerced to integer vector:
```{r, warning = FALSE}
# Strings containing numeric values can be coerced to integer (leading 0's are dropped, decimals are truncated)
# All other characters become NA
as.integer(c("0", "007", "2.5", "abc", "."))
```
Numeric vector coerced to integer vector:
```{r, warning = FALSE}
# All decimal places are truncated
as.integer(c(0, 2.1, 10.5, 8.8, -1.8))
```
**Example**: Using `as.character()` to convert to `character`
Logical vector coerced to character vector:
```{r}
as.character(c(FALSE, TRUE))
```
Numeric vector coerced to character vector:
```{r, warning = FALSE}
as.character(c(-5, 0, 2.5))
```
Integer vector coerced to character vector:
```{r, warning = FALSE}
as.character(c(-2L, 0L, 10L))
```
**Example**: Using `as.list()` to convert to `list`
Atomic vectors coerced to list:
```{r}
# Logical vector
as.list(c(TRUE, FALSE))
# Character vector
as.list(c("a", "b", "c"))
# Numeric vector
as.list(1:3)
```
**Example**: Using `as.data.frame()` to convert to `data.frame`
Lists coerced to dataframe:
```{r}
# Create a list
l <- list(A = c("x", "y", "z"), B = c(1, 2, 3))
str(l)
# Convert to class `data.frame`
df <- as.data.frame(l, stringsAsFactors = F)
str(df)
```
**Example**: Practical example of converting type
When working with data, it may be helpful to label values for certain variables. Data files often come with a codebook that defines how values are coded. Let's look at an example of labeling values and how converting data type may come into play.
We'll look at the `FIPS` variable from the [Integrated Postsecondary Education Data System (IPEDS)](https://nces.ed.gov/ipeds/use-the-data) data. The [state FIPS code](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code#FIPS_state_codes) is a numeric code that identifies a state. For example, `1` is the FIPS code for `Alabama`, `2` is the FIPS code for `Alaska`, etc. We'll want to label each numeric value in the `FIPS` column with the corresponding state name.
```{r}
# Library for labeling variables and values in a dataframe
library(labelled)
# Read in IPEDS data and codebook
ipeds_df <- read.csv('https://raw.githubusercontent.com/cyouh95/recruiting-chapter/master/data/ipeds_hd2017.csv', header = TRUE, na.strings=c('', 'NA'), stringsAsFactors = F)
ipeds_values <- read.csv('https://raw.githubusercontent.com/cyouh95/recruiting-chapter/master/data/ipeds_hd2017_values.csv', header = TRUE, na.strings=c('', 'NA'), stringsAsFactors = F)
# The codebook defines how variables are coded, such as STABBR, FIPS, and other variables
head(ipeds_values)
# Filter codebook for just the values for the FIPS variable
fips_values <- ipeds_values %>% filter(varname == 'FIPS') %>% select(varname, codevalue, valuelabel)
head(fips_values)
```
When we read in the data from the CSV files, R automatically tries to determine the data type of each variable. As seen below, the `FIPS` column from the `ipeds_df` that we want to label is of type `integer`, while the `codevalue` column from the codebook is of type `character` (since not all values are numeric):
```{r}
# Type of `FIPS` column
str(ipeds_df$FIPS)
# Type of `codevalue` column
str(fips_values$codevalue)
```
This discrepancy becomes a problem when we try to label the value using the `labelled` library:
```{r, eval=F}
# Error: `x` and `labels` must be same type
val_label(ipeds_df$FIPS, fips_values[1, 'codevalue']) <- fips_values[1, 'valuelabel']
```
To resolve this, we can use `as.integer()` to convert the `codevalue` from `character` type to `integer` before trying to label the value:
```{r}
# This now works
val_label(ipeds_df$FIPS, as.integer(fips_values[1, 'codevalue'])) <- fips_values[1, 'valuelabel']
# Check value labels
val_labels(ipeds_df$FIPS)
# We can use as.integer() to convert the entire vector (ie. codevalue column) to integer
fips_values$codevalue <- as.integer(fips_values$codevalue)
# Type of `codevalue` column
str(fips_values$codevalue)
# Use loop to label the rest of the values
for (i in 1:nrow(fips_values)) {
val_label(ipeds_df$FIPS, fips_values[i, 'codevalue']) <- fips_values[i, 'valuelabel']
}
# Check value labels
val_labels(ipeds_df$FIPS)
```
# String basics
What are **strings**?
- String is a type of data in R
- You can create strings using either single quotes (`'`) or double quotes (`"`)
- Internally, R stores strings using double quotes
- The `class()` and `typeof()` a string is `character`
**Example**: Creating string using single quotes
Notice how R stores strings using double quotes internally:
```{r}
my_string <- 'This is a string'
my_string
```
**Example**: Creating string using double quotes
```{r}
my_string <- "Strings can also contain numbers: 123"
my_string
```
**Example**: Checking class and type of strings
```{r}
class(my_string)
typeof(my_string)
```
**Note**: To include quotes as part of the string, we can either use the other type of quotes to surround the string (i.e., `'` or `"`) or escape the quote using a backslash (`\`). _We won't be going in-depth into escaping characters for this class, but see appendix for more details if you are interested._
```{r}
# Include quote by using the other type of quotes to surround the string
my_string <- "There's no issues with this string."
my_string
# Include quote of the same type by escaping it with a backslash
my_string <- 'There\'s no issues with this string.'
my_string
```
```{r, eval=F}
# This would not work
my_string <- 'There's an issue with this string.'
my_string
```
# `stringr` package
> "A consistent, simple and easy to use set of wrappers around the fantastic `stringi` package. All function and argument names (and positions) are consistent, all functions deal with `NA`'s and zero length vectors in the same way, and the output from one function is easy to feed into the input of another."
*Credit: `stringr` [R documentation](https://www.rdocumentation.org/packages/stringr/versions/1.4.0)*
The `stringr` package:
- The `stringr` package is based off the `stringi` package and is part of __Tidyverse__
- `stringr` contains functions to work with strings
- For many functions in the `stringr` package, there are equivalent "base R" functions
- But `stringr` functions all follow the same rules, while rules often differ across different "base R" string functions, so we will focus exclusively on `stringr` functions
- Most `stringr` functions start with `str_` (e.g., `str_length`)
## `str_length()`
__The `str_length()` function__:
```{r, eval = FALSE}
?str_length
# SYNTAX
str_length(string)
```
- Function: Find string length
- Arguments:
- `string`: Character vector (or vector coercible to character)
- Note that `str_length()` calculates the length of a string, whereas the `length()` function (which is not part of `stringr` package) calculates the number of elements in an object
**Example**: Using `str_length()` on string
```{r}
str_length("cats")
```
Compare to `length()`, which treats the string as a single object:
```{r}
length("cats")
```
**Example**: Using `str_length()` on character vector
```{r}
str_length(c("cats", "in", "hat"))
```
Compare to `length()`, which finds the number of elements in the vector:
```{r}
length(c("cats", "in", "hat"))
```
**Example**: Using `str_length()` on other vectors coercible to character
Logical vectors can be coerced to character vectors:
```{r}
str_length(c(TRUE, FALSE))
```
Numeric vectors can be coerced to character vectors:
```{r}
str_length(c(1, 2.5, 3000))
```
Integer vectors can be coerced to character vectors:
```{r}
str_length(c(2L, 100L))
```
**Example**: Using `str_length()` on dataframe column
Recall that the columns in a dataframe are just vectors, so we can use `str_length()` as long as the vector is coercible to character type. Let's look at the `screen_name` column from the `p12_df`:
```{r}
# `p12_df` is a dataframe object
str(p12_df)
# `screen_name` column is a character vector
str(p12_df$screen_name)
```
**[Base R method]** Use `str_length()` to calculate the length of each `screen_name`:
```{r}
# Let's focus on just the unique screen names
unique(p12_df$screen_name)
str_length(unique(p12_df$screen_name))
```
**[Tidyverse method]** Use `str_length()` to calculate the length of each `screen_name`:
```{r}
# Let's focus on just the unique screen names
p12_df %>% select(screen_name) %>% unique()
#p12_df %>% select(screen_name) %>% unique() %>% str_length()
```
Notice that the above line does not work as expected because we passed in a dataframe to `str_length()` and it is trying to coerce that to character:
```{r}
class(p12_df %>% select(screen_name) %>% unique())
```
An alternative way is to add a column to the dataframe that contains the result of applying `str_length()` to the `screen_name` vector:
```{r}
p12_df %>% select(screen_name) %>% unique() %>%
mutate(screen_name_len = str_length(screen_name))
```
## `str_c()`
__The `str_c()` function__:
```{r, eval = FALSE}
?str_c
# SYNTAX AND DEFAULT VALUES
str_c(..., sep = "", collapse = NULL)
```
- Function: Concatenate strings between vectors (element-wise)
- Arguments:
- The input is one or more character vectors (or vectors coercible to character)
- Zero length arguments are removed
- Short arguments are recycled to the length of the longest
- `sep`: String to insert between input vectors
- `collapse`: Optional string used to combine input vectors into single string
**Example**: Using `str_c()` on one vector
Since we only provided one input vector, it has nothing to concatenate with, so `str_c()` will just return the same vector:
```{r}
str_c(c("a", "b", "c"))
```
Note that specifying the `sep` argument will also not have any effect because we only have one input vector, and `sep` is the separator between multiple vectors:
```{r}
str_c(c("a", "b", "c"), sep = "~")
# Check length: Output is the original vector of 3 elements
str_c(c("a", "b", "c")) %>% length()
```
As seen above, `str_c()` returns a vector by default (because the default value for the `collapse` argument is `NULL`). But we can specify a string for `collapse` in order to collapse the elements of the output vector into a single string:
```{r}
str_c(c("a", "b", "c"), collapse = "|")
# Check length: Output vector of length 3 is collapsed into a single string
str_c(c("a", "b", "c"), collapse = "|") %>% length()
# Check str_length: This gives the length of the collapsed string, which is 5 characters long
str_c(c("a", "b", "c"), collapse = "|") %>% str_length()
```
**Example**: Using `str_c()` on more than one vector
When we provide multiple input vectors, we can see that the vectors get concatenated element-wise (i.e., 1st element from each vector are concatenated, 2nd element from each vector are concatenated, etc):
```{r}
str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"))
```
The default separator for each element-wise concatenation is an empty string (`""`), but we can customize that by specifying the `sep` argument:
```{r}
str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), sep = "~")
# Check length: Output vector is same length as input vectors
str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), sep = "~") %>% length()
```
Again, we can specify the `collapse` argument in order to collapse the elements of the output vector into a single string:
```{r}
str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), collapse = "|")
# Check length: Output vector of length 3 is collapsed into a single string
str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), collapse = "|") %>% length()
# Specifying both `sep` and `collapse`
str_c(c("a", "b", "c"), c("x", "y", "z"), c("!", "?", ";"), sep = "~", collapse = "|")
```
**Example**: Using `str_c()` on "strings"
What do we mean by "strings"?
- Informally, We can think of a "string" as being a character vector with `length()` equal to 1 (i.e., one element).
- Another way to think of it, a "string" is anything you put in between quotes".
- Loosely, we can also think of individual elements within a character vector as strings
Below, passing 3 strings into `str_c()` is like passing in 3 vectors of size 1 each.
- Remember that vectors are concatenated element-wise, so these strings will be joined like this:
```{r}
str_c("a", "b", "c")
# Again, we can think of strings as being character vectors of size 1
str_c(c("a"), c("b"), c("c"))
```
We can use `sep` to specify how the elements are separated:
```{r}
str_c("a", "b", "c", sep = "~")
```
Since we only have 1 element in each vector, the output from `str_c()` is a vector of length 1. Thus, `collapse` will not be useful here since it works to collapse multiple elements in the output vector into a single string:
```{r}
str_c("a", "b", "c", collapse = "|")
```
**Example**: Using `str_c()` on types other than character
When we provide a non-character vector (such as a numeric or logical vector), it will get coerced into a character vector:
```{r}
str_c(c("a", "b", "c"), c(1, 2, 3), c(TRUE, FALSE, FALSE))
# Specifying both `sep` and `collapse`
str_c(c("a", "b", "c"), c(1, 2, 3), c(TRUE, FALSE, FALSE), sep = "~", collapse = "|")
```
Note that we can also use any other single element input (other than string) that can be coerced to character:
```{r}
str_c(TRUE, 1.5, 2L, "X")
```
**Example**: Using `str_c()` on vectors of different lengths
When multiple vectors are provided, they are joined together element-wise, recycling the elements of the shorter vectors:
```{r, warning = FALSE}
str_c("#", c("a", "b", "c", "d"), c(1, 2, 3), c(TRUE, FALSE))
# Specifying both `sep` and `collapse`
str_c("#", c("a", "b", "c", "d"), c(1, 2, 3), c(TRUE, FALSE), sep = "~", collapse = "|")
```
**Example**: Using `str_c()` on dataframe columns
Let's combine the `user_id` and `screen_name` columns from `p12_df`. We'll focus on unique Twitter handles:
```{r}
p12_unique_df <- p12_df %>% select(user_id, screen_name) %>% unique()
p12_unique_df
```
**[Base R method]** Use `str_c()` to combine `user_id` and `screen_name`:
```{r}
str_c(p12_unique_df$user_id, "=", p12_unique_df$screen_name, sep = " ", collapse = ", ")
str_c(p12_unique_df$user_id, "=", p12_unique_df$screen_name, sep = " ") # without collapsing to one element
```
**[Tidyverse method]** Use `str_c()` to combine `user_id` and `screen_name`:
```{r}
p12_unique_df %>% mutate(twitter_handle = str_c(user_id,screen_name))
p12_unique_df %>% mutate(twitter_handle = str_c("User #", user_id, " is @", screen_name))
```
## `str_sub()`
__The `str_sub()` function__:
```{r, eval = FALSE}
?str_sub
# SYNTAX AND DEFAULT VALUES
str_sub(string, start = 1L, end = -1L)
str_sub(string, start = 1L, end = -1L, omit_na = FALSE) <- value
```
- Function: Subset strings
- Arguments:
- `string`: Character vector (or vector coercible to character)
- `start`: Position of first character to be included in substring (default: `1`)
- `end`: Position of last character to be included in substring (default: `-1`)
- Negative index means counting backwards from the end of the string
- If an element in the vector is shorter than the specified `end`, it will just include all the available characters that it does have
- `omit_na`: If `TRUE`, missing values in any of the arguments provided will result in an unchanged input
- When `str_sub()` is used in the assignment form, you can replace the subsetted part of the string with a `value` of your choice
- If an element in the vector is too short to meet the subset specification, the replacement `value` will be concatenated to the end of that element
- Note that this modifies your input vector directly, so you must have the vector saved to a variable (see example below)
**Example**: Using `str_sub()` to subset strings
If no `start` and `end` positions are specified, `str_sub()` will by default return the entire (original) string:
```{r}
str_sub(string = c("abcdefg", 123, TRUE))
```
Note that if an element is shorter than the specified `end` (i.e., `123` in the example below), it will just include all the available characters that it does have:
```{r}
str_sub(string = c("abcdefg", 123, TRUE), start = 2, end = 4)
```
Remember we can also use negative index to count the position starting from the back:
```{r}
str_sub(c("abcdefg", 123, TRUE), start = 2, end = -2)
```
**Example**: Using `str_sub()` to replace strings
If no `start` and `end` positions are specified, `str_sub()` will by default return the original string, so the entire string would be replaced:
```{r}
v <- c("A", "AB", "ABC", "ABCD", "ABCDE")
str_sub(v, start = 1,end =-1)
str_sub(v, start = 1,end =-1) <- "*"
v
```
If an element in the vector is too short to meet the subset specification, the replacement `value` will be concatenated to the end of that element:
```{r}
v <- c("A", "AB", "ABC", "ABCD", "ABCDE")
v
str_sub(v, start = 2, end = 3)
str_sub(v, start = 2, end = 3) <- "*"
v
```
Note that because the replacement form of `str_sub()` modifies the input vector directly, we need to save it in a variable first. Directly passing in the vector to `str_sub()` would give us an error:
```{r, eval = FALSE}
# Does not work
str_sub(c("A", "AB", "ABC", "ABCD", "ABCDE")) <- "*"
```
**Example**: Using `str_sub()` on dataframe column
We can use `as.character()` to turn the `created_at` value to a string, then use `str_sub()` to extract out various date/time components from the string:
```{r}
p12_datetime_df <- p12_df %>% select(created_at) %>%
mutate(
dt_chr = as.character(created_at),
date_chr = str_sub(dt_chr, 1, 10),
yr_chr = str_sub(dt_chr, 1, 4),
mth_chr = str_sub(dt_chr, 6, 7),
day_chr = str_sub(dt_chr, 9, 10),
hr_chr = str_sub(dt_chr, -8, -7),
min_chr = str_sub(dt_chr, -5, -4),
sec_chr = str_sub(dt_chr, -2, -1)
)
p12_datetime_df
```
## Other `stringr` functions
Other useful `stringr` functions:
- `str_to_upper()`: Turn strings to uppercase
- `str_to_lower()`: Turn strings to lowercase
- `str_sort()`: Sort a character vector
- `str_trim()`: Trim whitespace from strings (including `\n`, `\t`, etc.)
- `str_pad()`: Pad strings with specified character
**Example**: Using `str_to_upper()` to turn strings to uppercase
Turn column names of `p12_df` to uppercase:
```{r}
# Column names are originally lowercase
names(p12_df)
# Turn column names to uppercase
names(p12_df) <- str_to_upper(names(p12_df))
names(p12_df)
```
**Example**: Using `str_to_lower()` to turn strings to lowercase
Turn column names of `p12_df` to lowercase:
```{r}
# Column names are originally uppercase
names(p12_df)
# Turn column names to lowercase
names(p12_df) <- str_to_lower(names(p12_df))
names(p12_df)
```
**Example**: Using `str_sort()` to sort character vector
Sort the vector of `p12_df` column names:
```{r}
# Before sort
names(p12_df)
# Sort alphabetically (default)
str_sort(names(p12_df))
# Sort reverse alphabetically
str_sort(names(p12_df), decreasing = TRUE)
```
**Example**: Using `str_trim()` to trim whitespace from string
```{r}
# Trim whitespace from both left and right sides (default)
str_trim(c("\nABC ", " XYZ\t"))
# Trim whitespace from left side
str_trim(c("\nABC ", " XYZ\t"), side = "left")
# Trim whitespace from right side
str_trim(c("\nABC ", " XYZ\t"), side = "right")
```
**Example**: Using `str_pad()` to pad string with character
Let's say we have a vector of zip codes that has lost all leading 0's. We can use `str_pad()` to add that back in:
```{r}
# Pad the left side of strings with "0" until width of 5 is reached
str_pad(c(95035, 90024, 5009, 5030), width = 5, side = "left", pad = "0")
```
# Dates and times
> "Date-time data can be frustrating to work with in R. R commands for date-times are generally unintuitive and change depending on the type of date-time object being used. Moreover, the methods we use with date-times must be robust to time zones, leap days, daylight savings times, and other time related quirks, and R lacks these capabilities in some situations. Lubridate makes it easier to do the things R does with date-times and possible to do the things R does not."
*Credit: `lubridate` [documentation](https://lubridate.tidyverse.org/)*
How are dates and times stored in R? (From [Dates and Times in R](https://www.stat.berkeley.edu/~s133/dates.html))
- The `Date` class is used for storing dates
- "Internally, `Date` objects are stored as the number of days since January 1, 1970, using negative numbers for earlier dates. The `as.numeric()` function can be used to convert a `Date` object to its internal form."
- POSIX classes can be used for storing date plus times
- "The `POSIXct` class stores date/time values as the number of seconds since January 1, 1970"
- "The `POSIXlt` class stores date/time values as a list of components (hour, min, sec, mon, etc.) making it easy to extract these parts"
- There is no native R class for storing only time
Why use date/time objects?
- Using date/time objects makes it easier to fetch or modify various date/time components (e.g., year, month, day, day of the week)
- Compared to if the date/time is just stored in a string, these components are not as readily accessible and need to be parsed
- You can perform certain arithmetics with date/time objects (e.g., find the "difference" between date/time points)
## Creating date/time objects
### Creating date/time objects by parsing input
Functions that create date/time objects **by parsing character or numeric input**:
- Create `Date` object: `ymd()`, `ydm()`, `mdy()`, `myd()`, `dmy()`, `dym()`
- `y` stands for year, `m` stands for month, `d` stands for day
- Select the function that represents the order in which your date input is formatted, and the function will be able to parse your input and create a `Date` object
- Create `POSIXct` object: `ymd_h()`, `ymd_hm()`, `ymd_hms()`, etc.
- `h` stands for hour, `m` stands for minute, `s` stands for second
- For any of the previous 6 date functions, you can append `h`, `hm`, or `hms` if you want to provide additional time information in order to create a `POSIXct` object
- To force a `POSIXct` object without providing any time information, you can just provide a timezone (using `tz`) to one of the date functions and it will assume midnight as the time
- You can use `Sys.timezone()` to get the timezone for your location
**Example**: Creating `Date` object from character or numeric input
The `lubridate` functions are flexible and can parse dates in various formats:
```{r}
d <- mdy("1/1/2020")
d
d <- mdy("1-1-2020")
d
d <- mdy("Jan. 1, 2020")
d
d <- ymd(20200101)
d
```
Investigate the `Date` object:
```{r}
class(d)
typeof(d)
# Number of days since January 1, 1970
as.numeric(d)
```
**Example**: Creating `POSIXct` object from character or numeric input
The `lubridate` functions are flexible and can parse AM/PM in various formats:
```{r}
dt <- mdy_h("12/31/2019 11pm")
dt
dt <- mdy_hm("12/31/2019 11:59 pm")
dt
dt <- mdy_hms("12/31/2019 11:59:59 PM")
dt
dt <- ymd_hms(20191231235959)
dt
```
Investigate the `POSIXct` object:
```{r}
class(dt)
typeof(dt)
# Number of seconds since January 1, 1970
as.numeric(dt)
```
We can also create a `POSIXct` object from a date function by providing a timezone. The time would default to midnight:
```{r}
dt <- mdy("1/1/2020", tz = "UTC")
dt
# Number of seconds since January 1, 1970
as.numeric(dt) # Note that this is indeed 1 sec after the previous example
```
**Example**: Creating `Date` objects from dataframe column
Using the `p12_datetime_df` we created earlier, we can create `Date` objects from the `date_chr` column:
```{r}
# Use `ymd()` to parse the string stored in the `date_chr` column
p12_datetime_df %>% select(created_at, dt_chr, date_chr) %>%
mutate(date_ymd = ymd(date_chr))
```
**Example**: Creating `POSIXct` objects from dataframe column
Using the `p12_datetime_df` we created earlier, we can recreate the `created_at` column (class `POSIXct`) from the `dt_chr` column (class `character`):
```{r}
# Use `ymd_hms()` to parse the string stored in the `dt_chr` column
p12_datetime_df %>% select(created_at, dt_chr) %>%
mutate(datetime_ymd_hms = ymd_hms(dt_chr))
```
### Creating date/time objects from individual components
Functions that create date/time objects **from various date/time components**:
- Create `Date` object: `make_date()`
- Syntax and default values: `make_date(year = 1970L, month = 1L, day = 1L)`
- All inputs are coerced to integer
- Create `POSIXct` object: `make_datetime()`
- Syntax and default values: `make_datetime(year = 1970L, month = 1L, day = 1L, hour = 0L, min = 0L, sec = 0, tz = "UTC")`
**Example**: Creating `Date` object from individual components
There are various ways to pass in the inputs to create the same `Date` object:
```{r}
d <- make_date(2020, 1, 1)
d
# Characters can be coerced to integers
d <- make_date("2020", "01", "01")
d
# Remember that the default values for month and day would be 1L
d <- make_date(2020)
d
```
**Example**: Creating `POSIXct` object from individual components
```{r}
# Inputs should be numeric
d <- make_datetime(2019, 12, 31, 23, 59, 59)
d
```
**Example**: Creating `Date` objects from dataframe columns
Using the `p12_datetime_df` we created earlier, we can create `Date` objects from the various date component columns:
```{r}
# Use `make_date()` to create a `Date` object from the `yr_chr`, `mth_chr`, `day_chr` fields
p12_datetime_df %>% select(created_at, dt_chr, yr_chr, mth_chr, day_chr) %>%
mutate(date_make_date = make_date(year = yr_chr, month = mth_chr, day = day_chr))
```
**Example**: Creating `POSIXct` objects from dataframe columns
Using the `p12_datetime_df` we created earlier, we can recreate the `created_at` column (class `POSIXct`) from the various date and time component columns (class `character`):
```{r}
# Use `make_datetime()` to create a `POSIXct` object from the `yr_chr`, `mth_chr`, `day_chr`, `hr_chr`, `min_chr`, `sec_chr` fields
# Convert inputs to integers first
p12_datetime_df %>%
mutate(datetime_make_datetime = make_datetime(
as.integer(yr_chr), as.integer(mth_chr), as.integer(day_chr),
as.integer(hr_chr), as.integer(min_chr), as.integer(sec_chr)
)) %>%
select(datetime_make_datetime, yr_chr, mth_chr, day_chr, hr_chr, min_chr, sec_chr)
```
## Date/time object components
Storing data using date/time objects makes it easier to **get and set** the various date/time components.
- Basic accessor functions:
- `date()`: Date component
- `year()`: Year
- `month()`: Month
- `day()`: Day
- `hour()`: Hour
- `minute()`: Minute
- `second()`: Second
- `week()`: Week of the year
- `wday()`: Day of the week (`1` for Sunday to `7` for Saturday)
- `am()`: Is it in the am? (returns `TRUE` or `FALSE`)
- `pm()`: Is it in the pm? (returns `TRUE` or `FALSE`)
- To **get** a date/time component, you can simply pass a date/time object to the function
- Syntax: `accessor_function()`
- To **set** a date/time component, you can assign into the accessor function to change the component
- Syntax: `accessor_function() <- "new_component"`
- Note that `am()` and `pm()` can't be set. Modify the time components instead.
**Example**: Getting date/time components
```{r}
# Create datetime for New Year's Eve
dt <- make_datetime(2019, 12, 31, 23, 59, 59)
dt
dt %>% class()
# Get date
date(dt)
# Get hour
hour(dt)
# Is it pm?
pm(dt)
# Day of the week (3 = Tuesday)
wday(dt)
year(dt)
```
**Example**: Setting date/time components
```{r}
# Create datetime for New Year's Eve
dt <- make_datetime(2019, 12, 31, 23, 59, 59)
dt
# Get week of year
week(dt)
# Set week of year (move back 1 week)
week(dt) <- week(dt) - 1
# Date now moved from New Year's Eve to Christmas Eve
dt
# Set day to Christmas Day
day(dt) <- 25
# Date now moved from Christmas Eve to Christmas Day
dt
```
**Example**: Getting date/time components from dataframe column
Using the `p12_datetime_df` we created earlier, we can isolate the various date/time components from the `POSIXct` object in the `created_at` column:
```{r}
# The extracted date/time components will be of numeric type
p12_datetime_df %>% select(created_at) %>%
mutate(
yr_num = year(created_at),
mth_num = month(created_at),
day_num = day(created_at),
hr_num = hour(created_at),
min_num = minute(created_at),
sec_num = second(created_at),
ampm = ifelse(am(created_at), 'AM', 'PM') # am()/pm() returns TRUE/FALSE
)
```
## Time spans
![](https://raw.githubusercontent.com/anyone-can-cook/rclass2/master/assets/images/time_spans.png)
3 ways to represent time spans (From [lubridate cheatsheet](https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf))
- **Intervals** represent specific intervals of the timeline, bounded by start and end date-times
- Example: People with birthdays between the **interval** October 23 to November 22 are Scorpios
- **Periods** track changes in clock times, which ignore time line irregularities
- Example: Daylight savings time ends at the beginning of November and we gain an hour - this extra hour is _ignored_ when determining the **period** between October 23 to November 22
- **Durations** track the passage of physical time, which deviates from clock time when irregularities occur
- Example: Daylight savings time ends at the beginning of November and we gain an hour - this extra hour is _added_ when determining the **duration** between October 23 to November 22
### Time spans using `lubridate`
Using the `lubridate` package for time spans:
- **Interval**
- Create an interval using `interval()` or `%--%`
- Syntax: `interval(, )` or ` %--% `
- **Periods**
- "Periods are time spans but don’t have a fixed length in seconds, instead they work with '_human_' times, like days and months." (From [R for Data Science](https://r4ds.had.co.nz/dates-and-times.html#periods))
- Create periods using functions whose name is the time unit pluralized (e.g., `years()`, `months()`, `weeks()`, `days()`, `hours()`, `minutes()`, `seconds()`)
- Example: `days(1)` creates a period of 1 day - it does not matter if this day happened to have an extra hour due to daylight savings ending, since periods do not have a physical length
```{r}
days(1)
```
- You can add and subtract periods
- You can also use `as.period()` to get period of an interval
- **Durations**
- Durations keep track of the physical amount of time elapsed, so it is "stored as seconds, the only time unit with a consistent length" (From [lubridate cheatsheet](https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf))
- Create durations using functions whose name is the time unit prefixed with a `d` (e.g., `dyears()`, `dweeks()`, `ddays()`, `dhours()`, `dminutes()`, `dseconds()`)
- Example: `ddays(1)` creates a duration of `86400s`, using the standard conversion of `60` seconds in an minute, `60` minutes in an hour, and `24` hours in a day:
```{r}
ddays(1)
```
Notice that the output says this is equivalent to _approximately_ `1` day, since it acknowledges that not all days have `24` hours. In the case of daylight savings, one particular day may have `25` hours, so the duration of that day should be represented as:
```{r}
ddays(1) + dhours(1)
```
- You can add and subract durations
- You can also use `as.duration()` to get duration of an interval
**Example**: Working with interval
```{r}
# Use `Sys.timezone()` to get timezone for your location (time is midnight by default)
scorpio_start <- ymd("2019-10-23", tz = Sys.timezone())
scorpio_end <- ymd("2019-11-22", tz = Sys.timezone())
scorpio_start
# These datetime objects have class `POSIXct`
class(scorpio_start)
# Create interval for the datetimes
scorpio_interval <- scorpio_start %--% scorpio_end # or `interval(scorpio_start, scorpio_end)`
scorpio_interval <- interval(scorpio_start, scorpio_end)
scorpio_interval
# The object has class `Interval`
class(scorpio_interval)
as.numeric(scorpio_interval)
```
**Example**: Working with period
If we use `as.period()` to get the period of `scorpio_interval`, we see that it is a period of `30` days. We do not worry about the extra `1` hour gained due to daylight savings ending:
```{r}
# Period is 30 days
scorpio_period <- as.period(scorpio_interval)
scorpio_period
# The object has class `Period`
class(scorpio_period)
```
Because periods work with "human" times like days, it is more intuitive. For example, if we add a period of `30` days to the `scorpio_start` datetime object, we get the expected end datetime that is `30` days later:
```{r}
# Start datetime for Scorpio birthdays (time is midnight)
scorpio_start
# After adding 30 day period, we get the expected end datetime (time is midnight)
scorpio_start + days(30)
```
**Example**: Working with duration
If we use `as.duration()` to get the duration of `scorpio_interval`, we see that it is a duration of `2595600` seconds. It takes into account the extra `1` hour gained due to daylight savings ending:
```{r}
# Duration is 2595600 seconds, which is equivalent to 30 24-hr days + 1 additional hour
scorpio_duration <- as.duration(scorpio_interval)
scorpio_duration
# The object has class `Duration`
class(scorpio_duration)
# Using the standard 60s/min, 60min/hr, 24hr/day conversion,
# confirm duration is slightly more than 30 "standard" (ie. 24-hr) days
2595600 / (60 * 60 * 24)
# Specifically, it is 30 days + 1 hour, if we define a day to have 24 hours
seconds_to_period(scorpio_duration)
```
Because durations work with physical time, when we add a duration of `30` days to the `scorpio_start` datetime object, we do not get the end datetime we'd expect:
```{r}
# Start datetime for Scorpio birthdays (time is midnight)
scorpio_start
# After adding 30 day duration, we do not get the expected end datetime
# `ddays(30)` adds the number of seconds in 30 standard 24-hr days, but one of the days has 25 hours
scorpio_start + ddays(30)
# We need to add the additional 1 hour of physical time that elapsed during this time span
scorpio_start + ddays(30) + dhours(1)
```
# Appendix
## Special Characters
> "A sequence in a string that starts with a `\` is called an **escape sequence** and allows us to include special characters in our strings."
*Credit: [Escape sequences](https://campus.datacamp.com/courses/string-manipulation-with-stringr-in-r/string-basics?ex=4) from DataCamp*
**Special characters** are characters that will not be interpreted literally.
Common **special characters**:
- `\n`: newline
- `\t`: tab
- `\`: used for escaping purposes
- `\'`: literal single quote
- `\"`: literal double quote
- `\\`: literal backslash
These characters followed by a backslash `\` take on a new meaning. The `n` by itself is just an `n`. When you add a backslash to the `\n` you are escaping it and making it a special character where `\n` now represents a newline.
__The `writeLines()` function__:
```{r, eval = FALSE}
?writeLines
# SYNTAX AND DEFAULT VALUES
writeLines(text, con = stdout(), sep = "\n", useBytes = FALSE)
```
- "`writeLines()` displays quotes and backslashes as they would be read, rather than as R stores them." (From [writeLines](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/writeLines) documentation)
- When we include **escape sequences** in the string, it is helpful to use `writeLines()` to see how the escaped string looks
- `writeLines()` will also output the string without showing the outer pair of double quotes that R uses to store it, so we only see the content of the string
**Example**: Escaping single quotes
```{r}
my_string <- 'Escaping single quote \' within single quotes'
my_string
```
Alternatively, we could've just created the string using double quotes:
```{r}
my_string <- "Single quote ' within double quotes does not need escaping"
my_string
```
Using `writeLines()` shows us only the content of the string without the outer pair of double quotes that R uses to store strings:
```{r}
writeLines(my_string)
```
**Example**: Escaping double quotes
```{r}
my_string <- "Escaping double quote \" within double quotes"
my_string
```
Alternatively, we could've just created the string using single quotes:
```{r}
my_string <- 'Double quote " within single quotes does not need escaping'
my_string
```
Notice how the backslash still showed up in the above output to escape our double quote from the outer pair of double quotes that R uses to store the string. This is no longer an issue if we use `writeLines()` to only show the string content:
```{r}
writeLines(my_string)
```
**Example**: Escaping double quotes within double quotes
```{r}
my_string <- "I called my mom and she said \"Echale ganas!\""
my_string
```
Using `writeLines()` shows us only the content of the string without the backslashes:
```{r}
writeLines(my_string)
```
**Example**: Escaping backslashes
To include a literal backslash in the string, we need to escape the backslash with another backslash:
```{r}
my_string <- "The executable is located in C:\\Program Files\\Git\\bin"
my_string
```
Use `writeLines()` to see the escaped string:
```{r}
writeLines(my_string)
```
**Example**: Other special characters
```{r}
my_string <- "A\tB\nC\tD"
my_string
```
Use `writeLines()` to see the escaped string:
```{r}
writeLines(my_string)
```
### Escape special characters using Twitter data
Let's take a look at some tweets from our PAC-12 universities.
- Let's start by grabbing observations 1-3 from the `text` column.
```{r}
#Twitter example of \n newline special characters
p12_df$text[1:3]
```
- Using `writeLines()` we can see the contents of the strings as they would be read, rather than as R stores them.
```{r}
writeLines(p12_df$text[1:3])
```
**Example**: Escaping double quotes using Twitter data
- Using Twitter data you may encounter a lot of strings with double quotes.
- In the example below, our string includes special characters `\"` and `\n` to escape the double quotes and the newline character.
```{r}
#Twitter example of \" double quotes special characters
p12_df$text[24]
```
- Using `writeLines()` we can see the contents of the strings as they would be read, rather than as R stores them.
- We no longer see the escaped characters `\"` or `\n`
```{r}
writeLines(p12_df$text[24])
```