--- title: "STAT 302 Statistical Computing" subtitle: "Lecture 2: Data Structures in R" author: "Yikun Zhang (_Winter 2024_)" date: "" output: xaringan::moon_reader: css: ["uw.css", "fonts.css"] lib_dir: libs nature: highlightStyle: tomorrow-night-bright highlightLines: true countIncrementalSlides: false titleSlideClass: ["center","top"] --- ```{r setup, include=FALSE, purl=FALSE} options(htmltools.dir.version = FALSE) knitr::opts_chunk$set(comment = "##") library(kableExtra) ``` # Outline 1. Vectors in R. 2. Arrays and Matrices in R. 3. Lists in R. 4. Data Frames. 5. R Coding Style Guide * Acknowledgement: Parts of the slides are modified from the course materials by Prof. Ryan Tibshirani, Prof. Yen-Chi Chen, Prof. Deborah Nolan, Bryan Martin, and Andrea Boskovic. --- class: inverse # Part 1: Vectors in R --- # First Data Structure: Vectors * A **data structure** is a grouping of related data values into an object. * A **vector** is a sequence of values with the _same_ data type. -- ```{r} # Create a numeric vector x = c(7, 8, 10, 45, 2) is.vector(x) class(x) ``` The function `c()` combines all its arguments into a vector (or a list). --- # Vectors in R * A **vector** is a sequence of values with the _same_ data type. ```{r} y = c(7.5, as.integer(8), 10+4i, "c") y class(y) ``` * If there is some elements in a vector that is of character type, R will coerce all the elements into characters. -- ```{r} 1:6 ``` `1:6` is shorthand for `c(1,2,3,4,5,6)`. --- # Vectors in R * We can also generate vectors using functions such as `rep()` and `seq()` ```{r} # Sequence from 1 to 20, incrementing by a step 5 seq(1, 20, by = 5) # Repeat each element of a vector 3 times each rep(c(1, 2), each = 3) # Repeat an entire vector 3 times rep(c(1, 2), times = 3) ``` --- # Subsetting Vectors in R * We subset a vector using `[index]` after the vector name. ```{r} x = c(7, 8, 10, 45, 2) # Subset the second element x[2] # Subset the first, second, and fourth elements x[c(1,2,4)] ``` -- * If we use a negative index, we return the vector with that element removed. ```{r} x[-3] ``` --- # Subsetting Vectors in R * We can also subset a vector by a logical statement (or equivalently, a logical vector of the same length). ```{r} x = c(7, 8, 10, 45, 2) x[x > 9] # Return the indices of those elements > 9 which(x > 9) # Same output, but the code is redundant x[which(x > 9)] ``` --- # Naming the Elements of Vectors in R * We can give names to elements/components of vectors, and index vectors accordingly. - Note: Names are the labels of elements but not the additional components of the vector. ```{r} z = c(3, 2, 31, 10) names(z) = c("v1","v2","v3","fred") z z["fred"] z[c("v1", "fred")] ``` --- # Naming the Elements of Vectors in R What if we only name one element of `z` in the first place? ```{r} z = c(3, 2, 31, 10) names(z[2]) = "b" z ``` -- We can't change the name of a single element in vector `z` neither. ```{r} names(z[2]) = "b" z ``` --- # Vector Arithmetics Arithmetic operators apply to vectors in a "componentwise" fashion. ```{r} x = c(7, 8, 10, 45, 2) y = -1:-5 x + y ``` -- ```{r} z = c("a", "6", "7", "2", "5") x - as.numeric(z) ``` Note: Arithmetic operations only work for numeric vectors. --- # Vector Recycling What if we apply arithmetic operators to two numeric vectors of different lengths? ```{r size='tiny', warning=FALSE} x = c(7, 8, 10, 45, 2) p = c(2, 3) x^p ``` -- **Recycling** in R repeat elements in the shorter vector to match with the longer one. - This is useful when done on purpose, but could also lead to hard-to-catch bugs in our code! ```{r} 2*x ``` --- # Comparative and Logical Operations on Vectors We can also do componentwise comparisons and logical operations with vectors. ```{r warning=FALSE} x = c(7, 8, 10, 45, 2) x > 9 (x > 9) | (x < 6) x == c(10, 2) sum(x > 9) ``` --- # Built-in Functions for Vectors Many built-in functions can take vectors as arguments: * `mean(), median(), sd(), var(), max(), min(), length()`, and `sum()` return single numbers. * `cumsum(), cumprod(), cummax(), cummin()` return the cumulative sums, products, minima or maxima of the elements of a vector. * `sort()` returns the sorted vector. * `order()` returns the indices of the sorted vector. * `hist()` takes a vector of numbers and produces a histogram, a highly structured object, with the side effect of making a plot. * `ecdf()` similarly produces a cumulative-density-function object. * `summary()` gives the summary statistics of numerical vectors. * `any()` and `all()` are useful on Boolean vectors. --- class: inverse # Part 2: Arrays and Matrices in R --- # Second Data Structure: Arrays * An **array** is a multi-dimensional generalization of vectors. ```{r} x = c(7, 8, 10, 45, 20, 1) # Create a 3-by-2 array using the elements in `x` x_arr = array(x, dim = c(3, 2)) x_arr ``` -- ```{r} dim(x_arr) ``` The function `dim()` tells us the numbers of rows and columns. The output of `dim()` could be a vector of arbitrary length. --- # Arrays in R We can also create a 3-dim array (known as tensor in Python). ```{r} y = c(7, 8, 10, 45, 20, 1, 4, 2, 188, 32, 12, 34) # Create a 3-by-2-by-2 array using the elements in `x` y_arr = array(y, dim = c(3, 2, 2)) y_arr ``` --- # Subsetting/Indexing An Array in R We can access a 2-dim array either using `[index,index]` or by the underlying vector (column-major order). ```{r} is.array(x_arr) x_arr[1,2] y_arr[3,1,2] x_arr[c(1,2),2] ``` --- # Subsetting/Indexing An Array in R We can access a 2-dim array either using `[index,index]` or by the underlying vector (column-major order). ```{r} x_arr # View an array as a vector in a column-major order x_arr[4] as.vector(x_arr) ``` --- # Matrices in R A matrix is a specialization of a 2-dim array. ```{r} z_mat = matrix(c(40, 1, 60, 3, 4, 2), nrow = 3) z_mat is.matrix(z_mat) is.array(z_mat) ``` * We could also specify `ncol` for the number of columns. --- # Matrices in R We can also generate matrices by column binding (`cbind()`) and row binding (`rbind()`) vectors. ```{r} y = c(2, 3, 4) arr1 = cbind(x_arr, y) arr1 rbind(x_arr, x_arr[c(1,2),]) ``` --- # Matrices in R * We can subset a matrix as how we did for an array. * Matrices, like vectors, can only have its entries of the same data type. ```{r} rbind(c(1, 2, 3), c("a", "b", "c")) ``` -- * We can also apply (built-in) functions to matrices as vectors. ```{r} mean(arr1) ``` --- # Matrix Multiplication The usual multiplication `*` can only do component-wise/element-wise multiplication between two matrices. ```{r} x_arr * y_arr[,,1] ``` -- The matrix multiplication in R is achieved by `%*%`. ```{r} z_mat = matrix(data = c(1,2,3,12), ncol = 2) x_arr %*% z_mat ``` --- # Other Matrix Operations * Row/Column sum and mean: ```{r} rowSums(x_arr) colMeans(x_arr) ``` * Matrix transpose: ```{r} t(x_arr) ``` --- # Other Matrix Operations * The determinant of a square matrix: ```{r} print(z_mat) # The determinant of a square matrix det(z_mat) ``` * The inverse of a matrix: ```{r} solve(z_mat) ``` --- # Other Matrix Operations * The `diag()` function can extract the diagonal entries of a matrix: ```{r} diag(z_mat) ``` -- * The `diag()` function can also be used to create a diagonal matrix: ```{r} diag(c(1,4,3)) ``` --- # Names in Matrices * We can name either rows or columns or both, with `rownames()` and `colnames()`. The rules are the same as naming the vectors. ```{r} colnames(z_mat) = c("a", "b") z_mat ``` -- Similarly to `names()` for vectors, we then access them by calling the function again. ```{r} colnames(z_mat) ``` Note: Names help us understand what we are working with. --- class: inverse # Part 3: Lists in R --- # Third Data Structure: Lists A **list** is a collection of objects that are not necessarily all of the same data type and can even have different lengths. ```{r} my_list = list("exponential", 7, FALSE, c(1,6,2)) my_list ``` --- # Subsetting a List * We can use `[index]` as with vectors, and it will return a list. ```{r} my_list[4] class(my_list[4]) ``` -- * If we want to extract one element of a list, we have to use `[[index]]`. ```{r} my_list[[4]] ``` --- # Subsetting and Expanding a List ```{r} # Subset the second sub-element in the fourth element of the list my_list[[4]][2] ``` We can also use `[[index]]` to expand the list. ```{r} my_list[[5]] = c("a", "3", "UW STAT") my_list ``` --- # Contracting a List We can also shorten the list with by setting the length to something smaller (also works for vectors). ```{r} length(my_list) length(my_list) = 3 my_list ``` --- # Naming a List We can also name the elements of a list: ```{r} names(my_list) = c("first", "num", "logical") my_list ``` --- # Naming a List The names for the element of a list can be given when we initialize the list. ```{r} my_list = list(func = "exponential", num = 7, logi = FALSE, vec = c(1,6,2)) my_list ``` --- # Subsetting a List By Name There are two different ways to subset an element from the list by name. ```{r} my_list[["num"]] my_list$num ``` We will also use `$` to access a column of the data frame later... --- # Advantage of Lists * Lists give us a natural way to store and look up data by name, rather than by position. -- * Lists achieve a useful programming concept called **key-value pairs**, i.e., dictionaries in Python. - If we need to know the value of a component, we can look that up by name without caring where it is (in what position it lies) in the list. -- * Lists are generally used when a function returns multiple results... --- class: inverse # Part 4: Data Frames in R --- # Fourth Data Structure: Data Frames * A **data frame** is a classic data table with $n$ rows for cases and $p$ columns for variables. -- * A data frame can be viewed as a generalization of a named array. -- * In principle, a data frame is a special list, with the restriction that all its components are vectors of the same length.

--- # Data Frames in R We start from creating a matrix (2-dim array). ```{r} a_mat = matrix(c(35, 8, 10, 4, 12, 20, 10, 11, 1, 2), ncol=2) colnames(a_mat) = c("v1","v2") a_mat class(a_mat) ``` --- # Data Frames in R We can expand the column of a data frame or coerce a matrix/array to the data frame type using `data.frame()`. ```{r} a_df = data.frame(a_mat, Date=as.Date("1965/5/15") + 1:5) a_df ``` Note: Check what the function `as.Date()` is for. Why can we add a numeric vector to it? --- # Data Frames in R The function `cbind()` and `rbind()` also works for data frames. ```{r} a_df = cbind(a_df, binary=rbinom(5, size = 1, prob = 0.3)) a_df ``` Note: `rbinom()` generates some random samples from the binomial distribution. Run `?rbinom()` to check the documentation. --- # Data Frames in R * However, when using `rbind()`, the data type of each column in the new data frame should match the original data frame. ```{r} rbind(a_df, data.frame(v1=1, v2=32, Date=as.Date("2023/09/27"), binary=-1.1)) ``` --- # Subset Rows/Columns of A Data Frame ```{r} a_df$v2 a_df$Date[1:3] a_df[,2] a_df[-(3:4),2] ``` --- # Read Tables into R So far, we only create our data frames manually in R. -- In practice, it is more common to read those existing tabular data into R and carry out our analysis. There are many different ways to read tables into R. Here are two possible ways: ```{r} family_df = read.table(url("https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/family.txt"), sep = "\t", header = TRUE) family_df2 = read.csv(url("https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/family.txt"), sep = "\t", header = TRUE) all(family_df == family_df2) ``` The data `family.txt` can be downloaded through the link [https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/family.txt](https://github.com/zhangyk8/zhangyk8.github.io/raw/master/_teaching/file_stat302/Data/family.txt). --- # Read Tables into R If the data file is in our current working directory, then we do not have to use the function `url()` to access it. ```{r} family_df3 = read.table("family.txt", sep = "\t", header = TRUE) class(family_df3) head(family_df3) ``` --- # Post-Analysis After Reading the Tables ```{r out.width="40%", fig.align='center'} # Find all the unique first name unique(family_df$firstName) # Histogram of BMIs for all individuals hist(family_df$bmi, xlab = "BMIs", main = "Histogram of BMIs for all individuals") ``` --- # Working Directory in R A working directory is the file path that R uses to save and look for data. * We can check for our current working directory using `getwd()`. ```{r} getwd() ``` * We can change our working directory using `setwd()`. -- ```{r} setwd("/media/yikun/Disk_D1/Graduate School/STAT 302/Lectures") ``` Note: Do not change the working directory inside R Markdown files! By default, R Markdown sets the file path of where it is in as the working directory. --- # Saving Tables in R We can save a single R object as `.rds` files using `saveRDS()`, and multiple R objects as `.RData` or `.rda` files using `save()`. ```{r} object1 = 1:5 object2 = c("a", "b", "c") # save only object1 saveRDS(object1, file = "object1_only.rds") # save object1 and object2 save(object1, object2, file = "both_objects.RData") ``` -- If we want to save a data frame, it is recommended to write it as `.csv` or `.txt` file. ```{r} write.table(family_df, file = "family_newsave.txt", sep = "\t") write.csv(family_df, file = "family_newsave.csv") ``` --- class: inverse # Part 5: R Coding Style Guide --- # Object Names Use either underscores (`_`) or big camel case (`BigCamelCase`) to separate words within an object/Variable name. Try to avoid using dots `.` to separate words in R functions! ```{r, eval = FALSE} # Good day_one day_1 DayOne # Bad dayone ``` --- # Object Names Names should be concise, meaningful, and (generally) nouns. ```{r, eval = FALSE} # Good day_one # Bad first_day_of_the_month djm1 ``` --- # Object Names It is **very important** that object names do not overlap with common functions! ```{r, eval = FALSE} # Very extra super bad c = 7 t = 23 T = FALSE mean = "something" ``` Note: `T` and `F` are R shorthand for `TRUE` and `FALSE`, respectively. In general, we should spell them out as clear as possible. ```{r} mean(c(1, 2)) ``` --- # Spacing Put a space after every comma, just like the English writing. ```{r, eval = FALSE} # Good x[, 1] # Bad x[,1] x[ ,1] x[ , 1] ``` Do not put spaces inside or outside parentheses for regular function calls. ```{r, eval = FALSE} # Good mean(x, na.rm = TRUE) # Bad mean (x, na.rm = TRUE) mean( x, na.rm = TRUE ) ``` --- # Spacing with Operators Most of the time when we are doing math, conditionals, logicals, or assignments, our operators should be surrounded by spaces (e.g. for `==`, `+`, `-`, `<-`, etc.). ```{r, eval = FALSE} # Good height = (feet * 12) + inches mean(x, na.rm = 10) # Bad height=feet*12+inches mean(x, na.rm=10) ``` There are some exceptions we will learn more about later, such as the power symbol `^`. See the [Tidyverse Style Guide](https://style.tidyverse.org/) for more details! --- # Extra Spacing Adding extra spaces is fine if it improves alignment of `=` or `<-`. ```{r, eval = FALSE} # Good list( total = a + b + c, mean = (a + b + c) / n ) # Also fine list( total = a + b + c, mean = (a + b + c) / n ) ``` --- # Long Lines of Code Strive to limit our code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. If a function call is too long to fit on a single line, use one line each for the function name, each argument, and the closing `)`. This makes the code easier to read and to modify later. ```{r, eval = FALSE} # Good do_something_very_complicated( something = "that", requires = many, arguments = "some of which may be long" ) # Bad do_something_very_complicated("that", requires, many, arguments, "some of which may be long" ) ``` *Tip! Try RStudio > Preferences > Code > Display > Show Margin with Margin column 80 to give us a visual cue!* --- # Semicolons In R, semi-colons (`;`) are used to execute pieces of R code on a single line. * In general, this is bad practice and should be avoided. Also, we never need to end lines of code with semi-colons! ```{r, eval = FALSE} # Bad a = 2; b = 3 # Also bad a = 2; b = 3; # Good a = 2 b = 3 ``` --- # Quotes and Strings Use `"`, not `'`, for quoting text. The only exception is when the text already contains double quotes and no single quotes. ```{r, eval = FALSE} # Bad 'Text' 'Text with "double" and \'single\' quotes' # Good "Text" 'Text with "quotes"' 'A link' ``` -- ### Useful References for R Coding Style Guide * [Tidyverse Style Guide](https://style.tidyverse.org/) by Hadley Wickham. * [Google Style Guide](https://google.github.io/styleguide/Rguide.html). This style guides are useful for other people to understand our code! --- # Tidy Data Principles There are three rules required for data (or a data frame/table) to be considered tidy: 1. Each variable must have its own column. 2. Each observation must have its own row. 3. Each value must have its own cell. --- # Tidy Data Principles (Example 1) The rules seem simple, but using them can be tricky! Let's consider the following example. What is untidy about the following data frame? ```{r, echo = FALSE} library(kableExtra) untidy_dat = data.frame("Hospital" = c("A", "B", "C", "D"), "Diseased" = c(10, 15, 12, 5), "Healthy" = c(14, 18, 13, 16)) kable_styling(kable(untidy_dat, align = "c")) ``` -- * **Variables:** hospital, disease status, and counts. -- * **Observations:** the number of individuals at a given hospital and of a given disease status. -- * **Values:** Hospital A, Hospital B, Hospital C, Hospital D, individual count values, *Disease Status "Healthy"*, and *Disease Status "Diseased"*. --- # Tidy Data Principles (Example 1) ```{r, echo = FALSE} library(kableExtra) untidy_dat = data.frame("Hospital" = c("A", "B", "C", "D"), "Diseased" = c(10, 15, 12, 5), "Healthy" = c(14, 18, 13, 16)) kable_styling(kable(untidy_dat, align = "c")) ``` The main problem is that the column headers are values, not variables! How can we tidy it up? -- ```{r, echo = FALSE} tidy_dat <- data.frame("Hospital" = rep(c("A", "B", "C", "D"), each = 2), "Status" = rep(c("Diseased", "Healthy"), 4), "Count" = c(10, 14, 15, 18, 12, 13, 5, 16)) kable_styling(kable(tidy_dat, align = "c")) ``` --- # Tidy Data Principles (Example 2) Let's consider another example: ```{r, echo = FALSE} untidy_dat2 = data.frame("Country" = c("A", "B"), "Year" = rep(2018, 2), "m_16_24" = c(49, 34), "m_25_34" = c(55, 33), "f_16_24" = c(47, 50), "f_25_34" = c(41, 43)) kable_styling(kable(untidy_dat2)) ``` -- * **Variables:** Country, year, gender, age group, and counts. * **Observations:** the number of individuals in a given country, in a given year, of a given gender, and in a given age group. * **Values:** Country A, Country B, Year 2018, Gender "m", Gender "f", Age Group "16_24", Age Group "25_34", and individual counts. --- # Tidy Data Principles (Example 2) The tidy version is as follows: ```{r, echo = FALSE} tidy_dat2 = data.frame("Country" = rep(c("A", "B"), each = 4), "Year" = rep(2018, 8), "Gender" = rep(c("m", "m", "f", "f"), 2), "Age_Group" = rep(c("16_24", "25_34"), 4), "Counts" = c(49, 55, 47, 41, 34, 33, 50, 43)) kable_styling(kable(tidy_dat2, align = "c")) ``` Note: In R, this can be done via the `pivot_longer()` function in the `tidyr` package. We will discuss this in detail later... --- # Guidelines of Making Data Tidy 1. Identify the observations, variables, and values. 2. Ensure that each observation has its own row. * Be careful about individual observations spreading over multiple tables, Excel files, etc, or multiple types of observations within a single table (this would result in many empty cells). 3. Ensure that each variable has its own column. * Be careful about variables spreading over two columns and multiple variables within a single column. 4. Ensure that each value has its own cell. * Be careful about values as column headers. --- # Why Do We Need Tidy Data? * Easier to read and understand the data. * More intuitive to analyze and plot the data using R (required for `ggplot2`). * Fewer issues with missing values. ### Useful References for Tidy Data Principles Here is a [fantastic reference](https://vita.had.co.nz/papers/tidy-data.pdf) written by Hadley Wickham going through all these principles in detail and with more examples. --- # Summary - Data structures allow us to group related values together. - Vectors group together values with the same data type. - Arrays add multi-dimensional structure to vectors, while matrices are two-dimensional arrays. - Lists allow us to combine data of different types and lengths. - Data frames are hybrids of matrices and lists, allowing each column to have a different data type but the same length. - Tidy data principle helps us better analyze and visualize data tables in R. Submit Lab 2 on Gradescope by the end of Tuesday (January 23)!!