--- title: "Class 8: Introduction to matricies" author: - name: "Kristen Wells" url: https://github.com/kwells4 affiliation: "RNA Bioscience Iniative, Barbara Davis Diabetes Center" orcid_id: 0000-0002-7466-8164 output: distill::distill_article: self_contained: false toc: true draft: false preview: img/matrix_image.jpg --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE) ``` *The Rmarkdown for this class is [on github](https://raw.githubusercontent.com/rnabioco/bmsc-7810-pbda/main/_posts/2023-12-08-class-8-matricies/class8_matricies.Rmd)* ## Goals for this class * Learn what is a matrix * Describe difference between matrix and data frame * Perform mathematical functions * Convert between matrix and data frames ## Load packages ```{r load-packages} library(tidyverse) library(here) ``` ## Download files Before we get started, let's download all of the files you will need for the next three classes. ```{r} # conditionally download all of the files used in rmarkdown from github source("https://raw.githubusercontent.com/rnabioco/bmsc-7810-pbda/main/_posts/2023-12-08-class-8-matricies/download_files.R") ``` ## What is a matrix? A Matrix is an 2 dimensional object in R. We create a matrix using the `matrix` function ```{r} M <- matrix(c(10:21), nrow = 4, byrow = TRUE) M ``` We can also use `as.matrix` on an existing dataframe ```{r} df <- data.frame("A" = c(10:13), "B" = c(14:17), "C" = (18:21)) df ``` ```{r} new_mat <- as.matrix(df) new_mat ``` Just like data frames, we can name the rows and columns of the Matrix ```{r} rownames(new_mat) <- c("first", "second", "third", "forth") colnames(new_mat) <- c("D", "E", "F") new_mat ``` We can look at the structure of the matrix using `str` ```{r} str(new_mat) ``` Here you can see that the type of this structure is `int` because it is a matrix consisting of integers. We can also see the row names and column names. As with data frames, we can check the size of the matrix using `nrow`, `ncol` and `dim` ```{r} nrow(new_mat) ncol(new_mat) dim(new_mat) ``` We can also access data using brackets`[` Selecting a single value: ```{r} new_mat[1,2] ``` Selecting a section of the matrix: ```{r} new_mat[1:3,2] ``` If we don't provide an index for the row, R will return all rows: ```{r} new_mat[, 3] ``` The same is true for the columns ```{r} new_mat[3,] ``` Because this matrix has row and column names, we can also pull out data based on those ```{r} new_mat["second", "D"] ``` **Exercise** What value is in row 2 and column 3 of `new_mat`? ```{r} # TODO find the value in the matrix at row 2 and column 3 ``` **If we can make a matrix from a data frame, what's the difference?** Matrices can only have values of one type --> integer, boolean, character, while a dataframe can be a mix of types: ```{r} df <- data.frame("A" = c(10:12), "B" = c("cat", "dog", "fish"), "C" = c(TRUE, TRUE, FALSE)) df ``` ```{r} M <- as.matrix(df) M ``` ```{r} typeof(df[,1]) typeof(M[,1]) ``` But Matrices can take any type of input ```{r} M <- matrix(rep(c(TRUE, FALSE), 4), nrow = 4, byrow = TRUE) M ``` ```{r} typeof(M[,1]) ``` ## Matrix opearions If you've taken linear algebra, you've probably worked with matrices before. These same matrix operations can be done in R ### Basic operations We can do any of the mathematical operations for a matrix and one value. For example, we can add 5 to all values in a matrix, or subtract 2, or divide by 10 ```{r} M <- matrix(c(10:21), nrow = 4, byrow = TRUE) M M + 1 M + 2 M - 5 M / 3 M * 10 ``` We can also provide a vector or another matrix to perform element-wise functions with. ```{r} vector <- c(2, 3, 4, 5) M + vector ``` Here you can see that each element of the vector is added to a row ie element 1 is added to row 1, element 2 is added to row 2, etc. The same is true for subtraction ```{r} M - vector ``` And multiplication and division ```{r} M / vector M * vector ``` What happens if there are a different number of rows as elements in the vector? ```{r} vector <- c(2, 3, 4) M M + vector ``` Note how the vector just gets reused, no error is thrown. We can also perform these operations on two matrices ```{r} M1 <- matrix(c(10:21), nrow = 4, byrow = TRUE) M2 <- matrix(c(110:121), nrow =4, byrow = TRUE) M1 M2 M1 + M2 ``` Note how elements in the same position of each matrix are added together *Note this also is true of vectors* ```{r} v1 <- c(1,2,3) v2 <- c(4) v1 + v2 ``` ```{r} v3 <- c(5, 6) v1 + v3 ``` ```{r} v4 <- c(10, 11, 12, 13, 14, 15) v1 + v4 ``` **Exercise** Multiply, subtract, and divide the two matrices `M1` and `M2` ```{r} # TODO multiply, subtract and divide M1 and M2 ``` ### Matrix multiplication We will only briefly touch on matrix multiplication, but one reason matrices are very important in R is that you can perform multiplication with them. Exactly how this is done is explained nicely in a [math is fun](https://www.mathsisfun.com/algebra/matrix-multiplying.html) tutorial. Let's try to multiply two matrices together. Remember our first matrix has 4 rows and 3 columns: ```{r} dim(M) ``` So our new matrix must have 3 rows ```{r} M2 <- matrix(c(5:19), nrow = 3, byrow = TRUE) M2 ``` Let's perform matrix multiplication with these ```{r} M %*% M2 ``` ## Performing functions on a matrix So far, a matrix has looked a lot like a dataframe with some limitations. One of the places where matrices become the most useful is performing statistical functions because all items in a matrix are of the same type. For this next section, let's use some data that I downloaded from the social security office. This has the top 100 boy and girl names by state for 2020. We can now read in the data and convert it to a matrix ```{r} names_mat <- read.csv(here("class_8-10_data", "boy_name_counts.csv"), row.names = 1) %>% as.matrix() names_mat[1:5, 1:5] ``` Above you can see that we have the number of males with each name in each state. Looking at the structure, we can see that it is an integer matrix. ```{r} str(names_mat) ``` We can now explore this data set using many of the functions you have already learned such as `rowSums` and `colSums` ### Basic functions First, lets find the sum for all of the rows - how many total babies were named each name? ```{r} rowSums(names_mat) ``` And then the columns - how many babies were included from each state? ```{r} colSums(names_mat) ``` What if we want to find the percent of children with a given name across all states (divide the value by the row sum * 100) - what percent of total babies for each name came from each state: ```{r} percent_mat <- names_mat / rowSums(names_mat) * 100 percent_mat[1:5, 1:5] ``` *Remember from above that division using a vector will divide every element of a row by one value, so we can only do this using `rowSums`. In a few minutes we will discuss how do do this on the columns.* ### Summary functions We can also find the minimum, maximum, mean, and median values of the whole matrix and any column. First, lets get summary data for the whole matrix using `summary` ```{r} summary(names_mat)[ , 1:3] ``` You can see that this calculates the min, max, mean, median, and quartiles for the columns. What if we just want the minimum value for the "Alabama" names? We can run `min` while subsetting to just the column of interest ```{r} min(names_mat[, "Alabama"]) ``` We can do the same for the rows Lets try this for "William" ```{r} min(names_mat["William",]) ``` What if we wanted to find the smallest value in the whole matrix? ```{r} min(names_mat) ``` `max` works the same as min **Exercise** Find the maximum value in the for "Noah" ```{r} # TODO Find the maximum value for Noah and the whole matrix ``` We can also find the mean, median, and standard deviation of any part of the matrix By row: ```{r} mean(names_mat["William", ]) ``` ```{r} median(names_mat["William", ]) ``` ```{r} sd(names_mat["William", ]) ``` By column: ```{r} mean(names_mat[ , "Alabama"]) ``` ```{r} median(names_mat[ , "Alabama"]) ``` ```{r} sd(names_mat[ , "Alabama"]) ``` ### Transposition One important quality of a matrix is being able to transpose it to interchange the rows and columns - here the rows become columns and columns become rows. We transpose using `t()` to the matrix. Let's first look at this using the matrix we started with ```{r} M <- matrix(c(10:21), nrow = 4, byrow = TRUE) M ``` ```{r} t(M) ``` Note that the output of transposing either a matrix or a data frame will be a matrix (because the type within a column of a data frame must be the same). ```{r} df t(df) str(df) str(t(df)) ``` Note how after the transposition, all items in the original df are now characters and we no longer have a dataframe. Now let's try this transposition on the names matrix we've been working with ```{r} transposed_mat <- t(names_mat) transposed_mat[1:3,1:3] ``` Note how the columns are now names and the rows are now states. Remember the note above where we could only divide by the `rowSums`? Now we can use this transposition to figure out the percent of children in each state with a given name (divide the value by the column sum * 100) ```{r} state_percents <- transposed_mat / rowSums(transposed_mat) * 100 state_percents <- t(state_percents) state_percents[1:3, 1:3] ``` Above we did this in several steps, but we can also do in in one step: ```{r} state_percents_2 <- t(t(names_mat) / colSums(names_mat)) * 100 identical(state_percents, state_percents_2) ``` ### Statistical tests We can also use matrices to perform statistical tests, like t-tests. For instance, are the names Oliver and Noah, or Oliver and Thomas used different amounts? First, let's normalize the data to account for the fact that each state reported different numbers of births. To do this normalization, let's first divide each value by the total number of children reported for that state. Remember, we need to first transpose the matrix to be able to divide by the `colSums` ```{r} normalized_mat <- t(t(names_mat) / colSums(names_mat)) ``` Now that we have normalized values, we can do a t-test. ```{r} normalized_mat["Oliver", 1:3] normalized_mat["Noah", 1:3] normalized_mat["Thomas", 1:3] ``` ```{r} t.test(normalized_mat["Oliver",], normalized_mat["Noah",]) ``` Between Oliver and Noah, there does not seem to be a difference with the data we have. What about Oliver and Thomas? ```{r} t.test(normalized_mat["Oliver",], normalized_mat["Thomas",]) ``` Here we can see that there is a difference between the mean values for Oliver and Thomas using a `t.test` ## Using dataframes and matricies For many of the `tidyverse` functions you've learned so far, a data frame is required. Fortunately, it is very easy to change between a data frame and a matrix. ```{r} normalized_dat <- data.frame(normalized_mat) str(normalized_dat) ``` Once we can move between matrices and data frames, we can start to tidy our data for plotting purposes. Let's plot the distribution of name usage as a violin plot. Here we want the counts to be the y axis and the names to be the y axis. The first thing we need to do is make our matrix into a data frame ```{r} names_dat <- data.frame(names_mat) ``` Next, we will want the names to be a column rather than the row names. We can do this using `$` or `tibble::rownames_to_column` ```{r} names_dat <- rownames_to_column(names_dat, "name") names_dat[1:3,1:3] # To set using $ # names_dat$name <- rownames(names_dat) ``` Next, we need to `pivot_longer` from `tidyr`. We want to take everything but the names column ```{r} pivot_columns <- colnames(names_dat)[colnames(names_dat) != "name"] names_dat <- pivot_longer(names_dat, cols = all_of(pivot_columns), names_to = "state", values_to = "count") ``` Note, we can use the pipe `%>%` from `dplyr` to put all of this into one statement. ```{r} # Here we will specify the columns to keep first pivot_columns <- colnames(names_mat) names_dat <- names_mat %>% data.frame %>% rownames_to_column("name") %>% pivot_longer(cols = all_of(pivot_columns), names_to = "state", values_to = "count") ``` With this new data frame, we can now plot the distribution of names ```{r} # I first set the theme theme_set(theme_classic(base_size = 10)) ggplot(names_dat, aes(x = name, y = count, fill = name)) + geom_violin() + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) # rotate x axis ``` There are a few outliers here, almost certainly California. As we discussed above, normalizing the data helps put everything onto the same scale. **Exercise** Can you make the same plot as above but use our normalized values? ```{r} # TODO make plot above with normalized values # Hint start with normalized_dat ``` ## I'm a biologist, why should I care? * Many of your datasets can be analyzed using matrices * If you analyze RNA-seq data, all of the processing and differential testing will be done in a matrix through `DESeq2` * If you analyze protein data, that is best done in a matrix * Single cell RNA-seq data is analyzed using matrices, especially sparse matrices * Even just many measurements can be analyzed using a matrix * Matrices are efficient object types for most types of analysis you would want to do ## Acknowldgements and additional references The content of this class borrows heavily from previous tutorials: * Matrices * [R example](https://bookdown.org/manishpatwal/bookdown-demo/matrices-in-r.html) * [Quick tips](https://monashbioinformaticsplatform.github.io/r-intro/matrices.html)