--- title: "Lesson 6: Joining tables" date: 2024-06-04 format: html: standalone: true embed-resources: true number-sections: true toc: true editor: visual --- # Prepare the R environment ```{r} #| output: false library(here) library(readr) library(dplyr) library(tidyr) ``` If you're looking at the quarto doc for this listen, you might notice we're using the `#| output: false` option for this code chunk. This hides all of the output of the code in this chunk when we render this quarto document into an HTML file. The *here* library lists the full path to the current project directory on your local computer. While this is useful to see when you're running the code, we may not want to include this in the final rendered HTML report (particularly if we're publishing it on the internet). # Load RNA-seq data from the circadian gene expression atlas For this lesson we'll be working with gene expression data from an atlas of circadian transcription in the mouse. We'll be focusing specifically on core clock gene expression across the liver, kidney, and skeletal muscle. We prepared these data files using the *recount3* package. The recount project, and its associated R package, provide pre-processed expression data from over 10,000 mouse datasets. We will eventually add the code we used to prepare these files to the ITMAT office houRs website (look in "Notes & Nibbles" section on the Lessons page). We'll start by using functions from the *readr* package to load these tsv files into R. ```{r} clock_gene_expression <- read_tsv(here("DATA/bhtc_rnaseq_gene-expression_core-clock-genes_long-format.txt")) ``` ```{r} sample_metadata <- read_tsv(here("DATA/bhtc_rnaseq_sample-metadata.txt")) ``` ```{r} gene_metadata <- read_tsv(here("DATA/bhtc_rnaseq_gene-metadata_gencode-v23.txt")) ``` Now that we've loaded the data from these files, examine their contents. Compare what information each data table contains. Is there any information shared across data tables? ```{r} clock_gene_expression ``` ```{r} sample_metadata ``` ```{r} gene_metadata ``` # Join data from different tables We have one table that contains the read counts for every core clock gene, across several tissues. ```{r} head(clock_gene_expression) ``` This is not particularly useful. If we wanted to look at *Bmal1* expression in muscle tissue, how would we identify those samples? We could use information from the other tables, in combinations with the `filter()`{.r} function to access our samples of interest. ```{r} ``` While this gives us the samples we want, we're still missing a lot of the rich information contained in those other data tables (e.g. timepoints, gene name). This is where the `join_*`{.r} family of functions comes in. Use the `left_join()`{.r} command to combine information from the gene expression and gene metadata tables. ```{r} clock_gene_expression |> left_join(gene_metadata) ``` The `left_join()`{.r} function identified the 'gene_id' column as common between the two tables (called the "key"). It then used the values in the 'gene_id' columns to line up the rows in each column before joining them into a single table. We can tell this from the status message the `left_join()`{.r} function prints. Take note of the number of rows in the two input tables, compared to the table created by the `left_join()`{.r} function. ```{r} paste("Clock gene table:", dim(clock_gene_expression)[1], "rows") paste("Gene metadata table:", dim(gene_metadata)[1], "rows") paste("Left join result table:", dim(left_join(clock_gene_expression, gene_metadata, ))[1], "rows") ``` ## Different types of join Let's examine the help documentation for the `left_join()`{.r} function. ```{r} #| eval: false help("left_join") ``` Note that the first argument (*x*) is considered the "left table", and the second argument (*y*) is considered the "right table". In a `left_join()`{.r}, we start with *x* as the main table and then add columns from all of the matching rows in *y*. For comparison, let's see what happens if we use the `right_join()`{.r} function (pay attention to the number of rows in the result). ```{r} clock_gene_expression |> ``` In this case, the `right_join()`{.r} function is adding all of the columns in *x* ("left table") that match rows in *y* ("right table"), and joining them to the *y* table. This keeps all of the rows in *y*. Though we're not going to examine them here, there are other types of "mutating joins" (i.e. joins adding new columns to an existing table) offered by the *dplyr* package. The section on joins from Hadley Wickham's *R for Data Science* book has a [helpful figure](https://r4ds.hadley.nz/joins.html#fig-join-venn) summarizing the different types of joins. ![](https://r4ds.hadley.nz/diagrams/join/venn.png) ## Controlling the join Now let's join the sample metadata table to the gene expression table. ```{r} clock_gene_expression |> left_join(sample_metadata) ``` What do you think happened during the join? Hint: take a look at the contents of the two input tables, paying close attention to the column from each table that the `left_join()`{.r} function chose as the key. Looking at the two input tables, can we find a different set of columns to use as the key to join the tables? We can use the 'by' argument from any join function, in combination with the `join_by()`{.r} function, to control which columns from each table serve as the keys. ```{r} clock_gene_expression |> left_join(sample_metadata, by = ) ``` Note, you may see code that uses the 'by' argument without the `join_by()`{.r} function. ```{r} clock_gene_expression |> left_join(sample_metadata, by = ) ``` This is the older way of defining keys, before the introduction of the `join_by()`{.r} function. While this older convention is still supported by *dplyr*, the `join_by()`{.r} function offers more flexibility when defining key relationships between tables. ## Combining joins In closing, let's combine what we learned above to create a table that combines expression data with helpful gene metadata (like gene names) and sample metadata (like tissues and timepoint). We'll construct this table in stages. Let's start by joining the gene expression table with the gene metadata table. Note, when we specify a table in the `join_by()`{.r} function, we can include additional tidyverse operations to modify either table we're joining. ```{r} ``` In this case, we reduced the gene metadata table just two columns: 'gene_id' (the key for joining to the gene expression table) and 'gene_name' (the information we want to add to the gene expression table). Now let's use another join to add the sample metadata. Again, we'll use some additional operations to modify the sample metadata prior to the join. ```{r} ``` Now we have table that would be useful for generating plots or including with a publication. For quick reference, the *dplyr* cheatsheet contains a section summarizing the different join functions. Again, Hadley Wickham's *R for Data Science* also has an excellent [section on the join operations](https://r4ds.hadley.nz/joins.html) that dives more deeply into the various types of joins. # A primer on databases The three files we used above are an example of a simple database. We stored the data across multiple tables, which we can connect to each other using "keys" (i.e. columns with unique identifiers shared between tables). This general style of constructing a database is called the "relational database model". Inside R, we can use a series of functions/operations (e.g. `join()`{.r}, `select()`{.r}, `filter()`{.r}) to extract different subsets of the data across the different tables. In a database system, we use analogous operations to create a "query" that extracts the data we want. *All of Us* stores data using a relational model, so it's useful for us to be familiar with these operations. We'll need to use joins and filters to when working with *All of Us* data spread across multiple tables. The *All of Us* [data dictionary](https://support.researchallofus.org/hc/en-us/articles/360033200232-Data-Dictionaries-for-the-Curated-Data-Repositories-CDRs) lists the tables and columns in the database. This will serve as a guide when we construct queries to extract our datasets of interest and plan our analyses. # R session information Here we report the version number for R and the package versions we used to perform the analyses in this document. ```{r} sessionInfo() ```