# 1. Set up the environment ---- # We will need to load these 3 libraries library(visdat) # for initial inspection of data library(tidyverse) # for manipulation of data library(plotly) # for building interactive plots # Set your working directory to the course folder ## using setwd() # 2. Read the data ---- # Please, note, here we are using `read_tsv()` because original file is # tab-separated, not comma-separated ('csv') mydata <- read_tsv("../data/TPM-light-WT-17c-27c-RNA-seq-average-rep1-rep2_misexpressed.tsv") # 3. Check that dataset does not contain obvious problems ---- # Guide questions: # - how many dimensions does the data have? # - what are the column names? # - what kind of data do these variables contain? # - what are the variable types? # - are there any missing values? # Hint: check the functions that we have discussed in the module02: # View(), vis_dat(), dim(), glimpse(), names() # 4. Data manipulation ---- # Now, let's reshape our data. We will do it in several steps and # ultimately we will generate a tidy dataset, suitable for applying ggplot # functions. # 4.1 Reshape dataset from wide to long format: ---- # Hint: one of two: gather() or spread() should be able to help you. # Check dimensions of the transformed dataset, do they match your expectatios? # Hint: compare with dimensions of the original dataset with dim() function. # 4.2 Tidy variables ---- # You might have noticed that now you have a column that contains several # variables cramped together: time and temperature. Let's split it in 4 # columns: "units", "genotype", "time" and "temperature" using separate() function. # Does the transformed dataset looks as expected? Use View() or head() to make sure. # 4.3 Drop "units" and "genotype" columns # the values in these columns are always the same. # Hint: use select() function ---- # 4.4 Filter rows with low expression values ---- # When dealing with expression data we often have lots of genes that are barely # expressed. Let's get rid of them, this should slightly reduce the size of our # dataset. # Hint: use filter() to select rows with expression values greater than or equal to 1 TPM. # Don't forget to check the dimensions. Can you tell how many rows have been # dropped? # 5. Build plots ---- # 5.1. Initial visualization ---- # After all the transformations our dataset is ready to be plotted. # For example: what is the distribution of TPM values in the 3 time points and two temperatures? ## Hint: initialize plot with ggplot(), specify axes and plot aesthetics inside aes() ## select a suitable geom_*. ## When in doubt, have a look at the examples provided in the module02 materials. # 5.2 Look at your favourite genes ---- # Say, now we are interested in 10 specific genes and we want to visualize their # expression in different temperatures over time. # here are our genes of interest: genes_of_interest <- c("AT1G67090", "AT5G19240", "AT1G31580", "AT3G12580", "AT1G80920", "AT3G54660", "AT2G25110", "AT1G19530", "AT3G23990", "AT3G30775") # Now, lets subset expression data (data_long) for only our genes_of_iterest. # Hint: use filter() and %in% # Build line plots for every gene of interest, facet by gene_id. Can you put # measurements for both temperatures in a single plot? ## Hint: you will need to use the "group" aesthetic to tell geom_line which points should be connected to each other # 5.3 Save transformed data to file ---- # Bonus challenge for superheroes ---- ## now do all the same steps, from data transformation to plotting with pipes ## (excluding exploratory stages)!