### R script for the hands-on examples ### Week 7 ## The `apply()` Function ------------------------------------------------------------ ## 1. Import the gene expression data file "read-counts.csv" into RStudio ## and name it as `counts`. ## 2. Calculate following metrics for each row (genes) and ## column (samples) using the `apply()` function: ## - Mean (`mean()`) ## - Variance (`var()`) ## - Minimum and maximum (`min()`, `max()` or use `range()`) ## What is the expected length or dimension of the outputs? ## Attention: exclude the 1st column for the calculation. ## The `lapply()` Function ------------------------------------------------------------ ### Check GC Content ------------------------------------------------------------ ## The GC content is the percentage of guanine (G) and cytosine (C) in a DNA or RNA sequence. ## This measure is one of the metrics which can be used in the sequencing quality control ## (e.g., detects contamination). ## Here we will use a small example to see how this metric is calculated. # Generate a list of DNA sequences dna_sequences <- list( seq1 = "ATGCGTAGCTAGGCTATCCGA", seq2 = "CGCGTTAGGCAAGTGTTACG", seq3 = "GGTACGATCGATGCGCGTAA", seq4 = "TTTAAACCCGGGATATAAAA" ) dna_sequences[["seq1"]] # the 1st DNA sequence ## 1. Split the 1st DNA sequence into individual bases using the function `strsplit()`. (See `?strsplit`) seq1 <- strsplit(dna_sequences[["seq1"]], split = "") seq1 ## 2. Convert the results of split to a vector. ## 3. Count the number of G and C bases among all bases. ## 4. Calculate the percentage of GC in the whole sequence. ## 5. Write a function which take one sequence as input to calculate the GC content. ## Test your function on the 1st sequence of the list. ## 6. Use `lapply()` or `sapply()` to apply the created function to the list of sequences. ### Automate Tasks for a List of Genes ------------------------------------------------------------ ## Based on differential gene expression analysis (SET1 *vs.* WT) results, ## draw the boxplot of for the top 3 genes with the smallest adjusted p-value, ## add individual data points on the boxplot and ## show p-value above the boxes with horizontal bar ## (with the help of the ggsignif pacakge). ## 1. Import the differential gene expression analysis (SET1 *vs.* WT) results file "toy_DEanalysis.csv" ## into RStudio and name it as `de_res`. ## 2. Extract the genes of interest, i.e., 3 genes with the smallest adjusted p-value. ## 3. Based on the `counts` data, build data frame for the *LOH1* gene for the boxplot. ## This data frame should contain: ## - a column of counts for SET1 and WT samples, ## - and a column for corresponding the sample group. ## Attention: In order to avoid hardcoding the gene name or sample name, ## use a variable instead. ## 4. Draw boxplot and show the individual data points on the same figure. ## 5. Add p-value with a horizontal bar on the figure: ## - install the [`ggsignif`](https://const-ae.github.io/ggsignif/) package, ## - use the function `geom_signif()` to add a layer to show p-value on the figure. ## Check the documentation of `geom_signif()`, what do we need to add p-value? ## 6. Generalise previous steps with a function which take the name of gene as input. ## Test the function with another gene. ## 7. Apply the function to the targeted genes.