--- title: "HW06 — RNA-Seq Analysis" sidebar: assignments --- Source code downloads: [[ .qmd ]](https://raw.githubusercontent.com/tgirke/GEN242/refs/heads/main/assignments/homework/hw06/hw06.qmd) ## Overview This homework covers read counting strategies and differential expression analysis (DEG) from the [RNA-Seq Workflow](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/sprnaseq/sprnaseq_index.html) tutorial. It has three parts: - **Part A** — Compare unstranded and strand-specific read counting with `featureCounts` - **Part B** — Count reads over different genomic feature types with `featureCounts` - **Part C** — DEG analysis with `edgeR` and Venn diagram visualization All tasks build on the RNA-Seq workflow environment generated with `systemPipeRdata::genWorkenvir(workflow = "rnaseq")`. Complete the workflow through the **HISAT2 mapping step** before starting this homework — the sorted BAM files and `sal` object it produces are required as input throughout. ### Required packages ```{r} #| eval: false library(Rsubread) # featureCounts library(GenomicFeatures) # loadDb, exonsBy, intronsByTranscript, fiveUTRsByTranscript library(systemPipeR) # run_edgeR, filterDEGs, overLapper, vennPlot library(edgeR) ``` ### Required input objects Run the following once inside your `rnaseq` workflow directory to prepare shared inputs used across all tasks. The `TxDb` object (`tair10.sqlite`) is created from the GFF annotation file during the RNA-Seq workflow and saved to disk — load it with `loadDb()`. All annotation ranges for `featureCounts` are derived from this `TxDb`. ```{r} #| eval: false ## Load TxDb created during the RNA-Seq workflow txdb <- loadDb("./data/tair10.sqlite") ## Exons grouped by gene — used as annotation for featureCounts ## in Parts A and B (gene/exon feature types) eByg <- exonsBy(txdb, by = "gene") ## Paths to sorted BAM files produced by HISAT2 mapping step bam_files <- getColumn(sal, step = "hisat2_mapping", "outfiles", column = "samtools_sort_bam") ## Alternative if sal object is not available: # bam_files <- list.files("results/hisat2_mapping/", # pattern = "sorted.bam$", full.names = TRUE) ``` --- ## A. Unstranded and Strand-Specific Read Counting ### Background `featureCounts` from the `Rsubread` package counts reads overlapping genomic features. When a `GRangesList` object such as `eByg` (exons grouped by gene) is passed via `annot.ext`, `featureCounts` treats each list element as one meta-feature (gene). This is equivalent to the `exonsBy(txdb, by="gene")` approach used throughout the RNA-Seq workflow and is more flexible than relying on a GTF file. Set `isGTFAnnotationFile = FALSE` when passing a range object. The `strandSpecific` argument controls strand awareness: | Mode | `strandSpecific` | Counts reads on | |---|---|---| | Unstranded | `0` | Both strands | | Sense (positive) strand | `1` | Same strand as feature | | Antisense (negative) strand | `2` | Opposite strand to feature | For paired-end RNA-Seq data always set `isPairedEnd = TRUE`. The count matrix is accessed from the returned list via `$counts`. The `nthreads` argument controls parallelization — adjust to the number of available cores. ### Task A1 — Generate three count tables Using `bam_files` as input and `eByg` as the annotation, generate three count tables — one per strand mode. Since `eByg` is already a `GRangesList` of exons grouped by gene, set `useMetaFeatures = TRUE` and `isGTFAnnotationFile = FALSE`. **1. Unstranded** (`strandSpecific = 0`) ```{r} #| eval: false fc_unstranded <- featureCounts( files = bam_files, annot.ext = eByg, isGTFAnnotationFile = FALSE, useMetaFeatures = TRUE, strandSpecific = 0, isPairedEnd = TRUE, countMultiMappingReads = FALSE, nthreads = 4 ) unstranded <- fc_unstranded$counts unstranded[1:4, ] ``` **2. Sense (positive) strand** (`strandSpecific = 1`) ```{r} #| eval: false fc_stranded_pos <- featureCounts( files = bam_files, annot.ext = eByg, isGTFAnnotationFile = FALSE, useMetaFeatures = TRUE, strandSpecific = 1, isPairedEnd = TRUE, countMultiMappingReads = FALSE, nthreads = 4 ) stranded_pos <- fc_stranded_pos$counts stranded_pos[1:4, ] ``` **3. Antisense (negative) strand** (`strandSpecific = 2`) ```{r} #| eval: false fc_stranded_neg <- featureCounts( files = bam_files, annot.ext = eByg, isGTFAnnotationFile = FALSE, useMetaFeatures = TRUE, strandSpecific = 2, isPairedEnd = TRUE, countMultiMappingReads = FALSE, nthreads = 4 ) stranded_neg <- fc_stranded_neg$counts stranded_neg[1:4, ] ``` Store the count matrices in variables named exactly **`unstranded`**, **`stranded_pos`**, and **`stranded_neg`**. ### Task A2 — Compare strand-specific counts to unstranded Test whether the two strand-specific count tables sum to approximately the same values as the unstranded count table. Calculate for each strand-specific result the mean mapping frequency as a percentage of the unstranded counts: ```{r} #| eval: false perc_pos <- mean((stranded_pos / unstranded) * 100, na.rm = TRUE) perc_neg <- mean((stranded_neg / unstranded) * 100, na.rm = TRUE) perc_pos perc_neg ``` Store results in variables named exactly **`perc_pos`** and **`perc_neg`**. **Interpretation:** if both values are close to 50%, the RNA-Seq library is very likely **unstranded**. If one value dominates (e.g. ~100% sense, ~0% antisense), the library is strand-specific. For paired-end data expect very similar but not necessarily identical results between the sum of stranded counts and the unstranded total. ### Task A3 — Biological interpretation of strand-specific counting Include comment text in your script addressing the following four points: 1. When should strand-specific counting be used instead of unstranded counting? 2. How does strand-specific counting help resolve reads from overlapping genes encoded on opposite strands? 3. What biological information can antisense strand counting reveal? 4. How can comparing stranded and unstranded results determine whether an RNA-Seq experiment used a strand-specific library preparation protocol? --- ## B. Read Counting for Different Feature Types ### Background By using different range extraction functions on `txdb`, reads can be counted over different genomic features (genes, exons, introns, UTRs). All annotation ranges are derived from the same `TxDb` object loaded from `tair10.sqlite`, keeping everything consistent with the RNA-Seq workflow. `featureCounts` accepts `GRanges` and `GRangesList` objects directly via `annot.ext` — set `isGTFAnnotationFile = FALSE` for all range-based inputs. ### Task B — Count reads over five feature types Compute **sense strand** (`strandSpecific = 1`) count tables for the following five feature types. Use `isPairedEnd = TRUE` for gene and exon features. For introns and UTRs use `isPairedEnd = FALSE` (single-end mode since these are sub-gene features). **1. Genes** — exons grouped by gene (`eByg` from `exonsBy`) ```{r} #| eval: false ## eByg already available from the input setup above fc_genes <- featureCounts( files = bam_files, annot.ext = eByg, isGTFAnnotationFile = FALSE, useMetaFeatures = TRUE, strandSpecific = 1, isPairedEnd = TRUE, countMultiMappingReads = FALSE, nthreads = 4 ) gene_ranges_countDF <- fc_genes$counts gene_ranges_countDF[1:4, ] ``` **2. Exons** — individual exon level using `exons(txdb)` ```{r} #| eval: false exon_ranges <- exons(txdb) fc_exons <- featureCounts( files = bam_files, annot.ext = exon_ranges, isGTFAnnotationFile = FALSE, useMetaFeatures = FALSE, strandSpecific = 1, isPairedEnd = TRUE, countMultiMappingReads = FALSE, nthreads = 4 ) exon_ranges_countDF <- fc_exons$counts exon_ranges_countDF[1:4, ] ``` **3. Exons by genes** — `eByg` is the direct equivalent of `exonsBy(txdb, by="gene")`; reuse it here for clarity ```{r} #| eval: false ## eByg groups exons by gene — identical to the exonsBy approach exonByg_ranges_countDF <- gene_ranges_countDF # reuses result from Task B1 exonByg_ranges_countDF[1:4, ] ``` **4. Introns by transcripts** — `intronsByTranscript(txdb)` returns a `GRangesList` that is passed directly to `featureCounts` ```{r} #| eval: false intron_ranges <- intronsByTranscript(txdb, use.names = TRUE) fc_introns <- featureCounts( files = bam_files, annot.ext = intron_ranges, isGTFAnnotationFile = FALSE, strandSpecific = 1, isPairedEnd = FALSE, countMultiMappingReads = FALSE, nthreads = 4 ) intron_ranges_countDF <- fc_introns$counts intron_ranges_countDF[1:4, ] ``` **5. 5'-UTRs by transcripts** — `fiveUTRsByTranscript(txdb)` passed directly ```{r} #| eval: false fiveUTR_ranges <- fiveUTRsByTranscript(txdb, use.names = TRUE) fc_fiveUTR <- featureCounts( files = bam_files, annot.ext = fiveUTR_ranges, isGTFAnnotationFile = FALSE, strandSpecific = 1, isPairedEnd = FALSE, countMultiMappingReads = FALSE, nthreads = 4 ) fiveUTR_ranges_countDF <- fc_fiveUTR$counts fiveUTR_ranges_countDF[1:4, ] ``` Use the **exact variable names** shown above: `gene_ranges_countDF`, `exon_ranges_countDF`, `exonByg_ranges_countDF`, `intron_ranges_countDF`, `fiveUTR_ranges_countDF`. --- ## C. DEG Analysis with edgeR ### Background `run_edgeR()` from `systemPipeR` wraps the standard edgeR DEG workflow. Here you will run it on two count tables (unstranded vs sense strand) and compare the resulting DEG sets visually using 4-way Venn diagrams. ### Task C — DEG analysis and Venn diagram comparison **1. Run edgeR on both count tables** ```{r} #| eval: false cmp <- readComp(stepsWF(sal)[["hisat2_mapping"]], format = "matrix", delim = "-") ## Alternative if sal is not available: # cmp <- readComp(file = "targetsPE.txt", format = "matrix", delim = "-") ## DEG analysis with unstranded count table edgeDF_unstranded <- run_edgeR( countDF = unstranded, targets = targetsWF(sal)[["hisat2_mapping"]], cmp = cmp[[1]], independent = FALSE, mdsplot = "" ) ## DEG analysis with sense strand count table edgeDF_pos <- run_edgeR( countDF = stranded_pos, targets = targetsWF(sal)[["hisat2_mapping"]], cmp = cmp[[1]], independent = FALSE, mdsplot = "" ) ``` **2. Generate and save 4-way Venn diagrams** ```{r} #| eval: false ## (1) Venn diagram — unstranded count table DEG_list_unstranded <- filterDEGs( degDF = edgeDF_unstranded, filter = c(Fold = 2, FDR = 40), plot = FALSE ) vennsetup <- overLapper(DEG_list_unstranded$Up[6:9], type = "vennsets") vennsetdown <- overLapper(DEG_list_unstranded$Down[6:9], type = "vennsets") pdf("results/DEGcount_unstranded.pdf") vennPlot(list(vennsetup, vennsetdown), mymain = "", mysub = "", colmode = 2, ccol = c("blue", "red")) dev.off() ## (2) Venn diagram — sense strand count table DEG_list_pos <- filterDEGs( degDF = edgeDF_pos, filter = c(Fold = 2, FDR = 40), plot = FALSE ) vennsetup <- overLapper(DEG_list_pos$Up[6:9], type = "vennsets") vennsetdown <- overLapper(DEG_list_pos$Down[6:9], type = "vennsets") pdf("results/DEGcount_pos.pdf") vennPlot(list(vennsetup, vennsetdown), mymain = "", mysub = "", colmode = 2, ccol = c("blue", "red")) dev.off() ``` Both PDF files must be committed to the repository at: - `results/DEGcount_unstranded.pdf` - `results/DEGcount_pos.pdf` --- ## D. Bonus Challenge — edgeR in Native Mode with Time Course Design ::: {.callout-note} ## Bonus: +1 extra point This section is optional. Completing it earns **1 bonus point** on top of the 5-point maximum, for a total possible score of 6/5. It is intentionally more challenging and is meant for students who want a deeper understanding of linear model design in RNA-Seq analysis. ::: ### Background The `run_edgeR()` wrapper used in Part C is convenient but hides the underlying statistical machinery. Internally, it builds a simple pairwise design matrix and fits a generalized linear model (GLM). For the Howard et al. (2013) data set used throughout this workflow, a **pairwise design is not the optimal approach** — the experiment is a **two-factor time course** that deserves a more informative model. The experimental design of Howard et al. (2013) consists of 18 samples: | Factor | Meaning | Time points | Replicates | |---|---|---|---| | `M` | Mock-inoculated (*susceptible*) | 1h, 2h, 12h | A, B | | `A` | *Pseudomonas*-infected (*resistant*) | 1h, 2h, 12h | A, B | The sample names in `targetsPE.txt` encode this as: `M1A`, `M1B`, `M2A`, `M2B`, `M12A`, `M12B` (mock) and `A1A`, `A1B`, `A2A`, `A2B`, `A12A`, `A12B` (infected). The two factors are **treatment** (mock vs infected) and **time** (1h, 2h, 12h). An appropriate model for this design captures: - The main effect of **time** - The main effect of **treatment** (mock vs infected) - The **interaction** between time and treatment — i.e. whether the treatment effect changes over time, which is the biologically interesting question ### Task D — Run edgeR natively with an interaction model Using the `unstranded` count table from Part A as input, implement the full native edgeR workflow with a `~ treatment + time + treatment:time` interaction design. **Step 1 — Prepare sample metadata** ```{r} #| eval: false library(edgeR) ## Load targets to get sample information targets <- read.delim("targetsPE.txt", comment.char = "#") ## Extract treatment and time factors from the Factor column ## Factor levels: M1, M2, M12 (mock) and A1, A2, A12 (infected) targets$treatment <- ifelse(grepl("^M", targets$Factor), "mock", "infected") targets$time <- gsub("^[MA]", "", targets$Factor) # extracts "1", "2", "12" targets$time <- factor(targets$time, levels = c("1", "2", "12")) targets$treatment <- factor(targets$treatment, levels = c("mock", "infected")) ## Confirm the design targets[, c("SampleName", "Factor", "treatment", "time")] ``` **Step 2 — Create DGEList and filter low-count genes** ```{r} #| eval: false ## Match count table columns to sample order in targets countDF <- unstranded[, targets$SampleName] ## Create DGEList object dge <- DGEList(counts = countDF, group = targets$Factor) ## Filter genes with very low counts across all samples ## Keep genes with at least 1 count per million (CPM) in at least 2 samples keep <- rowSums(cpm(dge) >= 1) >= 2 dge <- dge[keep, , keep.lib.sizes = FALSE] message(sum(keep), " genes retained after low-count filtering") ## Normalize library sizes with TMM dge <- calcNormFactors(dge, method = "TMM") ``` **Step 3 — Build the interaction design matrix** ```{r} #| eval: false ## Interaction model: ~ treatment + time + treatment:time ## This tests whether the treatment effect differs across time points design_tc <- model.matrix(~ treatment + time + treatment:time, data = targets) colnames(design_tc) ## Should show: Intercept, treatmentinfected, time2, time12, ## treatmentinfected:time2, treatmentinfected:time12 ``` **Step 4 — Estimate dispersion and fit the GLM** ```{r} #| eval: false ## Estimate common, trended and tagwise dispersions dge <- estimateDisp(dge, design_tc, robust = TRUE) plotBCV(dge) # optional: inspect biological coefficient of variation ## Fit the negative binomial GLM fit_tc <- glmQLFit(dge, design_tc, robust = TRUE) ``` **Step 5 — Test for genes responding differently across time in infected vs mock** Test the interaction terms — these identify genes where the treatment effect changes significantly over time (the core time course question): ```{r} #| eval: false ## Test interaction coefficients: ## treatmentinfected:time2 and treatmentinfected:time12 interaction_cols <- grep("treatmentinfected:time", colnames(design_tc)) qlf_interaction <- glmQLFTest(fit_tc, coef = interaction_cols) ## Extract top DEGs topTags(qlf_interaction, n = 20) ## Summary of significant genes at FDR < 0.05 summary(decideTests(qlf_interaction, p.value = 0.05)) ## Save full results table tc_results <- as.data.frame(topTags(qlf_interaction, n = Inf)) write.csv(tc_results, "results/edgeR_timecourse_interaction.csv") ``` **Step 6 — Compare with pairwise result from Part C** Briefly compare (in comment text) the number of DEGs identified by the interaction model versus the pairwise approach from Part C. Consider: - Which approach identified more DEGs? - Which design better captures the biology of a time course experiment and why? - What biological question does each design answer? Store your interaction results in a variable named exactly **`tc_results`** and save the CSV to `results/edgeR_timecourse_interaction.csv`. --- ## Homework Submission Submit **one R script** named **`HW6.R`** and the **two Venn diagram PDFs** to your private GitHub homework repository at the following paths: ``` Homework/HW6/HW6.R results/DEGcount_unstranded.pdf results/DEGcount_pos.pdf results/edgeR_timecourse_interaction.csv # bonus only ``` ### Requirements for full credit 1. Use **`featureCounts()`** from `Rsubread` for all read counting (Parts A and B) 2. Define all variables with the **exact names** specified throughout 3. Include **comment text** for all four points of Task A3 4. Commit both **Venn diagram PDF files** alongside `HW6.R` 5. Script must be well structured following the recommended layout below ### Bonus requirement (+1 point) To earn the bonus point, additionally include in `HW6.R`: - The full native edgeR workflow for the interaction model (`~ treatment + time + treatment:time`) - Variable `tc_results` containing the full DEG table from `topTags(..., n = Inf)` - File `results/edgeR_timecourse_interaction.csv` committed to the repo - Comment text comparing the interaction model result with the pairwise result from Part C ### Recommended script structure ```{r} #| eval: false ## ───────────────────────────────────────────────────────────── ## HW6: RNA-Seq Analysis ## GEN242, Spring 2026 ## Author: ## ───────────────────────────────────────────────────────────── library(Rsubread); library(GenomicFeatures) library(systemPipeR); library(edgeR) ## Input txdb <- loadDb("./data/tair10.sqlite") eByg <- exonsBy(txdb, by = "gene") bam_files <- getColumn(sal, step = "hisat2_mapping", "outfiles", column = "samtools_sort_bam") ## ── Part A: Strand-specific read counting ───────────────────── ## Task A1 — unstranded (strandSpecific = 0) fc_unstranded <- featureCounts(files = bam_files, annot.ext = eByg, isGTFAnnotationFile = FALSE, useMetaFeatures = TRUE, strandSpecific = 0, isPairedEnd = TRUE, nthreads = 4) unstranded <- fc_unstranded$counts ## Task A1 — sense strand (strandSpecific = 1) fc_stranded_pos <- featureCounts( ) stranded_pos <- fc_stranded_pos$counts ## Task A1 — antisense strand (strandSpecific = 2) fc_stranded_neg <- featureCounts( ) stranded_neg <- fc_stranded_neg$counts ## Task A2 — percentage comparison perc_pos <- mean((stranded_pos / unstranded) * 100, na.rm = TRUE) perc_neg <- mean((stranded_neg / unstranded) * 100, na.rm = TRUE) ## Task A3 — biological interpretation (replace ... with your answers) ## 1. Strand-specific counting should be used when the library is stranded because ... ## 2. For overlapping genes on opposite strands, strand-specific counting allows ... ## 3. Antisense strand counting can reveal biological events such as ... ## 4. Comparing stranded and unstranded results determines strandedness because ... ## ── Part B: Feature type counting ──────────────────────────── ## B1: Genes (eByg = exonsBy(txdb, by="gene")) gene_ranges_countDF <- featureCounts( )$counts ## B2: Exons (individual) exon_ranges <- exons(txdb) exon_ranges_countDF <- featureCounts( )$counts ## B3: Exons by genes exonByg_ranges_countDF <- gene_ranges_countDF ## B4: Introns intron_ranges <- intronsByTranscript(txdb, use.names = TRUE) intron_ranges_countDF <- featureCounts( )$counts ## B5: 5'-UTRs fiveUTR_ranges <- fiveUTRsByTranscript(txdb, use.names = TRUE) fiveUTR_ranges_countDF <- featureCounts( )$counts ## ── Part C: DEG analysis ────────────────────────────────────── cmp <- readComp(stepsWF(sal)[["hisat2_mapping"]], format = "matrix", delim = "-") edgeDF_unstranded <- run_edgeR(countDF = unstranded, targets = targetsWF(sal)[["hisat2_mapping"]], cmp = cmp[[1]], independent = FALSE, mdsplot = "") edgeDF_pos <- run_edgeR(countDF = stranded_pos, targets = targetsWF(sal)[["hisat2_mapping"]], cmp = cmp[[1]], independent = FALSE, mdsplot = "") ## Venn diagrams → results/DEGcount_unstranded.pdf and results/DEGcount_pos.pdf ## ``` --- ## Due Date This homework is due **Tuesday, May 20th at 6:00 PM**. ## Homework Solutions To be posted after the due date.