---
title: "HW06 — RNA-Seq Analysis"
sidebar: assignments
---

Source code downloads: [[ .qmd ]](https://raw.githubusercontent.com/tgirke/GEN242/refs/heads/main/assignments/homework/hw06/hw06.qmd)

## Overview

This homework covers read counting strategies and differential expression
analysis (DEG) from the
[RNA-Seq Workflow](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/sprnaseq/sprnaseq_index.html)
tutorial. It has three parts:

- **Part A** — Compare unstranded and strand-specific read counting with `featureCounts`
- **Part B** — Count reads over different genomic feature types with `featureCounts`
- **Part C** — DEG analysis with `edgeR` and Venn diagram visualization

All tasks build on the RNA-Seq workflow environment generated with
`systemPipeRdata::genWorkenvir(workflow = "rnaseq")`. Complete the workflow
through the **HISAT2 mapping step** before starting this homework — the sorted
BAM files and `sal` object it produces are required as input throughout.

### Required packages

```{r}
#| eval: false
library(Rsubread)        # featureCounts
library(GenomicFeatures) # loadDb, exonsBy, intronsByTranscript, fiveUTRsByTranscript
library(systemPipeR)     # run_edgeR, filterDEGs, overLapper, vennPlot
library(edgeR)
```

### Required input objects

Run the following once inside your `rnaseq` workflow directory to prepare
shared inputs used across all tasks. The `TxDb` object (`tair10.sqlite`)
is created from the GFF annotation file during the RNA-Seq workflow and
saved to disk — load it with `loadDb()`. All annotation ranges for
`featureCounts` are derived from this `TxDb`.

```{r}
#| eval: false
## Load TxDb created during the RNA-Seq workflow
txdb <- loadDb("./data/tair10.sqlite")

## Exons grouped by gene — used as annotation for featureCounts
## in Parts A and B (gene/exon feature types)
eByg <- exonsBy(txdb, by = "gene")

## Paths to sorted BAM files produced by HISAT2 mapping step
bam_files <- getColumn(sal, step = "hisat2_mapping",
                       "outfiles", column = "samtools_sort_bam")
## Alternative if sal object is not available:
# bam_files <- list.files("results/hisat2_mapping/",
#                         pattern = "sorted.bam$", full.names = TRUE)
```

---

## A. Unstranded and Strand-Specific Read Counting

### Background

`featureCounts` from the `Rsubread` package counts reads overlapping
genomic features. When a `GRangesList` object such as `eByg` (exons grouped
by gene) is passed via `annot.ext`, `featureCounts` treats each list element
as one meta-feature (gene). This is equivalent to the `exonsBy(txdb, by="gene")`
approach used throughout the RNA-Seq workflow and is more flexible than
relying on a GTF file. Set `isGTFAnnotationFile = FALSE` when passing a
range object. The `strandSpecific` argument controls strand awareness:

| Mode | `strandSpecific` | Counts reads on |
|---|---|---|
| Unstranded | `0` | Both strands |
| Sense (positive) strand | `1` | Same strand as feature |
| Antisense (negative) strand | `2` | Opposite strand to feature |

For paired-end RNA-Seq data always set `isPairedEnd = TRUE`. The count
matrix is accessed from the returned list via `$counts`. The `nthreads`
argument controls parallelization — adjust to the number of available cores.

### Task A1 — Generate three count tables

Using `bam_files` as input and `eByg` as the annotation, generate three
count tables — one per strand mode. Since `eByg` is already a `GRangesList`
of exons grouped by gene, set `useMetaFeatures = TRUE` and
`isGTFAnnotationFile = FALSE`.

**1. Unstranded** (`strandSpecific = 0`)

```{r}
#| eval: false
fc_unstranded <- featureCounts(
    files                  = bam_files,
    annot.ext              = eByg,
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = TRUE,
    strandSpecific         = 0,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
unstranded <- fc_unstranded$counts
unstranded[1:4, ]
```

**2. Sense (positive) strand** (`strandSpecific = 1`)

```{r}
#| eval: false
fc_stranded_pos <- featureCounts(
    files                  = bam_files,
    annot.ext              = eByg,
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = TRUE,
    strandSpecific         = 1,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
stranded_pos <- fc_stranded_pos$counts
stranded_pos[1:4, ]
```

**3. Antisense (negative) strand** (`strandSpecific = 2`)

```{r}
#| eval: false
fc_stranded_neg <- featureCounts(
    files                  = bam_files,
    annot.ext              = eByg,
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = TRUE,
    strandSpecific         = 2,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
stranded_neg <- fc_stranded_neg$counts
stranded_neg[1:4, ]
```

Store the count matrices in variables named exactly **`unstranded`**,
**`stranded_pos`**, and **`stranded_neg`**.

### Task A2 — Compare strand-specific counts to unstranded

Test whether the two strand-specific count tables sum to approximately the
same values as the unstranded count table. Calculate for each
strand-specific result the mean mapping frequency as a percentage of the
unstranded counts:

```{r}
#| eval: false
perc_pos <- mean((stranded_pos / unstranded) * 100, na.rm = TRUE)
perc_neg <- mean((stranded_neg / unstranded) * 100, na.rm = TRUE)
perc_pos
perc_neg
```

Store results in variables named exactly **`perc_pos`** and **`perc_neg`**.

**Interpretation:** if both values are close to 50%, the RNA-Seq library is
very likely **unstranded**. If one value dominates (e.g. ~100% sense,
~0% antisense), the library is strand-specific. For paired-end data expect
very similar but not necessarily identical results between the sum of stranded
counts and the unstranded total.

### Task A3 — Biological interpretation of strand-specific counting

Include comment text in your script addressing the following four points:

1. When should strand-specific counting be used instead of unstranded counting?
2. How does strand-specific counting help resolve reads from overlapping genes
   encoded on opposite strands?
3. What biological information can antisense strand counting reveal?
4. How can comparing stranded and unstranded results determine whether an
   RNA-Seq experiment used a strand-specific library preparation protocol?

---

## B. Read Counting for Different Feature Types

### Background

By using different range extraction functions on `txdb`, reads can be
counted over different genomic features (genes, exons, introns, UTRs).
All annotation ranges are derived from the same `TxDb` object loaded from
`tair10.sqlite`, keeping everything consistent with the RNA-Seq workflow.
`featureCounts` accepts `GRanges` and `GRangesList` objects directly via
`annot.ext` — set `isGTFAnnotationFile = FALSE` for all range-based inputs.

### Task B — Count reads over five feature types

Compute **sense strand** (`strandSpecific = 1`) count tables for the
following five feature types. Use `isPairedEnd = TRUE` for gene and exon
features. For introns and UTRs use `isPairedEnd = FALSE` (single-end mode
since these are sub-gene features).

**1. Genes** — exons grouped by gene (`eByg` from `exonsBy`)

```{r}
#| eval: false
## eByg already available from the input setup above
fc_genes <- featureCounts(
    files                  = bam_files,
    annot.ext              = eByg,
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = TRUE,
    strandSpecific         = 1,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
gene_ranges_countDF <- fc_genes$counts
gene_ranges_countDF[1:4, ]
```

**2. Exons** — individual exon level using `exons(txdb)`

```{r}
#| eval: false
exon_ranges <- exons(txdb)
fc_exons <- featureCounts(
    files                  = bam_files,
    annot.ext              = exon_ranges,
    isGTFAnnotationFile    = FALSE,
    useMetaFeatures        = FALSE,
    strandSpecific         = 1,
    isPairedEnd            = TRUE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
exon_ranges_countDF <- fc_exons$counts
exon_ranges_countDF[1:4, ]
```

**3. Exons by genes** — `eByg` is the direct equivalent of
`exonsBy(txdb, by="gene")`; reuse it here for clarity

```{r}
#| eval: false
## eByg groups exons by gene — identical to the exonsBy approach
exonByg_ranges_countDF <- gene_ranges_countDF   # reuses result from Task B1
exonByg_ranges_countDF[1:4, ]
```

**4. Introns by transcripts** — `intronsByTranscript(txdb)` returns a
`GRangesList` that is passed directly to `featureCounts`

```{r}
#| eval: false
intron_ranges <- intronsByTranscript(txdb, use.names = TRUE)
fc_introns <- featureCounts(
    files                  = bam_files,
    annot.ext              = intron_ranges,
    isGTFAnnotationFile    = FALSE,
    strandSpecific         = 1,
    isPairedEnd            = FALSE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
intron_ranges_countDF <- fc_introns$counts
intron_ranges_countDF[1:4, ]
```

**5. 5'-UTRs by transcripts** — `fiveUTRsByTranscript(txdb)` passed
directly

```{r}
#| eval: false
fiveUTR_ranges <- fiveUTRsByTranscript(txdb, use.names = TRUE)
fc_fiveUTR <- featureCounts(
    files                  = bam_files,
    annot.ext              = fiveUTR_ranges,
    isGTFAnnotationFile    = FALSE,
    strandSpecific         = 1,
    isPairedEnd            = FALSE,
    countMultiMappingReads = FALSE,
    nthreads               = 4
)
fiveUTR_ranges_countDF <- fc_fiveUTR$counts
fiveUTR_ranges_countDF[1:4, ]
```

Use the **exact variable names** shown above:
`gene_ranges_countDF`, `exon_ranges_countDF`, `exonByg_ranges_countDF`,
`intron_ranges_countDF`, `fiveUTR_ranges_countDF`.

---

## C. DEG Analysis with edgeR

### Background

`run_edgeR()` from `systemPipeR` wraps the standard edgeR DEG workflow.
Here you will run it on two count tables (unstranded vs sense strand) and
compare the resulting DEG sets visually using 4-way Venn diagrams.

### Task C — DEG analysis and Venn diagram comparison

**1. Run edgeR on both count tables**

```{r}
#| eval: false
cmp <- readComp(stepsWF(sal)[["hisat2_mapping"]],
                format = "matrix", delim = "-")
## Alternative if sal is not available:
# cmp <- readComp(file = "targetsPE.txt", format = "matrix", delim = "-")

## DEG analysis with unstranded count table
edgeDF_unstranded <- run_edgeR(
    countDF     = unstranded,
    targets     = targetsWF(sal)[["hisat2_mapping"]],
    cmp         = cmp[[1]],
    independent = FALSE,
    mdsplot     = ""
)

## DEG analysis with sense strand count table
edgeDF_pos <- run_edgeR(
    countDF     = stranded_pos,
    targets     = targetsWF(sal)[["hisat2_mapping"]],
    cmp         = cmp[[1]],
    independent = FALSE,
    mdsplot     = ""
)
```

**2. Generate and save 4-way Venn diagrams**

```{r}
#| eval: false
## (1) Venn diagram — unstranded count table
DEG_list_unstranded <- filterDEGs(
    degDF  = edgeDF_unstranded,
    filter = c(Fold = 2, FDR = 40),
    plot   = FALSE
)
vennsetup   <- overLapper(DEG_list_unstranded$Up[6:9],   type = "vennsets")
vennsetdown <- overLapper(DEG_list_unstranded$Down[6:9], type = "vennsets")
pdf("results/DEGcount_unstranded.pdf")
vennPlot(list(vennsetup, vennsetdown),
         mymain = "", mysub = "", colmode = 2, ccol = c("blue", "red"))
dev.off()

## (2) Venn diagram — sense strand count table
DEG_list_pos <- filterDEGs(
    degDF  = edgeDF_pos,
    filter = c(Fold = 2, FDR = 40),
    plot   = FALSE
)
vennsetup   <- overLapper(DEG_list_pos$Up[6:9],   type = "vennsets")
vennsetdown <- overLapper(DEG_list_pos$Down[6:9], type = "vennsets")
pdf("results/DEGcount_pos.pdf")
vennPlot(list(vennsetup, vennsetdown),
         mymain = "", mysub = "", colmode = 2, ccol = c("blue", "red"))
dev.off()
```

Both PDF files must be committed to the repository at:

- `results/DEGcount_unstranded.pdf`
- `results/DEGcount_pos.pdf`

---

## D. Bonus Challenge — edgeR in Native Mode with Time Course Design

::: {.callout-note}
## Bonus: +1 extra point

This section is optional. Completing it earns **1 bonus point** on top of
the 5-point maximum, for a total possible score of 6/5. It is intentionally
more challenging and is meant for students who want a deeper understanding
of linear model design in RNA-Seq analysis.
:::

### Background

The `run_edgeR()` wrapper used in Part C is convenient but hides the
underlying statistical machinery. Internally, it builds a simple pairwise
design matrix and fits a generalized linear model (GLM). For the Howard et al.
(2013) data set used throughout this workflow, a **pairwise design is not the
optimal approach** — the experiment is a **two-factor time course** that
deserves a more informative model.

The experimental design of Howard et al. (2013) consists of 18 samples:

| Factor | Meaning | Time points | Replicates |
|---|---|---|---|
| `M` | Mock-inoculated (*susceptible*) | 1h, 2h, 12h | A, B |
| `A` | *Pseudomonas*-infected (*resistant*) | 1h, 2h, 12h | A, B |

The sample names in `targetsPE.txt` encode this as: `M1A`, `M1B`, `M2A`,
`M2B`, `M12A`, `M12B` (mock) and `A1A`, `A1B`, `A2A`, `A2B`, `A12A`, `A12B`
(infected). The two factors are **treatment** (mock vs infected) and **time**
(1h, 2h, 12h).

An appropriate model for this design captures:

- The main effect of **time**
- The main effect of **treatment** (mock vs infected)
- The **interaction** between time and treatment — i.e. whether the
  treatment effect changes over time, which is the biologically interesting question

### Task D — Run edgeR natively with an interaction model

Using the `unstranded` count table from Part A as input, implement the
full native edgeR workflow with a `~ treatment + time + treatment:time`
interaction design.

**Step 1 — Prepare sample metadata**

```{r}
#| eval: false
library(edgeR)

## Load targets to get sample information
targets <- read.delim("targetsPE.txt", comment.char = "#")

## Extract treatment and time factors from the Factor column
## Factor levels: M1, M2, M12 (mock) and A1, A2, A12 (infected)
targets$treatment <- ifelse(grepl("^M", targets$Factor), "mock", "infected")
targets$time      <- gsub("^[MA]", "", targets$Factor)   # extracts "1", "2", "12"
targets$time      <- factor(targets$time, levels = c("1", "2", "12"))
targets$treatment <- factor(targets$treatment, levels = c("mock", "infected"))

## Confirm the design
targets[, c("SampleName", "Factor", "treatment", "time")]
```

**Step 2 — Create DGEList and filter low-count genes**

```{r}
#| eval: false
## Match count table columns to sample order in targets
countDF <- unstranded[, targets$SampleName]

## Create DGEList object
dge <- DGEList(counts = countDF, group = targets$Factor)

## Filter genes with very low counts across all samples
## Keep genes with at least 1 count per million (CPM) in at least 2 samples
keep    <- rowSums(cpm(dge) >= 1) >= 2
dge     <- dge[keep, , keep.lib.sizes = FALSE]
message(sum(keep), " genes retained after low-count filtering")

## Normalize library sizes with TMM
dge <- calcNormFactors(dge, method = "TMM")
```

**Step 3 — Build the interaction design matrix**

```{r}
#| eval: false
## Interaction model: ~ treatment + time + treatment:time
## This tests whether the treatment effect differs across time points
design_tc <- model.matrix(~ treatment + time + treatment:time,
                          data = targets)
colnames(design_tc)
## Should show: Intercept, treatmentinfected, time2, time12,
##              treatmentinfected:time2, treatmentinfected:time12
```

**Step 4 — Estimate dispersion and fit the GLM**

```{r}
#| eval: false
## Estimate common, trended and tagwise dispersions
dge <- estimateDisp(dge, design_tc, robust = TRUE)
plotBCV(dge)   # optional: inspect biological coefficient of variation

## Fit the negative binomial GLM
fit_tc <- glmQLFit(dge, design_tc, robust = TRUE)
```

**Step 5 — Test for genes responding differently across time in infected vs mock**

Test the interaction terms — these identify genes where the treatment effect
changes significantly over time (the core time course question):

```{r}
#| eval: false
## Test interaction coefficients:
## treatmentinfected:time2  and  treatmentinfected:time12
interaction_cols <- grep("treatmentinfected:time", colnames(design_tc))
qlf_interaction  <- glmQLFTest(fit_tc, coef = interaction_cols)

## Extract top DEGs
topTags(qlf_interaction, n = 20)

## Summary of significant genes at FDR < 0.05
summary(decideTests(qlf_interaction, p.value = 0.05))

## Save full results table
tc_results <- as.data.frame(topTags(qlf_interaction, n = Inf))
write.csv(tc_results, "results/edgeR_timecourse_interaction.csv")
```

**Step 6 — Compare with pairwise result from Part C**

Briefly compare (in comment text) the number of DEGs identified by the
interaction model versus the pairwise approach from Part C. Consider:

- Which approach identified more DEGs?
- Which design better captures the biology of a time course experiment and why?
- What biological question does each design answer?

Store your interaction results in a variable named exactly **`tc_results`**
and save the CSV to `results/edgeR_timecourse_interaction.csv`.

---

## Homework Submission

Submit **one R script** named **`HW6.R`** and the **two Venn diagram PDFs**
to your private GitHub homework repository at the following paths:

```
Homework/HW6/HW6.R
results/DEGcount_unstranded.pdf
results/DEGcount_pos.pdf
results/edgeR_timecourse_interaction.csv   # bonus only
```

### Requirements for full credit

1. Use **`featureCounts()`** from `Rsubread` for all read counting (Parts A and B)
2. Define all variables with the **exact names** specified throughout
3. Include **comment text** for all four points of Task A3
4. Commit both **Venn diagram PDF files** alongside `HW6.R`
5. Script must be well structured following the recommended layout below

### Bonus requirement (+1 point)

To earn the bonus point, additionally include in `HW6.R`:

- The full native edgeR workflow for the interaction model (`~ treatment + time + treatment:time`)
- Variable `tc_results` containing the full DEG table from `topTags(..., n = Inf)`
- File `results/edgeR_timecourse_interaction.csv` committed to the repo
- Comment text comparing the interaction model result with the pairwise result from Part C

### Recommended script structure

```{r}
#| eval: false
## ─────────────────────────────────────────────────────────────
## HW6: RNA-Seq Analysis
## GEN242, Spring 2026
## Author: <your name>
## ─────────────────────────────────────────────────────────────

library(Rsubread); library(GenomicFeatures)
library(systemPipeR); library(edgeR)

## Input
txdb      <- loadDb("./data/tair10.sqlite")
eByg      <- exonsBy(txdb, by = "gene")
bam_files <- getColumn(sal, step = "hisat2_mapping",
                       "outfiles", column = "samtools_sort_bam")

## ── Part A: Strand-specific read counting ─────────────────────

## Task A1 — unstranded (strandSpecific = 0)
fc_unstranded <- featureCounts(files = bam_files, annot.ext = eByg,
                               isGTFAnnotationFile = FALSE,
                               useMetaFeatures = TRUE, strandSpecific = 0,
                               isPairedEnd = TRUE, nthreads = 4)
unstranded <- fc_unstranded$counts

## Task A1 — sense strand (strandSpecific = 1)
fc_stranded_pos <- featureCounts( <fill in arguments, strandSpecific = 1> )
stranded_pos <- fc_stranded_pos$counts

## Task A1 — antisense strand (strandSpecific = 2)
fc_stranded_neg <- featureCounts( <fill in arguments, strandSpecific = 2> )
stranded_neg <- fc_stranded_neg$counts

## Task A2 — percentage comparison
perc_pos <- mean((stranded_pos / unstranded) * 100, na.rm = TRUE)
perc_neg <- mean((stranded_neg / unstranded) * 100, na.rm = TRUE)

## Task A3 — biological interpretation (replace ... with your answers)
## 1. Strand-specific counting should be used when the library is stranded because ...
## 2. For overlapping genes on opposite strands, strand-specific counting allows ...
## 3. Antisense strand counting can reveal biological events such as ...
## 4. Comparing stranded and unstranded results determines strandedness because ...

## ── Part B: Feature type counting ────────────────────────────

## B1: Genes (eByg = exonsBy(txdb, by="gene"))
gene_ranges_countDF    <- featureCounts( <fill in, annot.ext = eByg> )$counts

## B2: Exons (individual)
exon_ranges            <- exons(txdb)
exon_ranges_countDF    <- featureCounts( <fill in, annot.ext = exon_ranges> )$counts

## B3: Exons by genes
exonByg_ranges_countDF <- gene_ranges_countDF

## B4: Introns
intron_ranges          <- intronsByTranscript(txdb, use.names = TRUE)
intron_ranges_countDF  <- featureCounts( <fill in, annot.ext = intron_ranges> )$counts

## B5: 5'-UTRs
fiveUTR_ranges         <- fiveUTRsByTranscript(txdb, use.names = TRUE)
fiveUTR_ranges_countDF <- featureCounts( <fill in, annot.ext = fiveUTR_ranges> )$counts

## ── Part C: DEG analysis ──────────────────────────────────────

cmp <- readComp(stepsWF(sal)[["hisat2_mapping"]], format = "matrix", delim = "-")

edgeDF_unstranded <- run_edgeR(countDF = unstranded,
                               targets = targetsWF(sal)[["hisat2_mapping"]],
                               cmp = cmp[[1]], independent = FALSE, mdsplot = "")
edgeDF_pos        <- run_edgeR(countDF = stranded_pos,
                               targets = targetsWF(sal)[["hisat2_mapping"]],
                               cmp = cmp[[1]], independent = FALSE, mdsplot = "")

## Venn diagrams → results/DEGcount_unstranded.pdf and results/DEGcount_pos.pdf
## <add filterDEGs, overLapper, vennPlot calls here>
```

---

## Due Date

This homework is due **Tuesday, May 20th at 6:00 PM**.

## Homework Solutions

To be posted after the due date.