--- title: "NGS Analysis Basics" subtitle: "GEN242: Data Analysis in Genome Biology" author: "Thomas Girke" date: today format: revealjs: theme: [default, ../../assets/revealjs_custom.scss] slide-number: true progress: true scrollable: true smaller: true highlight-style: github code-block-height: 380px transition: slide embed-resources: true footer: "GEN242 · UC Riverside · [Tutorial](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/rsequences/rsequences_index.html) · [HW05](https://girke.bioinformatics.ucr.edu/GEN242/assignments/homework/hw05/hw05/)" logo: "https://girke.bioinformatics.ucr.edu/GEN242/assets/logo_gen242.png" execute: echo: true eval: false --- ## Overview Topics covered in this tutorial: - String handling in R base - Key Bioconductor packages for sequence analysis - Biostrings: sequence containers, import/export, manipulation - Pattern matching with and without mismatches - Sequence logos and PWM searching - NGS data: FASTQ format and quality scores - FASTQ quality reports and QC - Filtering and trimming FASTQ files - Memory-efficient FASTQ processing - Range operations with `IRanges` and `GRanges` - Transcript ranges and `TxDb` databases - Efficient sequence parsing from genome files ::: {.callout-note} **HW06** tasks are linked at the end of this tutorial. Full assignment: [HW06](https://girke.bioinformatics.ucr.edu/GEN242/assignments/homework/hw06/hw06/) ::: --- ## Key Bioconductor Packages for Sequence Analysis ### R Base Basic string handling (`grep`, `gsub`, `substring`) plus the full range of numeric data analysis tools. ### Bioconductor — sequence-specific infrastructure | Package | Purpose | |---|---| | [Biostrings](http://bioconductor.org/packages/release/bioc/html/Biostrings.html) | General sequence analysis — the core package | | [ShortRead](http://bioconductor.org/packages/release/bioc/html/ShortRead.html) | Pipeline for short read / FASTQ data | | [IRanges](http://bioconductor.org/packages/release/bioc/html/IRanges.html) | Low-level range data infrastructure | | [GenomicRanges](http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) | High-level genomic range data | | [GenomicFeatures](http://bioconductor.org/packages/release/bioc/html/GenomicFeatures.html) | Transcript-centric annotation management | | [GenomicAlignments](http://bioconductor.org/packages/release/bioc/html/GenomicAlignments.html) | Short genomic alignments | | [Rsamtools](http://bioconductor.org/packages/release/bioc/html/Rsamtools.html) | Interface to `samtools`, `bcftools`, `tabix` | | [BSgenome](http://bioconductor.org/packages/release/bioc/html/BSgenome.html) | Genome annotation data | | [biomaRt](http://bioconductor.org/packages/release/bioc/html/biomaRt.html) | Interface to BioMart annotations | | [rtracklayer](http://bioconductor.org/packages/release/bioc/html/rtracklayer.html) | Annotation imports, interface to genome browsers | ### Install required packages ```{r} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("Biostrings", "GenomicRanges", "rtracklayer", "systemPipeR", "seqLogo", "ShortRead")) ``` --- ## String Handling in R Base {.scrollable} ### Sample sequence data ```{r} myseq <- c("ATGCAGACATAGTG", "ATGAACATAGATCC", "GTACAGATCAC") ``` ### String matching ```{r} myseq[grep("ATG", myseq)] # sequences containing "ATG" pos1 <- regexpr("AT", myseq) # position of FIRST match per sequence as.numeric(pos1) attributes(pos1)$match.length pos2 <- gregexpr("AT", myseq) # positions of ALL matches per sequence as.numeric(pos2[[1]]) # match positions in first sequence attributes(pos2[[1]])$match.length # match lengths in first sequence ``` ### String substitution ```{r} gsub("^ATG", "atg", myseq) # replace "ATG" at start of string ``` ### Positional parsing ```{r} nchar(myseq) # length of each string substring(myseq[1], c(1,3), c(2,5)) # extract two fragments from one string substring(myseq, c(1,4,7), c(2,6,10)) # one fragment from each string ``` ### Generate random DNA sequences ```{r} rand <- sapply(1:100, function(x) paste(sample(c("A","T","G","C"), sample(10:20, 1), replace=TRUE), collapse="")) rand[1:3] table(c(rand[1:4], rand[1])) # count identical sequences ``` --- ## Biostrings — Data Objects {.scrollable} The `Biostrings` package provides dedicated containers for biological sequences. ### `XString` — single sequence | Class | Use | |---|---| | `DNAString` | DNA sequence | | `RNAString` | RNA sequence | | `AAString` | Amino acid sequence | | `BString` | Any character string | ```{r} library(Biostrings) d <- DNAString("GCATAT-TAC") d d[1:4] # subsetting by position r <- RNAString(d) # convert DNA to RNA p <- AAString("HCWYHH") # amino acid sequence b <- BString("Any characters can go here.") ``` ### `XStringSet` — many sequences in one object | Class | Use | |---|---| | `DNAStringSet` | set of DNA sequences | | `RNAStringSet` | set of RNA sequences | | `AAStringSet` | set of amino acid sequences | ```{r} dset <- DNAStringSet(c("GCATATTAC", "AATCGATCC", "GCATATTAC")) names(dset) <- c("seq1", "seq2", "seq3") dset[1:2] width(dset) # length of each sequence d <- dset[[1]] # [[ returns single entry as XString dset2 <- c(dset, dset) # concatenate two XStringSet objects dsetchar <- as.character(dset) # convert to named character vector dsetone <- unlist(dset) # collapse all sequences into one DNAString DNAStringSet(dset, start=c(1,2,3), end=c(4,8,5)) # subset by positions ``` ### `QualityScaleXStringSet` — sequences with quality scores ```{r} # QualityScaledDNAStringSet, QualityScaledRNAStringSet, QualityScaledAAStringSet ``` --- ## Biostrings — Import, Export and Manipulation {.scrollable} ### Import FASTA file ```{r} # Download sample sequences dir.create("data", showWarnings = FALSE) download.file( "https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.ffn", "data/AE004437.ffn" ) myseq <- readDNAStringSet("data/AE004437.ffn") # import FASTA myseq[1:3] ``` ### Subset and export ```{r} sub <- myseq[grep("99.*", names(myseq))] # subset by name pattern length(sub) writeXStringSet(sub, file="./data/AE004437sub.ffn", width=80) # export to FASTA ``` ### Basic sequence manipulations ```{r} randset <- DNAStringSet(rand) # convert random seqs to DNAStringSet complement(randset[1:2]) # complement reverse(randset[1:2]) # reverse reverseComplement(randset[1:2]) # reverse complement (standard for NGS) translate(randset[1:2]) # translate DNA to protein ``` ### Multiple sequence alignment ```{r} origMAlign <- readDNAMultipleAlignment( filepath = system.file("extdata", "msx2_mRNA.aln", package="Biostrings"), format = "clustal" ) origMAlign ``` --- ## Pattern Matching with Biostrings {.scrollable} Biostrings supports pattern matching with **mismatches** — not possible with base R regex. ### Match a single pattern against one sequence ```{r} myseq1 <- readDNAStringSet("./data/AE004437.ffn") # Find all occurrences allowing 1 mismatch mypos <- matchPattern("ATGGTG", myseq1[[1]], max.mismatch=1) # Count matches only countPattern("ATGGCT", myseq1[[1]], max.mismatch=1) ``` ### Match against all sequences in a set ```{r} vcountPattern("ATGGCT", myseq1, max.mismatch=1)[1:20] # count per sequence myvpos <- vmatchPattern("ATGGCT", myseq1, max.mismatch=1) # match positions (MIndex) Views(myseq1[[1]], start(myvpos[[1]]), end(myvpos[[1]])) # retrieve matches for seq 1 # Retrieve match sequences across all entries sapply(names(myseq1), function(x) as.character(Views(myseq1[[x]], start(myvpos[[x]]), end(myvpos[[x]]))))[1:4] ``` ### Consensus matrix for query and hits ```{r} tmp <- c(DNAStringSet("ATGGTG"), DNAStringSet(mypos)) consensusMatrix(tmp)[1:4,] # shows nucleotide frequency at each position ``` ### Pattern matching with regular expression support ```{r} myseq <- DNAStringSet(c("ATGCAGACATAGTG", "ATGAACATAGATCC", "GTACAGATCAC")) myseq[grep("^ATG", myseq, perl=TRUE)] # sequences starting with ATG DNAStringSet(gsub("^ATG", "NNN", myseq)) # substitute ATG at start with NNN ``` --- ## Sequence Logos and PWM Searching ### Build a position weight matrix (PWM) ```{r} library(seqLogo) pwm <- PWM(DNAStringSet(c("GCT", "GGT", "GCA"))) pwm ``` ### Plot sequence logo with `seqLogo` ```{r} seqLogo(t(t(pwm) * 1/colSums(pwm))) ``` ### Plot with `ggseqlogo` (more customization options) ```{r} library(ggplot2); library(ggseqlogo) pwm <- PWM(DNAStringSet(c("GCT", "GGT", "GCA"))) ggseqlogo(pwm) ``` `ggseqlogo` supports various alphabets including amino acid sequences. See the [manual](https://omarwagih.github.io/ggseqlogo/) for details. ### Search a sequence for PWM matches ```{r} chr <- DNAString("AAAGCTAAAGGTAAAGCAAAA") matchPWM(pwm, chr, min.score=0.9) # returns matches scoring > 90% of maximum ``` --- ## FASTQ Format — Sequence and Quality Data {.scrollable} FASTQ is the standard format for raw NGS reads. Each read occupies **4 lines**: ```default @SRR038845.3 HWI-EAS038:6:1:0:1938 length=36 CAACGAGTTCACACCTTGGCCGACAGGCCCGGGTAA +SRR038845.3 HWI-EAS038:6:1:0:1938 length=36 BA@7>B=>:>>7@7@>>9=BAA?;>52;>:9=8.=A ``` | Line | Content | |---|---| | 1 | `@` + read ID and description | | 2 | DNA sequence | | 3 | `+` (optionally repeated ID) | | 4 | Quality scores as ASCII characters (Phred encoding) | ### Phred quality scores Phred scores are integers 0–50 stored as ASCII characters (score + 33): ```{r} phred <- 1:9 phreda <- paste(sapply(as.raw((phred)+33), rawToChar), collapse="") phreda # ASCII representation as.integer(charToRaw(phreda))-33 # convert back to integers ``` - **Phred 10** = 90% base call accuracy (1 error in 10) - **Phred 20** = 99% accuracy (1 error in 100) — common QC cutoff - **Phred 30** = 99.9% accuracy (1 error in 1,000) --- ## Working with FASTQ Files — ShortRead {.scrollable} ### Download sample data and import ```{r} library(ShortRead) download.file( "https://cluster.hpcc.ucr.edu/~tgirke/HTML_Presentations/Manuals/testdata/samplefastq/data.zip", "data.zip" ) unzip("data.zip") fastq <- list.files("data", "*.fastq$") fastq <- paste("data/", fastq, sep="") names(fastq) <- paste("flowcell6_lane", 1:length(fastq), sep="_") ``` ### Read and inspect a FASTQ file ```{r} fq <- readFastq(fastq[1]) # import first FASTQ file as ShortReadQ object countLines(dirPath="./data", pattern=".fastq$") / 4 # count reads in each file id(fq)[1] # read ID sread(fq)[1] # sequence quality(fq)[1] # Phred quality scores (ASCII) as(quality(fq), "matrix")[1:4, 1:12] # convert Phred to numeric matrix ShortReadQ(sread=sread(fq), quality=quality(fq), id=id(fq)) # construct from components ``` ### Construct a `QualityScaledDNAStringSet` from scratch ```{r} dset <- DNAStringSet(sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=TRUE), collapse=""))) myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=TRUE)) myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) myqual <- PhredQuality(myqual) dsetq1 <- QualityScaledDNAStringSet(dset, myqual) dsetq1[1:2] ``` --- ## FASTQ Quality Reports {.scrollable} ### Using `systemPipeR` — comprehensive QC plots ```{r} library(systemPipeR) fqlist <- seeFastq(fastq=fastq, batchsize=800, klength=8) # For real data: set batchsize to at least 10^5 seeFastqPlot(fqlist) ``` Generates a multi-panel QC report covering: - Read length distribution - Per-position base quality - Per-position nucleotide frequency - K-mer frequency - Adapter contamination Output can be written to a single PDF covering all samples. ### Using `ShortRead` — alternative QC ```{r} sp <- SolexaPath(system.file('extdata', package='ShortRead')) fl <- file.path(analysisPath(sp), "s_1_sequence.txt") fls <- c(fl, fl) coll <- QACollate( QAFastqSource(fls), QAReadQuality(), QAAdapterContamination(), QANucleotideUse(), QAQualityUse(), QASequenceUse(), QAFrequentSequence(n=10), QANucleotideByCycle(), QAQualityByCycle() ) x <- qa2(coll, verbose=TRUE) res <- report(x) if (interactive()) browseURL(res) # open HTML report in browser ``` --- ## Filtering and Trimming FASTQ Files {.scrollable} All filtering operates on `ShortReadQ` objects and returns a filtered subset. ### Adapter trimming ```{r} fqtrim <- trimLRPatterns(Rpattern="GCCCGGGTAA", subject=fq) sread(fq)[1:2] # before trimming sread(fqtrim)[1:2] # after trimming — adapter sequence removed ``` ### Count reads and remove duplicates ```{r} tables(fq)$distribution # frequency table of read occurrences sum(srduplicated(fq)) # number of duplicated reads fq[!srduplicated(fq)] # keep only unique reads ``` ### Trim low-quality tails ```{r} cutoff <- 30 cutoff <- rawToChar(as.raw(cutoff + 33)) # convert Phred score to ASCII sread(trimTails(fq, k=2, a=cutoff, successive=FALSE))[1:2] # k=2: trim if 2 consecutive bases fall below cutoff ``` ### Remove reads with low average quality ```{r} cutoff <- 30 qcount <- rowSums(as(quality(fq), "matrix") <= 20) fq[qcount == 0] # keep only reads where ALL Phred scores are >= 20 ``` ### Remove reads with Ns or low complexity ```{r} filter1 <- nFilter(threshold=1) # exclude reads with any N filter2 <- polynFilter(threshold=20, nuc=c("A","T","G","C")) # exclude low complexity filter <- compose(filter1, filter2) # combine filters fq[filter(fq)] # apply combined filter ``` --- ## Memory-Efficient FASTQ Processing {.scrollable} For large FASTQ files, streaming avoids loading everything into memory at once. ### Stream or randomly sample reads ```{r} fq <- yield(FastqStreamer(fastq[1], 50)) # import first 50 reads fq <- yield(FastqSampler(fastq[1], 50)) # randomly sample 50 reads ``` ### Stream through a file with filtering and write output ```{r} f <- FastqStreamer(fastq[1], 50) # open streamer with chunk size 50 while (length(fq <- yield(f))) { fqsub <- fq[grepl("^TT", sread(fq))] # keep reads starting with "TT" writeFastq(fqsub, paste(fastq[1], "sub", sep="_"), mode="a", # append mode — builds file across chunks compress=FALSE) } close(f) # always close the streamer ``` ::: {.callout-tip} **Streaming workflow** is the standard approach for production FASTQ processing: open a `FastqStreamer`, yield chunks, apply filters/trimming, write output, repeat until the file is exhausted. This handles files of any size with constant memory usage. ::: --- ## Range Operations — IRanges and GRanges {.scrollable} Genomic ranges (chromosome, start, end, strand) are the foundation of most genome-scale analyses in Bioconductor. ### Three key range containers | Class | Package | Stores | |---|---|---| | `IRanges` | IRanges | integer ranges only | | `GRanges` | GenomicRanges | ranges + chromosome + strand + metadata | | `GRangesList` | GenomicRanges | list of GRanges objects | ### Construct a `GRanges` object ```{r} library(GenomicRanges); library(rtracklayer) gr <- GRanges( seqnames = Rle(c("chr1","chr2","chr1","chr3"), c(1,3,2,4)), ranges = IRanges(1:10, end=7:16, names=head(letters,10)), strand = Rle(strand(c("-","+","*","+","-")), c(1,2,2,3,2)), score = 1:10, GC = seq(1, 0, length=10) ) ``` ### Import a GFF/GTF file directly into a `GRanges` object ```{r} gff <- import.gff("https://cluster.hpcc.ucr.edu/~tgirke/Documents/R_BioCond/Samples/gff3.gff") seqlengths(gff) <- end(ranges(gff[which(values(gff)[,"type"]=="chromosome"),])) names(gff) <- 1:length(gff) gff[1:4,] seqinfo(gff) as.data.frame(gff)[1:4, 1:7] # coerce to data.frame for inspection ``` --- ## GRanges — Accessor Functions and Subsetting {.scrollable} ### Accessor functions ```{r} seqnames(gff) # chromosome names ranges(gff) # IRanges object (start, end, width) strand(gff) # strand information seqlengths(gff) # chromosome lengths start(gff[1:4]) # start positions end(gff[1:4]) # end positions width(gff[1:4]) # feature widths ``` ### Accessing metadata columns ```{r} values(gff) # all metadata columns values(gff)[, "type"][1:20] # one metadata column gff[values(gff)[, "type"] == "gene"] # filter by metadata value ``` ### Subsetting and concatenation ```{r} gff[1:4] # subset by index gff[1:4, c("type", "ID")] # subset rows and metadata columns gff[2] <- gff[3] # replace range c(gff[1:2], gff[401:402]) # concatenate GRanges objects ``` --- ## GRanges — Range Utilities {.scrollable} ### Modify ranges ```{r} gff <- gff[values(gff)$type != "chromosome"] # remove chromosome-level entries strand(gff) <- "*" # erase strand information ``` ### Set operations on ranges ```{r} reduce(gff) # collapse overlapping ranges to continuous gaps(gff) # return uncovered (gap) regions setdiff(as(seqinfo(gff), "GRanges"), gff) # gaps — more intuitive approach disjoin(gff) # return disjoint (non-overlapping) ranges coverage(gff) # per-base coverage as Rle object ``` ### Overlap operations ```{r} findOverlaps(gff, gff[1:4]) # index pairs of overlapping ranges countOverlaps(gff, gff[1:4])[1:40] # count overlaps per range subsetByOverlaps(gff, gff[1:4]) # return only overlapping ranges ``` ::: {.callout-tip} `findOverlaps` and `subsetByOverlaps` are among the most frequently used functions in genomic analyses — they power intersection of features with peaks, reads, SNPs, and any other range-based data. ::: ### GRangesList — list of GRanges objects ```{r} sp <- split(gff, seq(along=gff)) # one range per list component split(gff, seqnames(gff)) # ranges grouped by chromosome unlist(sp) # collapse back to GRanges sp[1:4, "type"] # subset GRangesList lapply(sp[1:4], length) # loop over GRangesList like a list ``` --- ## Transcript Ranges — TxDb Databases {.scrollable} `TxDb` (TranscriptDb) objects store transcript, exon, and CDS ranges in a SQLite database — making annotation retrieval robust and convenient. ### Build a `TxDb` from a GFF file ```{r} library(txdbmaker) download.file( "https://cluster.hpcc.ucr.edu/~tgirke/Documents/R_BioCond/Samples/gff3.gff", "data/gff3.gff" ) txdb <- makeTxDbFromGFF( file = "data/gff3.gff", format = "gff", dataSource = "TAIR", organism = "Arabidopsis thaliana" ) saveDb(txdb, file="./data/TAIR10.sqlite") # save for reuse txdb <- loadDb("./data/TAIR10.sqlite") # reload later transcripts(txdb) # all transcripts as GRanges transcriptsBy(txdb, by="gene") # transcripts grouped by gene exonsBy(txdb, by="gene") # exons grouped by gene ``` ### Build a `TxDb` from BioMart ```{r} library(GenomicFeatures); library(txdbmaker); library(biomaRt) txdb <- makeTxDbFromBiomart( biomart = "plants_mart", dataset = "athaliana_eg_gene", host = "https://plants.ensembl.org" ) # Explore BioMart databases listMarts(host="plants.ensembl.org") mymart <- useMart("plants_mart", dataset="athaliana_eg_gene", host="plants.ensembl.org") listAttributes(mymart) getBM(attributes=c("ensembl_gene_id", "description"), mart=mymart)[1:4,] ``` --- ## Efficient Sequence Parsing from Genome Files {.scrollable} ### `getSeq` — parse sequences for any GRanges annotations Parse all ranges defined in a `GRanges` object from a genome FASTA file: ```{r} library(Biostrings); library(Rsamtools) # Create a random test genome and write to file gff <- gff[values(gff)$type != "chromosome"] rand <- DNAStringSet(sapply( unique(as.character(seqnames(gff))), function(x) paste(sample(c("A","T","G","C"), 200000, replace=TRUE), collapse="") )) writeXStringSet(DNAStringSet(rand), "./data/test") # Parse sequences for all annotated ranges getSeq(FaFile("./data/test"), gff) ``` ### `extractTranscriptSeqs` — splice-aware transcript parsing For multi-exon transcripts, exon sequences must be extracted and concatenated. `extractTranscriptSeqs` handles this automatically: ```{r} library(GenomicFeatures); library(Biostrings); library(Rsamtools) txdb <- loadDb("./data/TAIR10.sqlite") indexFa("data/test") # index the genome FASTA first txseq <- extractTranscriptSeqs( FaFile("data/test"), txdb, use.names=TRUE ) txseq ``` ::: {.callout-tip} Use `getSeq` for simple range extraction (genes, peaks, features). Use `extractTranscriptSeqs` when you need spliced transcript sequences that span multiple exons. ::: --- ## HW05 ::: {.callout-important} **Assignment:** [HW05 — NGS Analysis Basics](https://girke.bioinformatics.ucr.edu/GEN242/assignments/homework/hw06/hw06/) The homework is based on the sequence analysis, FASTQ processing, and range operations covered in this tutorial. ::: --- ## Summary — Key Functions by Topic ### Biostrings — sequences | Function | Purpose | |---|---| | `readDNAStringSet()` | import FASTA file | | `writeXStringSet()` | export to FASTA | | `reverseComplement()` | reverse complement | | `translate()` | DNA → protein | | `matchPattern()` / `vmatchPattern()` | pattern matching (with mismatches) | | `countPattern()` / `vcountPattern()` | count matches | | `PWM()` / `matchPWM()` | position weight matrix search | ### ShortRead — FASTQ | Function | Purpose | |---|---| | `readFastq()` | import FASTQ file | | `sread()`, `quality()`, `id()` | access sequence, quality, ID | | `seeFastqPlot()` | QC report (systemPipeR) | | `trimLRPatterns()` | adapter trimming | | `trimTails()` | quality tail trimming | | `nFilter()`, `polynFilter()` | N and complexity filters | | `FastqStreamer()` / `yield()` | memory-efficient streaming | ### GenomicRanges — ranges | Function | Purpose | |---|---| | `import.gff()` | GFF/GTF → GRanges | | `findOverlaps()` | find overlapping ranges | | `subsetByOverlaps()` | keep overlapping ranges | | `reduce()` / `coverage()` | merge / per-base coverage | | `getSeq()` | extract sequences for ranges | | `extractTranscriptSeqs()` | spliced transcript sequences | **Next:** T7 — [Workflow Framework](https://girke.bioinformatics.ucr.edu/GEN242/tutorials/systempiper/systemPipeR_index.html)