---
title: NGS Analysis Basics
author: "Author: Thomas Girke"
date: "Last update: `r format(Sys.time(), '%d %B, %Y')`"
output:
html_document:
toc: true
toc_float:
collapsed: true
smooth_scroll: true
toc_depth: 3
fig_caption: yes
code_folding: show
number_sections: true
fontsize: 14pt
bibliography: bibtex.bib
weight: 6
type: docs
---
```{r style, echo = FALSE, results = 'asis'}
BiocStyle::markdown()
options(width=100, max.print=1000)
knitr::opts_chunk$set(
eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")),
cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")))
```
Source code downloads:
[ [.Rmd](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/rsequences/Rsequences.Rmd) ]
[ [.R](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/rsequences/Rsequences.R) ]
## Overview
### Sequence Analysis in R and Bioconductor
__R Base__
* Some basic string handling utilities. Wide spectrum of numeric data analysis tools.
__Bioconductor__
Bioconductor packages provide much more sophisticated string handling utilities for sequence analysis [@Lawrence2013-kt; @Huber2015-ag].
* [Biostrings](http://bioconductor.org/packages/release/bioc/html/Biostrings.html): general sequence analysis environment
* [ShortRead](http://bioconductor.org/packages/release/bioc/html/ShortRead.html): pipeline for short read data
* [IRanges](http://bioconductor.org/packages/release/bioc/html/IRanges.html): low-level infrastructure for range data
* [GenomicRanges](http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html): high-level infrastructure for range data
* [GenomicFeatures](http://bioconductor.org/packages/release/bioc/html/GenomicFeatures.html): managing transcript centric annotations
* [GenomicAlignments](http://bioconductor.org/packages/release/bioc/html/GenomicAlignments.html): handling short genomic alignments
* [Rsamtools](http://bioconductor.org/packages/release/bioc/html/Rsamtools.html): interface to `samtools`, `bcftools` and `tabix`
* [BSgenome](http://bioconductor.org/packages/release/bioc/html/BSgenome.html): genome annotation data
* [biomaRt](http://bioconductor.org/packages/release/bioc/html/biomaRt.html): interface to BioMart annotations
* [rtracklayer](http://bioconductor.org/packages/release/bioc/html/rtracklayer.html): Annotation imports, interface to online genome browsers
* [HelloRanges](http://bioconductor.org/packages/release/bioc/html/HelloRanges.html): Bedtools semantics in Bioc's Ranges infrastructure
## Package Requirements
Several Bioconductor packages are required for this tutorial. To install them, execute
the following lines in the R console. Please also make sure that you have a recent R version
installed on your system. R versions `4.0.x` or higher are recommended.
_Please do not run this install on the HPCC unless you want to reinstall some of these packages in your own
user account._
```{r package_requrirements, eval=FALSE}
source("https://bioconductor.org/biocLite.R")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("Biostrings", "GenomicRanges", "rtracklayer", "systemPipeR", "seqLogo", "ShortRead"))
```
## Strings in R Base
### Basic String Matching and Parsing
#### String matching
Generate sample sequence data set
```{r string_matching_base1, eval=TRUE}
myseq <- c("ATGCAGACATAGTG", "ATGAACATAGATCC", "GTACAGATCAC")
```
String searching with regular expression support
```{r string_matching_base2, eval=TRUE}
myseq[grep("ATG", myseq)]
```
Searches `myseq` for first match of pattern "AT"
```{r string_matching_base3, eval=TRUE}
pos1 <- regexpr("AT", myseq)
as.numeric(pos1); attributes(pos1)$match.length # Returns position information of matches
```
Searches `myseq` for all matches of pattern "AT"
```{r string_matching_base4, eval=TRUE}
pos2 <- gregexpr("AT", myseq)
as.numeric(pos2[[1]]); attributes(pos2[[1]])$match.length # Returns positions of matches in first sequence
```
String substitution with regular expression support
```{r string_matching_base5, eval=TRUE}
gsub("^ATG", "atg", myseq)
```
#### Positional parsing
```{r positional_parsing_base, eval=TRUE}
nchar(myseq) # Computes length of strings
substring(myseq[1], c(1,3), c(2,5)) # Positional parsing of several fragments from one string
substring(myseq, c(1,4,7), c(2,6,10)) # Positional parsing of many strings
```
### Random Sequence Generation
#### Random DNA sequences of any length
```{r random_sequences_base, eval=TRUE}
rand <- sapply(1:100, function(x) paste(sample(c("A","T","G","C"), sample(10:20, 1), replace=TRUE), collapse=""))
rand[1:3]
```
#### Count identical sequences
```{r count_sequences_base, eval=TRUE}
table(c(rand[1:4], rand[1]))
```
#### Extract reads from reference
Note: this requires the `Biostrings` package.
```{r parse_from_ref, eval=TRUE, message=FALSE, warnings=FALSE}
library(Biostrings)
ref <- DNAString(paste(sample(c("A","T","G","C"), 100000, replace=T), collapse=""))
randstart <- sample(1:(length(ref)-15), 1000)
randreads <- Views(ref, randstart, width=15)
rand_set <- DNAStringSet(randreads)
unlist(rand_set)
```
## Sequences in Bioconductor
### Important Data Objects of Biostrings
#### `XString` for single sequence
* `DNAString`: for DNA
* `RNAString`: for RNA
* `AAString`: for amino acid
* `BString`: for any string
#### `XStringSet` for many sequences
* `DNAStringSet`: for DNA
* `RNAStringSet`: for RNA
* `AAStringSet`: for amino acid
* `BStringSet`: for any string
#### `QualityScaleXStringSet` for sequences with quality data
* `QualityScaledDNAStringSet`: for DNA
* `QualityScaledRNAStringSet`: for RNA
* `QualityScaledAAStringSet`: for amino acid
* `QualityScaledBStringSet`: for any string
### Sequence Import and Export
Download the following sequences to your current working directory and then import them into R:
[https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.ffn](https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.ffn)
```{r download_sequences, eval=TRUE, message=FALSE, warnings=FALSE}
dir.create("data", showWarnings = FALSE)
# system("wget https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.ffn")
download.file("https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.ffn", "data/AE004437.ffn")
```
Import FASTA file with `readDNAStringSet`
```{r import_sequences1, eval=TRUE}
myseq <- readDNAStringSet("data/AE004437.ffn")
myseq[1:3]
```
Subset sequences with regular expression on sequence name field
```{r import_sequences2, eval=TRUE}
sub <- myseq[grep("99.*", names(myseq))]
length(sub)
```
Export subsetted sequences to FASTA file
```{r export_sequences, eval=TRUE}
writeXStringSet(sub, file="./data/AE004437sub.ffn", width=80)
```
Now inspect exported sequence file `AE004437sub.ffn` in a text editor
### Working with `XString` Containers
The `XString` stores the different types of biosequences in dedicated containers
```{r xstring_container, eval=TRUE, message=FALSE, warnings=FALSE}
library(Biostrings)
d <- DNAString("GCATAT-TAC")
d
d[1:4]
```
RNA sequences
```{r rnastring_container, eval=TRUE, message=FALSE, warnings=FALSE}
r <- RNAString("GCAUAU-UAC")
r <- RNAString(d) # Converts d to RNAString object
r
```
Protein sequences
```{r aastring_container, eval=TRUE, message=FALSE, warnings=FALSE}
p <- AAString("HCWYHH")
p
```
Any type of character strings
```{r bstring_container, eval=TRUE, message=FALSE, warnings=FALSE}
b <- BString("I store any set of characters. Other XString objects store only the IUPAC characters.")
b
```
### Working with `XStringSet` Containers
`XStringSet` containers allow to store many biosequences in one object
```{r xtringset_container, eval=TRUE}
dset <- DNAStringSet(c("GCATATTAC", "AATCGATCC", "GCATATTAC"))
names(dset) <- c("seq1", "seq2", "seq3") # Assigns names
dset[1:2]
```
Important utilities for `XStringSet` containers
```{r xtringset_container_utilities, eval=TRUE}
width(dset) # Returns the length of each sequences
d <- dset[[1]] # The [[ subsetting operator returns a single entry as XString object
dset2 <- c(dset, dset) # Appends/concatenates two XStringSet objects
dsetchar <- as.character(dset) # Converts XStringSet to named vector
dsetone <- unlist(dset) # Collapses many sequences to a single one stored in a DNAString container
```
Sequence subsetting by positions:
```{r xtringset_subsetting, eval=TRUE}
DNAStringSet(dset, start=c(1,2,3), end=c(4,8,5))
```
### Multiple Alignment Class
The `XMultipleAlignment` class stores the different types of multiple sequence alignments:
```{r xmultiple_alignment, eval=TRUE}
origMAlign <- readDNAMultipleAlignment(filepath = system.file("extdata",
"msx2_mRNA.aln", package = "Biostrings"), format = "clustal")
origMAlign
```
### Basic Sequence Manipulations
#### Reverse and Complement
```{r rev_comp, eval=TRUE}
randset <- DNAStringSet(rand)
complement(randset[1:2])
reverse(randset[1:2])
reverseComplement(randset[1:2])
```
### Translate DNA into Protein
```{r translate, eval=TRUE, message=FALSE, warnings=FALSE}
translate(randset[1:2])
```
### Pattern Matching
#### Pattern matching with mismatches
Find pattern matches in reference
```{r pattern_matching1, eval=TRUE}
myseq1 <- readDNAStringSet("./data/AE004437.ffn")
mypos <- matchPattern("ATGGTG", myseq1[[1]], max.mismatch=1)
```
Count only the corresponding matches
```{r pattern_matching2, eval=TRUE}
countPattern("ATGGCT", myseq1[[1]], max.mismatch=1)
```
Count matches in many sequences
```{r pattern_matching3, eval=TRUE}
vcountPattern("ATGGCT", myseq1, max.mismatch=1)[1:20]
```
Results shown in DNAStringSet object
```{r pattern_matching4, eval=TRUE}
tmp <- c(DNAStringSet("ATGGTG"), DNAStringSet(mypos))
```
Return a consensus matrix for query and hits
```{r pattern_matching5, eval=TRUE}
consensusMatrix(tmp)[1:4,]
```
Find all pattern matches in reference
```{r pattern_matching6, eval=TRUE}
myvpos <- vmatchPattern("ATGGCT", myseq1, max.mismatch=1)
myvpos # The results are stored as MIndex object.
Views(myseq1[[1]], start(myvpos[[1]]), end(myvpos[[1]])) # Retrieves the result for single entry
```
Return all matches
```{r all_matches, eval=FALSE}
sapply(names(myseq1), function(x)
as.character(Views(myseq1[[x]], start(myvpos[[x]]), end(myvpos[[x]]))))[1:4]
```
#### Pattern matching with regular expression support
```{r regex_pattern_matching, eval=TRUE}
myseq <- DNAStringSet(c("ATGCAGACATAGTG", "ATGAACATAGATCC", "GTACAGATCAC"))
myseq[grep("^ATG", myseq, perl=TRUE)] # String searching with regular expression support
pos1 <- regexpr("AT", myseq) # Searches 'myseq' for first match of pattern "AT"
as.numeric(pos1); attributes(pos1)$match.length # Returns position information of matches
pos2 <- gregexpr("AT", myseq) # Searches 'myseq' for all matches of pattern "AT"
as.numeric(pos2[[1]]); attributes(pos2[[1]])$match.length # Match positions in first sequence
DNAStringSet(gsub("^ATG", "NNN", myseq)) # String substitution with regular expression support
```
### PWM Viewing and Searching
#### Plot with `seqLogo`
```{r pwm_logo, eval=TRUE}
library(seqLogo)
pwm <- PWM(DNAStringSet(c("GCT", "GGT", "GCA")))
pwm
seqLogo(t(t(pwm) * 1/colSums(pwm)))
```
#### Plot with `ggseqlogo`
The `ggseqlogo` package ([manual](https://omarwagih.github.io/ggseqlogo/))
provides many customization options for plotting sequence logos. It also supports
various alphabets including sequence logos for amino acid sequences.
```{r pwm_logo2, eval=TRUE}
library(ggplot2); library(ggseqlogo)
pwm <- PWM(DNAStringSet(c("GCT", "GGT", "GCA")))
ggseqlogo(pwm)
```
Search sequence for PWM matches with score better than `min.score`
```{r pwm_search, eval=TRUE}
chr <- DNAString("AAAGCTAAAGGTAAAGCAAAA")
matchPWM(pwm, chr, min.score=0.9)
```
## NGS Sequences
### Sequence and Quality Data: FASTQ Format
Four lines per sequence:
1. ID
2. Sequence
3. ID
4. Base call qualities (Phred scores) as ASCII characters
The following gives an example of 3 Illumina reads in a FASTQ file. The numbers
at the beginning of each line are not part of the FASTQ format. They have been added
solely for illustration purposes.
```
1. @SRR038845.3 HWI-EAS038:6:1:0:1938 length=36
2. CAACGAGTTCACACCTTGGCCGACAGGCCCGGGTAA
3. +SRR038845.3 HWI-EAS038:6:1:0:1938 length=36
4. BA@7>B=>:>>7@7@>>9=BAA?;>52;>:9=8.=A
1. @SRR038845.41 HWI-EAS038:6:1:0:1474 length=36
2. CCAATGATTTTTTTCCGTGTTTCAGAATACGGTTAA
3. +SRR038845.41 HWI-EAS038:6:1:0:1474 length=36
4. BCCBA@BB@BBBBAB@B9B@=BABA@A:@693:@B=
1. @SRR038845.53 HWI-EAS038:6:1:1:360 length=36
2. GTTCAAAAAGAACTAAATTGTGTCAATAGAAAACTC
3. +SRR038845.53 HWI-EAS038:6:1:1:360 length=36
4. BBCBBBBBB@@BAB?BBBBCBC>BBBAA8>BBBAA@
```
### Sequence and Quality Data: `QualityScaleXStringSet`
Phred quality scores are integers from 0-50 that are
stored as ASCII characters after adding 33. The basic R functions `rawToChar` and
`charToRaw` can be used to interconvert among their representations.
Phred score interconversion
```{r raw_to_char, eval=TRUE}
phred <- 1:9
phreda <- paste(sapply(as.raw((phred)+33), rawToChar), collapse="")
phreda
as.integer(charToRaw(phreda))-33
```
Construct `QualityScaledDNAStringSet` from scratch
```{r raw_to_char2, eval=TRUE}
dset <- DNAStringSet(sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse=""))) # Creates random sample sequence.
myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Creates random Phred score list.
myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) # Converts integer scores into ASCII characters.
myqual <- PhredQuality(myqual) # Converts to a PhredQuality object.
dsetq1 <- QualityScaledDNAStringSet(dset, myqual) # Combines DNAStringSet and quality data in QualityScaledDNAStringSet object.
dsetq1[1:2]
```
### Processing FASTQ Files with ShortRead
The following explains the basic usage of `ShortReadQ` objects. To make the sample code work,
download and unzip this [file](http://cluster.hpcc.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rsequences/data.zip) to your current working directory.
The following code performs the download for you.
```{r read_fastq1, eval=TRUE, message=FALSE, warnings=FALSE}
library(ShortRead)
download.file("http://cluster.hpcc.ucr.edu/~tgirke/HTML_Presentations/Manuals/testdata/samplefastq/data.zip", "data.zip")
unzip("data.zip")
```
Important utilities for accessing FASTQ files
```{r read_fastq2, eval=TRUE, message=FALSE, warnings=FALSE}
fastq <- list.files("data", "*.fastq$"); fastq <- paste("data/", fastq, sep="")
names(fastq) <- paste("flowcell6_lane", 1:length(fastq), sep="_")
(fq <- readFastq(fastq[1])) # Imports first FASTQ file
countLines(dirPath="./data", pattern=".fastq$")/4 # Counts numbers of reads in FASTQ files
id(fq)[1] # Returns ID field
sread(fq)[1] # Returns sequence
quality(fq)[1] # Returns Phred scores
as(quality(fq), "matrix")[1:4,1:12] # Coerces Phred scores to numeric matrix
ShortReadQ(sread=sread(fq), quality=quality(fq), id=id(fq)) # Constructs a ShortReadQ from components
```
### FASTQ Quality Reports
#### Using `systemPipeR`
The following `seeFastq` and `seeFastqPlot` functions generate and plot a series of useful quality statistics for a set of FASTQ files.
```{r see_fastq, eval=TRUE, message=FALSE, warnings=FALSE, fig.height=12, fig.width=14}
library(systemPipeR)
fqlist <- seeFastq(fastq=fastq, batchsize=800, klength=8) # For real data set batchsize to at least 10^5
seeFastqPlot(fqlist)
```
Handles many samples in one PDF file. For more details see [here](http://bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html)
#### Using `ShortRead`
The `ShortRead` package contains several FASTQ quality reporting functions.
```{r shortread_fastq_qc, eval=FALSE}
sp <- SolexaPath(system.file('extdata', package='ShortRead'))
fl <- file.path(analysisPath(sp), "s_1_sequence.txt")
fls <- c(fl, fl)
coll <- QACollate(QAFastqSource(fls), QAReadQuality(), QAAdapterContamination(),
QANucleotideUse(), QAQualityUse(), QASequenceUse(), QAFrequentSequence(n=10),
QANucleotideByCycle(), QAQualityByCycle())
x <- qa2(coll, verbose=TRUE)
res <- report(x)
if(interactive())
browseURL(res)
```
### Filtering and Trimming FASTQ Files with ShortRead
#### Adaptor trimming
```{r adaptor_trimming, eval=TRUE}
fqtrim <- trimLRPatterns(Rpattern="GCCCGGGTAA", subject=fq)
sread(fq)[1:2] # Before trimming
sread(fqtrim)[1:2] # After trimming
```
#### Read counting and duplicate removal
```{r read_counting, eval=TRUE}
tables(fq)$distribution # Counts read occurences
sum(srduplicated(fq)) # Identifies duplicated reads
fq[!srduplicated(fq)]
```
#### Trimming low quality tails
```{r trim_tails, eval=TRUE}
cutoff <- 30
cutoff <- rawToChar(as.raw(cutoff+33))
sread(trimTails(fq, k=2, a=cutoff, successive=FALSE))[1:2]
```
#### Removal of reads with Phred scores below a threshold value
```{r remove_low_quality_reads, eval=TRUE}
cutoff <- 30
qcount <- rowSums(as(quality(fq), "matrix") <= 20)
fq[qcount == 0] # Number of reads where all Phred scores >= 20
```
#### Removal of reads with x Ns and/or low complexity segments
```{r remove_N_reads, eval=TRUE}
filter1 <- nFilter(threshold=1) # Keeps only reads without Ns
filter2 <- polynFilter(threshold=20, nuc=c("A","T","G","C")) # Removes reads with nucleotide bias, >=20 of any base
filter <- compose(filter1, filter2)
fq[filter(fq)]
```
### Memory Efficient FASTQ Processing
Streaming through FASTQ files with `FastqStreamer` and random sampling reads with `FastqSampler`
```{r stream_fastq1, eval=TRUE}
fq <- yield(FastqStreamer(fastq[1], 50)) # Imports first 50 reads
fq <- yield(FastqSampler(fastq[1], 50)) # Random samples 50 reads
```
Streaming through a FASTQ file while applying filtering/trimming functions and writing the results to a new file
here `SRR038845.fastq_sub` in `data` directory.
```{r stream_fastq2, eval=TRUE, message=FALSE, warnings=FALSE}
f <- FastqStreamer(fastq[1], 50)
while(length(fq <- yield(f))) {
fqsub <- fq[grepl("^TT", sread(fq))]
writeFastq(fqsub, paste(fastq[1], "sub", sep="_"), mode="a", compress=FALSE)
}
close(f)
```
## Range Operations
### Important Data Objects for Range Operations
* `IRanges`: stores range data only (IRanges library)
* `GRanges`: stores ranges and annotations (GenomicRanges library)
* `GRangesList`: list version of GRanges container (GenomicRanges library)
### Range Data Are Stored in `IRanges` and `GRanges` Containers
#### Construct `GRanges` Object
The following works with information that relates to GFF/GTF file format. The defintions
of the GFF/GTF file format are [here](https://useast.ensembl.org/info/website/upload/gff.html).
```{r genomicranges1, eval=TRUE, message=FALSE, warnings=FALSE}
library(GenomicRanges); library(rtracklayer)
gr <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), ranges = IRanges(1:10, end = 7:16, names = head(letters, 10)), strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score = 1:10, GC = seq(1, 0, length = 10)) # Example of creating a GRanges object with its constructor function.
```
#### Import GFF into `GRanges` Object
```{r genomicranges2, eval=TRUE, message=FALSE, warnings=FALSE}
gff <- import.gff("http://cluster.hpcc.ucr.edu/~tgirke/Documents/R_BioCond/Samples/gff3.gff") # Imports a simplified GFF3 genome annotation file.
seqlengths(gff) <- end(ranges(gff[which(values(gff)[,"type"]=="chromosome"),]))
names(gff) <- 1:length(gff) # Assigns names to corresponding slot
gff[1:4,]
seqinfo(gff)
```
#### Coerce `GRanges` object to `data.frame`
```{r genomicranges3, eval=TRUE, message=FALSE, warnings=FALSE}
as.data.frame(gff)[1:4, 1:7]
```
### Utilities for Range Containers
#### Accessor and subsetting methods for GRanges objects
Subsetting and replacement
```{r genomicranges_access2, eval=TRUE}
gff[1:4]
gff[1:4, c("type", "ID")]
gff[2] <- gff[3]
```
GRanges objects can be concatenated with the `c` function
```{r genomicranges_access3, eval=TRUE}
c(gff[1:2], gff[401:402])
```
Acessor functions
```{r genomicranges_access4, eval=TRUE}
seqnames(gff)
ranges(gff)
strand(gff)
seqlengths(gff)
start(gff[1:4])
end(gff[1:4])
width(gff[1:4])
```
Accessing metadata component
```{r genomicranges_access5, eval=TRUE}
values(gff) # or elementMetadata(gff)
values(gff)[, "type"][1:20]
gff[values(gff)[ ,"type"] == "gene"]
```
#### Useful utilities for GRanges objects
Remove chromosome ranges
```{r genomicranges_utilities1, eval=TRUE}
gff <- gff[values(gff)$type != "chromosome"]
```
Erase the strand information
```{r genomicranges_utilities2, eval=TRUE}
strand(gff) <- "*"
```
Collapses overlapping ranges to continuous ranges.
```{r genomicranges_utilities3, eval=TRUE}
reduce(gff)
```
Return uncovered regions
```{r genomicranges_utilities4, eval=TRUE}
gaps(gff)
```
More intuitive way to get uncovered regions
```{r genomicranges_utilities4b, eval=TRUE}
setdiff(as(seqinfo(gff), "GRanges"), gff)
```
Return disjoint ranges
```{r genomicranges_utilities5, eval=TRUE}
disjoin(gff)
```
Returns coverage of ranges
```{r genomicranges_utilities6, eval=TRUE}
coverage(gff)
```
Return the index pairings for overlapping ranges
```{r genomicranges_utilities7, eval=TRUE}
findOverlaps(gff, gff[1:4])
```
Counts overlapping ranges
```{r genomicranges_utilities8, eval=TRUE}
countOverlaps(gff, gff[1:4])[1:40]
```
Return only overlapping ranges
```{r genomicranges_utilities9, eval=TRUE}
subsetByOverlaps(gff, gff[1:4])
```
### GRangesList Objects
```{r genomicrangeslist_objects, eval=TRUE}
sp <- split(gff, seq(along=gff)) # Stores every range in separate component of a GRangesList object
split(gff, seqnames(gff)) # Stores ranges of each chromosome in separate component.
unlist(sp) # Returns data as GRanges object
sp[1:4, "type"] # Subsetting of GRangesList objects is similar to GRanges objects.
lapply(sp[1:4], length) # Looping over GRangesList objects similar to lists
```
## Transcript Ranges
Storing annotation ranges in `TranscriptDb` databases makes many operations more robust and convenient.
```{r txdb_objects1, eval=TRUE, message=FALSE, warnings=FALSE}
library(GenomicFeatures)
download.file("http://cluster.hpcc.ucr.edu/~tgirke/Documents/R_BioCond/Samples/gff3.gff", "data/gff3.gff")
txdb <- makeTxDbFromGFF(file="data/gff3.gff", format="gff", dataSource="TAIR", organism="Arabidopsis thaliana")
saveDb(txdb, file="./data/TAIR10.sqlite")
txdb <- loadDb("./data/TAIR10.sqlite")
transcripts(txdb)
transcriptsBy(txdb, by = "gene")
exonsBy(txdb, by = "gene")
```
### `txdb` from BioMart
Alternative sources for creating `txdb` databases are BioMart, Bioc annotation packages, UCSC, etc. The following shows how to create a `txdb` from BioMart.
```{r txdb_objects1_biomart, eval=FALSE, message=FALSE, warnings=FALSE}
library(GenomicFeatures); library("biomaRt")
txdb <- makeTxDbFromBiomart(biomart = "plants_mart", dataset = "athaliana_eg_gene", host="https://plants.ensembl.org")
```
The following steps are useful to find out what is availble in BioMart.
```{r biomart_basics, eval=FALSE, message=FALSE, warnings=FALSE}
listMarts() # Lists BioMart databases
listMarts(host="plants.ensembl.org")
mymart <- useMart("plants_mart", host="plants.ensembl.org") # Select one, here plants_mart
listDatasets(mymart) # List datasets available in the selected BioMart database
mymart <- useMart("plants_mart", dataset="athaliana_eg_gene", host="plants.ensembl.org")
listAttributes(mymart) # List available features
getBM(attributes=c("ensembl_gene_id", "description"), mart=mymart)[1:4,]
```
### Efficient Sequence Parsing
### `getSeq`
The following parses all annotation ranges provided by a `GRanges` object (e.g. `gff`) from a genome sequence stored in a local file.
```{r getseq_gff, eval=TRUE, message=FALSE, warnings=FALSE}
gff <- gff[values(gff)$type != "chromosome"] # Remove chromosome ranges
rand <- DNAStringSet(sapply(unique(as.character(seqnames(gff))), function(x) paste(sample(c("A","T","G","C"), 200000, replace=T), collapse="")))
writeXStringSet(DNAStringSet(rand), "./data/test")
getSeq(FaFile("./data/test"), gff)
```
#### `extractTranscriptSeqs`
Sequences composed of several ranges, such as transcripts (or CDSs) with several exons, can be parsed with `extractTranscriptSeqs`.
Note: the following expects the genome sequence in a file with path `data/test` and a valid `txdb` defining the ranges for that
genome.
```{r extractTranscritpSeqs, eval=TRUE, message=FALSE, warnings=FALSE}
library(GenomicFeatures); library(Biostrings); library(Rsamtools)
txdb <- loadDb("./data/TAIR10.sqlite")
indexFa("data/test") # Creates index for genome fasta
txseq <- extractTranscriptSeqs(FaFile("data/test"), txdb, use.names=TRUE)
txseq
```
## Homework 6
See [here](https://girke.bioinformatics.ucr.edu/GEN242/assignments/homework/hw06/hw06/).
## Session Info
```{r sessionInfo, eval=TRUE}
sessionInfo()
```
## References