--- title: NGS Analysis Basics author: "Author: Thomas Girke" date: "Last update: `r format(Sys.time(), '%d %B, %Y')`" output: html_document: toc: true toc_float: collapsed: true smooth_scroll: true toc_depth: 3 fig_caption: yes code_folding: show number_sections: true fontsize: 14pt bibliography: bibtex.bib weight: 6 type: docs --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE"))) ```
Source code downloads:     [ [.Rmd](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/rsequences/Rsequences.Rmd) ]     [ [.R](https://raw.githubusercontent.com/tgirke/GEN242//main/content/en/tutorials/rsequences/Rsequences.R) ]
## Overview ### Sequence Analysis in R and Bioconductor __R Base__ * Some basic string handling utilities. Wide spectrum of numeric data analysis tools. __Bioconductor__ Bioconductor packages provide much more sophisticated string handling utilities for sequence analysis [@Lawrence2013-kt; @Huber2015-ag]. * [Biostrings](http://bioconductor.org/packages/release/bioc/html/Biostrings.html): general sequence analysis environment * [ShortRead](http://bioconductor.org/packages/release/bioc/html/ShortRead.html): pipeline for short read data * [IRanges](http://bioconductor.org/packages/release/bioc/html/IRanges.html): low-level infrastructure for range data * [GenomicRanges](http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html): high-level infrastructure for range data * [GenomicFeatures](http://bioconductor.org/packages/release/bioc/html/GenomicFeatures.html): managing transcript centric annotations * [GenomicAlignments](http://bioconductor.org/packages/release/bioc/html/GenomicAlignments.html): handling short genomic alignments * [Rsamtools](http://bioconductor.org/packages/release/bioc/html/Rsamtools.html): interface to `samtools`, `bcftools` and `tabix` * [BSgenome](http://bioconductor.org/packages/release/bioc/html/BSgenome.html): genome annotation data * [biomaRt](http://bioconductor.org/packages/release/bioc/html/biomaRt.html): interface to BioMart annotations * [rtracklayer](http://bioconductor.org/packages/release/bioc/html/rtracklayer.html): Annotation imports, interface to online genome browsers * [HelloRanges](http://bioconductor.org/packages/release/bioc/html/HelloRanges.html): Bedtools semantics in Bioc's Ranges infrastructure ## Package Requirements Several Bioconductor packages are required for this tutorial. To install them, execute the following lines in the R console. Please also make sure that you have a recent R version installed on your system. R versions `4.0.x` or higher are recommended. _Please do not run this install on the HPCC unless you want to reinstall some of these packages in your own user account._ ```{r package_requrirements, eval=FALSE} source("https://bioconductor.org/biocLite.R") if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("Biostrings", "GenomicRanges", "rtracklayer", "systemPipeR", "seqLogo", "ShortRead")) ``` ## Strings in R Base ### Basic String Matching and Parsing #### String matching Generate sample sequence data set ```{r string_matching_base1, eval=TRUE} myseq <- c("ATGCAGACATAGTG", "ATGAACATAGATCC", "GTACAGATCAC") ``` String searching with regular expression support ```{r string_matching_base2, eval=TRUE} myseq[grep("ATG", myseq)] ``` Searches `myseq` for first match of pattern "AT" ```{r string_matching_base3, eval=TRUE} pos1 <- regexpr("AT", myseq) as.numeric(pos1); attributes(pos1)$match.length # Returns position information of matches ``` Searches `myseq` for all matches of pattern "AT" ```{r string_matching_base4, eval=TRUE} pos2 <- gregexpr("AT", myseq) as.numeric(pos2[[1]]); attributes(pos2[[1]])$match.length # Returns positions of matches in first sequence ``` String substitution with regular expression support ```{r string_matching_base5, eval=TRUE} gsub("^ATG", "atg", myseq) ``` #### Positional parsing ```{r positional_parsing_base, eval=TRUE} nchar(myseq) # Computes length of strings substring(myseq[1], c(1,3), c(2,5)) # Positional parsing of several fragments from one string substring(myseq, c(1,4,7), c(2,6,10)) # Positional parsing of many strings ``` ### Random Sequence Generation #### Random DNA sequences of any length ```{r random_sequences_base, eval=TRUE} rand <- sapply(1:100, function(x) paste(sample(c("A","T","G","C"), sample(10:20, 1), replace=TRUE), collapse="")) rand[1:3] ``` #### Count identical sequences ```{r count_sequences_base, eval=TRUE} table(c(rand[1:4], rand[1])) ``` #### Extract reads from reference Note: this requires the `Biostrings` package. ```{r parse_from_ref, eval=TRUE, message=FALSE, warnings=FALSE} library(Biostrings) ref <- DNAString(paste(sample(c("A","T","G","C"), 100000, replace=T), collapse="")) randstart <- sample(1:(length(ref)-15), 1000) randreads <- Views(ref, randstart, width=15) rand_set <- DNAStringSet(randreads) unlist(rand_set) ``` ## Sequences in Bioconductor ### Important Data Objects of Biostrings #### `XString` for single sequence * `DNAString`: for DNA * `RNAString`: for RNA * `AAString`: for amino acid * `BString`: for any string #### `XStringSet` for many sequences * `DNAStringSet`: for DNA * `RNAStringSet`: for RNA * `AAStringSet`: for amino acid * `BStringSet`: for any string #### `QualityScaleXStringSet` for sequences with quality data * `QualityScaledDNAStringSet`: for DNA * `QualityScaledRNAStringSet`: for RNA * `QualityScaledAAStringSet`: for amino acid * `QualityScaledBStringSet`: for any string ### Sequence Import and Export Download the following sequences to your current working directory and then import them into R: [https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.ffn](https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.ffn) ```{r download_sequences, eval=TRUE, message=FALSE, warnings=FALSE} dir.create("data", showWarnings = FALSE) # system("wget https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.ffn") download.file("https://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Bacteria/Halobacterium_sp_uid217/AE004437.ffn", "data/AE004437.ffn") ``` Import FASTA file with `readDNAStringSet` ```{r import_sequences1, eval=TRUE} myseq <- readDNAStringSet("data/AE004437.ffn") myseq[1:3] ``` Subset sequences with regular expression on sequence name field ```{r import_sequences2, eval=TRUE} sub <- myseq[grep("99.*", names(myseq))] length(sub) ``` Export subsetted sequences to FASTA file ```{r export_sequences, eval=TRUE} writeXStringSet(sub, file="./data/AE004437sub.ffn", width=80) ``` Now inspect exported sequence file `AE004437sub.ffn` in a text editor ### Working with `XString` Containers The `XString` stores the different types of biosequences in dedicated containers ```{r xstring_container, eval=TRUE, message=FALSE, warnings=FALSE} library(Biostrings) d <- DNAString("GCATAT-TAC") d d[1:4] ``` RNA sequences ```{r rnastring_container, eval=TRUE, message=FALSE, warnings=FALSE} r <- RNAString("GCAUAU-UAC") r <- RNAString(d) # Converts d to RNAString object r ``` Protein sequences ```{r aastring_container, eval=TRUE, message=FALSE, warnings=FALSE} p <- AAString("HCWYHH") p ``` Any type of character strings ```{r bstring_container, eval=TRUE, message=FALSE, warnings=FALSE} b <- BString("I store any set of characters. Other XString objects store only the IUPAC characters.") b ``` ### Working with `XStringSet` Containers `XStringSet` containers allow to store many biosequences in one object ```{r xtringset_container, eval=TRUE} dset <- DNAStringSet(c("GCATATTAC", "AATCGATCC", "GCATATTAC")) names(dset) <- c("seq1", "seq2", "seq3") # Assigns names dset[1:2] ``` Important utilities for `XStringSet` containers ```{r xtringset_container_utilities, eval=TRUE} width(dset) # Returns the length of each sequences d <- dset[[1]] # The [[ subsetting operator returns a single entry as XString object dset2 <- c(dset, dset) # Appends/concatenates two XStringSet objects dsetchar <- as.character(dset) # Converts XStringSet to named vector dsetone <- unlist(dset) # Collapses many sequences to a single one stored in a DNAString container ``` Sequence subsetting by positions: ```{r xtringset_subsetting, eval=TRUE} DNAStringSet(dset, start=c(1,2,3), end=c(4,8,5)) ``` ### Multiple Alignment Class The `XMultipleAlignment` class stores the different types of multiple sequence alignments: ```{r xmultiple_alignment, eval=TRUE} origMAlign <- readDNAMultipleAlignment(filepath = system.file("extdata", "msx2_mRNA.aln", package = "Biostrings"), format = "clustal") origMAlign ``` ### Basic Sequence Manipulations #### Reverse and Complement ```{r rev_comp, eval=TRUE} randset <- DNAStringSet(rand) complement(randset[1:2]) reverse(randset[1:2]) reverseComplement(randset[1:2]) ``` ### Translate DNA into Protein ```{r translate, eval=TRUE, message=FALSE, warnings=FALSE} translate(randset[1:2]) ``` ### Pattern Matching #### Pattern matching with mismatches Find pattern matches in reference ```{r pattern_matching1, eval=TRUE} myseq1 <- readDNAStringSet("./data/AE004437.ffn") mypos <- matchPattern("ATGGTG", myseq1[[1]], max.mismatch=1) ``` Count only the corresponding matches ```{r pattern_matching2, eval=TRUE} countPattern("ATGGCT", myseq1[[1]], max.mismatch=1) ``` Count matches in many sequences ```{r pattern_matching3, eval=TRUE} vcountPattern("ATGGCT", myseq1, max.mismatch=1)[1:20] ``` Results shown in DNAStringSet object ```{r pattern_matching4, eval=TRUE} tmp <- c(DNAStringSet("ATGGTG"), DNAStringSet(mypos)) ``` Return a consensus matrix for query and hits ```{r pattern_matching5, eval=TRUE} consensusMatrix(tmp)[1:4,] ``` Find all pattern matches in reference ```{r pattern_matching6, eval=TRUE} myvpos <- vmatchPattern("ATGGCT", myseq1, max.mismatch=1) myvpos # The results are stored as MIndex object. Views(myseq1[[1]], start(myvpos[[1]]), end(myvpos[[1]])) # Retrieves the result for single entry ``` Return all matches ```{r all_matches, eval=FALSE} sapply(names(myseq1), function(x) as.character(Views(myseq1[[x]], start(myvpos[[x]]), end(myvpos[[x]]))))[1:4] ``` #### Pattern matching with regular expression support ```{r regex_pattern_matching, eval=TRUE} myseq <- DNAStringSet(c("ATGCAGACATAGTG", "ATGAACATAGATCC", "GTACAGATCAC")) myseq[grep("^ATG", myseq, perl=TRUE)] # String searching with regular expression support pos1 <- regexpr("AT", myseq) # Searches 'myseq' for first match of pattern "AT" as.numeric(pos1); attributes(pos1)$match.length # Returns position information of matches pos2 <- gregexpr("AT", myseq) # Searches 'myseq' for all matches of pattern "AT" as.numeric(pos2[[1]]); attributes(pos2[[1]])$match.length # Match positions in first sequence DNAStringSet(gsub("^ATG", "NNN", myseq)) # String substitution with regular expression support ``` ### PWM Viewing and Searching #### Plot with `seqLogo` ```{r pwm_logo, eval=TRUE} library(seqLogo) pwm <- PWM(DNAStringSet(c("GCT", "GGT", "GCA"))) pwm seqLogo(t(t(pwm) * 1/colSums(pwm))) ``` #### Plot with `ggseqlogo` The `ggseqlogo` package ([manual](https://omarwagih.github.io/ggseqlogo/)) provides many customization options for plotting sequence logos. It also supports various alphabets including sequence logos for amino acid sequences. ```{r pwm_logo2, eval=TRUE} library(ggplot2); library(ggseqlogo) pwm <- PWM(DNAStringSet(c("GCT", "GGT", "GCA"))) ggseqlogo(pwm) ``` Search sequence for PWM matches with score better than `min.score` ```{r pwm_search, eval=TRUE} chr <- DNAString("AAAGCTAAAGGTAAAGCAAAA") matchPWM(pwm, chr, min.score=0.9) ``` ## NGS Sequences ### Sequence and Quality Data: FASTQ Format Four lines per sequence: 1. ID 2. Sequence 3. ID 4. Base call qualities (Phred scores) as ASCII characters The following gives an example of 3 Illumina reads in a FASTQ file. The numbers at the beginning of each line are not part of the FASTQ format. They have been added solely for illustration purposes. ``` 1. @SRR038845.3 HWI-EAS038:6:1:0:1938 length=36 2. CAACGAGTTCACACCTTGGCCGACAGGCCCGGGTAA 3. +SRR038845.3 HWI-EAS038:6:1:0:1938 length=36 4. BA@7>B=>:>>7@7@>>9=BAA?;>52;>:9=8.=A 1. @SRR038845.41 HWI-EAS038:6:1:0:1474 length=36 2. CCAATGATTTTTTTCCGTGTTTCAGAATACGGTTAA 3. +SRR038845.41 HWI-EAS038:6:1:0:1474 length=36 4. BCCBA@BB@BBBBAB@B9B@=BABA@A:@693:@B= 1. @SRR038845.53 HWI-EAS038:6:1:1:360 length=36 2. GTTCAAAAAGAACTAAATTGTGTCAATAGAAAACTC 3. +SRR038845.53 HWI-EAS038:6:1:1:360 length=36 4. BBCBBBBBB@@BAB?BBBBCBC>BBBAA8>BBBAA@ ``` ### Sequence and Quality Data: `QualityScaleXStringSet` Phred quality scores are integers from 0-50 that are stored as ASCII characters after adding 33. The basic R functions `rawToChar` and `charToRaw` can be used to interconvert among their representations. Phred score interconversion ```{r raw_to_char, eval=TRUE} phred <- 1:9 phreda <- paste(sapply(as.raw((phred)+33), rawToChar), collapse="") phreda as.integer(charToRaw(phreda))-33 ``` Construct `QualityScaledDNAStringSet` from scratch ```{r raw_to_char2, eval=TRUE} dset <- DNAStringSet(sapply(1:100, function(x) paste(sample(c("A","T","G","C"), 20, replace=T), collapse=""))) # Creates random sample sequence. myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Creates random Phred score list. myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) # Converts integer scores into ASCII characters. myqual <- PhredQuality(myqual) # Converts to a PhredQuality object. dsetq1 <- QualityScaledDNAStringSet(dset, myqual) # Combines DNAStringSet and quality data in QualityScaledDNAStringSet object. dsetq1[1:2] ``` ### Processing FASTQ Files with ShortRead The following explains the basic usage of `ShortReadQ` objects. To make the sample code work, download and unzip this [file](http://cluster.hpcc.ucr.edu/~tgirke/HTML_Presentations/Manuals/Workshop_Dec_6_10_2012/Rsequences/data.zip) to your current working directory. The following code performs the download for you. ```{r read_fastq1, eval=TRUE, message=FALSE, warnings=FALSE} library(ShortRead) download.file("http://cluster.hpcc.ucr.edu/~tgirke/HTML_Presentations/Manuals/testdata/samplefastq/data.zip", "data.zip") unzip("data.zip") ``` Important utilities for accessing FASTQ files ```{r read_fastq2, eval=TRUE, message=FALSE, warnings=FALSE} fastq <- list.files("data", "*.fastq$"); fastq <- paste("data/", fastq, sep="") names(fastq) <- paste("flowcell6_lane", 1:length(fastq), sep="_") (fq <- readFastq(fastq[1])) # Imports first FASTQ file countLines(dirPath="./data", pattern=".fastq$")/4 # Counts numbers of reads in FASTQ files id(fq)[1] # Returns ID field sread(fq)[1] # Returns sequence quality(fq)[1] # Returns Phred scores as(quality(fq), "matrix")[1:4,1:12] # Coerces Phred scores to numeric matrix ShortReadQ(sread=sread(fq), quality=quality(fq), id=id(fq)) # Constructs a ShortReadQ from components ``` ### FASTQ Quality Reports #### Using `systemPipeR` The following `seeFastq` and `seeFastqPlot` functions generate and plot a series of useful quality statistics for a set of FASTQ files. ```{r see_fastq, eval=TRUE, message=FALSE, warnings=FALSE, fig.height=12, fig.width=14} library(systemPipeR) fqlist <- seeFastq(fastq=fastq, batchsize=800, klength=8) # For real data set batchsize to at least 10^5 seeFastqPlot(fqlist) ``` Handles many samples in one PDF file. For more details see [here](http://bioconductor.org/packages/devel/bioc/vignettes/systemPipeR/inst/doc/systemPipeR.html) #### Using `ShortRead` The `ShortRead` package contains several FASTQ quality reporting functions. ```{r shortread_fastq_qc, eval=FALSE} sp <- SolexaPath(system.file('extdata', package='ShortRead')) fl <- file.path(analysisPath(sp), "s_1_sequence.txt") fls <- c(fl, fl) coll <- QACollate(QAFastqSource(fls), QAReadQuality(), QAAdapterContamination(), QANucleotideUse(), QAQualityUse(), QASequenceUse(), QAFrequentSequence(n=10), QANucleotideByCycle(), QAQualityByCycle()) x <- qa2(coll, verbose=TRUE) res <- report(x) if(interactive()) browseURL(res) ``` ### Filtering and Trimming FASTQ Files with ShortRead #### Adaptor trimming ```{r adaptor_trimming, eval=TRUE} fqtrim <- trimLRPatterns(Rpattern="GCCCGGGTAA", subject=fq) sread(fq)[1:2] # Before trimming sread(fqtrim)[1:2] # After trimming ``` #### Read counting and duplicate removal ```{r read_counting, eval=TRUE} tables(fq)$distribution # Counts read occurences sum(srduplicated(fq)) # Identifies duplicated reads fq[!srduplicated(fq)] ``` #### Trimming low quality tails ```{r trim_tails, eval=TRUE} cutoff <- 30 cutoff <- rawToChar(as.raw(cutoff+33)) sread(trimTails(fq, k=2, a=cutoff, successive=FALSE))[1:2] ``` #### Removal of reads with Phred scores below a threshold value ```{r remove_low_quality_reads, eval=TRUE} cutoff <- 30 qcount <- rowSums(as(quality(fq), "matrix") <= 20) fq[qcount == 0] # Number of reads where all Phred scores >= 20 ``` #### Removal of reads with x Ns and/or low complexity segments ```{r remove_N_reads, eval=TRUE} filter1 <- nFilter(threshold=1) # Keeps only reads without Ns filter2 <- polynFilter(threshold=20, nuc=c("A","T","G","C")) # Removes reads with nucleotide bias, >=20 of any base filter <- compose(filter1, filter2) fq[filter(fq)] ``` ### Memory Efficient FASTQ Processing Streaming through FASTQ files with `FastqStreamer` and random sampling reads with `FastqSampler` ```{r stream_fastq1, eval=TRUE} fq <- yield(FastqStreamer(fastq[1], 50)) # Imports first 50 reads fq <- yield(FastqSampler(fastq[1], 50)) # Random samples 50 reads ``` Streaming through a FASTQ file while applying filtering/trimming functions and writing the results to a new file here `SRR038845.fastq_sub` in `data` directory. ```{r stream_fastq2, eval=TRUE, message=FALSE, warnings=FALSE} f <- FastqStreamer(fastq[1], 50) while(length(fq <- yield(f))) { fqsub <- fq[grepl("^TT", sread(fq))] writeFastq(fqsub, paste(fastq[1], "sub", sep="_"), mode="a", compress=FALSE) } close(f) ``` ## Range Operations ### Important Data Objects for Range Operations * `IRanges`: stores range data only (IRanges library) * `GRanges`: stores ranges and annotations (GenomicRanges library) * `GRangesList`: list version of GRanges container (GenomicRanges library) ### Range Data Are Stored in `IRanges` and `GRanges` Containers #### Construct `GRanges` Object ```{r genomicranges1, eval=TRUE, message=FALSE, warnings=FALSE} library(GenomicRanges); library(rtracklayer) gr <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), ranges = IRanges(1:10, end = 7:16, names = head(letters, 10)), strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), score = 1:10, GC = seq(1, 0, length = 10)) # Example of creating a GRanges object with its constructor function. ``` #### Import GFF into `GRanges` Object ```{r genomicranges2, eval=TRUE, message=FALSE, warnings=FALSE} gff <- import.gff("http://cluster.hpcc.ucr.edu/~tgirke/Documents/R_BioCond/Samples/gff3.gff") # Imports a simplified GFF3 genome annotation file. seqlengths(gff) <- end(ranges(gff[which(values(gff)[,"type"]=="chromosome"),])) names(gff) <- 1:length(gff) # Assigns names to corresponding slot gff[1:4,] seqinfo(gff) ``` #### Coerce `GRanges` object to `data.frame` ```{r genomicranges3, eval=TRUE, message=FALSE, warnings=FALSE} as.data.frame(gff)[1:4, 1:7] ``` ### Utilities for Range Containers #### Accessor and subsetting methods for GRanges objects Subsetting and replacement ```{r genomicranges_access2, eval=TRUE} gff[1:4] gff[1:4, c("type", "ID")] gff[2] <- gff[3] ``` GRanges objects can be concatenated with the `c` function ```{r genomicranges_access3, eval=TRUE} c(gff[1:2], gff[401:402]) ``` Acessor functions ```{r genomicranges_access4, eval=TRUE} seqnames(gff) ranges(gff) strand(gff) seqlengths(gff) start(gff[1:4]) end(gff[1:4]) width(gff[1:4]) ``` Accessing metadata component ```{r genomicranges_access5, eval=TRUE} values(gff) # or elementMetadata(gff) values(gff)[, "type"][1:20] gff[values(gff)[ ,"type"] == "gene"] ``` #### Useful utilities for GRanges objects Remove chromosome ranges ```{r genomicranges_utilities1, eval=TRUE} gff <- gff[values(gff)$type != "chromosome"] ``` Erase the strand information ```{r genomicranges_utilities2, eval=TRUE} strand(gff) <- "*" ``` Collapses overlapping ranges to continuous ranges. ```{r genomicranges_utilities3, eval=TRUE} reduce(gff) ``` Return uncovered regions ```{r genomicranges_utilities4, eval=TRUE} gaps(gff) ``` More intuitive way to get uncovered regions ```{r genomicranges_utilities4b, eval=TRUE} setdiff(as(seqinfo(gff), "GRanges"), gff) ``` Return disjoint ranges ```{r genomicranges_utilities5, eval=TRUE} disjoin(gff) ``` Returns coverage of ranges ```{r genomicranges_utilities6, eval=TRUE} coverage(gff) ``` Return the index pairings for overlapping ranges ```{r genomicranges_utilities7, eval=TRUE} findOverlaps(gff, gff[1:4]) ``` Counts overlapping ranges ```{r genomicranges_utilities8, eval=TRUE} countOverlaps(gff, gff[1:4])[1:40] ``` Return only overlapping ranges ```{r genomicranges_utilities9, eval=TRUE} subsetByOverlaps(gff, gff[1:4]) ``` ### GRangesList Objects ```{r genomicrangeslist_objects, eval=TRUE} sp <- split(gff, seq(along=gff)) # Stores every range in separate component of a GRangesList object split(gff, seqnames(gff)) # Stores ranges of each chromosome in separate component. unlist(sp) # Returns data as GRanges object sp[1:4, "type"] # Subsetting of GRangesList objects is similar to GRanges objects. lapply(sp[1:4], length) # Looping over GRangesList objects similar to lists ``` ## Transcript Ranges Storing annotation ranges in `TranscriptDb` databases makes many operations more robust and convenient. ```{r txdb_objects1, eval=TRUE, message=FALSE, warnings=FALSE} library(GenomicFeatures) download.file("http://cluster.hpcc.ucr.edu/~tgirke/Documents/R_BioCond/Samples/gff3.gff", "data/gff3.gff") txdb <- makeTxDbFromGFF(file="data/gff3.gff", format="gff", dataSource="TAIR", organism="Arabidopsis thaliana") saveDb(txdb, file="./data/TAIR10.sqlite") txdb <- loadDb("./data/TAIR10.sqlite") transcripts(txdb) transcriptsBy(txdb, by = "gene") exonsBy(txdb, by = "gene") ``` ### `txdb` from BioMart Alternative sources for creating `txdb` databases are BioMart, Bioc annotation packages, UCSC, etc. The following shows how to create a `txdb` from BioMart. ```{r txdb_objects1_biomart, eval=FALSE, message=FALSE, warnings=FALSE} library(GenomicFeatures); library("biomaRt") txdb <- makeTxDbFromBiomart(biomart = "plants_mart", dataset = "athaliana_eg_gene", host="plants.ensembl.org") ``` The following steps are useful to find out what is availble in BioMart. ```{r biomart_basics, eval=FALSE, message=FALSE, warnings=FALSE} listMarts() # Lists BioMart databases listMarts(host="plants.ensembl.org") mymart <- useMart("plants_mart", host="plants.ensembl.org") # Select one, here plants_mart listDatasets(mymart) # List datasets available in the selected BioMart database mymart <- useMart("plants_mart", dataset="athaliana_eg_gene", host="plants.ensembl.org") listAttributes(mymart) # List available features getBM(attributes=c("ensembl_gene_id", "description"), mart=mymart)[1:4,] ``` ### Efficient Sequence Parsing ### `getSeq` The following parses all annotation ranges provided by a `GRanges` object (e.g. `gff`) from a genome sequence stored in a local file. ```{r getseq_gff, eval=TRUE, message=FALSE, warnings=FALSE} gff <- gff[values(gff)$type != "chromosome"] # Remove chromosome ranges rand <- DNAStringSet(sapply(unique(as.character(seqnames(gff))), function(x) paste(sample(c("A","T","G","C"), 200000, replace=T), collapse=""))) writeXStringSet(DNAStringSet(rand), "./data/test") getSeq(FaFile("./data/test"), gff) ``` #### `extractTranscriptSeqs` Sequences composed of several ranges, such as transcripts (or CDSs) with several exons, can be parsed with `extractTranscriptSeqs`. Note: the following expects the genome sequence in a file with path `data/test` and a valid `txdb` defining the ranges for that genome. ```{r extractTranscritpSeqs, eval=TRUE, message=FALSE, warnings=FALSE} library(GenomicFeatures); library(Biostrings); library(Rsamtools) txdb <- loadDb("./data/TAIR10.sqlite") indexFa("data/test") # Creates index for genome fasta txseq <- extractTranscriptSeqs(FaFile("data/test"), txdb, use.names=TRUE) txseq ``` ## Homework 6 See [here](https://girke.bioinformatics.ucr.edu/GEN242/assignments/homework/hw06/hw06/). ## Session Info ```{r sessionInfo, eval=TRUE} sessionInfo() ``` ## References