--- title: "05: _Bioconductor_ Annotation Resources" author: - name: Martin Morgan affiliation: Roswell Park Comprehensive Cancer Center date: "`r format(Sys.time(), '%B %d, %Y')`" vignette: > %\VignetteIndexEntry{05: Bioconductor Annotation Resources} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} output: BiocStyle::html_document: number_sections: yes toc: true --- ```{r style, echo = FALSE, results = 'asis'} knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")) ) options(width = 75) ``` # Mapping between symbols ## [org.Hs.eg.db][] The `org` packages contain information to map between different symbols. Here check for available `org` packages. ```{r} BiocManager::available("^org\\.") ``` The regular expression `"^org\\.")` insists that the package names starts with `org` (`"^org"`) followed by a literal period rather than a wild-card representing any letter (`"\\."`). In addition to these packages, many `org` resources are available from [AnnotationHub][], described below ```{r, message = FALSE} library(AnnotationHub) ``` ```{r} query(AnnotationHub(), "^org\\.") ``` The naming convention of `org` objects uses a two-letter code to represent species, e.g., `Hs` is _Homo sapiens_ followed by the _central_ identifier used to map to and from other symbols; for `org.Hs.eg.db`, the central identifier is the Entrez gene identifier, and to map from, say HGNC Symbol to Ensembl identifier, a map must exist between the gene symbol and the Entrez identifier, and then from the Entrez identifier to the Ensembl identifier. Many additional `org` packages are available on [AnnotationHub][], as mentioned briefly below. ```{r, message = FALSE} library(org.Hs.eg.db) ``` We can discover available `keytypes()` for querying the database, and `columns()` to map to, e.g., ```{r} head(keys(org.Hs.eg.db)) ``` Here are a handful of ENTREZID keys ```{r} eid <- sample(keys(org.Hs.eg.db), 10) ``` Two main functions are `select()` and `mapIds()`. `mapIds()` is more focused. It guarantees a one-to-one mapping between keys a single selected column. By defaul, if a key maps to multiple values, then the 'first' value returned by the database is used. The return value is a named vector; the 1:1 mapping between query and return value makes this function particularly useful in pipelines where a single mapping must occur. ```{r} mapIds(org.Hs.eg.db, eid, "SYMBOL", "ENTREZID") ``` `select()` is more general, returning a data.frame of keys, plus one or more columns. If a key maps to multiple values, then multiple rows are returned. ```{r} map <- select(org.Hs.eg.db, eid, c("SYMBOL", "GO"), "ENTREZID") dim(map) head(map) ``` ## [GO.db][] [GO.db][] ```{r, message = FALSE} library(GO.db) ``` # Transcript annotations ## [TxDb.Hsapiens.UCSC.hg38.knownGene][] `TxDb` packages contain information about gene models (exon, gene, transcript coordinates). There are a number of `TxDb` packages available to install ```{r, message=FALSE} library(dplyr) # for `%>%` ``` ```{r} BiocManager::available("^TxDb") %>% tibble::enframe(name = NULL) ``` and to download from [AnnotationHub][] ```{r} query(AnnotationHub(), "^TxDb\\.") ``` Here we load the `TxDb` object containing gene models for _Homo sapiens_ using annotations provided by UCSC for the hg38 genome build, using the `knownGene` annotation track. ```{r, message = FALSE} library(TxDb.Hsapiens.UCSC.hg38.knownGene) ``` ## `exons()`, `transcripts()`, `genes()` The coordinates of annotated exons can be extracted as a `GRanges` object ```{r} exons(TxDb.Hsapiens.UCSC.hg38.knownGene) ``` Additional information is also present in the database, for instance the GENEID (Entrez gene id for these TxDb) ```{r} ex <- exons(TxDb.Hsapiens.UCSC.hg38.knownGene, columns = "GENEID") ex ``` Note that the object reports "595 sequences"; this is because the exons include both standard chromosomes and partially assembled contigs. Use `keepStandardChromosomes()` to update the object to contain only exons found on the 'standard' chromomes; the `pruning.mode=` argument determines whether sequence names that are 'in use' (have exons associated with them) can be dropped. ```{r} std_ex <- keepStandardChromosomes(ex, pruning.mode="coarse") std_ex ``` It is then possible to ask all sorts of question, e.g., the number of exons on each chromosome ```{r} table(seqnames(std_ex)) ``` or the identity the exons with more than 10000 nucleotides. ```{r} std_ex[width(std_ex) > 10000] ``` and of course more scientifically relevant questions. ## `exonsBy()`, `transcriptsBy()`, etc ## [ensembldb][] The [ensembldb][] package provides access to similar, but more rich, information from Ensembl, with most data resources available via [AnnotationHub][]; the AnnotationHub query asks for records that include both `EnsDb` and a particular Ensembl release. ```{r, message = FALSE} library(ensembldb) ``` ```{r} query(AnnotationHub(), c("^EnsDb\\.", "Ensembl 96")) ``` # Accessing online resources ## [biomaRt][] ```{r, message = FALSE} library(biomaRt) ``` Visit the [biomart website][] and figure out how to browse data to retrieve, e.g., genes on chromosomes 21 and 22. You'll need to browse to the ensembl mart, _Homo spaiens_ data set, establish filters for chromosomes 21 and 22, and then specify that you'd like the Ensembl gene id attribute returned. Now do the same process in [biomaRt][]: ```{r biomart, eval=FALSE} library(biomaRt) head(listMarts(), 3) ## list marts head(listDatasets(useMart("ensembl")), 3) ## mart datasets ensembl <- ## fully specified mart useMart("ensembl", dataset = "hsapiens_gene_ensembl") head(listFilters(ensembl), 3) ## filters myFilter <- "chromosome_name" substr(filterOptions(myFilter, ensembl), 1, 50) ## return values myValues <- c("21", "22") head(listAttributes(ensembl), 3) ## attributes myAttributes <- c("ensembl_gene_id","chromosome_name") ## assemble and query the mart res <- getBM(attributes = myAttributes, filters = myFilter, values = myValues, mart = ensembl) ``` ## [KEGGREST][] ```{r, message = FALSE} library(KEGGREST) ``` ## [AnnotationHub][] [AnnotationHub][] provides a resource of annotations that are available without requiring an annotation package. ```{r, message = FALSE} library(AnnotationHub) ah <- AnnotationHub() ``` One example of such annotations are `org`-style data resources for less-model organisms. Discover available resources using the flexible `query()` command. ```{r} query(ah, "^org\\.") ``` Find out more about a particular resource using `[` to select just that resource, or use `mcols()` on a subset of resources. identifier, e.g., ```{r} ah["AH70563"] ``` Retrieve and use a resource by using `[[` with the corresponding ```{r} org <- ah[["AH70563"]] org ``` Determine the central key, and the columns that can be mapped between ```{r} chooseCentralOrgPkgSymbol(org) columns(org) ``` Here are some Entrez identifiers, and their corresponding symbols for _Anopheles gambiae_, either allowing for 1:many maps (`select()`) or enforcing 1:1 maps. We use `AnnotationDbi::select()` to disambiguate between the `select()` generic defined in `AnnotationDbi` and the `select()` generic defined in `dplyr`: theses methods have incompatible signatures and 'contracts', and so must be invoked in a way that resolves our intention explicitly. ```{r, message = FALSE} library(dplyr) # for `%>%` ``` ```{r} eid <- head(keys(org)) AnnotationDbi::select(org, eid, "SYMBOL", "ENTREZID") eid %>% mapIds(x = org, "SYMBOL", "ENTREZID") %>% tibble::enframe("ENTREZID", "SYMBOL") ``` ## [ExperimentHub][] [ExperimentHub][] is analogous to [AnnotationHub][], but contains curated experimental results. Increasingly, [ExperimentHub][] packages are provided to document and ease access to these resources. A great example of an [ExperimentHub][] package is [curatedTCGAData][]. ```{r, message = FALSE} library(ExperimentHub) library(curatedTCGAData) ``` The [curatedTCGAData][] package provides an interface to a collection of resources available through ExperimentHub. The interface is straigth-forward. Use `curatedTCGAData()` to discover available types of data, choosing assay types after identifying cancer types. ```{r} curatedTCGAData() curatedTCGAData("BRCA") curatedTCGAData("BRCA", c("RNASeqGene", "CNVSNP")) ``` Adding `dry.run = FALSE` triggers the actual download (first time only) of the data from ExperimentHub, and presentation to the user as a `MultiAssayExperiment`. ```{r, message = FALSE} mae <- curatedTCGAData("BRCA", c("RNASeqGene", "CNVSNP"), dry.run=FALSE) mae ``` It is then easy to work with these data, via individual assays or in a more integrative analysis. For example, the distribution of library sizes in the RNASeq data can be visualized with. ```{r} mae[["BRCA_RNASeqGene-20160128"]] %>% assay() %>% colSums() %>% density() %>% plot(main = "TCGA BRCA RNASeq Library Size") ``` # Annotating variants ## [VariantAnnotation][] ```{r, message = FALSE} library(VariantAnnotation) ``` ## [ensemblVEP][] ```{r, message = FALSE} library(ensemblVEP) ``` # Provenance ```{r} sessionInfo() ``` [org.Hs.eg.db]: https://bioconductor.org/packages/org.Hs.eg.db [GO.db]: https://bioconductor.org/packages/GO.db [GSEABase]: https://bioconductor.org/packages/GSEABase [TxDb.Hsapiens.UCSC.hg38.knownGene]: https://bioconductor.org/packages/TxDb.Hsapiens.UCSC.hg38.knownGene [biomaRt]: https://bioconductor.org/packages/biomaRt [KEGGREST]: https://bioconductor.org/packages/KEGGREST [VariantAnnotation]: https://bioconductor.org/packages/VariantAnnotation [ensemblVEP]: https://bioconductor.org/packages/ensemblVEP [ensembldb]: https:/bioconductor.org/packages/ensembldb [AnnotationHub]: https:/bioconductor.org/packages/AnnotationHub [ExperimentHub]: https:/bioconductor.org/packages/ExperimentHub [biomart website]: http://biomart.org