--- title: "Gene annotations for the human genome" output: html_notebook --- Update: 2019/09/15 ## 1. Introduction ### 1.1. What is gene annotation? Over the past years, we have learnt that there are a number of chromosomes and genes in our genome. Counting the number of chromosomes is fairly easy but students might find difficult to say how many genes we have in our genome. If you can get an answer for this, could you tell how many genes encode protein and how many do not? To answer this question, we need to access the database for gene annotation. Gene annotation is the process of making nucleotide sequence meaningful - where genes are located? whether it is protein-coding or noncoding. If you would like to get an overview of gene annotation, please find this [link](http://www.biolyse.ca/what-is-gene-annotation-in-bioinformatics/). One of well-known collaborative efforts in gene annotation is [the GENCODE consortium](https://www.gencodegenes.org/pages/gencode.html). It is a part of the Encyclopedia of DNA Elements (The ENCODE project consortium) and aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation ([Harrow et al. 2012](https://genome.cshlp.org/content/22/9/1760.full.html)). You might find another database for gene annotation, like RefSeq, CCDS, and need to understand differences between them. **Figure 1. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction** ([Frankish et al. 2015](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431492/)). A) Mean number of alternatively spliced transcripts per multi-exon protein-coding locus B) Mean number of unique CDS per multi-exon protein-coding locus C) Mean number of unique (non-redundant) exons per multi-exon protein-coding locus D) Percentage genomic coverage of unique (non-redundant) exons at multi-exon protein-coding loci. In this tutorial, we will access to gene annotation from the GENCODE consortium and explore genes and functional elements in our genome. ### 1.2. Aims What we will do with this dataset: - Be familiar with gene annotation modality. - Tidy data and create a table for your analysis. - Apply `tidyverse` functions for data munging. Please note that there is better solution for getting gene annotation in R if you use a biomart. Our tutorial is only designed to have a practice on `tidyverse` exercise. ## 2. Explore your data ### 2.1. Unboxing your dataset This tutorial will use a gene annotation file from the GENCODE. You will need to download the file from the GENCODE. If you are using terminal, please download file using `wget`: ```{bash} # Run from your terminal, not R console # wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.basic.annotation.gtf.gz # Once you downloaded the file, you won't need to download it again. So please comment out the command above by adding # ``` Once you download the file, you can print out the first few lines using the following bash command (we will learn UNIX commands later): ```{bash} # Run from your terminal, not R console gzcat gencode.v31.basic.annotation.gtf.gz | head -7 ``` The file is [the GFT file format](https://www.ensembl.org/info/website/upload/gff.html), which you will find most commonly in gene annotation. Please read the file format thoroughly in the link above. For the tutorial, we need to load two packages. If the package is not installed in your system, please install it. - `tidyverse`, a package you have learnt from the chapter 5. - `readr`, a package provides a fast and friendly way to read. Since the file `gencode.v31.basic.annotation.gtf.gz` is pretty large, you will need some function to load data quickly into your workspace. ```{r message=F} library(tidyverse, quietly = T) library(readr, quietly = T) ``` Let's load the GTF file into your workspace. We will use `read_delim` function from the `readr` package. This is much faster loading than `read.delim` or `read.csv` from R base. However, some parameters and output class for `read_delim` are slightly different from them. ```{r} d = read_delim('gencode.v31.basic.annotation.gtf.gz', delim='\t', skip = 5, progress = F, col_names = F) ``` Can you find out what the parameters mean? Few things to note are: - The GTF file contains the first few lines for comments (`#`). In general, the file contains description, provider, date, format. - The GTF file does not have column names so you will need to assign `FALSE for col_names. Now you can check the class of the object `d`. ```{r} class(d) ``` Please note that this is a `tibble data frame`, not a data frame as you get in `read.csv`. The `readr` package is a part of `tidyverse` so it follows `tibble`, reading a delimited file (including csv & tsv) into a tibble ([link](https://readr.tidyverse.org/reference/read_delim.html)). Let's overview few lines from the tibble data frame, and explore what you get in this object. ```{r} head(d) ``` One thing you can find is that there is no columns in the data frame. Let's match which information is provided in columns. You can find the instruction page in the website ([link](https://www.gencodegenes.org/pages/data_format.html)). Based on this, you can assign a name for 9 columns. One thing to remember is you should not use space for the column name. Spacing in the column name is actually working but not a good habit for your code. So please replace a space with `underscore` in the column name. ```{r} # Assign column names according to the GENCODE instruction. cols = c('chrom', 'source', 'feature_type', 'start', 'end', 'score', 'strand', 'phase', 'info') ``` Now you can set up the column names into the `col_names` parameter, and load the file into a data frame. ```{r} d = read_delim('gencode.v31.basic.annotation.gtf.gz', delim='\t', skip = 5, progress = F, col_names = cols) ``` You can find the column names are now all set. ```{r echo=T} head(d) ``` When you loaded the file, you see the message about the data class. You might want to overview this data. ```{r} summary(d) ``` ### 2.2. How many feature types in the GENCODE dataset? As instructed in the GENCODE website, the GENCODE dataset provides a range of annotations for the feature type. You can check feature types using `____` function. ```{r} d %>% group_by(feature_type) %>% count(feature_type) # table(d$feature_type) ``` How many feature types provided in the GENCODE? And how many items stored for each feature type? Please write down the number of feature types from the dataset. Also, if you are not familiar with these types, it would be good to put one or two sentences that can describe each type). ### 2.3. How many genes we have? Let's count the number of genes in our genome. Since we know that the column `feature_type` contains rows with `gene`, which contains obviously annotations for genes. We might want to subset those rows from the data frame. ```{r} d1 = filter(d, feature_type == 'gene') # d1 = d[d$feature_type == 'gene', ] ``` ### 2.4. Ensembl, Havana and CCDS. Gene annotation for the human genome is provided by multiple organizations with different gene annotation methods and strategy. This means that information can be varying by resources, and users need to understand heterogeniety inherent in annotation databases. The GENCODE project utlizes two sources of gene annotation. 1. Havana: Manual gene annotation ([detailed strategy in here](https://asia.ensembl.org/info/genome/genebuild/manual_havana.html)) 2. Ensembl: Automatic gene annotation ([detailed strategy in here](https://asia.ensembl.org/info/genome/genebuild/automatic_coding.html)) It provides the combination of Ensembl/HAVANA gene set as the default gene annotation for the human genome. In addition, they also guarantee that all transcripts from the Consensus Coding Sequence (CCDS) set are present in the GENCODE gene set. The CCDS project is a collaborative effort to identify a core set of protein coding regions that are consistently annotated and of high quality. Initial results from the Consensus CDS (CCDS) project are now available through the appropriate Ensembl gene pages and from the CCDS project page at NCBI. The CCDS set is built by consensus among Ensembl, the National Center for Biotechnology Information (NCBI), and the HUGO Gene Nomenclature Committee (HGNC) for human ([link](https://asia.ensembl.org/info/genome/genebuild/ccds.html)). **Figure 2. Comparison of CCDS and Gencode** ([Source](https://twitter.com/ensembl/status/441959722376499200)). Right. Then now we count how many genes annotated with HAVANA and ENSEMBL. ```{r} d %>% group_by(source) %>% count(source) ``` ### 2.5. `do.call` Since the last column `info` contains a long string for multiple annotations, we will need to split it to extract each annotation. For example, the first line for transcript annotation looks like this: ```{markdown} chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "lncRNA"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1"; ``` If you would like to split `transcript_support_level` and create a new column, you can use `strsplit` function. ```{r} a = 'chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "lncRNA"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";' strsplit(a, 'transcript_support_level\\s+"') ``` After split the string, you can select the second item in the list (`[[1]][2]`). ```{r} strsplit(a, 'transcript_support_level\\s+"')[[1]][2] ``` You can find the `1` in the first position, which you will need to split again. ```{r} b = strsplit(a, 'transcript_support_level\\s+"')[[1]][2] strsplit(b, '\\"') ``` From this, you will get the first item in the list (`[[1]][1]`). Now you would like to apply `strsplit` function across vectors. For this, `do.call` function can be easily implemented to `strsplit` over the vectors from one column. Let's try this. ```{r} head(do.call(rbind.data.frame, strsplit(a, 'transcript_support_level\\s+"'))[[2]]) ``` Now you can write two lines of codes to process two steps we discussed above. ```{r} # First filter transcripts and create a data frame. d2 <- d %>% filter(feature_type == 'transcript') # Now apply the functions. d2$transcript_support_level <- as.character(do.call(rbind.data.frame, strsplit(d2$info, 'transcript_support_level\\s+"'))[[2]]) d2$transcript_support_level <- as.character(do.call(rbind.data.frame, strsplit( d2$transcript_support_level, '\\"'))[[1]]) ``` Now you can check the strsplit works. ```{r} head(d2$transcript_support_level) ``` You can use the same method to extract other annotations, like `gene_id`, `gene_name` etc. ## 3. Exercises Here I list the questions for group activity. You will need to pick up one session for three questions for your group. It will be your quiz on the next class. If you have done your session, you can of course go ahead and take other sessions for your practice. Please note that it is an exercise for `tidyverse` functions, which you will need to use in your code. In addition, you will need to write an one-line code for each question using pipe `%>%`. For questions, you should read some information thoroughly, including: - [Gene biotype](https://www.gencodegenes.org/pages/biotypes.html). - [0 or 1 based annotation in GTF, BED format](https://www.biostars.org/p/84686/) - [Why some features have 1 bp length?](https://www.biostars.org/p/61823/) - [What is the meaning of zero-length exons in GENCODE?](https://www.biostars.org/p/209854/) Also fun to have a review for [microexons](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5863539/) - [Transcript support level (TSL)](https://asia.ensembl.org/info/genome/genebuild/transcript_quality_tags.html#tsl) ### 3.1. Annotation of transcripts in our genome 1. Computes the number of transcripts per gene. What is the mean number of transcripts per gene? What is the quantile (25%, 50%, 75%) for these numbers? Which gene has the greatest number of transcript? 2. Compute the number of transcripts per gene among gene biotypes. For example, compare the number of transcript per gene between protein-coding genes, long noncoding genes, pseudogenes. 3. Final task is to compute the number of transcripts per gene per chromosome. ### 3.2. Gene length in the GENCODE 1. What is the average length of human genes? 2. Is the distribution of gene length differed by autosomal and sex chromosomes? Please calculate the quantiles (0%, 25%, 50%, 75%, 100%) of the gene length for each group. 3. Is the distribution of gene length differed by gene biotype? Please calculate the quantiles (0%, 25%, 50%, 75%, 100%) of the gene length for each group. ### 3.3. Transcript support levels (TSL) The GENCODE TSL provides a consistent method of evaluating the level of support that a GENCODE transcript annotation is actually expressed in humans. 1. With `transcript`, how many transcripts are categorized for each TSL? 2. From the first question, please count the number of transcript for each TSL by gene biotype. 3. From the first question, please count the number of transcript for each TSL by source. ### 3.4. CCDS in the GENCODE 1. With `gene`, please create a data frame with the columns - gene_id, gene_name, hgnc_id, gene_type, chromosome, start, end, and strand. Then, please create new columns for presence of hgnc and ccds. For example, you can put `1` in the column `isHgnc`, if hgnc annotation is avaiable, or `0` if not. Then, you can put `1` in the column `isCCDS`, if ccds annotation is avaiable, or `0` if not. 2. Please count the number of hgnc by gene biotypes. 3. Please count the number of hgnc by `level`. Please note that level in this question is not `TSL`. Please find information in this [link](https://www.gencodegenes.org/pages/data_format.html): 1 (verified loci), 2 (manually annotated loci), 3 (automatically annotated loci). ### 3.5. Transcripts in the GENCODE 1. Which gene has the largest number of transcripts? 2. Please calculate the quantiles (0%, 25%, 50%, 75%, 100%) of the gene length for protein coding genes and long noncoding genes. 3. Please count the number of transcripts by chromosomes. ### 3.6. Autosomal vs. Sex chromosomes. 1. Please calculate the number of genes per chromosome. 2. Please compare the number of genes between autosomal and sex chromosome (Mean, Median). 3. Please divide the genes into groups ‘protein coding’ and ‘long noncoding’, and then compare the number of genes in each chromosomes within groups.