--- title: "genefusions.md" output: html_document --- ```{r, echo=FALSE} opts_chunk$set(fig.path='module4_visualization_'); ``` # Gene fusion visualization ## Installation The required libraries are in bioconductor and can be obtained using the `biocLite` function. You may not require this code block if you have already installed the libraries. ```{r} #source("http://bioconductor.org/biocLite.R") #biocLite("GenomicRanges") #biocLite("ggbio") #biocLite("Gviz") ``` Load the required libraries. ```{r} library(ggbio) library(Gviz) library(GenomicRanges) library(knitr) ``` ## Read the data We will be working with predictions from deFuse. The output of deFuse is a tab delimited text file. This can be read with the `read.table` command. ```{r} results_url = "http://cbwmain.dyndns.info/Module4/packages/cbw_tutorial/gene_fusions/data/HCC1395/defuse/results.filtered.tsv" data = read.table(results_url, sep="\t", header=T, stringsAsFactors=F) ``` Use the `colnames` function to take a look at the data provided by deFuse. ```{r} colnames(data) ``` Many features and annotations are provided. We will primarily be intersted in the chromosome, strand, start and end annotations, and `cluster_id` which is the unique identifier assigned to each prediction. We will concentrate on the EIF3K-CYP39A1 fusion. Filter the data for this fusion. ```{r} gene_names = c("EIF3K", "CYP39A1") gene_names fusion = data[data$gene_name1 %in% gene_names & data$gene_name2 %in% gene_names,][2,] fusion$splitr_sequence <- NULL fusion ``` The `GViz` package works best with UCSC style chromosome names. Convert the `gene_chromosome1` and `gene_chromosome2` fields. ```{r} fusion$gene_chromosome1 = paste("chr", fusion$gene_chromosome1, sep="") fusion$gene_chromosome2 = paste("chr", fusion$gene_chromosome2, sep="") fusion$gene_chromosome1 fusion$gene_chromosome2 ``` Extract the chromosome and strand of the the fusion. Here the strand is referring to the strand of the genome, which is different than the strand on a gene which itself may be on the "-" strand of the genome. ```{r} chromosome1 = fusion$gene_chromosome1 strand1 = fusion$genomic_strand1 ``` Defuse predicts a fusion contig based on spanning and split RNA-Seq reads. The contig is a set of (potentially multiple) exons adjacent to the fusion boundary. The starts and ends of these exons are encoded in the `genomic_starts1/2` and `genomic_ends1/2` fields as comma separated lists. ```{r} starts1 = as.numeric(strsplit(fusion$genomic_starts1, ",")[[1]]) ends1 = as.numeric(strsplit(fusion$genomic_ends1, ",")[[1]]) starts1 ends1 ``` Many plotting funcitons in R take `GRanges` (`GenomicRanges` package) objects containing information about the genomic regions to be plotted. We will encode the fusion exons as `GRanges` objects. ```{r} fusionexons1 <- GRanges( seqnames = chromosome1, ranges = IRanges(start = starts1, end = ends1), strand = strand1, group = rep("fusionexons", length(starts1))) ``` ## Plot a fusion using Gviz We can plot the fusion exons in `Gviz` by creating a gene region track using the `GRanges` to define the exons and providing som additional information. Plot the track using the `plotTracks` function. ```{r, fusion_exons, fig.cap="fusion exons"} fusionTrack <- AnnotationTrack(fusionexons1, genome = "hg19", name = "fusion", groupAnnotation = "group", shape = "box", stacking = "dense") plotTracks(list(fusionTrack)) ``` The plot shows the exons of the fusion, though in isolation the exons are not very interesting. Add an ideogram with `IdeogramTrack`, genomic location using `GenomeAxisTrack`, and ensembl gene models using `BiomartGeneRegionTrack`. ```{r, fusion_exons_with_annot, fig.cap="fusion exons with annotations"} plot.start = min(starts1) - 2000 plot.end = max(ends1) + 5000 itrack <- IdeogramTrack(genome = "hg19", chromosome = chromosome1) gtrack <- GenomeAxisTrack(genome = "hg19", chromosome = chromosome1) biomTrack <- BiomartGeneRegionTrack(genome = "hg19", chromosome = chromosome1, start = plot.start, end = plot.end, name = "ENSEMBL") plotTracks(list(itrack, gtrack, biomTrack, fusionTrack), from = plot.start, to = plot.end) ``` It is also possible to plot read alignments and coverage directly from a bam file using the AlignmentsTrack. The bam file should ideally have ucsc chromosome names to have the optimal compatibility with `GViz`. Set the `sizes` argument so that the alignments track doesnt take over the entire plot. ```{r, exons_anno_align, fig.cap="fusion exons with annotations and alignments"} bam_url = "http://cbwmain.dyndns.info/Module4/HCC1395/rnaseq/genes/HCC1395_EIF3K.bam" bai_url = "http://cbwmain.dyndns.info/Module4/HCC1395/rnaseq/genes/HCC1395_EIF3K.bam.bai" bam_file <- basename(bam_url) bai_file <- basename(bai_url) download.file(bam_url, bam_file) download.file(bai_url, bai_file) alTrack <- AlignmentsTrack(bam_file, isPaired = TRUE, type = "coverage") plotTracks(list(itrack, gtrack, alTrack, biomTrack, fusionTrack), chromosome = chromosome1, from = plot.start, to = plot.end, sizes = c(0.5, 0.5, 1, 3, 0.5)) ``` Try other _types_ of alignment plotting, by setting the `type` argument to "coverage", "sashimi" or "pileup". For sashimi plots, use `type = c("sashimi", "coverage")`. ### Visualizing the fusion break end We will visualize the fusion break end using a highlight track overlayed on the fusion track. Create the highlight track starting and ending at the fusion break end. Apply to the fusion track. ```{r} brkendTrack <- HighlightTrack(trackList = list(fusionTrack), chromosome = chromosome1, start = fusion$genomic_break_pos1, end = fusion$genomic_break_pos1, inBackground = FALSE, col = "darkred", fill = NA) ``` Plot the highlight track and gene model and ideogram tracks. > Note: The highlight should be plotted _instead_ of the track its highlighting. Plotting the highlight track will plot the fusion track for you. ```{r, exons_anno_align_end, fig.cap="fusion exons with annotations and alignments, with fusion end highlighted"} plotTracks(list(itrack, gtrack, alTrack, biomTrack, brkendTrack), chromosome = chromosome1, from = plot.start, to = plot.end, sizes = c(0.5, 0.5, 1, 3, 0.5)) ``` Plotting the fusion break end in the context of the gene models will now give us an idea for which intron of the gene harbours the associated genomic rearrangement breakpoint. ### Plot the other end of the fusion ```{r, exons_anno_align_end_2, fig.cap="fusion exons with annotations and alignments, with fusion end highlighted"} starts2 = as.numeric(strsplit(fusion$genomic_starts2, ",")[[1]]) ends2 = as.numeric(strsplit(fusion$genomic_ends2, ",")[[1]]) fusionexons2 <- GRanges( seqnames = fusion$gene_chromosome2, ranges = IRanges( start = starts2, end = ends2), strand = fusion$genomic_strand2, group = rep("fusionexons", length(starts2))) fusionTrack <- AnnotationTrack(fusionexons2, genome = "hg19", name = "fusion", shape = "box", stacking = "dense", groupAnnotation = "group") brkendTrack <- HighlightTrack(trackList = list(fusionTrack), chromosome = fusion$gene_chromosome2, start = fusion$genomic_break_pos2, end = fusion$genomic_break_pos2, inBackground = FALSE, col = "darkred", fill = NA) bam_url = "http://cbwmain.dyndns.info/Module4/HCC1395/rnaseq/genes/HCC1395_CYP39A1.bam" bai_url = "http://cbwmain.dyndns.info/Module4/HCC1395/rnaseq/genes/HCC1395_CYP39A1.bam.bai" bam_file <- basename(bam_url) bai_file <- basename(bai_url) download.file(bam_url, bam_file) download.file(bai_url, bai_file) alTrack <- AlignmentsTrack(bam_file, isPaired = TRUE, type = "coverage") plot.start = min(fusionexons2@ranges@start) - 2000 plot.end = max(fusionexons2@ranges@start + fusionexons2@ranges@width) + 5000 itrack <- IdeogramTrack(genome = "hg19", chromosome = fusion$gene_chromosome2) gtrack <- GenomeAxisTrack(genome = "hg19", chromosome = fusion$gene_chromosome2) biomTrack <- BiomartGeneRegionTrack(genome = "hg19", chromosome = fusion$gene_chromosome2, start = plot.start, end = plot.end, name = "ENSEMBL") p2 = plotTracks(list(itrack, gtrack, alTrack, biomTrack, brkendTrack), from = plot.start, to = plot.end, sizes = c(0.5, 0.5, 1, 3, 0.5)) ``` ## Plot a _circos_ of all fusions using ggbio We will plot the full fusion dataset as a _circos_ plot using the `ggbio` package. A small amount of setup is required. ### Reference genome preparation First obtain the human genome hg19 chromosome lengths from the `BSgenome.Hsapiens.UCSC.hg19` package. Create a `SeqInfo` object for the chromosome sequences of hg19. ```{r} library(BSgenome.Hsapiens.UCSC.hg19) genome <- Seqinfo( seqnames = seqnames(Hsapiens), seqlengths = seqlengths(Hsapiens), genome = "hg19" ) ``` We will be working with NCBI 1, 2, 3... named chromosomes rather than chr1, chr2, chr3... ```{r} seqlevelsStyle(genome) <- "NCBI" ``` Subset to the chromosomes we are interested in plotting. ```{r} chromosomes = as.character(c(seq(22), "X")) genome = genome[chromosomes] ``` We will also require the chromosomes represented as a `GRanges` object. ```{r} genome.ranges = GRanges( seqnames = seqnames(genome), strand = "*", ranges = IRanges(start = 1, width = seqlengths(genome)), seqinfo = genome ) genome.ranges ``` ### Plotting chromosomes in a circle Using the `GRanges` representation of the chromosome lengths, we can draw a basic plot of the chromosomes in a circle. ```{r, circos_chrom, fig.cap="circos of chromosomes"} p <- ggbio() + circle(genome.ranges, geom = "ideo", fill = "gray70") p ``` Additional tracks are added using the `+` operator. Add a scale and chromosome name. ```{r, circos_chrom_anno, fig.cap="circos of chromosomes with scales and names"} p <- ggbio() + circle(genome.ranges, geom = "ideo", fill = "gray70") + circle(genome.ranges, geom = "scale", size = 2) + circle(genome.ranges, geom = "text", aes(label = seqnames), vjust = -1, size = 4) p ``` We are now ready the plot arcs within and between chromosomes representing fusions between genes. Subset the data by the chromosomes we would like to plot. Also remove low probability events. ```{r} data = data[ data$gene_chromosome1 %in% as.character(chromosomes) & data$gene_chromosome2 %in% as.character(chromosomes) & data$probability > 0.75,] ``` ### Creating a `GRanges` object for your fusions Create a `GRanges` object for the fusions. To do this, we must first create a GRanges object for all the fusion ends together, then split that object into a `GrangesList` with two elements, one for fusion end 1 and another for fusion end 2. Finally, we add fusion end 2 to the meta data of fusion end 1. First step, create a `GRanges` object of all the data, using the `c()` function to concatenate chromosome, strand and position. Add the `mate` metadata field to represent the end of the fusion for each entry in the `GRanges` object. > Important: you must provide a `SeqInfo` object as `ggbio` requires information about the chromosome lengths ```{r} fusion_ends = GRanges( seqnames = c(data$gene_chromosome1, data$gene_chromosome2), strand = c(data$genomic_strand1, data$genomic_strand2), ranges = IRanges(start=c(data$genomic_break_pos1, data$genomic_break_pos2), width=1), seqinfo = genome, mate = c(rep(0, length(data$gene_chromosome1)), rep(1, length(data$gene_chromosome2)))) ``` Second step, split the `GRanges` object on the `mate` field to create a `GRangesList` with two elements, one containing the fusion ends in gene 1 and the other the fusion ends in gene 2. ```{r} fusion_ends_list <- split(fusion_ends, values(fusion_ends)$mate) ``` Third step, create a fusions `GRanges` object that is the gene 1 fusion ends linked to the gene 2 fusion ends. Create the `fusions` object for the first end, then add the second end as the `fusedto` metadata field. ```{r} fusions = fusion_ends_list[[1]] fusions$fusedto = fusion_ends_list[[2]] ``` ### Plotting links Add the rearrangements to the plot using the `link` geom argument. ```{r, circos_chrom_anno_links, fig.cap="circos of chromosomes with scales and names and fusion links"} p <- ggbio() + circle(fusions, geom = "link", linked.to = "fusedto") + circle(genome.ranges, geom = "ideo", fill = "gray70") + circle(genome.ranges, geom = "scale", size = 2) + circle(genome.ranges, geom = "text", aes(label = seqnames), vjust = -1, size = 4) p ``` We can color the links by any of the fusion attributes in the table. ```{r} colnames(data) ``` First add the `orf` attribute to the metadata of our `GRanges` object. ```{r} fusions$orf = data$orf ``` Use `aes` to color the fusions by whether or not they preserve the open reading frame of the fusion. ```{r, circos_chrom_anno_links_orf, fig.cap="circos of chromosomes with scales and names and fusion links colored by open reading frame"} p <- ggbio() + circle(fusions, geom = "link", linked.to = "fusedto", aes(color = orf)) + circle(genome.ranges, geom = "ideo", fill = "gray70") + circle(genome.ranges, geom = "scale", size = 2) + circle(genome.ranges, geom = "text", aes(label = seqnames), vjust = -1, size = 4) p ```