---
title: "genefusions.md"
output: html_document
---

```{r, echo=FALSE}
opts_chunk$set(fig.path='module4_visualization_');
```

# Gene fusion visualization

## Installation

The required libraries are in bioconductor and can be obtained using the `biocLite` function.  You may not require this code block if you have already installed the libraries.

```{r}
#source("http://bioconductor.org/biocLite.R")

#biocLite("GenomicRanges")
#biocLite("ggbio")
#biocLite("Gviz")
```

Load the required libraries.

```{r}
library(ggbio)
library(Gviz)
library(GenomicRanges)
library(knitr)
```

## Read the data

We will be working with predictions from deFuse.  The output of deFuse is a tab delimited text file.  This can be read with the `read.table` command.

```{r}
results_url = "http://cbwmain.dyndns.info/Module4/packages/cbw_tutorial/gene_fusions/data/HCC1395/defuse/results.filtered.tsv"
data = read.table(results_url, sep="\t", header=T, stringsAsFactors=F)
```

Use the `colnames` function to take a look at the data provided by deFuse.

```{r}
colnames(data)
```

Many features and annotations are provided.  We will primarily be intersted in the chromosome, strand, start and end annotations, and `cluster_id` which is the unique identifier assigned to each prediction.

We will concentrate on the EIF3K-CYP39A1 fusion.  Filter the data for this fusion.

```{r}
gene_names = c("EIF3K", "CYP39A1")
gene_names

fusion = data[data$gene_name1 %in% gene_names & data$gene_name2 %in% gene_names,][2,]
fusion$splitr_sequence <- NULL
fusion
```

The `GViz` package works best with UCSC style chromosome names.  Convert the `gene_chromosome1` and `gene_chromosome2` fields.

```{r}
fusion$gene_chromosome1 = paste("chr", fusion$gene_chromosome1, sep="")
fusion$gene_chromosome2 = paste("chr", fusion$gene_chromosome2, sep="")
fusion$gene_chromosome1
fusion$gene_chromosome2
```

Extract the chromosome and strand of the the fusion.  Here the strand is referring to the strand of the genome, which is different than the strand on a gene which itself may be on the "-" strand of the genome.

```{r}
chromosome1 = fusion$gene_chromosome1
strand1 = fusion$genomic_strand1
```

Defuse predicts a fusion contig based on spanning and split RNA-Seq reads.  The contig is a set of (potentially multiple) exons adjacent to the fusion boundary.  The starts and ends of these exons are encoded in the `genomic_starts1/2` and `genomic_ends1/2` fields as comma separated lists.

```{r}
starts1 = as.numeric(strsplit(fusion$genomic_starts1, ",")[[1]])
ends1 = as.numeric(strsplit(fusion$genomic_ends1, ",")[[1]])
starts1
ends1
```

Many plotting funcitons in R take `GRanges` (`GenomicRanges` package) objects containing information about the genomic regions to be plotted.  We will encode the fusion exons as `GRanges` objects.

```{r}
fusionexons1 <- GRanges(
  seqnames = chromosome1,
  ranges = IRanges(start = starts1, end = ends1),
  strand = strand1,
  group = rep("fusionexons", length(starts1)))
```

## Plot a fusion using Gviz

We can plot the fusion exons in `Gviz` by creating a gene region track using the `GRanges` to define the exons and providing som additional information.  Plot the track using the `plotTracks` function.

```{r, fusion_exons, fig.cap="fusion exons"}
fusionTrack <- AnnotationTrack(fusionexons1, genome = "hg19", name = "fusion",
  groupAnnotation = "group", shape = "box", stacking = "dense")

plotTracks(list(fusionTrack))
```

The plot shows the exons of the fusion, though in isolation the exons are not very interesting.  Add an ideogram with `IdeogramTrack`, genomic location using `GenomeAxisTrack`, and ensembl gene models using `BiomartGeneRegionTrack`.

```{r, fusion_exons_with_annot, fig.cap="fusion exons with annotations"}
plot.start = min(starts1) - 2000
plot.end = max(ends1) + 5000

itrack <- IdeogramTrack(genome = "hg19", chromosome = chromosome1)

gtrack <- GenomeAxisTrack(genome = "hg19", chromosome = chromosome1)

biomTrack <- BiomartGeneRegionTrack(genome = "hg19",
  chromosome = chromosome1, start = plot.start, end = plot.end,
  name = "ENSEMBL")

plotTracks(list(itrack, gtrack, biomTrack, fusionTrack), from = plot.start, to = plot.end)
```

It is also possible to plot read alignments and coverage directly from a bam file using the AlignmentsTrack.  The bam file should ideally have ucsc chromosome names to have the optimal compatibility with `GViz`.  Set the `sizes` argument so that the alignments track doesnt take over the entire plot.

```{r, exons_anno_align, fig.cap="fusion exons with annotations and alignments"}
bam_url = "http://cbwmain.dyndns.info/Module4/HCC1395/rnaseq/genes/HCC1395_EIF3K.bam"
bai_url = "http://cbwmain.dyndns.info/Module4/HCC1395/rnaseq/genes/HCC1395_EIF3K.bam.bai"

bam_file <- basename(bam_url)
bai_file <- basename(bai_url)

download.file(bam_url, bam_file)
download.file(bai_url, bai_file)

alTrack <- AlignmentsTrack(bam_file,
  isPaired = TRUE, type = "coverage")

plotTracks(list(itrack, gtrack, alTrack, biomTrack, fusionTrack),
  chromosome = chromosome1, from = plot.start, to = plot.end, 
  sizes = c(0.5, 0.5, 1, 3, 0.5))
```

Try other _types_ of alignment plotting, by setting the `type` argument to "coverage", "sashimi" or "pileup".  For sashimi plots, use `type = c("sashimi", "coverage")`.

### Visualizing the fusion break end

We will visualize the fusion break end using a highlight track overlayed on the fusion track.

Create the highlight track starting and ending at the fusion break end.  Apply to the fusion track.

```{r}
brkendTrack <- HighlightTrack(trackList = list(fusionTrack),
  chromosome = chromosome1, start = fusion$genomic_break_pos1,
  end = fusion$genomic_break_pos1,
  inBackground = FALSE, col = "darkred", fill = NA)
```

Plot the highlight track and gene model and ideogram tracks.

> Note: The highlight should be plotted _instead_ of the track its highlighting.  Plotting the highlight track will plot the fusion track for you.

```{r, exons_anno_align_end, fig.cap="fusion exons with annotations and alignments, with fusion end highlighted"}
plotTracks(list(itrack, gtrack, alTrack, biomTrack, brkendTrack),
  chromosome = chromosome1, from = plot.start, to = plot.end,
  sizes = c(0.5, 0.5, 1, 3, 0.5))
```

Plotting the fusion break end in the context of the gene models will now give us an idea for which intron of the gene harbours the associated genomic rearrangement breakpoint.

### Plot the other end of the fusion

```{r, exons_anno_align_end_2, fig.cap="fusion exons with annotations and alignments, with fusion end highlighted"}
starts2 = as.numeric(strsplit(fusion$genomic_starts2, ",")[[1]])
ends2 = as.numeric(strsplit(fusion$genomic_ends2, ",")[[1]])

fusionexons2 <- GRanges(
  seqnames = fusion$gene_chromosome2,
  ranges = IRanges(
    start = starts2,
    end = ends2),
  strand = fusion$genomic_strand2,
  group = rep("fusionexons", length(starts2)))

fusionTrack <- AnnotationTrack(fusionexons2,
  genome = "hg19", name = "fusion", 
  shape = "box", stacking = "dense",
  groupAnnotation = "group")

brkendTrack <- HighlightTrack(trackList = list(fusionTrack),
  chromosome = fusion$gene_chromosome2,
  start = fusion$genomic_break_pos2,
  end = fusion$genomic_break_pos2,
  inBackground = FALSE, col = "darkred", fill = NA)

bam_url = "http://cbwmain.dyndns.info/Module4/HCC1395/rnaseq/genes/HCC1395_CYP39A1.bam"
bai_url = "http://cbwmain.dyndns.info/Module4/HCC1395/rnaseq/genes/HCC1395_CYP39A1.bam.bai"

bam_file <- basename(bam_url)
bai_file <- basename(bai_url)

download.file(bam_url, bam_file)
download.file(bai_url, bai_file)

alTrack <- AlignmentsTrack(bam_file,
  isPaired = TRUE, type = "coverage")

plot.start = min(fusionexons2@ranges@start) - 2000
plot.end = max(fusionexons2@ranges@start + fusionexons2@ranges@width) + 5000

itrack <- IdeogramTrack(genome = "hg19", chromosome = fusion$gene_chromosome2)

gtrack <- GenomeAxisTrack(genome = "hg19", chromosome = fusion$gene_chromosome2)

biomTrack <- BiomartGeneRegionTrack(genome = "hg19",
  chromosome = fusion$gene_chromosome2, start = plot.start, end = plot.end,
  name = "ENSEMBL")

p2 = plotTracks(list(itrack, gtrack, alTrack, biomTrack, brkendTrack),
  from = plot.start, to = plot.end,
  sizes = c(0.5, 0.5, 1, 3, 0.5))

```

## Plot a _circos_ of all fusions using ggbio

We will plot the full fusion dataset as a _circos_ plot using the `ggbio` package.  A small amount of setup is required.

### Reference genome preparation

First obtain the human genome hg19 chromosome lengths from the `BSgenome.Hsapiens.UCSC.hg19` package.  Create a `SeqInfo` object for the chromosome sequences of hg19.

```{r}
library(BSgenome.Hsapiens.UCSC.hg19)

genome <- Seqinfo(
  seqnames = seqnames(Hsapiens),
  seqlengths = seqlengths(Hsapiens),
  genome = "hg19"
)
```

We will be working with NCBI 1, 2, 3... named chromosomes rather than chr1, chr2, chr3...

```{r}
seqlevelsStyle(genome) <- "NCBI"
```

Subset to the chromosomes we are interested in plotting.

```{r}
chromosomes = as.character(c(seq(22), "X"))
genome = genome[chromosomes]
```

We will also require the chromosomes represented as a `GRanges` object.

```{r}
genome.ranges = GRanges(
  seqnames = seqnames(genome),
  strand = "*",
  ranges = IRanges(start = 1, width = seqlengths(genome)),
  seqinfo = genome
)
genome.ranges
```

### Plotting chromosomes in a circle

Using the `GRanges` representation of the chromosome lengths, we can draw a basic plot of the chromosomes in a circle.

```{r, circos_chrom, fig.cap="circos of chromosomes"}
p <- ggbio() +
  circle(genome.ranges, geom = "ideo", fill = "gray70")
p
```

Additional tracks are added using the `+` operator.  Add a scale and chromosome name.

```{r, circos_chrom_anno, fig.cap="circos of chromosomes with scales and names"}
p <- ggbio() +
  circle(genome.ranges, geom = "ideo", fill = "gray70") +
  circle(genome.ranges, geom = "scale", size = 2) +
  circle(genome.ranges, geom = "text", aes(label = seqnames), vjust = -1, size = 4)
p
```

We are now ready the plot arcs within and between chromosomes representing fusions between genes.

Subset the data by the chromosomes we would like to plot.  Also remove low probability events.

```{r}
data = data[
  data$gene_chromosome1 %in% as.character(chromosomes) & 
  data$gene_chromosome2 %in% as.character(chromosomes) &
  data$probability > 0.75,]
```

### Creating a `GRanges` object for your fusions

Create a `GRanges` object for the fusions.  To do this, we must first create a GRanges object for all the fusion ends together, then split that object into a `GrangesList` with two elements, one for fusion end 1 and another for fusion end 2.  Finally, we add fusion end 2 to the meta data of fusion end 1.

First step, create a `GRanges` object of all the data, using the `c()` function to concatenate chromosome, strand and position.  Add the `mate` metadata field to represent the end of the fusion for each entry in the `GRanges` object.  

> Important: you must provide a `SeqInfo` object as `ggbio` requires information about the chromosome lengths 

```{r}
fusion_ends = GRanges(
  seqnames = c(data$gene_chromosome1, data$gene_chromosome2),
  strand = c(data$genomic_strand1, data$genomic_strand2),
  ranges = IRanges(start=c(data$genomic_break_pos1, data$genomic_break_pos2), width=1),
  seqinfo = genome,
  mate = c(rep(0, length(data$gene_chromosome1)), rep(1, length(data$gene_chromosome2))))
```

Second step, split the `GRanges` object on the `mate` field to create a `GRangesList` with two elements, one containing the fusion ends in gene 1 and the other the fusion ends in gene 2.

```{r}
fusion_ends_list <- split(fusion_ends, values(fusion_ends)$mate)
```

Third step, create a fusions `GRanges` object that is the gene 1 fusion ends linked to the gene 2 fusion ends.  Create the `fusions` object for the first end, then add the second end as the `fusedto` metadata field.

```{r}
fusions = fusion_ends_list[[1]]
fusions$fusedto = fusion_ends_list[[2]]
```

### Plotting links

Add the rearrangements to the plot using the `link` geom argument.

```{r, circos_chrom_anno_links, fig.cap="circos of chromosomes with scales and names and fusion links"}
p <- ggbio() +
  circle(fusions, geom = "link", linked.to = "fusedto") +
  circle(genome.ranges, geom = "ideo", fill = "gray70") +
  circle(genome.ranges, geom = "scale", size = 2) +
  circle(genome.ranges, geom = "text", aes(label = seqnames), vjust = -1, size = 4)
p
```

We can color the links by any of the fusion attributes in the table.

```{r}
colnames(data)
```

First add the `orf` attribute to the metadata of our `GRanges` object.

```{r}
fusions$orf = data$orf
```

Use `aes` to color the fusions by whether or not they preserve the open reading frame of the fusion.

```{r, circos_chrom_anno_links_orf, fig.cap="circos of chromosomes with scales and names and fusion links colored by open reading frame"}
p <- ggbio() +
  circle(fusions, geom = "link", linked.to = "fusedto", aes(color = orf)) +
  circle(genome.ranges, geom = "ideo", fill = "gray70") +
  circle(genome.ranges, geom = "scale", size = 2) +
  circle(genome.ranges, geom = "text", aes(label = seqnames), vjust = -1, size = 4)
p
```