---
title: "Farris et al RNA-seq methods"
output:
  html_document:
    code_folding: hide
    toc: true
    toc_depth: 4
    toc_float:
      collapsed: true
    theme: cerulean
    highlight: pygments
    df_print: kable
  word_document:
    fig_width: 6
    fig_height: 6
    fig_caption: true
always_allow_html: yes
---

```{r setup, include=FALSE}
#    css: liao_style.css
## To render this page:
## In the farrisSeq folder run:
## rmarkdown::render("farrisSeq.Rmd", "html_document", knit_root_dir=getwd(), output_dir=getwd());

knitr::opts_chunk$set(cache=TRUE,
   autodep=TRUE,
   cache.extra=list(
      packageVersion("farrisdata"),
      packageVersion("jamba"),
      packageVersion("splicejam"))
   );

options("kable_styling_position"="left",
  "kable_styling_full_width"=FALSE,
  "kable_styling_bootstrap_options"=c("striped",
    "hover",
    "condensed",
    "responsive")
);
```

This document describes and demonstrates analysis methods used to analyze
mouse hippocampal subregion- and compartment-specific transcriptome
RNA-seq data [(Farris et al, 2019)](https://www.cell.com/cell-reports/fulltext/S2211-1247(19)31155-6).

Raw sequence read data is provided through GEO GSE116343, with RNA-seq
data available at
[GEO accession GSE116342](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE116342).
Data used for analysis in R is provided
through an R package `farrisdata`, installed in R with
`devtools::install_github("jmw86069/farrisdata")`.

## Reproducing this document

This document can be reproduced using the R code displayed in
this document. There are three basic methods to reproduce these
steps.

#### Method One

Copy and paste the commands into an active R session. Note that the
workflow will create temporary files in the current directory.
You can check the active directory with `getwd()` in R.


#### Method Two

Download the Rmarkdown file
[farrisSeq.Rmd](https://raw.githubusercontent.com/jmw86069/jampack/master/docs/farrisSeq.Rmd)
(linked) from the Github repository.

* Move this file to a convenient folder.
Cache files and images will be stored in that folder as a result
of the analysis steps below.
* Open R in the folder where the file was saved, 
then run this command in R:

> rmarkdown::render("farrisSeq.Rmd", "html_document", knit_root_dir=getwd(), output_dir=getwd())

Once the process is complete, there will be a new file
`farrisSeq.html` that contains the full summary including
figures.

This method has the benefit of loading all the R objects into
the active R environment, so the data can be examined in more detail.

#### Method Three

Download the Rmarkdown file
[farrisSeq.Rmd](https://github.com/jmw86069/jampack/blob/master/docs/farrisSeq.html)
(linked) from the Github repository.

* Move this file to a convenient folder.
Cache files and images will be stored in that folder as a result
* Open the `farrisSeq.Rmd` file in Rstudio, and click `"Knit"` to
render the document into an HTML page.

Once this process is complete, the output file `farrisSeq.html` will
be opened automatically in the Rstudio viewer.
This process has the slight disadvantage that the R objects are
not loaded into the R environment.

## R environment setup

The initial steps involve loading required R packages into memory.

```{r, r_initialization, cache=FALSE}
suppressPackageStartupMessages(library(knitr));
suppressPackageStartupMessages(library(kableExtra));
suppressPackageStartupMessages(library(dplyr));
suppressPackageStartupMessages(library(tidyselect));
suppressPackageStartupMessages(library(plotly));
suppressPackageStartupMessages(library(made4));
suppressPackageStartupMessages(library(matrixStats));
suppressPackageStartupMessages(library(ComplexHeatmap));
suppressPackageStartupMessages(library(SummarizedExperiment));

suppressPackageStartupMessages(library(ggplot2));
suppressPackageStartupMessages(library(ggrepel));

suppressPackageStartupMessages(library(jamba));
suppressPackageStartupMessages(library(jamma));
suppressPackageStartupMessages(library(colorjam));
suppressPackageStartupMessages(library(splicejam));
suppressPackageStartupMessages(library(farrisdata));

suppressPackageStartupMessages(library(png));
suppressPackageStartupMessages(library(Cairo));

```


## Description of Input Data

The input data for this analysis workflow
is gene and transcript level read count data 
derived from Salmon quantitation files, which were already imported and
assembled into relevant data matrices and saved as R objects
`farrisGeneSE` and `farrisTXSE`, as described in more depth below.

### Description of sample groups

The cell body (CB) and dendrite (DE) compartments from each
hippocampal subregion cell type (CA1, CA2, CA3, DG) from three
adult male mice (animal IDs 492, 496, 502) were laser capture
microdissected and total RNA was extracted for RNAseq. Libraries
were made using the Nugen Ovation Universal RNA-Seq system,
which does not use isothermal amplification and produces
strand-specific libraries depleted of rRNA.

For consistency, a common set of colors are defined to represent
the different sample groups, obtained from `farrisdata::colorSub`.

### Gene expression data

Salmon quantitation files were imported using the R Bioconductor
package `tximport`, summarized to the gene level, and stored in an
R object `farrisGeneSE`, which is available in the R data package
`"farrisdata"`, installed using
`devtools::install_github("jmw86069/farrisdata")`.

```{r load_gene_data}
## Load the gene data
farrisGeneSE;
```

#### farrisGeneSE SummarizedExperiment format

`farrisGeneSE` is a `SummarizedExperiment` object with this content:

* **assays** list containing the numeric matrix with gene rows,
sample columns, with log2 median-normalized counts. Each row is named
using a unique gene symbol derived from the Gencode GTF file. To
access the normalized Salmon pseudocounts:

```{r, eval=FALSE}
assays(farrisGeneSE)[["counts"]]
```

* **assays** also contains the raw Salmon pseudocounts
accessible, using this command: `assays(farrisGeneSE)[["raw_counts"]]`

* **rowData** data.frame with gene rows, using the gene symbol derived from
the Gencode GTF file. 
```{r, gene_table, dependson="load_gene_data"}
as.data.frame(head(rowData(farrisGeneSE), 10)) %>%
   dplyr::mutate() %>%
   dplyr::select(probeID, geneSymbol,
      tidyselect::ends_with("_type"),
      tidyselect::contains("has")) %>%
   kable(escape=FALSE) %>%
   kable_styling()
```

* **colData** annotated data.frame with sample rows, describing the
mouse hippocampal **Region**, **CellType**, **Sample**, and **groupName**.
```{r, sample_table, echo=FALSE, dependson=c("load_gene_data","color_sub")}
colorSub <- farrisdata::colorSub;
df2 <- kable_coloring(colData(farrisGeneSE),
      colorSub=colorSub,
      row_color_by="groupName",
      returnType="kable") %>%
   collapse_rows(columns=c(5), valign="middle") %>%
   row_spec(0, background="#DDDDDD")
df2;
```


* **metadata** list describing the experimental design and contrasts used
for statistical comparisons.
    * **design** numeric matrix describing the sample design, suitable
    for use directly by methods from the R Bioconductor `limma` package.
    * **contrasts** numeric matrix describing contrasts, suitable for
    use directly by `limma`.
    * **samples** vector of samples used.
    * **genes** vector of genes used.


### Transcript expression data

Salmon  quantitation files were imported using `tximport` as above,
maintaining individual isoform abundances, then stored in an R object
`farrisTxSE`, which is available in the R data package
`farrisdata`, installed using
`devtools::install_github("jmw86069/farrisdata")`.

```{r load_tx_data}
## Load the gene data
farrisTxSE;
```

#### farrisTxSE format

The `farrisTxSE` object is a named list with content similar to that
described above for gene data:

* **assays** list containing the numeric matrix with gene rows,
sample columns, with log2 median-normalized counts. Each row is named
using a unique gene symbol derived from the Gencode GTF file.
* **rowData** data.frame with gene rows, using the gene symbol derived from
the Gencode GTF file.
```{r, tx_table, dependson=c("load_tx_data","color_sub")}
tx_df <- subset(as.data.frame(rowData(farrisTxSE)),
   geneSymbol %in% c("Gria1","Shank2","Ntrk2")) %>%
   dplyr::select(geneSymbol,
      transcript_id,
      tidyselect::ends_with("_type"),
      tidyselect::contains("has"),
      tidyselect::contains("tpm"));
tx_df <- mixedSortDF(tx_df,
   byCols=match(c("geneSymbol","TxDetectedByTPM"),
     colnames(tx_df))*c(1,-1));
colorSubGene <- c(colorSub,
   group2colors(unique(tx_df$geneSymbol),
      cRange=c(20,30),
      lRange=c(80,90)));
tx_df2 <- kable_coloring(tx_df,
      colorSub=colorSubGene,
      row_color_by="geneSymbol",
      verbose=FALSE,
      returnType="kable") %>%
   collapse_rows(columns=2, valign="middle") %>%
   row_spec(0, background="#DDDDDD")

tx_df2;
```

* **colData** annotated data.frame with sample rows, describing the
mouse hippocampal **Region**, **CellType**, **Sample**, and **groupName**.
This data is identical to the sample table shown for `farrisGeneSE` above.
* **metadata** list describing the experimental design and contrasts used
for statistical comparisons.
    * **design** numeric matrix describing the sample design, suitable
    for use directly by methods from the R Bioconductor `limma` package.
    * **contrasts** numeric matrix describing contrasts, suitable for
    use directly by `limma`.
    * **samples** vector of samples used.
    * **genes** vector of genes used.


## Related analysis parameters

The following steps and parameters were used to create
the Salmon median normalized count data used in differential
gene and isoform expression analyses and/or STAR junction
count data used in sashimi plots. The STAR alignment was
also used to create sequence read coverage files.

### Sickle read trimming

We used sickle version 1.33 to trim paired-end reads, with parameters
`-q 20 -l 20` to filter reads with quality threshold 20, and requiring
reads at least 20 base length.

### Cutadapt adapter trimming

We used cutadapt version 1.8.1 with parameter `-m 20` and with standard
Illumina Truseq adapter sequence `AGATCGGAAGAGC`.

### Salmon quantitation

We used salmon version 0.9.1 for transcript quantitation, using
Gencode vM12 comprehensive GTF file, using a index with kmer size 31,
and the cutadapt adapter-trimmed reads, with seqBias, VBO, and gcBias
options enabled.

### STAR sequence alignment

We used STAR version 2.5.1b to align versus Gencode vM12 comprehensive GTF,
with the following parameters, based upon ENCODE recommended RNA-seq
parameters:

```
--outFilterType Normal
--outFilterMultimapNmax 20
--alignSJoverhangMin 8
--alignSJDBoverhangMin 1
--outFilterMismatchNmax 999
--alignIntronMin 20
--alignIntronMax 1000000
--alignMatesGapMax 1000000
--outSAMtype BAM SortedByCoordinate Unsorted
--outWigType bedGraph
--outWigStrand Stranded
--outWigNorm None
--outReadsUnmapped Fastx
--outMultimapperOrder Random
```


### featureCounts read counting

We used featureCounts version 1.5.1 to produce per-gene counts, using
Gencode vM12 transcripts flattened per gene.


## Gene expression analysis summary


### BGA plot, refers to Figure 2

3D between group analysis (BGA) plot of the three
biological replicates per cell type per compartment.

```{r bga_plot, dependson="load_gene_data"}
bgaExprs <- assays(farrisGeneSE)$counts;
bgaExprs[bgaExprs < 7] <- 7;
bgaExprsUse <- bgaExprs[rowMins(bgaExprs) < rowMaxs(bgaExprs),,drop=FALSE];
nrow(bgaExprsUse);

salmonBga1 <- bga(dataset=bgaExprsUse,
   classvec=nameVector(colData(farrisGeneSE)$groupName,
      colnames(farrisGeneSE)),
   type="coa");
```

```{r bga_plotly, dependson=c("bga_plot","color_sub")}
salmonBga1ly <- bgaPlotly3d(salmonBga1,
   axes=c(1,2,3),
   colorSub=colorSub,
   useScaledCoords=FALSE,
   drawVectors="none",
   drawSampleLabels=FALSE,
   superGroups=gsub("_.+", "", salmonBga1$fac),
   ellipseType="none",
   sceneX=0, sceneY=1, sceneZ=1,
   verbose=FALSE);
salmonBga1ly;
```

### Comparison with CA1 published data by Nakayama, Ainsley, Cajigas, refers to Figure S2

First we calculate group medians using gene expression data, to determine
the set of genes detected above an expression threshold of 128 pseudocounts
(log2 = 7).

```{r genes_detected_CA1, fig.height=9, fig.width=6, dependson="load_gene_data"}
farrisGeneGroupMedians <- rowGroupMeans(assays(farrisGeneSE)[["counts"]],
   groups=colData(farrisGeneSE)$groupName,
   useMedian=TRUE);

# Plot histogram of expression in CA1_CB and CA1_DE
plotPolygonDensity(farrisGeneGroupMedians[,c("CA1_CB","CA1_DE")],
   xlim=c(0,20),
   ylim=c(0,700),
   ablineV=7);

# Detected genes in CA1_CB and CA1_DE
CA1detected <- (farrisGeneGroupMedians[,"CA1_CB"] > 7 &
   farrisGeneGroupMedians[,"CA1_DE"] > 7);
CA1detCBandDE <- rownames(farrisGeneGroupMedians)[CA1detected];
length(CA1detCBandDE);
```

Next we load previously gene lists representing detected CA1 dendrite genes
from published studies from Nakayama et al, Ainsley et al, and
Cajigas et al. These genes are available in the `farrisdata` package,
as described above.

Produce a 4-way Venn diagram showing the overlapping genes from these
studies.

```{r venn_4_way, fig.height=8, fig.width=8, dependson="genes_detected_CA1"}
GeneListL <- list(
  Farris=CA1detCBandDE,
  Cajigas=CajigasGenes,
  Ainsley=AinsleyGenes,
  Nakayama=NakayamaGenes);
GeneListIM <- list2im(GeneListL);
vps <- limma::vennDiagram(GeneListIM,
   circle.col=rainbowJam(4));

vps2 <- venn(GeneListL,
   show.plot=FALSE);
## Retrieve specific overlaps like this:
vps2vennL <- attr(vps2, "intersections");
## vps2vennL[["Farris:Cajigas:Ainsley:Nakayama"]]
```

### Correlation Centered by Compartment, refers to Figure 3

#### Cell Body

Per-sample correlation, centered across all CB samples. First, we
use Salmon normalized pseudocounts, restricted to genes where at
least one group mean value is above log2(7), which is >= 128
normalized pseudocounts.

Then data is centered per **Compartment**, so that CellBody samples
are centered by subtracting the mean CellBody expression per gene,
and Dendrite samples are centered by subtracting the mean Dendrite
expression per gene. This step uses `centerGeneData()` with the
argument `centerGroups`.

Next we prepare a data.frame with color coding to show the
CellType and Compartment values.

```{r sample_corr_cb_prep, dependson=c("load_gene_data","genes_detected_CA1","color_sub")}
## farrisGeneGroupMedians
## Pull out only cell body samples
iSamples <- colnames(farrisGeneSE);
iSamplesCB <- vigrep("CB", iSamples);
iSamplesDE <- vigrep("DE", iSamples);
#iSamplesGrp <- colnames(iMatrix7grp);
iSamplesGrp <- colnames(farrisGeneGroupMedians);
iSamplesGrpCB <- vigrep("CB", iSamplesGrp);
iSamplesGrpDE <- vigrep("DE", iSamplesGrp);

corrCutoff <- 7;
genesAboveCutoff <- (rowMaxs(farrisGeneGroupMedians) >= corrCutoff);
genesAboveCutoffCB <- (rowMaxs(farrisGeneGroupMedians[,iSamplesGrpCB]) >= corrCutoff);
genesAboveCutoffDE <- (rowMaxs(farrisGeneGroupMedians[,iSamplesGrpDE]) >= corrCutoff);
genesAboveCutoffBoth <- (genesAboveCutoffCB & genesAboveCutoffDE);

iMatrix7 <- assays(farrisGeneSE[genesAboveCutoff,])[["counts"]];
centerGroups <- gsub("^.+(DE|CB).+$",
   "\\1",
   iSamples);
iMatrix7ctr <- centerGeneData(iMatrix7,
   centerGroups=centerGroups);
iMatrix7ctrCor <- cor(iMatrix7ctr);


## Generate some color bars to annotate the heatmap
colDataColorsRepL <- as.data.frame(colData(farrisGeneSE)[,c("CellType","Compartment")]);
colDataColorsRep <- df2colorSub(colDataColorsRepL, colorSub=colorSub);
```

Sample correlation heatmap

```{r sample_corr_cb_CH, fig.height=8, fig.width=10, dependson=c("sample_corr_cb_prep","color_sub")}
pheat_colors <- list(Compartment=colorSub[c("CB","DE")],
   CellType=colorSub[c("CA1","CA2","CA3","DG")]);
pheat_breaks <- warpAroundZero(seq(from=-1, to=1, length.out=51), lens=1);

cBR <- circlize::colorRamp2(breaks=pheat_breaks,
   col=getColorRamp("RdBu_r", n=51));

colHA <- HeatmapAnnotation(df=colDataColorsRepL[iSamplesCB,2:1],
   show_annotation_name=TRUE,
   border=TRUE,
   annotation_legend_param=list(
      title_gp=gpar(fontsize=12, fontface="bold")),
   col=pheat_colors);
rowA <- rowAnnotation(df=colDataColorsRepL[iSamplesCB,2:1],
   col=pheat_colors,
   show_annotation_name=TRUE,
   border=TRUE,
   annotation_legend_param=list(
      title_gp=gpar(fontsize=12, fontface="bold")),
   show_legend=FALSE);

corHM <- Heatmap(iMatrix7ctrCor[iSamplesCB,iSamplesCB],
   name="Correlation",
   clustering_method_columns="ward.D",
   clustering_method_rows="ward.D",
   column_dend_height=unit(20, "mm"),
   row_dend_width=unit(20, "mm"),
   top_annotation=colHA,
   left_annotation=rowA,
   border=TRUE,
   heatmap_legend_param=list(
      grid_width=unit(8, "mm"),
      border="black",
      title_gp=gpar(fontsize=12, fontface="bold"),
      legend_height=unit(30, "mm")),
   row_dend_side="left",
   col=cBR);
ComplexHeatmap::draw(corHM, merge_legend=TRUE);

```


#### Cell Body and Dendrite

Correlation showing CA2_CB correlates highest with CA2_DE,
etc. for each CellType.

```{r corr_by_compartment_prep, dependson=c("load_gene_data","genes_detected_CA1","color_sub")}
## farrisGeneGroupMedians[genesAboveCutoff,]
iMatrix7grp <- farrisGeneGroupMedians[genesAboveCutoffCB,];
centerGroupsGrp <- gsub("^.*(DE|CB).*$",
   "\\1",
   iSamplesGrp);
## 31oct2018 using mean instead of median
iMatrix7grpCtr <- centerGeneData(iMatrix7grp,
   centerGroups=centerGroupsGrp,
   mean=TRUE);
iMatrix7grpCtrCor <- cor(iMatrix7grpCtr);

## Generate some color bars to annotate the heatmap
colDataColorsGrpL <- data.frame(rbindList(
   strsplit(nameVector(iSamplesGrp), "_")));
colnames(colDataColorsGrpL) <- c("CellType","Compartment");
colDataColorsGrp <- df2colorSub(colDataColorsGrpL,
  colorSub=colorSub);
```

Sample correlation heatmap

```{r corr_by_compartment_CB, fig.height=8, fig.width=10, dependson="corr_by_compartment_prep"}
colHA2 <- HeatmapAnnotation(df=colDataColorsGrpL[iSamplesGrp,2:1],
   show_annotation_name=TRUE,
   border=TRUE,
   annotation_legend_param=list(
      title_gp=gpar(fontsize=12, fontface="bold")),
   col=pheat_colors);
rowA2 <- rowAnnotation(df=colDataColorsGrpL[iSamplesGrp,2:1],
   col=pheat_colors,
   show_annotation_name=TRUE,
   border=TRUE,
   annotation_legend_param=list(
      title_gp=gpar(fontsize=12, fontface="bold")),
   show_legend=FALSE);

corHM2 <- Heatmap(iMatrix7grpCtrCor[iSamplesGrp,iSamplesGrp],
   name="Correlation",
   clustering_method_columns="ward.D",
   clustering_method_rows="ward.D",
   column_dend_height=unit(20, "mm"),
   row_dend_width=unit(20, "mm"),
   top_annotation=colHA2,
   left_annotation=rowA2,
   border=TRUE,
   heatmap_legend_param=list(
      grid_width=unit(8, "mm"),
      border="black",
      title_gp=gpar(fontsize=12, fontface="bold"),
      legend_height=unit(30, "mm")),
   row_dend_side="left",
   col=cBR);
ComplexHeatmap::draw(corHM2, merge_legend=TRUE);
```

#### Splom plot, refers to Figures 3 and S4

Scatterplot showing the +1,+1 selection of genes, which
helps define the gene lists in the next section.

```{r splom_CA2, fig.height=8, fig.width=8, dependson="corr_by_compartment_prep"}
corCols <- c("CA2_DE","CA2_CB");
cutCB <- round(digits=3, log2(1.5));
cutDE <- round(digits=3, log2(1.5));
splomL <- lapply(nameVector(c("CA1","CA2","CA3","DG")), function(k) {
   corCols <- vigrep(k, colnames(iMatrix7grpCtr));
   df1 <- as.data.frame(iMatrix7grpCtr[,corCols]);
   CBhit <- (abs(df1[,1]) > cutCB) * sign(df1[,1]);
   DEhit <- (abs(df1[,2]) > cutCB) * sign(df1[,2]);
   CBhitN <- paste0(k, "_", "CBhit");
   DEhitN <- paste0(k, "_", "DEhit");
   df1[,CBhitN] <- CBhit;
   df1[,DEhitN] <- DEhit;
   df1[,"GeneName"] <- rownames(df1);
   df1;
});
#table(splomL[[1]][,3:4])
splomLsub <- lapply(splomL, function(iDF){
   subset(iDF, !iDF[,3] %in% 0 | !iDF[,4] %in% 0)
});

splomDF2 <- rbindList(lapply(splomLsub, function(iDF){
   iDF[,"CellType"] <- gsub("_.+", "", colnames(iDF)[1]);
   colnames(iDF) <- c("CB","DE","CBhit","DEhit","GeneName","CellType");
   iDF;
}));
#splomDF2[,"row"] <- ifelse(splomDF2$CellType %in% c("CA1","CA2"), "CA1", "CA3");
splomDF2[,"CBDEhit"] <- paste0(splomDF2$CBhit,":",splomDF2$DEhit);

## Color-code the splom points
ccColors <- nameVector(rep("grey", 8), unique(splomDF2$CBDEhit));
ccColors[c("1:1","-1:-1")] <- "orangered3";
ccColors[c("-1:1","1:-1")] <- "grey";

## Gene labels
GeneHighlight <- c("Plch2","Rgs14","Necab2","1700024P16Rik");
splomDF2$label <- ifelse(splomDF2$GeneName %in% GeneHighlight,
  splomDF2$GeneName, "");


splomDF2[,"CBmaxGroupMean"] <- rowMaxs(farrisGeneGroupMedians[splomDF2$GeneName,iSamplesGrpCB]);
splomDF2[,"DEmaxGroupMean"] <- rowMaxs(farrisGeneGroupMedians[splomDF2$GeneName,iSamplesGrpDE]);
splomDF2[,"CBdet7"] <- (splomDF2[,"CBmaxGroupMean"] > 7)+0;
splomDF2[,"DEdet7"] <- (splomDF2[,"DEmaxGroupMean"] > 7)+0;

gg2 <- ggplot(splomDF2,
      aes(x=CB, y=DE, color=CBDEhit, fill=CBDEhit, GeneName=GeneName)) +
   geom_point(shape=21, size=2) +
   geom_vline(xintercept=c(-1,1)*0.585,
      linetype="dashed", color="grey20") +
   geom_hline(yintercept=c(-1,1)*0.585, linetype="dashed", color="grey20") +
   facet_wrap(facets=~CellType) +
   theme_jam() +
   scale_fill_manual(values=ccColors) +
   scale_color_manual(values=makeColorDarker(ccColors)) +
   ggrepel::geom_label_repel(aes(label=label),
      force=8,
      fill="white") +
   theme(legend.position="none");
gg2;
```


```{r neuronal_gene_lists}
## anyGenes1 is the full set of all genes with abundance > 1
anyGenes1 <- rownames(farrisGeneGroupMedians)[rowMaxs(farrisGeneGroupMedians) > 1];

## Filtering rules for CB and DE
## corrCutoff is 7
corFilterRuleCB <- (rowMaxs(farrisGeneGroupMedians[,iSamplesGrpCB]) >= corrCutoff);
corFilterRuleDE <- (rowMaxs(farrisGeneGroupMedians[,iSamplesGrpDE]) >= corrCutoff);
corFilterRule <- (corFilterRuleCB);

DEgenes7 <- rownames(farrisGeneGroupMedians)[corFilterRuleDE];
CBgenes7 <- rownames(farrisGeneGroupMedians)[corFilterRuleCB];
DEgenes7nonCB7 <- setdiff(DEgenes7, CBgenes7); # non-pyramidal
CBgenes7nonDE7 <- setdiff(CBgenes7, DEgenes7); # "non-DE genes"
DEgenes7andCB7 <- intersect(DEgenes7, CBgenes7);   # "DE genes"

## Define neuro gene lists
geneListsAll <- list(anyGenes1=anyGenes1,
   DEandCB=DEgenes7andCB7, # detected
   CBonly=CBgenes7nonDE7,  # detected only CB, not detected DE
   `non-pyramidal`=DEgenes7nonCB7,
   `CB1andDE1`=unique(subset(splomDF2,
      CBDEhit %in% "1:1" &
      CBdet7 %in% c(1) &
      DEdet7 %in% c(1))$GeneName)
   );
#lengths(geneListsAll);
geneListsDF <- data.frame(GeneList=names(geneListsAll),
   GeneCount=format(lengths(geneListsAll), big.mark=",", trim=TRUE));
rownames(geneListsDF) <- NULL;
colorSubGeneLists <- nameVector(rainbowJam(length(geneListsAll)),
   names(geneListsAll));
geneListsDF2 <- kable_coloring(geneListsDF,
      colorSub=colorSubGeneLists,
      row_color_by="GeneList",
      returnType="kable") %>%
   row_spec(0, background="#DDDDDD")
geneListsDF2;

```


### Differential gene expression using limma

In preparation.


## Transcript analysis summary

### Definition of detected transcripts

We refer to the function `defineDetectedTx()` in the `jampack` R package.
The function takes normalized counts, normalized TPM values, sample
group information, and several cutoff values:

* **cutoffTxPctMax=10** requires a transcript isoform to be at least
10% of the highest isoform present in the same sample.
* **cutoffTxExpr=32** requires a transcript isoform to have at least
32 normalized counts.
* **cutoffTxTPMExpr=2** requires a transcript isoform to have at least
a TPM value of 2.
* All of the above criteria must be met for an isoform to be considered
"detected."

```{r, detectedTxTPM}
#refreshFunctions("farrisSalmonWWS");
detectedTxTPML <- defineDetectedTx(
   iMatrixTx=assays(farrisTxSE)[["counts"]],
   iMatrixTxTPM=assays(farrisTxSE)[["tpm"]],
   groups=colData(farrisTxSE)$groupName,
   cutoffTxPctMax=10,
   cutoffTxExpr=32,
   cutoffTxTPMExpr=2,
   zeroAsNA=FALSE,
   tx2geneDF=renameColumn(rowData(farrisTxSE),
     from=c("geneSymbol","probeID"),
     to=c("gene_name","transcript_id")),
   useMedian=FALSE,
   verbose=FALSE);
detectedTx <- detectedTxTPML$detectedTx;
numDetectedTx <- length(detectedTx);
detectedGenes <- mixedSort(unique(rowData(farrisTxSE[detectedTx,])$geneSymbol));
numDetectedGenes <- length(detectedGenes);
```

The code above defined
**`r format(numDetectedTx, big.mark=",")` detected
transcripts** by the given criteria,
covering **`r format(numDetectedGenes, big.mark=",")` unique gene symbols**.


### Transcript type per gene list

e.g. protein_coding, etc.

First, as a point of organizing the transcript-gene associations,
we read the Gencode GTF file and produce a `data.frame` with the
transcript-gene data.

Note: This step downloads the Gencode GTF file for version vM12,
then extracts data from that file. Once the `data.frame` is extracted,
it is stored in a text file for re-use.

```{r, tx2geneDF}
vM12gtf <- "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M12/gencode.vM12.annotation.gtf.gz";
vM12gtfBase <- basename(vM12gtf);
if (!file.exists(vM12gtfBase)) {
   curl::curl_download(url=vM12gtf,
      destfile=vM12gtfBase);
}
tx2geneFile <- file.path(".", "vM12gtf.tx2geneDF.txt");
if (!file.exists(tx2geneFile)) {
   tx2geneDF <- makeTx2geneFromGtf(GTF=vM12gtfBase,
      verbose=FALSE);
   write.table(file=tx2geneFile,
      x=tx2geneDF,
      sep="\t",
      quote=FALSE,
      na="",
      col.names=TRUE,
      row.names=FALSE);
} else {
   tx2geneDF <- read.table(tx2geneFile,
      sep="\t",
      check.names=FALSE,
      as.is=TRUE,
      fill=TRUE,
      quote="\"",
      allowEscapes=FALSE,
      comment.char="",
      header=TRUE,
      stringsAsFactor=FALSE);
}

```

Note the logic to define various transcript types is applied below,
after the definition of Neuronal Transcript Lists.

#### Preparation of 3'UTR data from Gencode

We imported Gencode vM12 comprehensive GTF into R using
R Bioconductor GenomicFeatures package, with which we derived
3'UTR regions, using `GenomicFeatures::makeTxDbFromGFF()` and
`GenomicFeatures::threeUTRsByTranscript()`, respectively.

This step re-uses the Gencode GTF file downloaded above, then
converts it to a `"Txdb"` object, which is a SQLite relational database
format. This database is saved into a file so it can be recalled
without re-creating the file again.

```{r, gencode_vM12_gtf}
localDb <- file.path(".", "vM12gtf.txdb");
if (!file.exists(localDb)) {
   vM12txdb <- GenomicFeatures::makeTxDbFromGFF(vM12gtfBase);
   AnnotationDbi::saveDb(x=vM12txdb, file=localDb);
} else {
   vM12txdb <- AnnotationDbi::loadDb(file=localDb);
}
```

```{r, txdb_refresh, cache=FALSE}
## This snippet simply confirms the vM12txdb object is valid
## otherwise it loads from the source file. This step is
## necessary to resume a previously cached knitr session.
if (!DBI::dbIsValid(AnnotationDbi::dbconn(vM12txdb))) {
   vM12txdb <- AnnotationDbi::loadDb(file=localDb);
   printDebug("Refreshed vM12txdb.");
   printDebug('print(find("vM12txdb"))');
   print(find("vM12txdb"));
} else {
   printDebug("Did not refresh vM12txdb.");
}
```

```{r, three_utr_length, dependson=c("gencode_vM12_gtf","tx2geneDF")}
## Check for valid vM12txdb since it is not cached properly by knitr
if (!DBI::dbIsValid(dbconn(vM12txdb))) {
   vM12txdb <- AnnotationDbi::loadDb(file=localDb);
}
## Grab three prime UTR into a GRangesList
gencode3utr <- GenomicFeatures::threeUTRsByTranscript(vM12txdb,
   use.names=TRUE);
values(gencode3utr@unlistData)[,"transcript_id"] <- rep(names(gencode3utr), lengths(gencode3utr));
txMatch <- match(values(gencode3utr@unlistData)[,"transcript_id"], tx2geneDF$transcript_id);
values(gencode3utr@unlistData)[,"gene_id"] <- tx2geneDF[txMatch,"gene_id"];
values(gencode3utr@unlistData)[,"gene_name"] <- tx2geneDF[txMatch,"gene_name"];
values(gencode3utr@unlistData)[,"gene_type"] <- tx2geneDF[txMatch,"gene_type"];
values(gencode3utr@unlistData)[,"transcript_type"] <- tx2geneDF[txMatch,"transcript_type"];
names(gencode3utr@unlistData) <- pasteByRow(values(gencode3utr@unlistData)[,c("transcript_id","exon_rank")]);

## Add flags to tx2geneDF, whether a gene or tx has 3'UTR
tx2geneDF[,"GeneHas3UTR"] <- tx2geneDF$gene_id %in% 
   values(gencode3utr@unlistData)[,"gene_id"];
tx2geneDF[,"TxHas3UTR"] <- tx2geneDF$transcript_id %in%
   values(gencode3utr@unlistData)[,"transcript_id"];

## Summarize widths of three-prime-utr exons by transcript
threePrimeTx <- sum(width(gencode3utr));
GencodeVM12mm10threeUtrLength <- sum(width(gencode3utr));

## non-mito detectedTx
detectedTxMito <- subset(tx2geneDF[match(detectedTx, tx2geneDF$transcript_id),],
   grepl("^mt-", gene_name))$transcript_id;
detectedTxNonMito <- setdiff(detectedTx, detectedTxMito);

```


### Neuronal Transcript Lists

The code below creates transcript lists equivalent to the
Neuronal Gene Lists above. The main distinction is that
the filtering rules are applied to transcript-level data,
as opposed to gene-level summary data above.
While a gene may be present in one category, perhaps only a
subset of its transcript isoforms may be present in that category.
Similarly, individual isoforms from the same gene may be present in
multiple distinct neuronal categories.

The transcript-level subsets are used to produce the 3-prime UTR
and CAI plots in subsequent figures.

```{r, neuronal_tx_lists}
## First calculate per-transcript group expression values
farrisTxGroupMedians <- rowGroupMeans(assays(farrisTxSE)[["counts"]],
   groups=colData(farrisTxSE)$groupName,
   useMedian=TRUE);
farrisTxTPMGroupMedians <- rowGroupMeans(assays(farrisTxSE)[["tpm"]],
   groups=colData(farrisTxSE)$groupName,
   useMedian=TRUE);

## Define TPM data matrix
## Note: we add +1 to normalized TPM values, which keeps the minimum
## values above zero. Consequently, we reset values which were already
## zero to zero, so they are not adjusted.
iMatrixTxTPM <- assays(farrisTxSE)[["tpm"]] + 1;
iMatrixTxTPM[assays(farrisTxSE)[["raw_tpm"]] == 0] <- 0

## Define cutoffs using CB and DE samples
## "Any" uses permissive cutoffs:
## pseudocounts >= 5, TPM >= 1
## no requirement for isoforms to be a certain percent the max per gene
detectedTxTPManyL <- defineDetectedTx(
#   iMatrixTx=assays(farrisTxSE[,iSamplesCB])[["counts"]],
#   iMatrixTxTPM=assays(farrisTxSE[,iSamplesCB])[["tpm"]],
#   groups=colData(farrisTxSE[,iSamplesCB])$groupName,
   iMatrixTx=assays(farrisTxSE)[["counts"]],
   iMatrixTxTPM=assays(farrisTxSE)[["tpm"]],
   groups=colData(farrisTxSE)$groupName,
   cutoffTxPctMax=0,
   cutoffTxExpr=5,
   cutoffTxTPMExpr=1,
   tx2geneDF=renameColumn(rowData(farrisTxSE),
     from=c("geneSymbol","probeID"),
     to=c("gene_name","transcript_id")),
   useMedian=FALSE,
   verbose=FALSE);
## Define cutoffs using cell body (CB) samples
## Requires pseudocounts >= 32, TPM >= 2
## and isoforms must be >= 10 percent the max per gene
detectedTxCBTPML <- defineDetectedTx(
   iMatrixTx=assays(farrisTxSE[,iSamplesCB])[["counts"]],
   iMatrixTxTPM=assays(farrisTxSE[,iSamplesCB])[["tpm"]],
   groups=colData(farrisTxSE[,iSamplesCB])$groupName,
   cutoffTxPctMax=10,
   cutoffTxExpr=32,
   cutoffTxTPMExpr=2,
   tx2geneDF=renameColumn(rowData(farrisTxSE),
     from=c("geneSymbol","probeID"),
     to=c("gene_name","transcript_id")),
   useMedian=FALSE,
   verbose=FALSE);
detectedTxTPMCB <- detectedTxCBTPML$detectedTx;
## Define cutoffs using cell body (DE) samples
## Requires pseudocounts >= 32, TPM >= 2
## and isoforms must be >= 10 percent the max per gene
detectedTxDETPML <- defineDetectedTx(
   iMatrixTx=assays(farrisTxSE[,iSamplesDE])[["counts"]],
   iMatrixTxTPM=assays(farrisTxSE[,iSamplesDE])[["tpm"]],
   groups=colData(farrisTxSE[,iSamplesDE])$groupName,
   cutoffTxPctMax=10,
   cutoffTxExpr=32,
   cutoffTxTPMExpr=2,
   tx2geneDF=renameColumn(rowData(farrisTxSE),
     from=c("geneSymbol","probeID"),
     to=c("gene_name","transcript_id")),
   useMedian=FALSE,
   verbose=FALSE);
detectedTxTPMDE <- detectedTxDETPML$detectedTx;

## Center data within Compartment
farrisTxTPMGroupMediansCtr2 <- centerGeneData(farrisTxTPMGroupMedians,
   centerGroups=gsub("^.+_", "", colnames(farrisTxTPMGroupMedians)),
   mean=TRUE, showGroups=FALSE);


## anyTx1 is the full set of all transcripts with abundance > 1
anyTx1 <- rownames(farrisTxGroupMedians)[rowMaxs(farrisTxGroupMedians) > 1];

## Filtering rules for CB and DE
## corrCutoff is 7
corFilterRuleTxCB <- (rowMaxs(farrisTxGroupMedians[,iSamplesGrpCB]) >= corrCutoff);
corFilterRuleTxDE <- (rowMaxs(farrisTxGroupMedians[,iSamplesGrpDE]) >= corrCutoff);
corFilterRuleTx <- (corFilterRuleTxCB);

## Subsets of detected transcripts
DEtx7 <- rownames(farrisTxGroupMedians)[corFilterRuleTxDE];
CBtx7 <- rownames(farrisTxGroupMedians)[corFilterRuleTxCB];
DEtx7nonCB7 <- setdiff(DEtx7, CBtx7); # non-pyramidal
CBtx7nonDE7 <- setdiff(CBtx7, DEtx7); # "non-DE genes"
DEtx7andCB7 <- intersect(DEtx7, CBtx7);   # "DE genes"

## Plot based upon TPM or counts
splomTxL <- lapply(nameVector(c("CA1","CA2","CA3","DG")), function(k) {
   corCols <- vigrep(k, colnames(farrisTxTPMGroupMediansCtr2));
   #df1 <- as.data.frame(farrisTxTPMGroupMediansCtr2[corFilterRuleTx,corCols]);
   df1 <- as.data.frame(farrisTxTPMGroupMediansCtr2[detectedTxTPMCB,corCols]);
   CBhit <- (abs(df1[,1]) > cutCB) * sign(df1[,1]);
   DEhit <- (abs(df1[,2]) > cutCB) * sign(df1[,2]);
   CBhitN <- paste0(k, "_", "CBhit");
   DEhitN <- paste0(k, "_", "DEhit");
   df1[,CBhitN] <- CBhit;
   df1[,DEhitN] <- DEhit;
   df1[,"GeneName"] <- rownames(df1);
   df1;
});
## Remove entries with (0,0) no change in any condition
splomTxLsub <- lapply(splomTxL, function(iDF){
   subset(iDF, !iDF[,3] %in% 0 | !iDF[,4] %in% 0)
});
## Create a data.frame for ggplot
splomTxDF2 <- rbindList(lapply(splomTxLsub, function(iDF){
   iDF[,"CellType"] <- gsub("_.+", "", colnames(iDF)[1]);
   colnames(iDF) <- c("CB","DE","CBhit","DEhit","transcript_id","CellType");
   iDF;
}));
splomTxDF2[,"Gene"] <- tx2geneDF[match(splomTxDF2$transcript_id, tx2geneDF$transcript_id),"gene_name"];
splomTxDF2[,"CBDEhit"] <- paste0(splomTxDF2$CBhit,
   ":",
   splomTxDF2$DEhit);
## Add rowMaxs for CB and DE to farrisSplomDF2
splomTxDF2[,"CBmaxGroupMedian"] <- rowMaxs(farrisTxGroupMedians[splomTxDF2$transcript_id, iSamplesGrpCB]);
splomTxDF2[,"DEmaxGroupMedian"] <- rowMaxs(farrisTxGroupMedians[splomTxDF2$transcript_id, iSamplesGrpDE]);
splomTxDF2[,"CBdet7"] <- (splomTxDF2[,"CBmaxGroupMedian"] > corrCutoff)+0;
splomTxDF2[,"DEdet7"] <- (splomTxDF2[,"DEmaxGroupMedian"] > corrCutoff)+0;

## Add TPM detected columns
splomTxDF2[,"CBmaxGroupMedianTPM"] <- rowMaxs(farrisTxTPMGroupMedians[splomTxDF2$transcript_id, iSamplesGrpCB]);
splomTxDF2[,"DEmaxGroupMedianTPM"] <- rowMaxs(farrisTxTPMGroupMedians[splomTxDF2$transcript_id, iSamplesGrpDE]);
splomTxDF2[,"CBdetTPM"] <- (splomTxDF2$transcript_id %in% detectedTxTPMCB)+0;
splomTxDF2[,"DEdetTPM"] <- (splomTxDF2$transcript_id %in% detectedTxTPMDE)+0;


## Put it all together into tx lists
txListsAll <- list(anyGenes1=anyTx1,
   DEandCB=DEtx7andCB7, # detected
   CBonly=CBtx7nonDE7,  # detected only CB, not detected DE
   `non-pyramidal`=DEtx7nonCB7,
   `CB1andDE1`=unique(subset(splomTxDF2,
      CBDEhit %in% "1:1" &
      CBdet7 %in% c(1) & #DEdetTPM %in% c(0) &
      DEdet7 %in% c(1))$transcript_id)
   );

#########################################################
## Same logic as above, using TPM instead of pseudocounts
detectedTxTPMany <- detectedTxTPManyL$detectedTx;
detectedTxTPMCB <- detectedTxCBTPML$detectedTx;
detectedTxTPMDE <- detectedTxDETPML$detectedTx;
txListsAllTPM <- list(anyGenes1=detectedTxTPMany,
   DEandCB=intersect(detectedTxTPMCB, detectedTxTPMDE),
   CBonly=setdiff(detectedTxTPMCB, detectedTxTPMDE),
   `non-pyramidal`=setdiff(detectedTxTPMDE, detectedTxTPMCB),
   `CB1andDE1`=unique(subset(splomTxDF2,
      CBDEhit %in% "1:1" &
      CBdetTPM %in% c(1) &
      DEdetTPM %in% c(1))$transcript_id)
   );

```

#### Transcript types by transcript list


```{r, tx_by_type}

ncTx2geneDFdetTPM <- subset(tx2geneDF,
   !gene_type %in% c("protein_coding","TEC","Mt_tRNA","Mt_rRNA") &
   transcript_id %in% detectedTx);
## nc detectedTx has 1548 rows, originally it had 2657 rows
## Convert to list
txListsAllTPMncDet <- lapply(txListsAllTPM, function(i){
   intersect(i, ncTx2geneDFdetTPM$transcript_id);
});
## lengths(txListsAllTPMncDet)
## 1436, 747, 381, 420, 94

## % Tx in txListsAll that are protein_coding and have 3'UTRs
cdsTx2geneDFdetTPM <- subset(tx2geneDF,
   gene_type %in% c("protein_coding") &
   transcript_id %in% detectedTx);
txListsAllTPMcdsDet <- lapply(txListsAllTPM, function(i){
   iDF <- subset(cdsTx2geneDFdetTPM, transcript_id %in% i);
   iDF$transcript_id;
});
## lengths(txListsAllTPMcdsDet)
## 24120, 17104, 3424, 5259, 1541

txListsAllTPMcdsDetHas3utr <- lapply(txListsAllTPM, function(i){
   iDF <- subset(cdsTx2geneDFdetTPM, transcript_id %in% i);
   iV <- table(iDF$TxHas3UTR);
   c(format(iV, scientific=FALSE, big.mark=","),
      pct=round(1000*nrow(subset(iDF, TxHas3UTR)) / nrow(iDF))/10);
});
#txListsAllTPMcdsDetHas3utr;
txListsAllTPMcdsDetHas3utrV <- unlist(lapply(txListsAllTPMcdsDetHas3utr, function(i){
   as.numeric(gsub(",", "", i[["TRUE"]]));
}));

## Subset txListsAllTPM for detected transcripts
txListsAllTPMdet <- lapply(txListsAllTPM, function(i){
   intersect(i, detectedTx);
});


## Make a small table with txListsAll counts
txListsAllDF <- data.frame(geneListsAll=lengths(geneListsAll),
   txListsAllTPM=lengths(txListsAllTPM),
   txListsAllTPMdet=lengths(txListsAllTPMdet),
   non_coding=lengths(txListsAllTPMncDet),
   pct_non_coding=lengths(txListsAllTPMncDet)/lengths(txListsAllTPMdet),
   protein_coding=lengths(txListsAllTPMcdsDet),
   pct_protein_coding=lengths(txListsAllTPMcdsDet)/lengths(txListsAllTPMdet),
   protein_coding_with3utr=txListsAllTPMcdsDetHas3utrV,
   pct_protein_coding_with3utr=txListsAllTPMcdsDetHas3utrV/lengths(txListsAllTPMcdsDet),
   protein_coding_without3utr=lengths(txListsAllTPMcdsDet)-txListsAllTPMcdsDetHas3utrV
);
txListsAllDF$otherNc <- txListsAllDF[,"txListsAllTPMdet"] - (txListsAllDF[,"non_coding"] + txListsAllDF[,"protein_coding"]);

txListsAllDFuse <- t(as.matrix(txListsAllDF[,c(
   "protein_coding_with3utr","protein_coding_without3utr",
   "non_coding","otherNc"
   )]));

## Scale to 100%
txListsAllDFuseScaled <- t(t(txListsAllDFuse) / colSums(txListsAllDFuse))*100;

txListsAllDFuseScaled2 <- melt(txListsAllDFuseScaled);
txListsAllDFuseScaled2$Var1 <- factor(txListsAllDFuseScaled2$Var1,
   levels=unique(provigrep(c("other","noncoding|non.coding","without","with","."),
      as.character(txListsAllDFuseScaled2$Var1))));

txListsAllDFuseScaled2$group <- pasteByRowOrdered(txListsAllDFuseScaled2[,c("Var2","Var1")]);
```


### Properties of hippocampus dendritic RNAs, refers to S7


#### Heatmap of high-confidence dendritic, cell body retained, and non-pyramidal RNA lists, refers to S7A

A heatmap is used to show the pattern of gene expression among the
Neuronal Gene Lists defined above.

```{r, heatmap_genelists345_prep}
## Display only genes from the 
glWhich <- c("CBonly", "non-pyramidal", "CB1andDE1");
gl4 <- unname(unlist(geneListsAll[glWhich]));
## Make a vector of gene list names, named by gene
gl4names <- nameVector(rep(names(geneListsAll[glWhich]),
   lengths(geneListsAll[glWhich])),
   gl4);
iSamplesGrpO <- mixedSort(iSamplesGrp);
## Use all gene group median values
iM4 <- farrisGeneGroupMedians[gl4,iSamplesGrpO];

## Column annotations using each experimental factor (not used)
colHA3 <- HeatmapAnnotation(df=colDataColorsGrpL[iSamplesGrpO,2:1],
   show_annotation_name=TRUE,
   col=pheat_colors);

## Column annotations using group name and color
colHA3grp <- HeatmapAnnotation(Group=rownames(colDataColorsGrpL[iSamplesGrpO,]),
   annotation_height=unit(7, "mm"), height=unit(7, "mm"),
   border=TRUE,
   annotation_legend_param=list(
      title_gp=gpar(fontsize=12, fontface="bold")),
   col=list(Group=colorSub[iSamplesGrpO]));

## Row annotations as colors with no labels
rowHA3 <- HeatmapAnnotation(which="row",
   df=data.frame(List=gl4names[rownames(iM4)]),
   annotation_width=unit(7, "mm"), width=unit(7, "mm"),
   col=list(List=colorSubGeneLists[names(geneListsAll)[3:5]]));
## Row annotations with block labels
rowHA3B <- rowAnnotation(
   foo=anno_block(gp=gpar(fill=colorSubGeneLists[sort(names(geneListsAll)[3:5])]),
      labels=sort(names(geneListsAll)[3:5]),
      labels_gp=gpar(col="white"))
   );

## Define color ramp
pheat_expr_breaks <- warpAroundZero(seq(from=-2, to=18, length.out=51),
   baseline=5,
   lens=4);
range(pheat_expr_breaks);
cBRexpr <- circlize::colorRamp2(breaks=pheat_expr_breaks,
   colors=colorRampPalette(tail(warpRamp(getColorRamp("RdBu_r", n=15), lens=0), 12))(51));
```

```{r, heatmap_genelists345, fig.height=10, fig.width=8}

## Create the heatmap
splitF <- factor(gl4names[rownames(iM4)],
   levels=(c("CBonly","non-pyramidal","CB1andDE1")));
glHM <- Heatmap(iM4[,iSamplesGrpO],
   name="Log2 Counts",
   show_parent_dend_line=FALSE,
   cluster_columns=FALSE,
   show_row_names=FALSE,
   clustering_method_rows="ward.D2",
   clustering_distance_rows="euclidean",
   #clustering_method_rows="complete",
   #clustering_distance_rows="maximum",
   column_dend_height=unit(20, "mm"),
   row_dend_width=unit(20, "mm"),
   top_annotation=colHA3grp,
   left_annotation=rowHA3B,
   split=factor(gl4names[rownames(iM4)], levels=names(geneListsAll)[3:5]),
   gap=unit(3, "mm"),
   use_raster=TRUE,
   border=TRUE,
   heatmap_legend_param=list(
      grid_width=unit(8, "mm"),
      border="black",
      title_gp=gpar(fontsize=12, fontface="bold"),
      legend_height=unit(30, "mm")),
   row_dend_side="left",
   raster_device="CairoPNG",
   col=cBRexpr);
ComplexHeatmap::draw(glHM + rowHA3B, merge_legend=TRUE);
## Print the number of rows and columns in the heatmap
## (broken in non-interactive sessions)
if (1 == 2) {
   legend(x=grconvertX(0.00, "ndc"),
      y=grconvertY(0.05, "ndc"),
      legend=paste(
         dim(iM4[,iSamplesGrpO]),
         c("rows", "columns")),
      cex=0.8,
      xpd=NA,
      bg="transparent",
      box.col="transparent");
}

```


#### Percent breakdown of transcript type by transcript list, refers to S7B

A stacked bar chart is used to show the relative abundances of
transcript types, for each Neuronal Transcript List.

```{r, tx_list_by_type_key, fig.height=6, fig.width=7}
## Define the colors by type
nType <- nrow(txListsAllDFuse);
nList <- ncol(txListsAllDFuse);
txColorSubL <- split(unname(
   color2gradient(n=nType,
      reverseGradient=FALSE,
      gradientWtFactor=1,
      rainbowJam(nList))),
   factor(rep(colnames(txListsAllDFuse), each=nType),
      levels=colnames(txListsAllDFuse)));
txColorSubM <- do.call(cbind, txColorSubL);
rownames(txColorSubM) <- rownames(txListsAllDFuse);
colnames(txColorSubM) <- NULL;
par("mar"=c(4,14,4,2));
imageByColors(txColorSubM,
   main="Transcript Type Color Key");
```

```{r, tx_list_by_type_fig, fig.height=12, fig.width=12}
txColorSubV <- nameVector(unlist(txColorSubL),
   txListsAllDFuseScaled2$group);

ggTxType <- ggplot(txListsAllDFuseScaled2, aes(x=Var2, fill=group, color=group)) +
   geom_bar(aes(weight=value)) +
   scale_fill_manual(values=txColorSubV) + 
   scale_color_manual(values=makeColorDarker(darkFactor=20, txColorSubV)) +
   theme_jam(base_size=20) +
   theme(
      panel.grid.major.x=element_blank(),
      panel.grid.minor.x=element_blank()) +
   ylab("Percent") +
   xlab("Transcript List") +
   ggtitle("txListsAllTPM by gene_type and 3'UTR") +
   theme(legend.position="none");
print(ggTxType);
```

#### Heatmap of hypergeometric enrichment P-values of cell type by RNA list, refers to S7C

In preparation.

#### Violin plot of annotated 3'UTR lengths in each RNA list, refers to S7D

We plotted 3'UTR lengths as violin plots using only detected transcripts
based upon TPM criteria described previously.

```{r, utr_length, fig.width=12, fig.height=10, dependson=c("three_utr_length")}
## Subset the txListsAll by those with three prime utr data
txLists3utr <- lapply(txListsAll, function(i){
   intersect(i, names(GencodeVM12mm10threeUtrLength));
});

## Pull pre-computed data from the farrisdata package
data(GencodeVM12mm10cai);
data(GencodeVM12mm10cdsLength);
data(GencodeVM12mm10caiCtLow);
data(GencodeVM12mm10caiCtBad);

## Create data in format for violin plots using ggplot2
violin3utrDF <- data.frame(check.names=FALSE,
   stringsAsFactors=FALSE,
   Tx=unlist(txLists3utr),
   gene_name=tx2geneDF[match(unlist(txLists3utr), tx2geneDF$transcript_id),"gene_name"],
   Subset=factor(rep(names(txLists3utr), lengths(txLists3utr)),
      levels=names(txLists3utr)),
   detectedTx=ifelse(unlist(txLists3utr) %in% detectedTxNonMito,
      "Detected Tx", "Undetected Tx"),
   threeUtrLength=GencodeVM12mm10threeUtrLength[unlist(txLists3utr)],
   txCai=GencodeVM12mm10cai[unlist(txLists3utr)],
   cdsLength=rmNA(naValue=0, GencodeVM12mm10cdsLength[unlist(txLists3utr)])
   );
## Now make a version that includes Tx counts in the labels
violin3utrDF$SubsetCount <- splicejam::factor2label(
   pasteByRowOrdered(violin3utrDF[,c("Subset","detectedTx")]),
   itemName="txs");
levels(violin3utrDF$SubsetCount) <- gsub("_([^(]+) Tx (.+) (txs)", " \\2 \\1 \\3",
   levels(violin3utrDF$SubsetCount))
colorSubGeneLists2 <- unlist(lapply(names(colorSubGeneLists), function(i){
   j <- vigrep(i, levels(violin3utrDF$SubsetCount));
   nameVector(rep(colorSubGeneLists[i], length(j)), j);
}));
## Note some tricky math, we take the median log length, then exponentiate
violin3utrDF$SubsetLabel <- splicejam::factor2label(
   pasteByRowOrdered(violin3utrDF[,c("Subset","detectedTx")]),
   types="count",
   aggFun=function(x,...){2^median(log2(x[x>10]),na.rm=TRUE)},
   valuesL=violin3utrDF[,"threeUtrLength",drop=FALSE],
   itemName="txs");
violin3utrDF$log3utr <- log10(1+violin3utrDF$threeUtrLength);

## Subset for specific neuronal groups
violin3utrDFsub <- subset(violin3utrDF, Subset %in% names(txListsAll)[c(3,4,5)]);

## 3 prime UTR length violin plots using ggplot2
gg3utr <- ggplot(data=subset(violin3utrDF, detectedTx %in% "Detected Tx"),
   aes(x=Subset, y=threeUtrLength)) +
   scale_y_continuous(trans="log10",
      breaks=c(10,20,50,100,200,300,500,750,1000,1500,2000,5000,10000,20000,50000),
      limits=10^c(1,4.35)) +
   #annotation_logticks(sides="lr") +
   facet_wrap(~detectedTx) +
   geom_violin(aes(fill=SubsetCount),
      scale="area",
      size=1,
      alpha=0.8,
      trim=TRUE,
      draw_quantiles=c(0.5)) +
   theme_jam(grid.minor.size=0) +
   scale_color_manual(values=colorSubGeneLists2) +
   scale_fill_manual(values=colorSubGeneLists2);
gg3utr;

```

##### Statistical tests comparing three prime UTR lengths

```{r, three_utr_stats}
## Wilcoxon t-test comparing the percent CAI
threeUtrCBonly <- subset(violin3utrDF, detectedTx %in% "Detected Tx" &
   Subset %in% "CBonly")$threeUtrLength;
threeUtrNonPyr <- subset(violin3utrDF, detectedTx %in% "Detected Tx" &
   Subset %in% "non-pyramidal")$threeUtrLength;
threeUtrCB1andDE1 <- subset(violin3utrDF, detectedTx %in% "Detected Tx" &
   Subset %in% "CB1andDE1")$threeUtrLength;

## parametric t-tests, using log10-transformed UTR lengths
t35utr <- t.test(log10(threeUtrCBonly), log10(threeUtrCB1andDE1));
t34utr <- t.test(log10(threeUtrCBonly), log10(threeUtrNonPyr));
t45utr <- t.test(log10(threeUtrNonPyr), log10(threeUtrCB1andDE1));

## Wilcoxon non-parametric t-tests
wt35utr <- wilcox.test(threeUtrCBonly, threeUtrCB1andDE1);
wt34utr <- wilcox.test(threeUtrCBonly, threeUtrNonPyr);
wt45utr <- wilcox.test(threeUtrNonPyr, threeUtrCB1andDE1);

TtestStatsUtr <- list(t35utr=t35utr,
   t34utr=t34utr,
   t45utr=t45utr,
   wt35utr=wt35utr,
   wt34utr=wt34utr,
   wt45utr=wt45utr);
TtestStatsUtrDF <- rbindList(lapply(names(TtestStatsUtr), function(t1){
   t <- TtestStatsUtr[[t1]];
   data.frame(data.name=t$data.name,
      p.value=t$p.value,
      method=t$method);
}));

## Print stats table
colorSubMethod <- group2colors(levels(TtestStatsUtrDF$method));
TtestStatsUtrDF$p.value <- format(TtestStatsUtrDF$p.value,
   digits=3,
   trim=TRUE);
TtestStatsUtrDF2 <- kable_coloring(TtestStatsUtrDF,
      colorSub=colorSubMethod,
      row_color_by="method",
      returnType="kable") %>%
   row_spec(0, background="#DDDDDD")
TtestStatsUtrDF2;

```


#### Proximal versus distal 3'UTR detected in each RNA list per compartment, refers to S7E

ALE elements are defined by unique 5-prime ends of each 3-prime UTR,
which allows for overlapping regions downstream, but protects against
a prevalent annotation artifact seen with automated gene transcript
models. For example, certain genes have a premature alternative
STOP codon annotated early in the CDS, causing some isoforms to annotate
nearly the entire transcript as 3-prime UTR. We found that taking the
5-prime end of 3-prime UTR regions was able to condense shared
3-prime UTR regions for transcript isoforms per gene, sufficient for
this analysis.

The resulting data matrix contains rows of ALE elements per gene.
The subsequent violin plot compares the ALE2 expression to ALE1
expression, where ALE1 has been subtracted to yield a log2 fold
change.

```{r, tx2ale_prep}
## Check for valid vM12txdb since it is not cached properly by knitr
if (!DBI::dbIsValid(dbconn(vM12txdb))) {
   vM12txdb <- AnnotationDbi::loadDb(file=localDb);
}
## Define transcripts with ALE regions
formatInt <- function(x,...){format(x, scientific=FALSE, trim=TRUE, big.mark=",", ...);}

aleL <- tx2ale(
   threeUtrGRL=gencode3utr,
   txdb=vM12txdb,
   detectedTx=detectedTx,
   iMatrix=assays(farrisTxSE)[["tpm"]],
   method="flank",
   verbose=TRUE,
   tx2geneDF=tx2geneDF);
## use tx2ale to calculate ALE using count data
iMatrixAleCt <- log2(1+
   shrinkMatrix(
      2^(assays(farrisTxSE[names(aleL$tx2ale),])[["counts"]])-1,
      groupBy=aleL$tx2ale,
      shrinkFunc=function(x){sum(x, na.rm=TRUE)},
      returnClass="matrix"));

```

We also applied logical filtering to the resulting genes,
so the Neuronal Gene Lists defined above would be represented
by genes in each region according to these rules:

* `anyGenes1` is allowed to contain expression data from all genes
in the `anyGenes` list from any region.
* `DEandCB` only includes genes where the ALE data is meets the
thresholds for CB and DE samples.
* `CBonly` only includes genes present in CB samples, and excludes
DE samples.
* `non-pyramidal` only includes genes present in DE samples, and
excludes CB samples.
* `CB1andDE1` only includes genes where the ALE data is meets the
thresholds for CB and DE samples.

Data were filtered so that the aggregate ALE TPM value for either
ALE1 or ALE2 for each gene was at least 2.

```{r, ale_analysis, dependson="tx2ale_prep", fig.height=10, fig.width=12}
groups <- nameVector(colData(farrisTxSE[,iSamples])$groupName, iSamples);
facet_groups <- nameVector(gsub("^.+_", "", groups), names(groups))

## Filter genes for the appropriate geneLists entry
subsetCBDE <- function(iDiffAleTall2ctr) {
   iDFviolinGeneL <- split(iDiffAleTall2ctr$gene_name,
      pasteByRowOrdered(iDiffAleTall2ctr[,c("geneList","Region")]));
   DEandCBsubset1 <- intersect(iDFviolinGeneL[["DEandCB_CB"]],
      iDFviolinGeneL[["DEandCB_DE"]]);
   CB1andDE1subset1 <- intersect(iDFviolinGeneL[["CB1andDE1_CB"]],
      iDFviolinGeneL[["CB1andDE1_DE"]]);
   iDiffAleTall2ctrUse <- subset(iDiffAleTall2ctr,
      (geneList %in% "DEandCB" & gene_name %in% DEandCBsubset1) |
      (geneList %in% "CB1andDE1" & gene_name %in% CB1andDE1subset1) |
      (geneList %in% "CBonly" & Region %in% "CB") |
      (geneList %in% "non-pyramidal" & Region %in% "DE") |
      (geneList %in% c("anyGenes1"))
      );
   #print(table(iDiffAleTall2ctrUse[,c("Region","geneList")]));
   iDiffAleTall2ctrUse;
}

## Create violin plot data
iMatrixAleGrp <- rowGroupMeans(aleL$iMatrixAle[,iSamples],
   useMedian=FALSE,
   groups=groups);
aleViolinL <- ale2violin(aleL$iMatrixAle[,iSamples],
   geneLists=geneListsAll,
   maxGroupMeanALE=2,
   groups=groups,
   lineAlpha=0.1,
   subsetFunc=subsetCBDE,
   facet_name="Region",
   verbose=FALSE,
   facet_groups=facet_groups);
printDebug("Number of genes represented in each panel (TPM calculations):");
table(aleViolinL$iDiffAleTall2ctrFacet[,c("Region","geneList")])/2;
print(aleViolinL$ggAle2);


```


#### Violin plot of the average percentage of codons with CAI <= 0.5 per RNA list, refers to S7F

Codon CAI values are loaded via the `farrisdata` R package, and can be
calculated using methods described in the `jampack` R package suite.


```{r, codon_usage, fig.width=12, fig.height=10, dependson=c("three_utr_length")}
## codon adaptability index (cai) values are pre-computed and
## available from the farrisdata package
#data(GencodeVM12mm10cai);

## Subset transcript lists for entries having CAI data,
## which mostly filters for CDS-encoding Txs
txListsCai <- lapply(txListsAll, function(i){
   intersect(i, names(GencodeVM12mm10cai))
});
txListsCaiTPM <- lapply(txListsAllTPM, function(i){
   intersect(i, names(GencodeVM12mm10cai))
});
#txListsCai <- txListsCaiTPM;


## Create data in format for violin plots using ggplot2
violinCaiDF <- data.frame(check.names=FALSE,
   stringsAsFactors=FALSE,
   Tx=unlist(txListsCai),
   gene_name=tx2geneDF[match(unlist(txListsCai), tx2geneDF$transcript_id),"gene_name"],
   Subset=factor(rep(names(txListsCai), lengths(txListsCai)),
      levels=names(txListsCai)),
   detectedTx=ifelse(unlist(txListsCai) %in% detectedTxNonMito,
      "Detected Tx", "Undetected Tx"),
   txCai=GencodeVM12mm10cai[unlist(txListsCai)],
   cdsLength=GencodeVM12mm10cdsLength[unlist(txListsCai)],
   ctLowCai=GencodeVM12mm10caiCtLow[unlist(txListsCai)],
   pctLowCai=(GencodeVM12mm10caiCtLow[unlist(txListsCai)] /
         GencodeVM12mm10cdsLength[unlist(txListsCai)] *100),
   ctBadCai=GencodeVM12mm10caiCtBad[unlist(txListsCai)],
   pctBadCai=(GencodeVM12mm10caiCtBad[unlist(txListsCai)] /
         GencodeVM12mm10cdsLength[unlist(txListsCai)] *100),
   txCaiQ1=GencodeVM12mm10caiQ1mean[unlist(txListsCai)]
   );

## Now make a version that includes Tx counts in the labels
violinCaiDF$SubsetCount <- splicejam::factor2label(violinCaiDF$Subset,
   itemName="txs");
violinCaiDF$SubsetLabel <- splicejam::factor2label(violinCaiDF$Subset,
   types="count",
   aggFun=median,
   valuesL=violinCaiDF[,"pctLowCai",drop=FALSE],
   itemName="txs");
colorSubGeneLists3 <- unlist(lapply(names(colorSubGeneLists), function(i){
   j <- vigrep(i, c(
      levels(violinCaiDF$SubsetLabel),
      levels(violinCaiDF$SubsetCount)));
   nameVector(rep(colorSubGeneLists[i], length(j)), j);
}));

## Subset for specific neuronal groups
violinCaiDFsub <- subset(violinCaiDF, Subset %in% names(txListsAll)[c(3,4,5)]);

## CAI violin plots using ggplot2
ggPctLowCai <- ggplot(data=subset(violinCaiDFsub, detectedTx %in% "Detected Tx"),
   aes(x=Subset, y=pctLowCai)) +
   #aes(x=Subset, y=txCai)) +
   facet_wrap(~detectedTx) +
   geom_violin(aes(fill=SubsetLabel), scale="area",
   #geom_violin(aes(fill=SubsetCount), scale="area",
      alpha=0.8, trim=TRUE, draw_quantiles=c(0.5)) +
   ylim(c(0, 20)) +
   theme_jam(grid.minor.size=0) +
   scale_color_manual(values=colorSubGeneLists3) +
   scale_fill_manual(values=colorSubGeneLists3);
ggPctLowCai;

```

##### Statistical tests comparing gene lists

```{r, cai_stats}
## Wilcoxon t-test comparing the percent CAI
caiCBonly <- subset(violinCaiDFsub, detectedTx %in% "Detected Tx" &
   Subset %in% "CBonly")$pctLowCai;
caiNonPyr <- subset(violinCaiDFsub, detectedTx %in% "Detected Tx" &
   Subset %in% "non-pyramidal")$pctLowCai;
caiCB1andDE1 <- subset(violinCaiDFsub, detectedTx %in% "Detected Tx" &
   Subset %in% "CB1andDE1")$pctLowCai;

## parametric t-tests
t35cai <- t.test(caiCBonly, caiCB1andDE1);
t34cai <- t.test(caiCBonly, caiNonPyr);
t45cai <- t.test(caiNonPyr, caiCB1andDE1);

## Wilcoxon non-parametric t-tests
wt35cai <- wilcox.test(caiCBonly, caiCB1andDE1);
wt34cai <- wilcox.test(caiCBonly, caiNonPyr);
wt45cai <- wilcox.test(caiNonPyr, caiCB1andDE1);

TtestStatsCai <- list(t35cai=t35cai,
   t34cai=t34cai,
   t45cai=t45cai,
   wt35cai=wt35cai,
   wt34cai=wt34cai,
   wt45cai=wt45cai);
TtestStatsCaiDF <- rbindList(lapply(names(TtestStatsCai), function(t1){
   t <- TtestStatsCai[[t1]];
   data.frame(data.name=t$data.name,
      p.value=t$p.value,
      method=t$method);
}));

## Print stats table
colorSubMethod <- group2colors(levels(TtestStatsCaiDF$method));
#TtestStatsCaiDF$p.value <- sfsmisc::pretty10exp(TtestStatsCaiDF$p.value);
TtestStatsCaiDF$p.value <- format(TtestStatsCaiDF$p.value,
   digits=3,
   trim=TRUE);
TtestStatsCaiDF2 <- kable_coloring(TtestStatsCaiDF,
      colorSub=colorSubMethod,
      row_color_by="method",
      returnType="kable") %>%
   row_spec(0, background="#DDDDDD")
TtestStatsCaiDF2;

```


### Differential transcript isoform analysis limma

Differential isoform tests were performed using the R limma
package, and the `diffSplice()` function, using normalized
TPM abundances.

#### Definition of experiment design and contrast matrices

The experimental design and contrast matrices are defined based
upon the Limma Users Guide, using zero intercept. All pairwise and
two-way contrasts are included, provided each contrast changes only
one factor at a time.

```{r, design_matrices}
## Design matrix is defined by the sample groups
iGroups <- nameVector(
   colData(farrisTxSE[,iSamples])$groupName,
      iSamples);
## Re-order factor levels so CA2 is first
iGroups <- factor(iGroups,
   levels=provigrep(c("CA2", "."), levels(iGroups)));
iDC <- groups2contrasts(iGroups, returnDesign=TRUE);
iDesign <- iDC$iDesign;
## Alternative "manual" method
#iDesign <- stats::model.matrix(~0+iGroups);
#colnames(iDesign) <- levels(iGroups);
#rownames(iDesign) <- names(iGroups);

## Contrasts are CB-DE
iContrasts <- iDC$iContrasts;
## Alternative format, useful for custom contrasts
#iContrasts <- limma::makeContrasts(
#   contrasts=c(
#      "CA2_DE-CA2_CB",
#      "CA1_DE-CA1_CB"),
#   levels=iDesign);

```

#### Differential isoform statistical tests

Differential isoforms are determined using the `limma::diffSplice()`
function, made conveniently available in the `splicejam::runDiffSplice()`
function.

```{r, run_diffSplice, dependson=c("design_matrices","detectedTxTPM")}
iMatrixTx <- assays(farrisTxSE[,iSamples])[["tpm"]];
iMatrixTx[iMatrixTx == 0] <- NA;
diffSpliceL <- runDiffSplice(
   iMatrixTx=iMatrixTx,
   detectedTx=detectedTx,
   tx2geneDF=tx2geneDF,
   iDesign=iDesign,
   iContrasts=iContrasts,
   cutoffFDR=0.05,
   cutoffFold=1.5,
   collapseByGene=TRUE,
   spliceTest="t",
   verbose=FALSE,
   useVoom=FALSE);
diffSpliceHitsL <- lapply(diffSpliceL$statsDFs, function(iDF){
   hitColname <- head(vigrep("^hit ", colnames(iDF)), 1);
   as.character(subset(iDF, iDF[[hitColname]] != 0)$gene_name);
});

## diffSplice transcript level
diffSpliceTxL <- runDiffSplice(
   iMatrixTx=iMatrixTx,
   detectedTx=detectedTx,
   tx2geneDF=tx2geneDF,
   iDesign=iDesign,
   iContrasts=iContrasts,
   cutoffFDR=0.05,
   cutoffFold=1.5,
   collapseByGene=FALSE,
   spliceTest="t",
   verbose=FALSE,
   useVoom=FALSE);
diffSpliceHitsTxL <- lapply(diffSpliceTxL$statsDFs, function(iDF){
   hitColname <- head(vigrep("^hit ", colnames(iDF)), 1);
   as.character(subset(iDF, iDF[[hitColname]] != 0)$transcript_id);
});
#sdim(diffSpliceHitsTxL);
```

```{r, tx_hit_summary, dependson="run_diffSplice"}
## Subsets of named contrasts
DEcon <- unvigrep("_CB", names(diffSpliceHitsTxL));
CBcon <- unvigrep("_DE", names(diffSpliceHitsTxL));
DEconHits <- length(unique(unlist(
   diffSpliceHitsL[DEcon])));
CBconHits <- length(unique(unlist(
   diffSpliceHitsL[CBcon])));
DEconHitsTx <- length(unique(unlist(
   diffSpliceHitsTxL[DEcon])));
CBconHitsTx <- length(unique(unlist(
   diffSpliceHitsTxL[CBcon])));

CA1CA2CBcon <- vigrep("CA1.*CA2",
   unvigrep("DE", names(diffSpliceHitsTxL)));
CA1CA2DEcon <- vigrep("CA1.*CA2",
   unvigrep("CB", names(diffSpliceHitsTxL)));
CA1CA2CBconHits <- length(diffSpliceHitsL[[CA1CA2CBcon]]);
CA1CA2DEconHits <- length(diffSpliceHitsL[[CA1CA2DEcon]]);
CA1CA2CBconHitsTx <- length(diffSpliceHitsTxL[[CA1CA2CBcon]]);
CA1CA2DEconHitsTx <- length(diffSpliceHitsTxL[[CA1CA2DEcon]]);

CA1CA2Twocon <- vigrep("CA1.*CA2_CB",
   vigrep("DE", names(diffSpliceHitsTxL)));
CA1CA2TwoconHitsTx <- length(diffSpliceHitsTxL[[CA1CA2Twocon]]);
CA1CA2TwoconHits <- length(diffSpliceHitsL[[CA1CA2Twocon]]);

```

Statistical hits were defined as having a linear fold change >= 1.5,
and a Benjamini Hochberg adjusted P-value <= 0.05.

* **`r formatInt(nrow(diffSpliceL$statsDFs[[1]]))`** genes with multiple detected isoforms
were tested, which include
* **`r formatInt(nrow(diffSpliceTxL$statsDFs[[1]]))`** detected isoforms in
these multi-isoform genes.
* **`r formatInt(length(unique(unlist(diffSpliceHitsTxL))))`** total unique differential
transcript isoforms from
* **`r formatInt(length(unique(unlist(diffSpliceHitsL))))`** unique genes across all
comparisons.

Interestingly, only a minor percentage of isoform hits came from cell body
comparisons (e.g. CA1 cell body vs CA2 cell body,
**`r formatInt(length(diffSpliceHitsTxL[["CA1_CB-CA2_CB"]]))`**
differentially spliced transcript isoforms from 
**`r formatInt(length(diffSpliceHitsL[["CA1_CB-CA2_CB"]]))`** unique genes),
suggesting that mature hippocampal neurons overwhelmingly express
the same isoforms in similar proportions for co-expressed genes,
albeit with nearly
**`r formatInt(ceiling(length(diffSpliceHitsL[["CA1_CB-CA2_CB"]]) / 50)*50)`**
exceptions.

In contrast, the vast majority of differentially expressed isoform
hits came from dendrite comparisons
(**`r formatInt(DEconHitsTx)`**
differentially spliced transcript isoforms from
**`r formatInt(DEconHits)`** unique genes). A subset of these hits
overlapped with the two-way comparison (e.g. CA2 cell body to
dendrite vs CA1 cell body to dendrite,
**`r CA1CA2TwoconHitsTx`** differentially spliced transcript
isoforms from **`r CA1CA2TwoconHitsTx`** unique genes),
suggesting either cell-type and/or compartment specific splicing.


#### Export stats results to Excel

The stats results will be exported to Excel, using the `openxlsx` R
package, storing each stats table as its own worksheet.

```{r, export_stats_xlsx, dependson="run_diffSplice"}
GroupColnames <- c("Compartment","Region");
contrastsDF <- iDC$iContrastNames;
colnames(contrastsDF)[1:2] <- GroupColnames;
contrastsDFsplit <- gsub("^_", "",
   jamba::pasteByRow(
      do.call(cbind,
         lapply(nameVector(GroupColnames), function(i){
            ifelse(lengths(strsplit(as.character(contrastsDF[[i]]), ",")) > 1,
               i,
               NA)
         })
      )
   )
);
contrastsL <- split(as.character(contrastsDF$contrastName),
   contrastsDFsplit);
con2label <- function(i) {
   gsub("_$", "", cPaste(sep="_", doSort=FALSE,
      lapply(i, function(j){
         sortSamples(preControlTerms=c("CB","DE"),
            unique(unlist(
               strsplit(j, "[-_()]+"))));
      })
   ))
}
## Loop through each contrast list and save an Excel file
for (i in nameVectorN(contrastsL)) {
   iXlsx <- paste0("DiffIsoforms_by_", i, "_", jamba::getDate(), ".xlsx");
   #printDebug(i, "", ": ", iXlsx,
   #   ", ", length(contrastsL[[i]]), " worksheets");
   iConLabels <- con2label(contrastsL[[i]]);
   iMatch <- match(unique(iConLabels), iConLabels);
   for (iCon in contrastsL[[i]][iMatch]) {
      iLab <- con2label(iCon);
      #printDebug("   ", iLab);
      #iDFtx <- diffSpliceTxL$statsDFs[[iCon]];
      ## Gene-centric stats table
      iDFgene <- diffSpliceL$statsDFs[[iCon]];
      ## Remove t-statistic column, reorder hit column
      iDFgene <- dplyr::select(iDFgene,
            -tidyselect::matches("^t ")) %>%
         dplyr::select(tidyselect::matches("gene|transcript"),
            tidyselect::matches("^hit "),
            tidyselect::everything());
      jamba::writeOpenxlsx(file=iXlsx,
         x=iDFgene,
         sheetName=iLab,
         append=(!iCon == contrastsL[[i]][[1]]),
         highlightColumns=igrep("gene|transcript", colnames(iDFgene)),
         pvalueColumns=igrep("p.val|fdr", colnames(iDFgene)),
         lfcColumns=igrep("logfc", colnames(iDFgene)),
         numColumns=igrep("groupmean", colnames(iDFgene)),
         intColumns=igrep("^numTx", colnames(iDFgene)),
         intRule=c(0,7,15),
         hitColumns=igrep("^hit ", colnames(iDFgene)),
         freezePaneColumn=max(igrep("gene|transcript", colnames(iDFgene))),
         freezePaneRow=2
      );
   }
}

```

### Splicing, Sashimi plots

Two genes with interesting alternative transcript isoforms are
**Ntrk2** and **Gria1**. Sashimi plots are shown below, which
represent the RNA-seq sequence read coverage across the exons,
and the splice junction reads that span from exon to exon.

The plots also include the transcript-exon model, and the flattened
gene-exon model.

First, some Sashimi plot data is prepared:

* Exons per transcript, and CDS exons per transcript,
which are derived from the Gencode GTF loaded previously.
* `filesDF` which is a `data.frame` describing each RNA-seq file
used to load sequence read coverage, and splice junction reads.
* `sample_id` which is a subset of sample_id values described
in `filesDF`.

```{r, sashimi_prep}
## Check for valid TxDb
if (!DBI::dbIsValid(AnnotationDbi::dbconn(vM12txdb))) {
   vM12txdb <- AnnotationDbi::loadDb(file=localDb);
}

## Obtain the exons per transcript
exonsByTx <- exonsBy(vM12txdb, by="tx", use.names=TRUE);
## Obtain CDS exons per transcript, used to show UTR as non-CDS
cdsByTx <- cdsBy(vM12txdb, by="tx", use.names=TRUE);

## flatten exons by gene for these two genes
flatExonsByGene2 <- flattenExonsBy(exonsByTx=exonsByTx,
   tx2geneDF=tx2geneDF,
   detectedTx=detectedTx,
   cdsByTx=cdsByTx,
   by="gene",
   genes=c("Ntrk2", "Gria1"));

## flatten CDS exons by gene for these two genes
flatExonsByTx2 <- flattenExonsBy(exonsByTx=exonsByTx,
   tx2geneDF=tx2geneDF,
   detectedTx=detectedTx,
   cdsByTx=cdsByTx,
   by="tx",
   genes=c("Ntrk2", "Gria1"));

## Define the files used for the sashimi plots
filesDF <- farrisdata::farris_sashimi_files_df;
## Define a subset of sample_id to display
sample_id <- c("CA1_CB", "CA1_DE", "CA2_CB", "CA2_DE");
```

#### Sashimi plot for Ntrk2

```{r, sashimi_ntrk2_prep, fig.height=14, fig.width=16}
## prepare sashimi plot data for Ntrk2
sashimi_ntrk2 <- prepareSashimi(gene="Ntrk2",
   tx2geneDF=tx2geneDF,
   flatExonsByTx=flatExonsByTx2,
   flatExonsByGene=flatExonsByGene2,
   colorSub=farrisdata::colorSub,
   filesDF=filesDF,
   sample_id=sample_id);
```

```{r, sashimi_ntrk2, fig.height=14, fig.width=16}
## prepare gene-exon model for Ntrk2
gg_ntrk2 <- gene2gg(gene="Ntrk2",
   gene_order="last",
   labelExons=FALSE,
   exonLabelSize=6,
   direction="y",
   flatExonsByGene=flatExonsByGene2,
   flatExonsByTx=flatExonsByTx2)

## use cowplot to assemble sashimi plot and gene-exon model
cp_ntrk2 <- cowplot::plot_grid(
   gg_ntrk2 +
      theme(axis.text.x=element_blank()) +
      #ggtitle(NULL) +
      xlab(NULL),
   plotSashimi(sashimi_ntrk2,
         color_sub=farrisdata::colorSub),
   ncol=1,
   align="v",
   axis="lr",
   rel_heights=c(1,2));
cp_ntrk2;
```

#### Sashimi plot for Gria1

```{r, sashimi_gria1_prep, fig.height=14, fig.width=16}
## prepare sashimi plot data for Ntrk2
sashimi_gria1 <- prepareSashimi(gene="Gria1",
   tx2geneDF=tx2geneDF,
   flatExonsByTx=flatExonsByTx2,
   flatExonsByGene=flatExonsByGene2,
   filesDF=filesDF,
   minJunctionScore=200,
   sample_id=sample_id);
```

```{r, sashimi_gria1, fig.height=14, fig.width=16}
## prepare gene-exon model for Ntrk2
gg_gria1 <- gene2gg(gene="Gria1",
   gene_order="last",
   labelExons=FALSE,
   exonLabelSize=6,
   direction="y",
   flatExonsByGene=flatExonsByGene2,
   flatExonsByTx=flatExonsByTx2)

## use cowplot to assemble sashimi plot and gene-exon model
cp_gria1 <- cowplot::plot_grid(
   gg_gria1 +
      theme(axis.text.x=element_blank()) +
      #ggtitle(NULL) +
      xlab(NULL),
   plotSashimi(sashimi_gria1,
         color_sub=farrisdata::colorSub),
   ncol=1,
   align="v",
   axis="lr",
   rel_heights=c(1,2));
cp_gria1;
```

Zoom into the differential slice event, that switched from
the "flop" to the "flip" form of Gria1:

```{r, sashimi_gria1_zoom, fig.height=10, fig.width=10}
## use cowplot to assemble sashimi plot and gene-exon model
gria1_r <- ranges(range(subset(flatExonsByGene2[["Gria1"]],
   gene_nameExon %in% c("Gria1_exon13", "Gria1_exon17a"))))
gria1_zoom <- c(start(gria1_r), end(gria1_r));

cp_gria1_zoom <- cowplot::plot_grid(
   gg_gria1 +
      theme(axis.text.x=element_blank()) +
      coord_cartesian(xlim=gria1_zoom, expand=FALSE) +
      xlab(NULL),
   plotSashimi(sashimi_gria1,
      label_coords=gria1_zoom,
      color_sub=farrisdata::colorSub) +
      coord_cartesian(xlim=gria1_zoom, expand=FALSE),
   ncol=1,
   align="v",
   axis="lr",
   rel_heights=c(1,3));
cp_gria1_zoom;
```

## Gene Set Enrichment

### Selection of gene hits

Used DESeq-normalized expression values, tested versus MSigDB v6.0
using hypergeometric enrichment via the R function
`stats::phyper()`. A `data.frame` summary of statistical results
was prepared, which included:

* pathway name
* pathway "Source" and "Category" as defined by MSigDB
* number of genes per pathway
* number of gene hits found per pathway
* hypergeometric enrichment P-value
* FDR-adjusted P-value
* comma-delimited gene hits per pathway

The gene symbols obtained from MSigDB were converted to most recent
mouse Entrez gene symbols using the Bioconductor package `"org.Mm.eg.db"`.
This same process was used with the gene symbols from the RNA-seq
analysis steps above, in order to ensure the gene symbols in MSigDB
and RNA-seq analysis were using the same comparable version of
Entrez gene symbols.

One `data.frame` result was prepared for each gene hit list,
and was further analyzed using methods in the `"multienrichjam"`
package.


### MultiEnrichMap

MultiEnrichMap is an analysis workflow that extends previous
innovative analysis workflows:

* "[EnrichmentMap](https://www.baderlab.org/Software/EnrichmentMap)" by the
Bader lab, ref: [Merico,et al, PLoS One, 2010](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0013984))
* "[clusterProfiler](https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html)" and
"[enrichplot](https://bioconductor.org/packages/release/bioc/html/enrichplot.html)" by
Dr. Guangchuang Yu.

The previous EnrichmentMap and EnrichMap workflows are intended to help
understand and visualize gene set enrichment results in the context of
one gene hit list.

MultiEnrichMap is an extension of those workflow that compares multiple
enrichment results. This workflow is implemented in an R package
"multienrichjam", which can be installed with

> `devtools::install_github("jmw86069/multienrichjam")`

Subset pathways for top 15 per source-category

Heatmap of enrichment P-values.

Cnet plot