---
title: "DU-Bii Study cases - Mouse fibrotic kidney"
author: "Olivier Sand and Jacques van Helden"
date: '`r Sys.Date()`'
output:
  html_document:
    self_contained: no
    fig_caption: yes
    highlight: zenburn
    theme: cerulean
    toc: yes
    toc_depth: 3
    toc_float: yes
    code_folding: "hide"
  ioslides_presentation:
    slide_level: 2
    self_contained: no
    colortheme: dolphin
    fig_caption: yes
    fig_height: 5
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    smaller: yes
    toc: yes
    widescreen: yes
  slidy_presentation:
    smart: no
    slide_level: 2
    self_contained: yes
    fig_caption: yes
    fig_height: 5
    fig_width: 7
    highlight: tango
    incremental: no
    keep_md: yes
    smaller: yes
    theme: cerulean
    toc: yes
    widescreen: yes
  beamer_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 5
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    incremental: no
    keep_tex: no
    slide_level: 2
    theme: Montpellier
    toc: yes
  revealjs::revealjs_presentation:
    theme: night
    transition: none
    self_contained: true
    css: ../slides.css
  pdf_document:
    fig_caption: yes
    highlight: zenburn
    toc: yes
    toc_depth: 3
  powerpoint_presentation:
    slide_level: 2
    fig_caption: yes
    fig_height: 5
    fig_width: 7
    toc: yes
font-import: http://fonts.googleapis.com/css?family=Risque
subtitle: DUBii 2020
font-family: Garamond
transition: linear
editor_options: 
  chunk_output_type: console
---


```{r settings, include=FALSE, echo=FALSE, eval=TRUE}

options(width = 300)
# options(encoding = 'UTF-8')
knitr::opts_chunk$set(
  fig.width = 7, fig.height = 5, 
  fig.path = 'figures/mouse-kidney_',
  fig.align = "center", 
  size = "tiny", 
  echo = TRUE, 
  eval = TRUE, 
  warning = FALSE, 
  message = FALSE, 
  results = TRUE, 
  comment = "")

options(scipen = 12) ## Max number of digits for non-scientific notation
# knitr::asis_output("\\footnotesize")

```


```{r libraries, echo=FALSE, eval=TRUE}
requiredLib <- c("knitr", "stringr", "vioplot")
for (lib in requiredLib) {
  if (!require(lib, character.only = TRUE)) {
    install.packages(lib, )
  }
  require(lib, character.only = TRUE)
}

```

## Reference


Pavkovic, M., Pantano, L., Gerlach, C.V. et al. Multi omics analysis of fibrotic kidneys in two mouse models. Sci Data 6, 92 (2019). 

- <https://doi.org/10.1038/s41597-019-0095-5> 
- <https://www.nature.com/articles/s41597-019-0095-5#citeas>
- Mouse fibrotic kidney browser: <http://hbcreports.med.harvard.edu/fmm/>
- Data on Zenodo: <https://zenodo.org/record/2592516>


## Samples

>Samples from two mouse models were collected. The first one is a reversible chemical-induced injury model (folic acid (FA) induced nephropathy). The second one is an irreversible surgically-induced fibrosis model (unilateral ureteral obstruction (UUO)). mRNA and small RNA sequencing, as well as 10-plex tandem mass tag (TMT) proteomics were performed with kidney samples from different time points over the course of fibrosis development. 
> In the FA model, mice were sacrificed before the treatment (day 0) and 1, 2, 7, and 14 days after a single injection of folic acid. 
> For the UUO model, mice were sacrificed before obstruction (day 0) and 3, 7, and 14 days after the ureter of the left kidney was obstructed via ligation. For both studies, kidneys were removed at each time point for total RNA isolation and protein sample preparation.

We will first explore the UUO transcriptome data.


## Data sources

| Doc | URL | 
|:-----------------------------|:--------------------------|
| Total RNA for the experiment on Unilateral ureter obstruction (UUO) model | <https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118339> |

## Parameters

```{r parameters}
#### Define parameters for the analysis ####

## Keep a trace of the original parameters
par.ori <- par(no.readonly = TRUE)

## Analysis parameters
parameters <- list(
  dataset = "uuo", ## Supported: uuo, fa
  datatype = "transcriptome",
  epsilon = 0.1,
  minCount = 10,
  forceDownload = FALSE)

kable(as.data.frame(parameters))

```

## Output directories

```{r outdirs}
#### Output directories ####
outdirs <- list()
# outdirs$main <- getwd()
outdirs$main <- "."

## Data directory, where the data will be downloaded and uncompressed
outdirs$data <- file.path(outdirs$main, "data")
dir.create(outdirs$data, recursive = TRUE, showWarnings = FALSE)

## Main result directory
outdirs$results <- file.path(outdirs$main, "results")

# Transcriptome results
outdirs$transcriptome <- file.path(outdirs$results, "transcriptome")
dir.create(outdirs$transcriptome, recursive = TRUE, showWarnings = FALSE)


```

## Download transcriptome data

```{r download_transcriptome}
#### Download transcriptome data ####
archiveFile <- "MouseKidneyFibrOmics-v1.0.zip"
archiveURL <- file.path("https://zenodo.org/record/2592516/files/hbc", archiveFile)
localDataArchive <- file.path(outdirs$data, archiveFile)

if (file.exists(localDataArchive) & !parameters$forceDownload) {
  message("Data archive already downloaded:\n\t", localDataArchive)
} else {
  message("Downloading data archive from zenodo: ", archiveURL)
  download.file(url = archiveURL, destfile =  localDataArchive)
  
  ## Uncompess the archive
  message("Uncompressing data archive")
  unzip(zipfile = localDataArchive, exdir = outdirs$data)
}

## Define destination directory
# outdirs$csv <- file.path(outdirs$data, "CSV")
# dir.create(outdirs$csv, showWarnings = FALSE, recursive = TRUE)

```

## Load raw counts

```{r load_raw_counts}
#### Load raw counts data table ####


## Note: the "raw" counts are decimal numbers, I suspect that they have been somewhat normalised. To check. 
rawCountFile <- file.path(
  outdirs$data,
   paste0("hbc-MouseKidneyFibrOmics-a39e55a/tables/", 
          parameters$dataset, 
          "/results/counts/raw_counts.csv.gz"))
 # "hbc-MouseKidneyFibrOmics-a39e55a/tables/fa/results/counts/raw_counts.csv.gz")
rawValues <- read.csv(file = rawCountFile, header = 1, row.names = 1)

```

The RNA-seq transcriptome data was loaded as raw counts. This table contains `r nrow(rawValues)` rows (genes) and `r ncol(rawValues)` columns (samples).  


## Build metadata table

```{r metadata}
#### Build metadata table ####
metadata <- data.frame(
  dataType = "transcriptome",
  sampleName = colnames(rawValues))
metadata[ , c("condition", "sampleNumber")] <- 
  str_split_fixed(string = metadata$sampleName, pattern = "_", n = 2)

## Colors per condition
colPerCondition <- c(normal = "#BBFFBB",
                     day3 = "#FFFFDD", 
                     day7 = "#FFDD88",
                     day14 = "#FF4400")
metadata$color <- colPerCondition[metadata$condition]
```


## Compute sample-wise statistics

```{r sample_wise_stat}

sampleStat <- metadata
sampleStat$mean <- apply(X = rawValues, 2, mean)
sampleStat$median <- apply(X = rawValues, 2, median)
sampleStat$sd <- apply(X = rawValues, 2, sd)
sampleStat$min <- apply(X = rawValues, 2, min)
sampleStat$perc05 <- apply(X = rawValues, 2, quantile, prob = 0.05)
sampleStat$Q1 <- apply(X = rawValues, 2, quantile, prob = 0.25)
sampleStat$median <- apply(X = rawValues, 2, quantile, prob = 0.5)
sampleStat$Q3 <- apply(X = rawValues, 2, quantile, prob = 0.75)
sampleStat$perc95 <- apply(X = rawValues, 2, quantile, prob = 0.95)
sampleStat$max <- apply(X = rawValues, 2, max)
sampleStat$iqr <- apply(X = rawValues, 2, IQR)

## Print statistics per sample
kable(format(x = format(digits = 5, sampleStat)))

```


## Distribution of raw counts

```{r RNA-seq_count_distrib, fig.width=12, fig.height=5, out.width="70%", fig.cap="Distribution of raw counts"}
hist(unlist(rawValues), breaks = 1000, 
     main = "Raw count distribution", 
     xlab = "Raw counts", 
     ylab = "Number of genes (all samples)")
```

The distribution of raw counts is not very informative, because the range is defined by some outlier, i.e. a gene having a huge number of reads. Even with  strong zoom on the abcsissa range from 0 to 500, the histogram shows a steep drop in the first bins. 

## Distribution of raw counts - truncated abscissa

```{r RNA-seq_count_distrib_truncated, fig.width=12, fig.height=5, out.width="70%", fig.cap="Distribution of raw counts"}
#### Count distrib - truncated abscissa ####
hist(unlist(rawValues), breaks = 500000, 
     main = "Raw count distribution", 
     xlab = "Raw counts (truncated abscissa", 
     ylab = "Number of genes (all samples)",
     xlim = c(0, 500))

```

## Log2-transformed counts

A typical approach is to normalise the counts by applying a log2 transformation .
This however creates a problem when the counts of a given gene in a given sample is 0. To circumvent this, we can add an epsilon ($\epsilon = `r parameters$epsilon`$) before the log2 transformation. 

```{r log2_counts, fig.width=12, fig.height=5, out.width="70%"}

#### Log2 transformatiojn of the counts ####
log2Values <- log2(rawValues + parameters$epsilon)
hist(unlist(log2Values), breaks = seq(from = -5, to = 22, by = 0.1),
     main = "log2-counts distribution",
     xlab = "log2(Counts + epsilon)", 
     ylab = "Number of genes", col = "#BBFFBB")

```

## Two-columns test

:::::::::::::: {.columns}
::: {.column}
contents 1 ...
:::

::: {.column}
pas l'air de marcher ...
:::

::::::::::::::


## Box plots

We can now inspect the distribution of counts per sample with the `boxplot()` function.

```{r box plots, fig.width=5, fig.height=7, out.width="25%"}
#### Box plots ####
par(mar = c(4, 6, 5, 1))
boxplot(log2Values, 
        col = metadata$color,
        horizontal = TRUE, 
        las = 1, 
        main = "log2-transformed", 
        xlab = "log2(counts)")
par(par.ori)
```

## Box plot comment

We notice an obvious problem: the vast majority of counts is very small. This can result from different causes, which will not be investigated in this context. 

## Gene filtering

```{r gene_filtering, fig.width=5, fig.height=7, out.width="50%"}

## Filter out the genes with very low counts in all conditions
undetectedGenes <- apply(rawValues, MARGIN = 1, FUN = sum) < parameters$minCount
# table(undetectedGenes)
log2Filtered <- log2Values[!undetectedGenes, ]

```

We filtered out all the genes whose maximal count value across all samples was lower than `r parameters$minCount`. Among the `r nrow(rawValues)` genes from the raw count table, `r sum(undetectedGenes)` were considered undetected according to this criterion. We use the remaining `r nrow(log2Filtered)` genes for the subsequent analyses. 

## Boxplot after gene filtering

```{r boxplot_filtered, fig.width=5, fig.height=7, out.width="32%"}
#### Box plots after filtering ####
par(mar = c(4, 6, 5, 1))
boxplot(log2Filtered, 
        col = metadata$color,
        horizontal = TRUE, 
        las = 1,
        main = "Filtered genes", 
        xlab = "log2(counts)")
par(par.ori)

```

## Normalisation (more precisely: scaling)

Before going any further, it is important to ensure some normalisation of the counts, in order to correct for biases due to inter-sample differences in sequencing depth. 

For the sake of simplicity, we will use here a very simple criterion: median-based normalisation. The principle is to multiply the counts of each sample by a scaling factor in order to bring each sample to the same median count. 

## Normalisation code

```{r median_normalisation, fig.width=5, fig.height=7, out.width="25%"}

#### Median-based normalisation ####
sampleMedians <- apply(log2Filtered, 2, median)
seriesMedian <- median(sampleMedians)
scalingFactors <- seriesMedian / sampleMedians

log2Standardised <- data.frame(matrix(
  nrow = nrow(log2Filtered),
  ncol = ncol(log2Filtered)))
colnames(log2Standardised) <- colnames(log2Filtered)
rownames(log2Standardised) <- rownames(log2Filtered)
for (j in 1:ncol(log2Filtered)) {
  log2Standardised[, j] <- log2Filtered[, j] * scalingFactors[j]
}

## Check the remaining medians
apply(log2Standardised, 2, median)

```

## Normalized boxplot

```{r median_normalisation_boxplot, fig.width=5, fig.height=7, out.width="32%"}
#### Box plots after scaling ####
par(mar = c(4, 6, 5, 1))
boxplot(log2Standardised, 
        col = metadata$color,
        horizontal = TRUE, 
        las = 1,
        main = "Median-based scaled", 
        xlab = "log2(counts")
par(par.ori)


```

## Violin plots

We can also inspect the distribution of counts per sample with the `vioplot()` function.

```{r violin plots, fig.width=5, fig.height=7, out.width="25%"}
#### Violin plots ####
par(mar = c(4, 6, 5, 1))
vioplot::vioplot(log2Values, 
        col = metadata$color,
        horizontal = TRUE, 
        las = 1, 
        main = "log2-transformed", 
        xlab = "log2(counts)")
par(par.ori)
```

## Scatter plots

We can also inspect the distribution of counts per sample with the `plot()` function.

```{r scatter plots, fig.width=5, fig.height=7, out.width="25%"}
#### Scatter plots ####
#par(mar = c(4, 6, 5, 1))
#plot(log2Values[,metadata$sampleName], 
#        col = metadata$color)
#par(par.ori)
```

## Combinations


## Exporting the result

We export the pre-processed data in separate tables for further reuse. 

```{r save_tables}
#### Save tables  ####

outfiles <- vector()

## Raw counts, all the variables
outfiles["raw"] <- file.path(outdirs[parameters$datatype], 
                                    paste0(parameters$dataset, 
                                           "_", parameters$datatype, 
                                           "_raw.tsv.gz"))
write.table(x = format(digits = 3, big.mark   = "", decimal.mark = ".", rawValues), 
            dec = ".", 
            file = gzfile(outfiles["raw"], "w"), 
            quote = FALSE, sep = "\t")

## Log2-transformed counts, all the genes
outfiles["log2"] <- file.path(outdirs[parameters$datatype], 
                                    paste0(parameters$dataset, 
                                           "_", parameters$datatype, 
                                           "_log2.tsv.gz"))
write.table(x = format(digits = 3, big.mark   = "", decimal.mark = ".", log2Values), 
            dec = ".", 
            file = gzfile(outfiles["log2"], "w"), 
            quote = FALSE, sep = "\t")

## Filtered genes only, log2-transformed counts
outfiles["filtered"] <- file.path(outdirs[parameters$datatype], 
                                    paste0(parameters$dataset, 
                                           "_", parameters$datatype, 
                                           "_log2_filtered.tsv.gz"))
write.table(x = format(digits = 3, big.mark   = "", decimal.mark = ".", log2Filtered), 
            dec = ".", 
            file = gzfile(outfiles["filtered"], "w"), 
            quote = FALSE, sep = "\t")

## Filtered genes only, log2-transformed and standardized counts
outfiles["standardised"] <- file.path(outdirs[parameters$datatype], 
                                    paste0(parameters$dataset, 
                                           "_", parameters$datatype, 
                                           "_log2_norm.tsv.gz"))
write.table(x = format(digits = 3, big.mark   = "", decimal.mark = ".", log2Standardised), 
            dec = ".", 
            file = gzfile(outfiles["standardised"], "w"), 
            quote = FALSE, sep = "\t")

## Metadata
outfiles["metadata"] <- file.path(outdirs[parameters$datatype], 
                                    paste0(parameters$dataset, 
                                           "_", parameters$datatype, 
                                           "_metadata.tsv"))
write.table(x = metadata, 
            file = outfiles["metadata"] ,
            quote = FALSE, sep = "\t")


## Build a table to display the links in the report
fileTable <- data.frame(outfiles)
fileTable$basename <- basename(fileTable$outfiles)
fileTable$dirname <- dirname(fileTable$outfiles)
fileTable$link <- paste0("[", fileTable$basename, "]", "(", fileTable$outfiles, ")")

## Print the directories
kable(t(as.data.frame.list(outdirs)), col.names = "Directories", caption = "Directories")

##  Print the link table
kable(cbind(rownames(fileTable), fileTable$link), row.names = FALSE, col.names = c("Contents", "Link"),
      caption = "Output files")
```


## Session info

```{r session_info}

sessionInfo()
```