---
title: "Practical -- k-mer distributions"
author: "Jacques van Helden"
date: "`r Sys.Date()`"
output:
  html_document:
    code_folding: hide
    fig_caption: yes
    highlight: zenburn
    self_contained: no
    theme: cerulean
    toc: yes
    toc_depth: 3
    toc_float: yes
  beamer_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    incremental: no
    keep_tex: no
    slide_level: 2
    theme: Montpellier
    toc: yes
  ioslides_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    smaller: yes
    toc: yes
    widescreen: yes
  word_document:
    toc: yes
    toc_depth: '3'
  slidy_presentation:
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    highlight: tango
    incremental: no
    keep_md: yes
    smaller: yes
    theme: cerulean
    toc: yes
    widescreen: yes
  pdf_document:
    fig_caption: yes
    highlight: zenburn
    toc: yes
    toc_depth: 3
font-import: http://fonts.googleapis.com/css?family=Risque
subtitle: Statistics for Biology (STAT2)
font-family: Garamond
transition: linear
---

```{r include=FALSE, echo=FALSE, eval=TRUE}
if (!require(knitr)) {
  install.packages("knitr")
}
library(knitr)

options(width = 300)
knitr::opts_chunk$set(
  fig.width = 7, fig.height = 5, 
  fig.path = 'figures/kmer_distrib_',
  fig.align = "center", 
  size = "tiny", 
  echo = TRUE, eval = TRUE, 
  warning = FALSE, message = FALSE, 
  results = TRUE, comment = "")
# knitr::asis_output("\\footnotesize")

```


## Introduction

In this practical, we will count k-mer occurrences in DNA sequences of different organisms (one organism per sutdent), fit different theoretical distributions of probabilities onto the empirical distributions of counts, and test the goodness of fit for these alternative distributions. 


## Organisms

Each student will choose one organism of interest among the following ones. 
  
| Taxon | Organism | 
|--------|-----------------------------------|
| Fungi | Saccharomyces_cerevisiae |
| Bacteria | Escherichia coli GCF 000005845.2 ASM584v2 |
| Mammalian | Homo sapiens GRCh38 |
| Mammalian | Mus musculus GRCm38 |
| Bird | Gallus gallus EnsEMBL |
| Fish | Danio_rerio_EnsEMBL |
| Insect | Drosophila melanogaster | 
| Worm | |
| Plant | Arabidopsis thaliana.TAIR10.29 |
| Plant | Zea mays.AGPv3.29 |
| Apicomplexa |  |


## Solutions


### Working directory

On your computer, create a directory for this practical. I suggest to use a consistent naming for the different practicals of this course. 

We will further create a sub-folder with the name of our organism of interest. 


```{r working_dir}
## Define your organism of interest
org <- "Homo_sapiens"

## Define a working directory
work.dir <- file.path("~", "CMB-STAT2_practicals", "kmer_distrib", org)
dir.create(work.dir, recursive = TRUE, showWarnings = FALSE)


## Print a message with the result directory
message("Result directory\t", work.dir)

```


## Counting k-mer occurrences in each promoter of a model organism

1. Open a connection to the Regulatory Sequence Analysis Tools (RSAT) teaching server : <http://teaching.rsat.eu/>

2. In the tool search box, type "retrieve sequence" and click on the corresponding tool. 

3. In the [*retrieve-sequence*](http://pedagogix-tagc.univ-mrs.fr/rsat/retrieve-seq_form.cgi) form, 

    - click *Mandatory inputs*, enter the name your organism of interest, and check the option *all genes of this organism*;
    - in *Mandatory options*, select *upstream*, and set the sequence limits from -1 to -500
    - in *Advanced options*, *make sure that this option is **unchecked**: *Prevent overlap with neighbour genes (noorf)* ^[Note: normally it is recommended to check this option, but we intently inactivate it in order to get sequences of the same sizes. ]
    - click *Run analysis* and *GO*.
    
After a few seconds (or minutes) the result is displayed. Right-click on the sequence file (extension fasta) and open it in a separate tab to check its content. 
    
4. Come back to the result page of retrieve-sequence. In the *Next Step* box below the result, click on the link to *oligo-analysis*. This will transfer your sequences to the oligo-analysis form.

    - In the *Sequence* section, inactivate the option *purge sequence*.
    - In the *Oligmer counting mode*, **uncheck** the option *prevent overlapping matches*.
    - Select *Count on single strand*.
    - For *oligomer lengths*, select 2 and **uncheck the other lengths**.
    - In *Results*, check the option *Occurrence table*.
    - Type your email address and select the mail output. 
    - Click *GO*. 

After a few seconds (minutes), the RSAT server should display the result page, with links to the k-mer count table. Copy the URL of the result file. 

### Download the count table from RSAT


Let us define the name we will give to the local copy of the k-mer count table generated on the RSAT server in the previous steps.

```{r kmer_table_name}
## Define the path and the name of the local file
kmer.file <- file.path(work.dir, "2nt-ovlp-1str_Homo_sapiens.tab")

```


One solution is to download manually the k-mer count table generated on the RSAT server, move it to the work directory, and rename it `2nt-ovlp-1str_Homo_sapiens.tab` (to be adapted depending on your organism of interest). 

Another possiblity is to use R command `download.file()` download it from the URL of the result file on the RSAT server. 


```{r eval=FALSE}

## Note: this will only work for a few days, because the temporary files are removed from the server
temp.url <- "http://pedagogix-tagc.univ-mrs.fr/rsat/tmp/www-data/2020/02/17/oligo-analysis_2020-02-17.155035_zZ4IA4"

## Provide the arguments in the order of the function definition
download.file(temp.url, kmer.file)

## Equivalent : name the arguments
# download.file(url = temp.url, destfile = kmer.file)

## Note: named arguments can be provided in a different order without problem
# download.file(destfile = kmer.file, url = temp.url)

```


Whichever method was chosen, check that the file is at the right place on your computed. 

```{r}

## List the files in the working directory
list.files(work.dir)

## Send a message with the k-mer file location
message("K-mer count table file:\t", kmer.file)

```


### Load the k-mer count table in R

Use the finction `read.table()` to load the k-mer count table in a variable named `kmer.table`.

```{r}

## Call the help for read.table()
# ?read.table

## Load the k-mer count table in a variable
kmer.table <- read.table(
  file = kmer.file, 
  comment.char = ";", ## comment lines start with ";" in RSAT
  header = TRUE, # the first row (after the comments) contain the column headers
  row.names = 1, ## the first column contains row names but there might be homonyms
  sep = "\t" ## column separator is the tabulation
  )

## We set uppercases for the dinucleotide sequences, which were in lowercases in the original count table
names(kmer.table) <- toupper(names(kmer.table))

```

Check the dimensions of this table.

```{r}

## Check the dimensions of the k-mer count table
dim(kmer.table)

## Number of k-mers
m <- ncol(kmer.table)

n <- nrow(kmer.table)

## Print the result
print(paste0("Number of rows: ", n))
print(paste0("Number of columns: ", m))
```


Check the column names

```{r}
## Check the column names
names(kmer.table)
colnames(kmer.table) # equivalent
```

Display the first and last 10 lines of the k-mer count table. 

```{r}
## Show the first 10 lines of the k-mer count table
head(kmer.table)
tail(kmer.table)

```


## Compute marginal statistics

### Statistics per column (k-mer)

Tips: use the R function `apply()`. 

We will compute marginal statistics on the rows and columns of the count table. One possibility is to add columns and rows on this table, but this would not be very convenient for the subsequent computations to be performed on the counts. We will thus create two separate tables: one with the row-wise statistics, and another one with the column-wise statistics. 

Rather than looping over each row or column of the count table, we can use the function `apply()`in order to compute a chosen statistics on each row (margin = 1) or column (margin = 2) of the table.

```{r}

## Recall the dimensions of the kmer count table
dim(kmer.table)
# View(kmer.table)

## Compute means per column (on the "second" margin)
col.means <- apply(kmer.table, 2, mean)

print(col.means)

```

We can run the `apply()`function in the same way to compute the other descriptive statistics (median, variance, ...) and store each result in a separate vector. However, it would be more convenient to dispose of a single structure containing all the statistics for each column. For this, we use `data.frame()` to create a data frame with one column per column per computed statistics, and one row per k-mer. 

```{r}
## Compute a table summarising the statistics per k-mer 
## (marginal stats on the columns)
stats.per.kmer <- data.frame(
  mean = apply(kmer.table, 2, mean),
  sd = apply(kmer.table, 2, sd),
  var = apply(kmer.table, 2, var),
  min = apply(kmer.table, 2, min),
  p05 = apply(kmer.table, 2, quantile, 0.05),
  Q1 = apply(kmer.table, 2, quantile, 0.25),
  median = apply(kmer.table, 2, median),
  Q3 = apply(kmer.table, 2, quantile, 0.75),
  p95 = apply(kmer.table, 2, quantile, 0.95),
  max = apply(kmer.table, 2, max),
  sum = apply(kmer.table, 2, sum)
)


kable(stats.per.kmer, caption = "Column-wise means of the k-mer count table. ")

```


### Warning about R sd() and var() functions

**Beware:** the R functions `sd` and `var` do not compute the standard deviation and variance of the numbers provded! Instead, they consider that the data provided are a sample drawed from a population, and the return an estimate of the standard deviation or variance of this population. 

So, instead of the sample variance

$$s^2 = \sum_{i=1}^n \frac{1}{n}(x - \bar{x})^2$$

They estimate the variance of the population corrected for the systematic bias of sample variance. 

$$\hat{s^2} = \sum_{i=1}^n \frac{1}{n-1}(x - \bar{x})^2$$

For this exercise, this is actually what we want, since we can consider that the `r n` genes of our organism of interest are a sampling of all the genes that might be found in organisms of the same taxon. In any case, the impact of the correction is negligible with such a large number of genes ($n = `r n`$).


$$\frac{n}{n-1} =  `r round(digits = 7, n/(n-1))`$$


### Analyse the relationships between the mean and the variance

Poisson: variance = mean
Binomial: variance < mean
XXX: variance > mean


```{r mean_vs_var, fig.width=7, fig.height=4, out.width="60%", fig.cap="Mean-variance plot for dinucleotide counts in all human promoters. The line indicates the variance/mean relationship in Poisson. distributions. "}
plot(stats.per.kmer$mean, stats.per.kmer$var, 
     xlab = "Mean of the counts per promoters", 
     ylab = "Variance of the counts per promoters", 
     las = 1, panel.first = grid(),
     main = "Dinucleotide counts in human promoters\nVariance versus mean plot",
     xlim = c(10,50), ylim = c(0,500), cex = 0)
text(x = stats.per.kmer$mean, y = stats.per.kmer$var, labels = rownames(stats.per.kmer))
abline(a = 0, b = 1)

```


### Statistics per gene (row)

```{r stats_per_gene}
## Compute a table summarising the gene-wise (row-wise) statistics of the k-mer count table
stats.per.gene <- data.frame(
  mean = apply(kmer.table, 1, mean),
  sd = apply(kmer.table, 1, sd),
  var = apply(kmer.table, 1, var),
  min = apply(kmer.table, 1, min),
  Q1 = apply(kmer.table, 1, quantile, 0.25),
  median = apply(kmer.table, 1, median),
  Q3 = apply(kmer.table, 1, quantile, 0.75),
  max = apply(kmer.table, 1, max),
  sum = apply(kmer.table, 1, sum)
)

```

We can inspect the sum of counts per gene. 

```{r sum_per_gene}
kable(rev(table(stats.per.gene$sum)))
```

For almost all genes, the sum of k-mer counts is $n = L -1$, where $L$ is the sequence length. This is logical since there is no dinucleotide starting at the last position of a sequence. More generally, for a k-mer of size $k$, the number of possible positions is $n = L - k + 1$. 

The very few cases of sequences having a smaller sums of counts result from the presence of ambiguous nucleotides (represented by the symbol `N`) in the genome sequence. For this exercise, we will suppress these genes because our models rely on the hypothesis that the sum of counts per gene is constant. 

```{r gene_filtering}
## Filter out the genes having an incorrect sum of counts
n <- max(stats.per.gene$sum)

## Count the number of genes with the correct sum of counts
sum(stats.per.gene$sum == n)

## Count the number of genes with an incorrect sum of counts
sum(stats.per.gene$sum != n)

## Filter out the genes with incorrect sums of counts
kmer.table <- kmer.table[stats.per.gene$sum == n, ]
dim(kmer.table)

## Count the remaining number of genes
nb.genes <- nrow(kmer.table)

```


## Drawing empirical distributions of counts


### Draw an histogram with the distribution of counts for a given k-mer. 

The simplest way to generate a histogram is to use the `hist()` function. Here is the result with the default parameters.

```{r one_kmer_hist, fig.width=7, fig.height=4, out.width="60%", fig.cap="Counts per sequence in all human promoters. "}
kmer <- "AG" # Select a k-mer
counts <- kmer.table[, kmer] # select the counts for this k-mer
range(counts)

## Draw an histogram with default parameters (not very beautiful)
hist(x = counts)
```

We can fine-tune the parameters in order to get a more informative representation, by specifying custom breaks. 

```{r one_kmer_hist_pretty, fig.width=7, fig.height=4, out.width="60%", fig.cap="Counts per sequence in all human promoters. "}
## Draw a nice and informative histogram
hist(x = counts, 
     breaks = 0:max(counts + 1),
     col = "#44DDFF", border = "#44DDFF",
     las = 1,
     ylab = "Number of promoters",   
     xlab = "K-mer counts",
     main = paste0(kmer, " counts in promoters"))

```


### Generate a figure with 4 x 4 panels to depict the histograms of the 16 k-mers


```{r all_kmer_hist_pretty, fig.width=12, fig.height=8, out.width="100%", fig.cap="Dinucleotide counts per sequence in all human promoters. "}

par(mfrow = c(4,4))
par(mar = c(4.1, 5.1, 4.1, 1.1))
for (i in 1:16) {
  kmer <- names(kmer.table)[i]
  counts <- kmer.table[, i] # select the ith column of the k-mer table
  ## Draw a nice and informative histogram
  hist(x = counts, 
       breaks = 0:max(counts + 1),
       col = "#44DDFF", border = "#44DDFF",
       las = 1,
       ylab = "Number of promoters",   
       xlab = "K-mer counts",
       main = paste0(toupper(kmer), " counts in promoters"))
}

par(mfrow = c(1,1))
par(mar = c(4.1, 5.1, 4.1, 2.1))

```


### Use other graphical representations to get an insight of the k-mer count distributions (boxplots, violin plots)

#### Boxplot

The funciton `boxplot()` provides an informative summary of the data. 

```{r kmer_count_boxplot, fig.width=7, fig.height=5, out.width="60%", fig.cap="Boxplot of dinucleotide counts in all promoters. "}
boxplot(x = kmer.table)
```

We can fine-tue the parameters to enhance the readability

```{r kmer_count_boxplot_pretty, fig.height=10, fig.width=7, out.width="60%", fig.cap="Boxplot of dinucleotide counts in all promoters. "}
boxplot(x = kmer.table[,order(stats.per.kmer$median)], 
        horizontal = TRUE, 
        las = 1, 
        col = "#44DDFF", 
        main = "Dinucleotide distributions", 
        xlab = "Counts")
```

Alternative: sort by variance. 

```{r kmer_count_boxplot_pretty_var, fig.height=10, fig.width=7, out.width="60%", fig.cap="Boxplot of dinucleotide counts in all promoters. "}
boxplot(x = kmer.table[,order(stats.per.kmer$var)], 
        horizontal = TRUE, 
        las = 1, 
        col = "#44DDFF", 
        main = "Dinucleotide distributions", 
        xlab = "Counts")
```


#### Violin plot

We can use the R function `vioplot()` to generate a violin plot. This requires to install the R package `vioplot`, if it is not already done. 

```{r violin_plot, fig.width=7, fig.height=5, out.width="60%", fig.cap="Violin plot of dinucleotide counts in promoter sequences"}
## Check if the vioplot package is already installed
if (!require(vioplot)) {
  ## Install the vioplot package
  install.packages("vioplot")
}
library(vioplot) ## Load the vioplot package
# help(package = vioplot) # Get help on the package
# help(vioplot) # Get help on the vioplot function

vioplot(kmer.table)

```

Let us fine-tune the parameters to get a pretty violin plot. 

```{r violin_plot_pretty, fig.height=10, fig.width=7, out.width="60%", fig.cap="Violin plot of dinucleotide counts in promoter sequences"}
vioplot(kmer.table[, order(stats.per.kmer$mean)], horizontal = TRUE, col = "#44DDFF", las = 1, main = "Violin plot of dinucleotide counts in promoters", xlab = "Counts", )

```


```{r violin_plot_pretty_var, fig.height=10, fig.width=7, out.width="60%", fig.cap="Violin plot of dinucleotide counts in promoter sequences"}
vioplot(kmer.table[, order(stats.per.kmer$var)], horizontal = TRUE, col = "#44DDFF", las = 1, main = "Violin plot of dinucleotide counts in promoters", xlab = "Counts", )
```


## Exploring k-mer count distributions in promoter sequences
    
5. Compute a vector with the relative frequency of each k-mer in all the sequences. 

6. Compute a table with the relative frequencies of k-mers per sequence, and compute similar summary statistics per column on this relative frequency table. 

7. Write a brief interpretation of the results. 
## Fitting


### Encapsulate the pretty histogram in an R function

For the rest of this tutorial, we will have to draw several histograms of k-mer count distributions. Since we are satisfied of the fine-tuned histogram above, one possibility is to include it in a function, that we will be able to recall later rather than having to re-type the full code for each usage. 

We will properly document this function with the roxygen2 standard. This will enable us to incorporate this function in a custom R package if we wish. 

We can also include parameters (defined in the signature of the function) in order to make its use more flexible. 

Moreover, this function will return an object (of class "histogram") with the  numerical values of the histogram (breaks, frequencies, counts, ...). 

```{r count_histogram, collapse=FALSE}
#' @title K-mer count histogram
#' @author Jacques van Helden
#' @description Draw an histogram with the distribution of a vector of counts for a given k-mer.
#' @param counts a vector of counts
#' @param kmer the sequence of the oligonucleotide (k-mer) to be displayed in the title
#' @param class.interval=1 class interval (breaks are automatically computed)
#' @param col="#44DDFF" fill color for the histogram rectangles
#' @param border="#22BBFF" border color for the histogram rectangles
#' @param main=paste0(kmer, " counts in promoters")
#' @param ... all the other arguments are passed to the hist function
#' @return an object of type histogram returned by the hist() function
#' @export 
kmer.hist <- function(counts, 
                      kmer, 
                      class.interval = 1,
                      col = "#44DDFF", 
                      border = "#44DDFF",
                      main = paste0(kmer, " counts in promoters"),
                      ...) {
  h <- hist(x = counts, 
            breaks = seq(from = 0, 
                         to = max(counts + class.interval), 
                         by = class.interval),
            col = col, 
            border = border,
            las = 1,
            ylab = "Number of promoters",   
            xlab = "K-mer counts",
            main = main,
            ...)
  
  ## Return the histogram values
  return(h)
}
```

We can then call this function with a given k-mer.

```{r another_kmer, fig.width=7, fig.height=4, out.width="60%", fig.cap="Distribution of k-mer counts in promoter generated with our custom funciton kmer.hist(). "}

hist.values <- kmer.hist(counts = kmer.table[, "TT"], kmer = "TT")

```

We can then re-use it with wider breaks.

```{r another_kmer_ci5, fig.width=7, fig.height=4, out.width="60%", fig.cap="Distribution of k-mer counts in promoter generated with our custom funciton kmer.hist() with class intervals of size 5. "}
kmer <- "TT"
counts <- kmer.table[, kmer]
hist.values.ci5 <- kmer.hist(counts = counts, 
                             kmer = kmer, 
                             border = "#0088FF",
                             class.interval = 5)
```

We can now extract the values of the histogram, for example the class midpoints and the counts per class. 

```{r hist_values}
class(hist.values)
names(hist.values)
# View(hist.values)
```

We can print the midpoints of the class intervals.

```{r hist_mids}
print(hist.values$mids)
```

And the corresponding counts. 

```{r hist_counts}
print(hist.values$counts)

```

We can also use this information to draw a custom plot of the distribution (for example a frequency polygon). 

```{r frequency_polygon, fig.width=7, fig.height=4, out.width="60%", fig.cap="Distribution of counts for the TT dinucleotide in Human promoter sequences. Representation. as frequency polygon. "}
plot(hist.values$mids, hist.values$counts, 
     las = 1, xlab = "Number of sequences", 
     ylab = "Counts per sequence",
     type = "l", col = "#008800", 
     lwd = 3,
     main = paste0(kmer, " counts\nFrequency polygon"))
grid() # Add a grid to the plot
```


### Fit a Poisson distribution on each empirical distribution of k-mer counts. 

    - How do you choose the parameters?
    - Draw the fitted Poisson distribution over the histogram of empirical distribution (observed k-mer occurences)
    
        **Tips:** 
        
        - in order to add some plot over an existing plot, you can use the `lines()` function
        - you can also use specific options to draw histogram-like lines: `lines(x, y, type = "h"). 
        
        
The Poisson distribution is defined by a single parameter: the expected mean $\lambda$ (the Greek letter "lambda"). Noteworth, the variance of a Poisson distribution equals its mean. 

```{r fit_poisson}
## Select a kmer and its counts
kmer <- "GA" ## Select an arbitrary k-mer
counts <- kmer.table[, kmer]
ci <- 2

## Estimate lambda (mean of the Poisson) as the mean of the counts
lambda <- mean(counts)

## Compute the Poisson Probability Mass Function
X <- 0:max(counts) ## Compute the values covered by the k-mer counts
# length(X)
poisson.pmf <- dpois(x = X, lambda = lambda)
# sum(poisson.pmf) # Validation : this should approximte 1

## Compute expected number of genes for each count value
poisson.exp <- poisson.pmf * nb.genes
# sum(poisson.exp) # Validation: this should approximate n

```


```{r fitting_plot_fp, fig.width=7, fig.height=5, out.width="60%", fig.cap="Fitting of a Poisson distribution on dinucleotide counts"}

## plot the histogram
h <- kmer.hist(counts, kmer, 
               ylim = c(0, max(poisson.exp)), 
               class.interval = 1, 
               col = "#BBBBBB", border = "#BBBBBB",
               main = paste0("Poisson fit on ", kmer, " counts"))
lines(X, poisson.exp, col = "red", lwd = 2)

obs <- h$counts


```

An alternative is to draw the lines as a histogram rather than a polygon frequency. 

```{r fitting_plot_h, fig.width=7, fig.height=5, out.width="60%", fig.cap="Fitting of a Poisson distribution on dinucleotide counts"}

## plot the histogram
h <- kmer.hist(counts, kmer, 
               ylim = c(0, max(poisson.exp)), 
               class.interval = 1, col = "#BBBBBB", border = "#BBBBBB",
               main = paste0("Poisson fit on ", kmer, " counts"))
lines(X, poisson.exp, col = "red", lwd = 1, type = "h")

```

We can now generate the same drawing for each k-mer. 

```{r poisson_fit_all_kmers, fig.width=12, fig.height=8, out.width="100%", fig.cap="Poisson fit on dinucleotide counts from Human promoter sequences. "}

par(mfrow = c(4, 4))
for (i in 1:m) {
  kmer <- names(kmer.table)[i]
  counts <- kmer.table[, kmer]
  ## plot the histogram
  h <- kmer.hist(counts, kmer, 
                 ylim = c(0, max(poisson.exp)), 
                 class.interval = 1, col = "#BBBBBB", border = "#BBBBBB",
                 main = paste0("Poisson fit on ", kmer, " counts"))
  
  lambda <- mean(counts)
  poisson.exp <- dpois(x = X, lambda = lambda) * nb.genes
  lines(X, poisson.exp, col = "red", lwd = 0.5, type = "h")

}
par(mfrow = c(1, 1))

```

### Using the sample median as estimator of the population mean


```{r poisson_fit_all_kmers_median, fig.width=12, fig.height=8, out.width="100%", fig.cap="Poisson fit on dinucleotide counts from Human promoter sequences. "}

par(mfrow = c(4, 4))
for (i in 1:m) {
  kmer <- names(kmer.table)[i]
  counts <- kmer.table[, kmer]
  ## plot the histogram
  h <- kmer.hist(counts, kmer, 
                 ylim = c(0, max(poisson.exp)), 
                 class.interval = 1, col = "#BBBBBB", border = "#BBBBBB",
                 main = paste0("Poisson fit on ", kmer, " counts"))

  lambda <- median(counts)
  poisson.exp <- dpois(x = X, lambda = lambda) * nb.genes
  lines(X, poisson.exp, col = "red", lwd = 0.5, type = "h")
  
  lines(X, poisson.exp, col = "red", lwd = 0.5, type = "h")
  
}
par(mfrow = c(1, 1))

```


### Q-Q plots

```{r}
## Selevct a k-mer
kmer <- "AC"
counts <- kmer.table[, kmer]
lambda <- mean(counts)
X <- 0:(max(counts))

## Generate random counts that follow a Poisson loaw
rcounts <-  rpois(n = length(counts), lambda = lambda)

## Draw a Q-Q plot with the random counts versus actual counts
qqplot(x = counts, y = rcounts, 
       main = "QQplot - counts vs random Poisson",
       col = "#88AABB",
       xlab = "Counts",
       ylab = "Random counts (rpois)")
grid()
abline(a = 0, b = 1)

```

## Goodness of fit

### Using R built-in chisq.test() function

1. Assess the goodness of the fit using the R `chisq.test()` function.

    - How significant is the result?
    - How good is the fit?
    - Did the test issue some warning message? If so, how do you interpret it?
    - Check if the assumptions are met for the validity of the chi2 test. 
    
    
```{r chisq_test}
## Choose a k-mer and select its counts
kmer <- "TC"
counts <- kmer.table[, kmer]
max.count <- max(counts)
X <- 0:max.count # all possible values of X

## Get the number of genes per k-mer counts with hist() 
## but do not plot the histogram
h <- hist(counts, 
          breaks = seq(from = 0, 
                       to = max.count + 1, 
                       by = 1), 
          plot = FALSE)
obs <- h$counts ## Observed counts

## Compute the probabilities from a  Poisson distribution
poisson.p <- dpois(X, lambda = mean(counts))

## Run the chi2 test
chi2.result <- chisq.test(x = obs, p = poisson.p)
print(chi2.result)

pval <- chi2.result$p.value

```
    

### With a custom chisquare.test() function
    
2. Implement a properly documented function that 

    - Takes as input a vector of observed values (counts per class) + a vector of expected values (counts per class), 
     - Checks that the expected values are compliant with the applicability conditions.
     - If not, groups the tails of the vectors in order to ensur this condition.
     - Runs the chi2 test
     - Returns the following values

          - chi2 statistics
          - number of initial classes
          - number of classes after tail grouping
          - degrees of freedom
          - p-value


**Tips**

- **Beware**, the input  vector should not be the counts of vector per sequences, but a vector of counts per class (e.g. how many genes had between 0 and 4 k-mer occurrences, betwen 5 and 9, betwen 10 and 14, ...) returned by the histogram

```{r chi2_test_function}
#' @title chi-squared test
#' @author Jacques van Helden
#' @description Run a chi2 test with convenient options enabling to automatically group the tails of the input vector if the expected values are too small
#' @param obs a vector of observed values. These must be absolut e counts and not relative frequncies. 
#' @param exp a vector of expected values. Should have the same length and same sum as the observed values
#' @param X X values for the histogram
#' @param main main title for the histogram
#'@return a vector of statistics related to the chi2 test: chi2 statistics, number of classes of the input vectors, numnber of classes after tail grouping, degrees of freedom, p-value
chisq.test.grouped.tails <- function(obs, 
                                     exp,
                                     X,
                                     main = "chi2 test") {
  ## Group the left tail
  first.ok <- head(which(exp >= 5), n = 1)
  last.ok <- tail(which(exp >= 5), n = 1)
  if (first.ok > 1) {
    exp.grouped <-   c(sum(exp[1:(first.ok - 1)]), 
                       exp[first.ok:last.ok])
    obs.grouped <-   c(sum(obs[1:(first.ok - 1)]), 
                       obs[first.ok:last.ok])
    X.grouped <-   c(X[first.ok - 1], 
                     X[first.ok:last.ok])
  } else {
    exp.grouped <- exp[first.ok:last.ok]
    obs.grouped <- obs[first.ok:last.ok]
    X.grouped <- X[first.ok:last.ok]
  }
  
  if (last.ok < length(exp)) {
    exp.grouped <- c(exp.grouped, sum(exp[(last.ok + 1):length(exp)]))
    obs.grouped <- c(obs.grouped, sum(obs[(last.ok + 1):length(obs)]))
    X.grouped <- c(X.grouped, X[last.ok + 1])
  }
  test <- chisq.test(x = obs.grouped, p = exp.grouped, rescale.p = TRUE)
  
  ## Collect the result
  result <- data.frame(
    chi2.stat = test$statistic, 
    nclasses.input = length(obs),  
    nclasses.grouped = length(obs.grouped),
    df = test$parameter,
    p.value = test$p.value)
  
  
  ## Plot the distributions before and after grouping
  
  if (class(X) == "numeric") {
    par(mfrow = c(2,1))
    
    ## Plot distrib before grouping
    ymax = ceiling(max(obs.grouped, exp.grouped)) ## Compute max value for Y axis
    plot(x = X, y = obs, type = "h", col = "grey", lwd = 2, 
         ylim = c(0, ymax),
         xlim = range(X),
         ylab = "Number of sequences", 
         xlab = "K-mer occurrences",
         las = 1, 
         main = main)
    lines(x = X, y = exp, type = "l", col = "red")
    legend("topright", legend = paste0(sum(exp < 5), " counts with exp < 5"))

    ## Plot distrib after grouping
    plot(x = X.grouped, y = obs.grouped, 
         type = "h", col = "grey", lwd = 2, 
         ylim = c(0, ymax),
         xlim = range(X),
         ylab = "Number of sequences", 
         xlab = "K-mer occurrences",
         las = 1, main = paste0(main, "\ngrouped tails"))
    lines(x = X.grouped, y = exp.grouped, type = "l", col = "darkgreen")
    legend("topright", cex = 0.7,
           legend = paste0(names(result), 
                           " = ", 
                           signif(result, digits = 3))
    )

    par(mfrow = c(1,1))

  }  

  return(result)
}


```

```{r TC_fitting, fig.width=7, fig.height=7, out.width="80%"}

#' @title Poisson fit
#' @description Compute the distribution of counts and fit a Poisson distribution
#' @param counts 
#' @param kmer="k-mer"
poisson.fit <- function(counts, main = "Poisson fit") {

  ## Compute the number of occurrences for each count value
  h <- hist(counts, 
            breaks = seq(
              from = min(counts) - 0.5,
              to = max(counts) + 0.5, 
              by = 1), plot = FALSE)
  obs <- h$counts # note: the right tail contains many zeros
  X <- h$mids
  mean.counts <- mean(counts)
  
  ## Compute expected counts
  exp <- dpois(x = X, lambda = mean.counts) * sum(obs)
  chi2.result <- chisq.test.grouped.tails(obs = obs, exp = exp, X = X, main = main)
  return(chi2.result)
}

## Run the chi2 test with tail grouping
kmer <- "CT"
counts <- kmer.table[, kmer]
poisson.fit(counts, main = "CT poisson fit")

```


As a matter of control we will fit a Poisson distribution on a vector of random numbers generated according to a Poisson distribution.

```{r rpois_fitting, fig.width=7, fig.height=7, out.width="80%"}
## Run the chi2 test with random numbers following a Poisson distribution
poisson.fit(rpois(n = length(counts), lambda = mean(counts)), main = "Random Poisson numbers")
```


## Fitting a binomial distribution

```{r eval=FALSE}
kmer <- "TC"
counts <- kmer.table[, kmer]

h <- hist(counts,
          breaks = seq(
            from = min(counts - 0.5),
            to = max(counts + 0.5),
            by = 1), plot = FALSE)
obs <- h$counts
X <- h$mids

## Estimate prior k-mer probabilities by computing their
## frequency in the whole data set
total.per.kmer <- apply(kmer.table, 2, sum)
freq.per.kmer <- total.per.kmer / sum(total.per.kmer)
n <- max(stats.per.gene$sum) # number of k-mer positions per sequence

exp <- sum(obs) * dbinom(x = X, 
                         size = n, 
                         prob = freq.per.kmer[kmer])
length(exp)
length(obs)
length(X)
chi2.result <- chisq.test.grouped.tails(
  obs = obs, exp = exp, 
  X = X, main = paste0(kmer, "; binomial fitting"))
 
```


## Fitting a normal distribution

## Fitting a negative binomial