---
title: "Practical -- k-mer distributions"
author: "Jacques van Helden"
date: "`r Sys.Date()`"
output:
  html_document:
    code_folding: hide
    fig_caption: yes
    highlight: zenburn
    self_contained: no
    theme: cerulean
    toc: yes
    toc_depth: 3
    toc_float: yes
  beamer_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    incremental: no
    keep_tex: no
    slide_level: 2
    theme: Montpellier
    toc: yes
  slidy_presentation:
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    highlight: tango
    incremental: no
    keep_md: yes
    smaller: yes
    theme: cerulean
    toc: yes
    widescreen: yes
  word_document:
    toc: yes
    toc_depth: '3'
  ioslides_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    smaller: yes
    toc: yes
    widescreen: yes
  pdf_document:
    fig_caption: yes
    highlight: zenburn
    toc: yes
    toc_depth: 3
font-import: http://fonts.googleapis.com/css?family=Risque
subtitle: Statistics for Biology (STAT2)
font-family: Garamond
transition: linear
---

```{r include=FALSE, echo=FALSE, eval=TRUE}
library(knitr)
options(width = 300)
knitr::opts_chunk$set(
  fig.width = 7, fig.height = 5, 
  fig.path = 'figures/02_combinatorix_',
  fig.align = "center", 
  size = "tiny", 
  echo = TRUE, eval = TRUE, 
  warning = FALSE, message = FALSE, 
  results = TRUE, comment = "")
# knitr::asis_output("\\footnotesize")

```


## Introduction

In this practical, we will count k-mer occurrences in DNA sequences of different organisms (one organism per sutdent), fit different theoretical distributions of probabilities onto the empirical distributions of counts, and test the goodness of fit for these alternative distributions. 

We will run the beginning of this practical in tutorial mode, by providing the solutions for

- computing k-mer distributions in the upstream sequences of all the genes of an organism of interest on the [RSAT server](http://teaching.rsat.eu);
- downloading the resulting k-mer count table;
- loading this table in and R data frame.

We will then pursue with exercises to 
- explore the data (descriptive statistics)
- fit different theoretical distributions on the observed k-mer counts
- assess the goodness of the fit.


## Organisms

Each student will choose one organism of interest among the following ones. 
  
| Taxon | Organism | 
|--------|-----------------------------------|
| Fungi | Saccharomyces_cerevisiae |
| Bacteria | Escherichia coli GCF 000005845.2 ASM584v2 |
| Mammalian | Homo sapiens GRCh38 |
| Mammalian | Mus musculus GRCm38 |
| Bird | Gallus gallus EnsEMBL |
| Fish | Danio_rerio_EnsEMBL |
| Insect | Drosophila melanogaster | 
| Worm | |
| Plant | Arabidopsis thaliana.TAIR10.29 |
| Plant | Zea mays.AGPv3.29 |
| Apicomplexa |  |


## Tutorial for the first steps


### Working directory

On your computer, create a directory for this practical. I suggest to use a consistent naming for the different practicals of this course. 

We will further create a sub-folder with the name of our organism of interest. 


```{r working_dir}
## Define your organism of interest
org <- "Homo_sapiens"

## Define a working directory
work.dir <- file.path("~", "CMB-STAT2_practicals", "kmer_distrib", org)
dir.create(work.dir, recursive = TRUE, showWarnings = FALSE)


## Print a message with the result directory
message("Result directory\t", work.dir)

```


## Counting k-mer occurrences in each promoter of a model organism

1. Open a connection to the Regulatory Sequence Analysis Tools (RSAT) teaching server : <http://teaching.rsat.eu/>

2. In the tool search box, type "retrieve sequence" and click on the corresponding tool. 

3. In the [*retrieve-sequence*](http://pedagogix-tagc.univ-mrs.fr/rsat/retrieve-seq_form.cgi) form, 

    - click *Mandatory inputs*, enter the name your organism of interest, and check the option *all genes of this organism*;
    - in *Mandatory options*, select *upstream*, and set the sequence limits from -1 to -500
    - in *Advanced options*, *make sure that this option is **unchecked**: *Prevent overlap with neighbour genes (noorf)* ^[Note: normally it is recommended to check this option, but we intently inactivate it in order to get sequences of the same sizes. ]
    - click *Run analysis* and *GO*.
    
After a few seconds (or minutes) the result is displayed. Right-click on the sequence file (extension fasta) and open it in a separate tab to check its content. 
    
4. Come back to the result page of retrieve-sequence. In the *Next Step* box below the result, click on the link to *oligo-analysis*. This will transfer your sequences to the oligo-analysis form.

    - In the *Sequence* section, inactivate the option *purge sequence*.
    - In the *Oligmer counting mode*, **uncheck** the option *prevent overlapping matches*.
    - Select *Count on single strand*.
    - For *oligomer lengths*, select 2 and **uncheck the other lengths**.
    - In *Results*, check the option *Occurrence table*.
    - Type your email address and select the mail output. 
    - Click *GO*. 

After a few seconds (minutes), the RSAT server should display the result page, with links to the k-mer count table. Copy the URL of the result file. 

### Download the count table from RSAT


Let us define the name we will give to the local copy of the k-mer count table generated on the RSAT server in the previous steps.

```{r kmer_table_name}
## Define the path and the name of the local file
kmer.file <- file.path(work.dir, "2nt-ovlp-1str_Homo_sapiens.tab")

```


One solution is to download manually the k-mer count table generated on the RSAT server, move it to the work directory, and rename it `2nt-ovlp-1str_Homo_sapiens.tab` (to be adapted depending on your organism of interest). 

Another possiblity is to use R command `download.file()` download it from the URL of the result file on the RSAT server. 


```{r eval=FALSE}

## Note: this will only work for a few days, because the temporary files are removed from the server
temp.url <- "http://pedagogix-tagc.univ-mrs.fr/rsat/tmp/www-data/2020/02/17/oligo-analysis_2020-02-17.155035_zZ4IA4"

## Provide the arguments in the order of the function definition
download.file(temp.url, kmer.file)

## Equivalent : name the arguments
# download.file(url = temp.url, destfile = kmer.file)

## Note: named arguments can be provided in a different order without problem
# download.file(destfile = kmer.file, url = temp.url)

```


Whichever method was chosen, check that the file is at the right place on your computed. 

```{r}

## List the files in the working directory
list.files(work.dir)

## Send a message with the k-mer file location
message("K-mer count table file:\t", kmer.file)

```


### Load the k-mer count table in R

Use the finction `read.table()` to load the k-mer count table in a variable named `kmer.table`.

```{r}

## Call the help for read.table()
# ?read.table

## Load the k-mer count table in a variable
kmer.table <- read.table(
  file = kmer.file, 
  comment.char = ";", ## comment lines start with ";" in RSAT
  header = TRUE, # the first row (after the comments) contain the column headers
  row.names = 1, ## the first column contains row names but there might be homonyms
  sep = "\t" ## column separator is the tabulation
  )

```

Check the dimensions of this table.

```{r}

## Check the dimensions of the k-mer count table
dim(kmer.table)

## Number of k-mers
m <- ncol(kmer.table)

## Number of genes
n <- nrow(kmer.table)

## Print the result
print(paste0("Number of rows: ", n))
print(paste0("Number of columns: ", m))
```


Check the column names

```{r}
## Check the column names
names(kmer.table)
colnames(kmer.table) # equivalent
```

Display the first and last 10 lines of the k-mer count table. 

```{r}
## Show the first 10 lines of the k-mer count table
head(kmer.table)
tail(kmer.table)

```


## Homework

- Read the tutorial [first steps wit R](https://jvanheld.github.io/stat1/practicals/01_intro_R/01_R-first-steps.html)

- Compute marginal statistics
- Draw histograms of a given k-mer of your choice


### Compute marginal statistics


Tips: use the R function `apply()`. 


### Draw the distributions


## Exploring k-mer count distributions in promoter sequences

1. Load the k-mer count table generated in the previous step.

2. Draw an histogram with the distribution of counts for a given k-mer. After having fine-tuned the representation, generate a pdf file with 4 x 4 panels to depict the histograms of the 16 k-mers.

3. Use other graphical representations to get an insight of the k-mer count distributions (boxplots, violin plots)

4. Compute summary statistics for each column of the count table, including the following estimators

    - min and max
    - mean
    - percentiles 05, 25 (=Q1), 50 (=median), 75 (=Q3), 95
    - variance and standard deviation
    - sum
    
5. Compute a vector with the relative frequency of each k-mer in all the sequences. 

> **Tip: ** for this exercise you will need to divide each count by the marginal counts of the row (the sum of k-mer counts of the considered sequence). This could in principle be done by implementing two embedded loops, one that iterates on the rows (sequences) and the other one on the columns (k-mers). This would however be quite inefficient. Instead, you can immediately divide the whole count matrix by the vector containing the sum of k-mr counts per gene. 

6. Compute a table with the relative frequencies of k-mers per sequence, and compute similar summary statistics per column on this relative frequency table. 

7. Write a brief interpretation of the results. 


## Fitting distributions

1. Fit a Poisson distribution on each empirical distribution of k-mer counts. 

    - How do you choose the parameters?
    - Draw the fitted Poisson distribution over the histogram of empirical distribution (observed k-mer occurences)
    
        **Tips:** 
        
        - in order to add some plot over an existing plot, you can use the `lines()` function
        - you can also use specific options to draw histogram-like lines: `lines(x, y, type = "h"). 

2. Do the same with the following distributions : 

    a. binomial
    b. hypergeometric
    c. normal
    d. negative binomial

## Goodness of fit


1. Assess the goodness of the fit using the R `chisq.test()` function.

    - How significant is the result?
    - How good is the fit?
    - Did the test issue some warning message? If so, how do you interpret it?
    - Check if the assumptions are met for the validity of the chi2 test. 
    
2. Implement a function that 

    - Takes as input a vector of observed values + a vector of expected values, 
     - Checks that the expected values are compliant with the applicability conditions.
     - If not, groups the tails of the vectors in order to ensur this condition.
     - Runs the chi2 test
     - Returns the following values

          - chi2 statistics
          - number of initial classes
          - number of classes after tail grouping
          - degrees of freedom
          - p-value

3. Apply this function to each k-mer for each distribution, and evaluate which one gives the best fit to the data.