---
title: "Analysis of a genome annotation table"
author: "Jacques van Helden"
date: '`r Sys.Date()`'
output:
  pdf_document:
    fig_caption: yes
    highlight: zenburn
    toc: yes
    toc_depth: 3
  beamer_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    incremental: no
    keep_tex: no
    slide_level: 2
    theme: Montpellier
    toc: yes
  html_document:
    self_contained: no
    code_folding: show
    fig_caption: yes
    highlight: zenburn
    theme: cerulean
    toc: yes
    toc_depth: 3
    toc_float: yes
  slidy_presentation:
    self_contained: no
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    highlight: tango
    incremental: no
    keep_md: no
    smaller: yes
    theme: cerulean
    toc: yes
    toc_float: yes
    widescreen: yes
  ioslides_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    incremental: no
    keep_md: no
    smaller: yes
    theme: cerulean
    toc: yes
    widescreen: yes
  word_document:
    toc: yes
    toc_depth: 3
font-import: http://fonts.googleapis.com/css?family=Risque
font-family: Garamond
subtitle: Probabilities and statistics for biology (STAT1)
address: TAGC lab, Aix-Marseille Université, France
transition: linear
---

```{r setup, include=FALSE, size="huge"}
library(knitr)
## Default parameters for displaying the slides
knitr::opts_chunk$set(
  echo = TRUE, 
  eval = TRUE, 
  fig.width = 7, 
  fig.height = 5, 
  fig.align = "center", 
  fig.path = "figures/",
  size = "tiny", 
  warning = FALSE, 
  results = TRUE, 
  message = FALSE, 
  comment = "")

dir.main <- "~/stat1"
url.main <- "http://jvanheld.github.io/stat1"
dir.current <- file.path(dir.main, "practicals", "02_yeast_annotations")
setwd(dir.current)

```


## Goal of this practical

During this practical session, you will run the following tasks:

1. Handle a table containing annotated features of the yeast genome.
2. Select a subset of the data by filtering rows based on a given criterion (annotation type, chromosome, ...)
3. Generate graphics to represent different aspects of the data. 
4. Compute estimators of central tendency and dispersion. 
5. Compute a confidence interval around the mean. 


## Expected report

At the end of the practical you will be asked to submit two documents 


1. Your **R code**.Each question must be explicitly formulated before presenting the results that answer it and giving an interpretation of these results. 
2. UA  **synthetic report**, which will include a presentation of the main results (figures, descriptive stats, tables) as well as your interpretation of the result. 

### Expectation for the code


1. The code must be **readable and undestandable**: choose variable names that explicitly indicate what they represent.

2. The code must be properly documented (the `#` symbol starts a comment, either at the beginning or in the middle of a line of code).

    - Before each chunk of code, explain what this code is supposed to do, what it serves to.
    
    - Don't hesitate to occasionally add some comment words to justify the chosen approach.
    
    - Each time you define a variable, add a comment on the same line to indicate what this variable represents.

3. The code must be  **portable**: other people should be able to download it and run it on their computer. For this practical, I will systematically test whether your code can run on my computer. hard-coded absolute paths of a file on your machine should thus always be avoided (we will indicate hereafter how to define relative paths relative to the root of your user account). 

### Expected content for the interpretation report

Your report must be synthetic (1 text page max + as many figures and table as you wish)

Each question must be explicitly formulated before presenting the results that answer it and then interpreting those results.  


Each figure or table must be documented with a legend that allows a naive reader to understand what it represents. The interpretation of the results displayed on a figure or table will be found in the main text (with a reference to the figure or table number).
 

## Historical example: yeast genome

```{r load_yeast_cds_lengths, echo=FALSE}
## Read a GTF file
## Format specification: http://www.ensembl.org/info/website/upload/gff.html

## Download tab-delimited file with chromosome sizes (unless already there)
gtf.url <- "http://jvanheld.github.io/stat1/data/Saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.37.gtf"
gtf.file <- file.path(dir.main, "data", "Saccharomyces_cerevisiae", "Saccharomyces_cerevisiae.R64-1-1.37.gtf")

## Only download the file if not already stored on local computer
if (file.exists(gtf.file)) {
  message("GTF annotation file: ", gtf.file)
} else {
  download.file(gtf.url, destfile = gtf.file)
}

feature.table <- read.delim(gtf.file, 
                            comment.char = "#", 
                            sep = "\t", header = FALSE, row.names = NULL)
names(feature.table) <- c("seqname", "source", "feature", "start", "end", "score", "strand", "frame", "attribute")
# nrow(feature.table) ## Count feature number
# View(feature.table)

## Select subset of features having "CDS" as "feature" attribute
cds <- subset(feature.table, feature == "CDS")
## nrow(cds) ## Count the number of "coding sequences", i.e. open reading frames annotated as cds

```

- 1992: publication of the first complete eukaryotic chromosome, the 3rd yeast chromosome. 
- 1996: publication of the complete genome.

On the base of the genes of the 3rd chromosome (sample) we can estimate the average size of a yeast gene. 

**Questions: ** 

(a) Would the sample mean (chromosome III) be sufficient to predict the population mean (complete genome)? 

    To answer this question, we will imagine that we came back in 1992, and will use all the genes of chromosome III (considered here as a sample of the genome) to estimate the average size of genes for the whole genome (the "population" of genes").

(b) Can this sample be described as "simple and independent"?

## Analysis of the length of the baker's yeast genes

```{r cds_length_histo, out.width="90%", fig.width=10, fig.height=4, fig.cap="Distribution of cds lengths for Saccharomyces cerevisiae. ", echo=FALSE}
par.ori <- par(no.readonly = TRUE)
par(mar = c(4.1,4.1,2.1,1.1))

## Add a column to the table with cds lengths
cds$length <- cds$end - cds$start + 1
max.len <- max(cds$length)

## Select cds on the third chromosome
cds.III <- subset(cds, seqname == "III")
# View(cds.III)
par(mfrow = c(1, 2))

## Plot an histogram with cds lengths
hist(cds.III$length, 
     breaks=seq(from=0, to=max.len+100, by=100), 
     main="Chromosome III",
     xlab=NA, ylab="Number of cds", 
     col="#BBDDFF")

## Plot an histogram with cds lengths
hist(cds$length, 
     breaks=seq(from=0, to=max.len+100, by=100), 
     main="Full genome",
     xlab="cds length (bp)", ylab="Number of cds", 
     col="#BBFFDD")
par(mfrow=c(1,1))

par <- par.ori

```


## Tutorial

Before moving to the exercises, we show you here some basic elements about reading, manipulating and writing data tables with R.


### The path to the home (manual)

We will create a folder for this tutorial, starting from the root of our account.  

First possibility (**quick but not very elegant**):  enter (manually) the path from the root of your account in a variable

`dir.home <- /the/path/to/the/home`

- Advantage: fast and convenient
- Disadvantage: not portable, will only work on your computer

### The path to the home (automatic)

A more general solution: use the **R** command  `Sys.getenv()`.

- Invoked without parameters, this command lists all environment variables (your system configuration). 
- The output can be restricted to a given environment variable, for example `Sys.getenv("HOME")` returns the path to the root of your account.

**Note:** equivalent writing with Linux: the tilde symbol `~` also indicates the path to the root of your account. 

```{r get_home}
## Identify the home directory 
## by getting the environment variable HOME 
dir.home <- Sys.getenv("HOME")
print(dir.home)
```


### Creating a folder for the TP

```{r create_practical_dir}
## Define a variable containing the path of the results for this tutorial
dir.tuto <- file.path(dir.home, "stat1", "TP2")

print(dir.tuto)

## Create the directory for this tutorial
dir.create(path = dir.tuto, 
           showWarnings = FALSE, 
           recursive = TRUE)

## Go to the tutorial directory
setwd(dir.tuto)

## List the files already present in the folder (if any)
list.files()

```


### Downloading the GTF file from EnsemblGenomes

**Tips: ** before downloading the annotation file (GTF) from EnsemblGenomes to our computer, we will check if it is already present (and in this case we do not re-download it). 

```{r download_GTF}
## Define the URL of the annotation file (GTF-formatted)
gtf.URL <- "ftp://ftp.ensemblgenomes.org/pub/release-37/fungi/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.37.gtf.gz"

## Define the path where the local copy will be stored
local.GTF <- file.path(dir.tuto, "Saccharomyces_cerevisiae.R64-1-1.37.gtf.gz")

## If the local file file laready exists, skip the download
if (file.exists(local.GTF)) {
  message("GTF file already exists in the tutorial folder: ", local.GTF)
} else {
  ## Download annotation table in GTF format
  download.file(url = gtf.URL, destfile = local.GTF)
  message("GTF file downloaded in the tutorial folder: ", local.GTF)
}


```


### Loading a data table

R has several types of tabular structures (matrix, data.frame, table).

The most commonly used structure is the `data.frame`, which consists of an array of values (numeric or strings) whose rows and columns are associated with names. 

The function `read.table()` allows you to read a text file containing a data table, and store the content in a variable.

Several functions derived from `read.table()` make it easier to read different types of formats:  

- `read.delim()` for files whose columns are delimited by a particular character (usually the tab, represented by "\t").
- `read.csv()` for files "comma-separated values".


1. Download the following file to your computer: 

- [Saccharomyces_cerevisiae.R64-1-1.37.gtf](../../data/Saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.37.gtf)

2. Load it using the read.table function (for this you must replace the path below by that of your computer). 

```{r read.table, echo=TRUE, result=TRUE}
## Read a GTF file with yeast genome annotations

## Load the feature table
feature.table <- read.table(
  local.GTF, 
  comment.char = "#", 
  sep="\t", 
  header=FALSE, 
  row.names=NULL)

## The bed format does not contain any column header, 
## so we set it manually based on the description of the format, 
## found here: 
##     http://www.ensembl.org/info/website/upload/gff.html
names(feature.table) <- c("seqname", "source", "feature", "start", "end", "score", "strand", "frame", "attribute")


```

### Exploring the content of a data table

The first thing to do after loading a data table is to check its dimensions.

```{r echo=TRUE, result=TRUE, warning=FALSE}
dim(feature.table) ## Dimensions of the tbale
nrow(feature.table) ## Number of rows
ncol(feature.table) ## Number of columns
```

The display of the complete annotation table would not be very readable, since it contains tens of thousands of lines.  

We can display the first lines with the function `head()`. 

**Note: ** the last column is particularly heavy (it contains a lot of information). We will see later how to select a subset of the columns to simplify the display.  

```{r echo=TRUE, eval=TRUE, result=TRUE}
## Display the 5 first rows of the feature table
head(feature.table, n = 5) 
```

The function `tail()` displays the last few lines:

```{r echo=TRUE, eval=TRUE, result=TRUE}
## Display the 5 last rows of the feature table
tail(feature.table, n = 5) 
```

If you are using the **RStudio** environment, you can display the table in a dynamic viewer pane with the function  `View()`.

```{r echo=TRUE, eval=FALSE}
## In RStudio, display the table in a separate tab
View(feature.table) 
```


### Selection of subsets from a table

Selection of a line specified by its index.

```{r echo=TRUE, eval=TRUE, result=TRUE}
feature.table[12,]
```

Selection of a column specified by its index (display of the first values only).

```{r echo=TRUE, eval=TRUE, result=TRUE}
head(feature.table[,3])
```

Selection of a cell by combining row and column indices.

```{r echo=TRUE, eval=TRUE, result=TRUE}
feature.table[12, 3]
```

Selection of a column and/or row set.

```{r echo=TRUE, eval=TRUE, result=TRUE}
feature.table[100:105, 1:6]
```

Selection of specific columns (here, the genomic coordinates of each feature): chromosome, beginning, end, strand.

```{r echo=TRUE, eval=TRUE, result=TRUE}
feature.table[100:105, c(1,4,5,7)]
```

Select a column based on its name.

```{r echo=TRUE, eval=TRUE, result=TRUE}
## Select the "start" column and print the 100 first results
head(feature.table$start, n = 100)

## Print the 20 first values of the "feature" field, which indicates the feature type
head(feature.table$feature, n = 20)
```

Selection of several columns based on their names. 


```{r echo=TRUE, eval=TRUE, result=TRUE}
## Select the "start" column and print the 100 first results
feature.table[100:106, c("seqname", "start", "end", "strand")]
```

**Note**: Selection of several columns based on their names. It is also possible to name the rows of a data.frame but the GTF table does not support this. We will see more examples later.


### Selection of a subset of rows based on the content of a column

The function `subset()` allows you to select a subset of the rows of a data.frame based on a condition applied to one or more columns.

We can apply it to select the subset of rows in the annotation table corresponding to coding sequences (CDS).

```{r echo=TRUE, eval=TRUE, result=TRUE}
## Select subset of features having "cds" as "feature" attribute
cds <- subset(feature.table, feature == "CDS")

nrow(feature.table) ## Count the number of features
nrow(cds) ## Count the number of cds
```

### Count by value

The function `table()` allows you to count the occurrences of each value in a vector or array. Some examples of use below.  

```{r echo=TRUE, eval=TRUE, results=TRUE}
## Count the number of featues per chromosome
table(feature.table$seqname)

## Count the number of features per type
table(feature.table$feature)
```

We can use the `knitr::kable()` function to include a nicely formatted table in a report. This requires to load the `knitr` library.

```{r echo=TRUE, eval=TRUE, results=TRUE}
## Count the number of features per type
require(knitr)
features.per.type <- table(feature.table$feature)

kable(features.per.type, col.names = c("feature type","Number"), caption = "Number of features of different types in the GTF annotations of the yest genome. ")
```


Contingency tables can be calculated by counting the number of combinations between 2 vectors (or 2 columns of a table). 

```{r echo=TRUE, eval=TRUE, results=TRUE}
##  Table with two vectors
table(feature.table$feature, feature.table$seqname)

## Same result with a 2-column data frame
table(feature.table[, c("seqname", "feature")])

## The same, nicely formatted
kable(table(feature.table[, c("seqname", "feature")]))


```

## Exercises

### 1. GTF format specifications

Read the GTF format specifications.

- Ensembl (<http://www.ensembl.org/info/website/upload/gff.html>)
- UCSC (<https://genome.ucsc.edu/FAQ/FAQformat.html#format4>)


### 2. Creating a local folder for the TP

Create a local folder (for example: `stat1/TP_yeast` from the root of your account). We suggest you to use the following functions. 

- `Sys.getenv("HOME")` (Linux and Mac OS X), to get the root of your user account;
- `file.path()` to build a path;
- `dir.create()` to create the folder for the TP. Read carefully the options of this function with `help(dir.create)` 

(solution is above)

### 3. Locating the annotation file

Locate the yeast genome annotation file in GTF format in this local folder.

- Site Ensembl Fungi: <http://fungi.ensembl.org/>
- Click "Downloads" to access the [ftp website](http://fungi.ensembl.org/info/website/ftp/)
- In the search box, type *"saccharomyces cerevisiae"* and follow the link "[GTF](ftp://ftp.ensemblgenomes.org/pub/release-37/fungi/gtf/saccharomyces_cerevisiae)"
- Copy the address (URL) of the file [Saccharomyces_cerevisiae.R64-1-1.37.gtf.gz](ftp://ftp.ensemblgenomes.org/pub/release-37/fungi/gtf/saccharomyces_cerevisiae/Saccharomyces_cerevisiae.R64-1-1.37.gtf.gz)

(solution above)

### 4. Downloading a file from an ftp website

Suggested functions:

- `download.file()` (read the help to know the arguments)

(solution above)

### 5. Loading a data table in R

Write a script that loads the data table into a variable named `feature.table`, using the function R `read.delim()`. 

Be sure to ignore the comment lines (which start with a character `#`). 

(solution above)

### 6. Compute the length of coding genes

- Add to the annotation table (`feature.table`) a column entitled "length" which indicates the length of each annotated genomic feature.

```{r }
## Add a colmn with feature lengths
feature.table[, "length"] <- feature.table[, "end"] - feature.table[, "start"] + 1

## Add a colmn with feature lengths: equivalent result with simpler notation
feature.table$length <- feature.table$end - feature.table$start + 1
```


- Count the number of rows in the table corresponding to each type of annotation (3rd column of the GTF, "feature").  

    - fonction `table()`

```{r echo=FALSE}
table(feature.table$feature)
```

- Print the same result in a nicely formatted table with `knitr::kable()`./

```{r echo=FALSE}
require(knitr)
kable(table(feature.table$feature))
```


- Select the lines corresponding to coding regions ("CDS")

    - fonction `subset()`

```{r echo=FALSE}
cds <- subset(feature.table, feature == "CDS")
```


- Count the number of CDS per chromosome.

    - fonction `table()`


```{r echo=FALSE}
table(cds$seqname)
kable(table(cds$seqname), col.names = c("Chromosome", "Number of CDSs"))
```

- Load the chromosome size table [chrom_sizes.tsv](../../data/Saccharomyces_cerevisiae/chrom_sizes.tsv), and compute the  density of genes for each chromosome (number of genes per Mb). 

```{r echo=FALSE}
## Download tab-delimited file with chromosome sizes (unless already there)
annot.url <- "http://jvanheld.github.io/stat1/data/Saccharomyces_cerevisiae/chrom_sizes.tsv"
chrom.size.file <- file.path(dir.tuto, "chrom_sizes.tsv")

if (!file.exists(chrom.size.file)) {
    download.file(annot.url, destfile = chrom.size.file)
}

## Read chromosome sizes
chrom.size <- read.delim(
  file = chrom.size.file,
  header = FALSE, row.names = 1)

## Assign a name to the columns
names(chrom.size) <- c("chromID", "size")
# View(chrom.size)

## print the size of hte third chromosome
chrom.size["III", "size"]

## Get the number of genes per chromosome
genes <- subset(x = feature.table, feature == "gene")
nrow(genes)
nrow(cds)
genes.per.chrom <- table(genes$seqname)

## Merge the chromosome sizes and genes per chrom in a single table
chrom.stats <- chrom.size
chrom.stats$genes <- genes.per.chrom[rownames(chrom.stats)]
chrom.stats$density <- chrom.stats$genes * 1e+6 / chrom.stats$size

barplot(chrom.stats$density, horiz = TRUE, names.arg = rownames(chrom.stats), las = 1)

```


### 6. Histogram of gene length

By using the function `hist()`, draw a histogram representing the length distribution of the CDS.

```{r echo=FALSE}
hist(cds$length)
```

Choose the class intervals in a way that the histogram is informative (neither too large nor too few classes).

```{r echo=FALSE}
## Take more or less 100 bins
h <- hist(cds$length, breaks = 100)

```

Retrieve the result of `hist()` in a variable named `cds.length.hist`.

```{r echo=FALSE}
## Define breaks exactly in the way you wish
cds.length.hist <- hist(
  cds$length, 
  breaks = seq(from = 0, 
               to = max(cds$length) + 100, 
               by = 100))

```


Print the result on the screen (`print()`) and analyze the structure of the variable `cds.length.hist` (this is a list variable). 

Useful functions:

```{r echo=FALSE}
## Display the values used to draw the histogram
print(cds.length.hist)

```

- `class(cds.length.hist)`
- `attributes(cds.length.hist)`


Other types of graphs allow you to explore the distribution of a set of data. In particular, box plots display, for a series of data, the median, the quarterfinal range, a confidence interval and outliers.  

In the `boxplot()` function, we use the formula `length ~ seqname` in order to group lengths by seqname (i.e. chromosome names). 

```{r fig.width=7, fig.height=5, fig.cap="Boîte à moustache indiquant la distribution de longueur des gènes par chromosome. ", echo=FALSE}
boxplot(length ~ seqname, 
        data = cds, 
        col = "palegreen", 
        horizontal = TRUE, 
        las = 1, 
        xlab = "Gene length", 
        ylab = "Chromosome")

```

### 7. Descriptive parameters

Calculate the parameters of central tendency (mean, median, mode) and dispersion (variance, standard deviation, inter-quarterly deviation) 

- for the genes of chromosome III;
- for all yeast genes.


```{r echo=FALSE}
x <- subset(cds, seqname == "III", select = "length")
dim(x) 
class(x) ## Ah ah, this is not a vector but a data.frame

## Convert the data frame into a vector
x <- unlist(x)
class(x)
head(x)

## Compute the mean, either manually or with the ad hoc R function
n <- length(x) 
print(paste("Chromosome III contains", n, "CDS"))
message("Chromosome III contains ", n, " CDS")

m <- mean(x)
print(m)
message("mean(x) = ", round(m, digits = 1))

## Compute the mean manually to compare the result
m.recalc <- sum(x)/n
message("Manually computed sample mean: ", round(digits=1, m.recalc))

## Compute manually standard dev of the sample
sample.var <- sum((x - m)^2)/ n
sample.sd <- sqrt(sample.var)
message("Sample standard dev =", round(digits=1, sample.sd))

## Compute an estimate of the population standard deviation
pop.sd.est <- sqrt(sum((x - m)^2) / (n-1))
message("Sample-based estimate of population standard dev =", round(digits=1, pop.sd.est))

## Compute the standard deviation with R function sd()
R.sd <- sd(x)
message("Result of R sd() function =", round(digits=1, R.sd))


```

**Ah ah! (skeptical tone)** The R function `sd()` does **not** compute the standard deviation of the input numbers ($s$), but the **estimate of the standard deviaiton of the population** ($\hat{\sigma}$)

Display these parameters on the histogram of gene length, using the function `arrows()`

### 8. Confidence interval

From genes of chromosome III (considered as the sample available in 1992), calculate a confidence interval around the mean, and formulate the interpretation of this confidence interval. 
Then evaluate whether or not this confidence interval covered the average population (all genes in the yeast genome, which became available 4 years after chromosome III). 

$$ \bar{x} \pm \frac{\hat{\sigma}}{\sqrt(n)} \cdot t_{1-\alpha/2}^{n-1}$$

```{r echo=FALSE}
## Define alpha, the risk
alpha <- 0.05

## Let us get the critical value for the t distribution
help("TDist")

## Which value corresponds to alpha/2 

## Beware ! by default the qt() function return the lower tail
qt(p = alpha/2, df =  n - 1)

## For confidence intervals we need a positive t value, we thus take the upper tail
t.value <- qt(p = alpha/2, df =  n - 1, lower.tail = FALSE)

IC.min <- m - pop.sd.est * t.value /sqrt(n)
IC.max <- m + pop.sd.est * t.value /sqrt(n)

message("Confidence interval: [", 
        round(digits=1, IC.min), 
        ", ", 
        round(digits=1, IC.max), "]")

```

Draw a polygon of frequencies indicating the number of genes per class (class medium). 

```{r echo=FALSE}

## Draw a frequency polygon
plot(cds.length.hist $mids, cds.length.hist $counts, 
     main="Frequency polygon", 
     xlab="Gene length", ylab="Number of genes", 
     type="l", col="darkgreen", lwd=2, xlim=c(0, 5000))
grid()

arrows(x0 = IC.min, y0 = 100, x1 = IC.max, y1=100, length = 0, angle = 00, col="violet", lwd=3)
abline(v = m, col="violet")

pop.mean <- mean(cds$length)
abline(v = pop.mean, col="blue")

legend("topright", legend = c("genome", "chr3"), col = c("blue", "violet"), lwd=1)
```


### 9. Distribution of gene length

- From the result of `hist()`, retrieve an array (in a variable of type `data.frame`) indicating the absolute frequencies (`count`) according to the median class size (`mids`),

- Add to this table a column indicating the relative frequency of each class of gene length.

- Add columns to this table indicating the **empirical distribution function** gene lengths (number of genes of a size less than or equal to each observed $x$ value, and relative frequency of this number). 

    - basic function: `cumsum()`
    - advanced function:`ecdf()`
    
- by using the functions `plot()` and `lines()`, draw a graph representing the absolute frequency per class (medians of classes in $X$, counts in $Y$), and the empirical distribution function. 

    - suggestion: superposez les  ??utilisez le type de lignes "h" pour les fréquences de classe, et "l" ou "s" pour la fonction de répartition. 


### 10. Expected distribution of gene lengths

Based on the genome size ($12.156.679$ bp) and genomic frequencies of the codons provided in the table below, calculate the gene length distribution that would be expected by chance, and draw it on top of the graph with the observed distribution of gene lengths. 

Note: the genomic frequencies of all polynucleotides can be downloaded here:  [3nt_genomic_Saccharomyces_cerevisiae-ovlp-1str.tab](http://jvanheld.github.io/stat1/data/Saccharomyces_cerevisiae/oligo_freq/3nt_genomic_Saccharomyces_cerevisiae-ovlp-1str.tab)

Alternative: create a variable `freq.3nt` and manually assign the values for the 4? required polynucleotides from the table below.


```{r echo=FALSE}
## Download tab-delimited file with chromosome sizes (unless already there)
freq.3nt.url <- "http://jvanheld.github.io/stat1/data/Saccharomyces_cerevisiae/oligo_freq/3nt_genomic_Saccharomyces_cerevisiae-ovlp-1str.tab"
freq.3nt.file <- file.path(dir.tuto, "3nt_genomic_Saccharomyces_cerevisiae-ovlp-1str.tab")

if (file.exists(freq.3nt.file)) {
  message("Trinucleotide frequency file: ", freq.3nt.file)
} else {
    download.file(freq.3nt.url, destfile = freq.3nt.file)
}


## Download oligonucleotide frequencies
oligo.freq <- read.table(freq.3nt.file, sep="\t", row.names = NULL)
colnames(oligo.freq) <- c("sequence", "frequency", "occurrences")
oligo.freq$frequency <- signif(digits=3, oligo.freq$frequency)

row.names(oligo.freq) <- oligo.freq$sequence

#oligo.freq["...",] <- c("...", "...", "...")

selected.oligos <- c("AAA", "ATG", "TAA", "TAG", "TGA", "TTT")

kable(data.frame(oligo.freq[selected.oligos, ]), row.names = FALSE)
#print(data.frame(oligo.freq[selected.oligos, ]))
```


```{r exp_orf_length_distrib, fig.width = 10, fig.height=12, out.width="100%", fig.cap="Distribution of the number of ORFs expected by chance in a random genomic sequence having the same codon frequencies as the yeast genome. "}
## Compute the probabilities of start, stop ,and not-stop codons
P <- c("start" = oligo.freq["ATG", "frequency"],
       "stop" = sum(oligo.freq[c("TAA", "TGA", "TAG"), "frequency"])
       )
P["not-stop"] <- 1 - P["stop"]


## Rounded number encompassing the number of codons of the longest gene
max.n <- 100 * ceiling(max(cds$length) / 300) 
n <- 1:max.n # A vector with all relevant length in numbers of codons
L <- 3*n # Gene lengths in base pairs

## Probability of observing an ORF of exactly n codons
length.proba.density <- P["start"] * P["not-stop"]^n * P["stop"]
length.pvalue <- rev(cumsum(rev(length.proba.density)))

## Compute the random expectation for the number of genes
## Note: genes can be found on both strands -> we multiply by 2
G <- 12156679 ## Genome length
exp.genes <- length.pvalue * G * 2


## Plot the open reading frame (ORF) probability as a
## function of ORF length (in base pairs)
par(mfrow = c(3,2))
plot(L, length.proba.density, type = "h",
     las = 1, col = "blue",
     main = "ORF length probability density (full range)",
     xlab = "ORF length (base pairs)",
     ylab = "P(X = x)")

plot(L, length.proba.density, type = "h", xlim = c(0,600),
     las = 1, col = "blue",
     main = "ORF length probability density (restricted range)",
     xlab = "ORF length (base pairs)",
     ylab = "P(X = x)")

plot(L, length.pvalue, type = "l", xlim = c(0,600),
     las = 1, col = "darkgreen", lwd = 2, panel.first = grid(),
     main = "ORF length P-value",
     xlab = "ORF length (base pairs)",
     ylab = "P(X >= x)")

plot(L, exp.genes, type = "l", xlim = c(0,600),
     las = 1, col = "darkgreen", lwd = 2, panel.first = grid(),
     main = "Expected number of ORFs (restricted range)",
     xlab = "ORF length (base pairs)",
     ylab = "Expected ORFs")

plot(L, length.pvalue, type = "l", 
     las = 1, col = "darkgreen", lwd = 2, panel.first = grid(),
     log = "y", xlim = c(0, 600), ylim = c(length.pvalue[201], 1),
     main = "ORF length P-value",
     xlab = "ORF length (base pairs)",
     ylab = "P(X >= x) on a log scale")

plot(L, exp.genes, type = "l", 
     las = 1, col = "darkgreen", lwd = 2, panel.first = grid(),
     log = "y", xlim = c(0, 600), ylim = c(exp.genes[201], exp.genes[1]),
     main = "ORF length E-value (restricted range)",
     xlab = "ORF length (base pairs)",
     ylab = "Expected ORFs (log scale)")
par(mfrow = c(1,1))


```


The **top-left panel** shows the **density of probability** of ORF lengths, i.e. the probability to observe by chance an ORF of exactly $x$ nucleotides: $P(X = x)$. The shape of the distribution is not very well depicted because the range extends up to $15,000$ base pairs, the length of the longest yeast gene. This gene is an outlier (exceptionally long, not representative of the other yeast genes). 

The **top right panel** shows the same distribution with a range restricted to 0-600 bp. 

The **middle left panel** shows the distribution of **P-value** for ORF lengths $x$ ranging from 0 to 600: $P(X \ge x)$. This is the probability, for each length $x$, to find by chance a gene at least as long starting at a given genomic position. 

The **middle right panel** shows the **E-value** $E(X \ge x)$, i.e. the number of ORFs expected by chance in the whole genome, for length $x$ ranging from 0 to 600. 

The **bottom** panels show the same distributions of P-value (left) and E-value (right) on a logarithmic scale, to better emphasize the very small probabilities. 

Of note, with a threshold $X \ge 300$, we still expect `r exp.genes[100]` ORFs at random. Since this threshold was used to infer the presence of an ORF in the original annotation of the yeast genome, biologists knew that these annotations would contain an important number of false ORF predictions. Consistently, several hundreds of genes were discarded from the annotations a few years later, based on comparative genomics. Indeed, when the genomes of other fungal species became available, the genes for which no homologs was found in any other fungal genome were considered likely false positives. 


### 11. Before finishing: keep track of your session

Tractability is an essential issue in science. 
The function ***R***  `sessionInfo()` provides a summary of the conditions of a work session: version of R, operator system, libraries of functions used. 


```{r}
sessionInfo()
```