---
title: "First steps with R"
author: "Jacques van Helden"
date: '`r Sys.Date()`'
output:
  slidy_presentation:
    self_contained: no
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    highlight: tango
    incremental: no
    keep_md: no
    smaller: yes
    theme: cerulean
    toc: yes
    toc_float: yes
    widescreen: yes
  ioslides_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    incremental: no
    keep_md: no
    smaller: yes
    theme: cerulean
    toc: yes
    widescreen: yes
  beamer_presentation:
    colortheme: dolphin
    fig_caption: yes
    fig_height: 6
    fig_width: 7
    fonttheme: structurebold
    highlight: tango
    incremental: no
    keep_tex: no
    slide_level: 2
    theme: Montpellier
    toc: yes
  pdf_document:
    fig_caption: yes
    highlight: zenburn
    toc: yes
    toc_depth: 3
  html_document:
    code_folding: show
    self_contained: no
    fig_caption: yes
    highlight: zenburn
    theme: cerulean
    toc: yes
    toc_depth: 3
    toc_float: yes
  word_document:
    toc: yes
    toc_depth: 3
font-import: http://fonts.googleapis.com/css?family=Risque
font-family: Garamond
subtitle: Probabilités et statistique pour la biologie (STAT1)
address: TAGC lab, Aix-Marseille Université, France
transition: linear
editor_options: 
  chunk_output_type: console
---

```{r setup, include=FALSE, size="huge"}
library(knitr)
## Default parameters for displaying the slides
knitr::opts_chunk$set(
  echo = TRUE, 
  eval=TRUE, 
  fig.width = 7, 
  fig.height = 5, 
  fig.align = "center", 
  fig.path = "figures/",
  size = "tiny", 
  warning = FALSE, 
  results = TRUE, 
  message = FALSE, 
  comment = "")

dir.main <- "~/stat1"
dir.current <- file.path(dir.main, "practicals", "01_intro_R")
setwd(dir.current)

```


## Goal of this tutorial

This tutorial aims at discovering the fundamental elements of the ***R*** statistical language. 


We will briefly survey the following concepts.

- Handling variables

    - Assigning a value to a variable
    - Basic operations on numbers
    - Basic data structures
    
        - vectors
        - matrices
        - data.frames
        - lists
    
- Using functions
    
- Graph drawing

- Distributions of probabilities


## Using ***R*** as a calculator

**Example:** an addition with R.

At the R prompt, type the following instruction and press the Enter key. 

```{r}
2 + 5
```

The result ($7$) of the addition is printed out, preceded by an index $[1]$ (we will explain later why this index appears). 

## Assigning a value to a variable

In ***R***, the succession of a the hyphen and "smaller than" characters (**`<-`**) serves to assign a value to a variable. If the variable does not exist yet, it is created.

For example

```{r}
a <- 2
```

creates a variable named $a$, and assigns it the value $2$. 


The result can be displayed with the  `print()` function. 

```{r}
print(a)
```

**Remark: ** R also allows to use the equal symbol (`=`) to assign a value to a variable. However, we prefer to use the original assignation (`<-`), to follow the [R style recommendations](https://google.github.io/styleguide/Rguide.html). 

## Naming and syntactic conventions in R

A priori, are several conventions can be envisaged to ensure a consistent naming of variables, functions, operators, etc. 

For each programming language, the community of programmers defines some standard(s) to ensure a consistency of the published code. 

For this course, we will follow the recommendations of hte **Google R style guide:**

<https://google.github.io/styleguide/Rguide.xml>

However, for  [variable idendifiers](https://google.github.io/styleguide/Rguide.xml#identifiers), the traditional notation `variable.name` raises some issues for programmers who are familiar with object-oriented languages (e.g. java, python), where the point serves to apply a method (that follows the point) to an object (that precedes the point). 

To avoid this confusion, we will use the alternative so-called *camel back* notation (e.g. `variableName`). Attention, according to this convention, variable names always start with a lower case, whereas function / method names start with an uppercase. 

## Computing with variables

- Create a variable named $b$ with value $5$
- Compute  $a + b$ and store the result in a variable named $c$
- Print the result

```{r}
b <- 5
c <- a + b
print(a)
print(b)
print(c)
```

## Assignation $\neq$ equality - Exercise

- Replace the value of $a$ by $3$
- Print out the value of $c$
- Is-it still true that $c = a + b$? Why?

## Assignation $\neq$ equality - Solution

```{r}
a <- 3 ## Change the value of a
print(a)
print(b)
print(c)

## Check whether c equals a + b
c == a + b
```

Interpretation: **`==`** tests whether two variables have the same content. The result is a logical value (TRUE or FALSE). 

## Recomputing a result

When the content of a given variable $a$ is changed, another variable ($c$) previously computed from it has no reason to be recomputed if not explicitly requested. 


**Example:** 

- Replace the value of $a$ by $27$, 
- Recompute the value of $c$
- Test the equality $c = a + b$

```{r}
a <- 27 ## Change the value of a
c <- a + b
print(c) ## Print the value of c

## Check whether c  equals a + b
c == a + b
```

## Vectors of values

In  ***R***, the simplest data structure is a **vector**. 

- In the previous example, the variable $a$ contained a single number, but in practice it was stored in a single-entry vector. 

- The R function `print()` displays the indices at the beginning of each row. This is useful when displaying a vector with alarge number of entries. 


**Example:** create a variable named  ***threeNumbers***, and initialise it with a vector containing the values ***27***, ***12*** and ***3000***.

**Tips:**

- the function `c()` combines several values into a vector.

```{r}
threeNumbers <- c(27,12,3000)
print(threeNumbers)
```


## Series of numbers

The simplest way to create a series of number is to use the column character `:`, which generates all integer values between two boundaries; 

```{r}
x <- 0:30
print(x)
```

**Note: ** if the printout of the values extends beyond the width of the console, R goes to the next row but displays between square brackets the index of the first element at the beginning of the new row. 

Another example


```{r}
print(58:157)
```


## Computing with vectors


**R** enables to handle vectors in a every practical ways: mathematical operations involving a vector automatically apply to all its elements.


```{r}
x <- 1:10 # Define a series from 1 to 10
print(x)
y <- x^2 # Compute the square of each number
print(y)
```


## Sequences of numbers

The function `seq()` enables to generate series of bumbers separated by an arbitrary interval. 

```{r}
seq(from=-1, to=1, by=0.1)
```

## Variables can also contain text

Variables are not restricted to numbers: they can contain text (strings of characters). 


We will now use the function `c()` to combine severak character strings into a vector. 


```{r}
# The # symbol allows to insert comments in R code

# Define  a vector named "whoami", and 
# containing two names
whoami <- c("Denis", "Siméon")
print(whoami) # Comment at the end of a line
```

## Sring concatenation

The function `paste()` enables to concatenate string-containing variables.  

```{r}
# To concatenate the elements of a vector in a single chain, use "collapse"
firstName <- paste(collapse = " ", whoami)
print(firstName)

# TO concatenate two vectors, use "sep"
lastName <- "Poisson"
print(paste(sep = " ", firstName, lastName))

## Concatenate 2 vectors with 3 values each
firstNames <- c("George", "Alfred", "Frédéric")
lastNames <- c("Sand",  "Musset", "Chopin")
fullNames <- paste(sep = " ", firstNames, lastNames)
print(fullNames)

```


Note that the `paste()` functions can also be used to concatenate all the values of a given vector, but this requires to use the `collapse` argument insteead of `sep`. 


```{r}
paste(fullNames, collapse = ", ")
```


## Graphical functions

R includes a large number of functions enabling to draw simple or elaborate graphics. We explore hereafter the simplest methods.

## Scatter plot

```{r scatter_plot, fig.width=7, fig.height=5, fig.align="center"}
x <- seq(from = -10, to = 10, by = 0.1)
y <- x^2
plot(x,y)
```

## Curve (line plot)


```{r line_plot, fig.width=7, fig.height=5, fig.align="center"}
x <- seq(from = -10, to = 10, by = 0.1)
y <- x^2
plot(x,y, type="l")
```


## Improving a graphics

**Exercise: ** on the R console, type `help(plot)`, read the help of the `plot()` function, and explore the parameters in order to improve the previous graphics. Also consult the help, function for the graphical parameters (`help(par`). 


You could for example, attempt to add the following elements to the figure: 

- title
- axis labels
- line color
- line width
- grid
- trace an horizontal line to mark the Y axis at coordinate $X=0$
- trace a vertical line to mark the X axis at coordinate $Y=0$
- any other parameter that will improve the readability / interpretability of the resulting Figure

An example of solution is shown in the next section (don't look before having done the exercise!).

## Solution

Run the following chunk of code to generate the improved figure. 


```{r line_plot_improved, fig.width=7, fig.height=5, fig.align="center", echo = TRUE, eval = FALSE}
x <- seq(from = -10, to = 10, by = 0.1)
y <- x^2
plot(x,y, 
     type="l", # Plot type
     main = "Parabole", # Main title
     xlab = "x", #X label
     ylab = "y = x^2", # Y label
     col = "blue", # Curve color
     lwd = 3, # Line width
     las = 1 # display axis labels horizontally 
     )
grid(lty = "dashed", col="gray") # Grid
abline(h = 0) # Horizontal line
abline(v = 0) # Vertical line

```


## Probability distributions

For each one of the classical distribution of probabilities, **R** provides 4 functions-. 


Before going any further, read carefully the help for the functions associated to the binomial distribution. 


```{r binomial_help}
help("Binomial")

```

**Questions:** 

1. Which R function enables to compuote the probabilty mass function (also called "density" for convenience)?

2. Which R function corresponds to the cumulative distribution function (CDF)?

2. What is the role of the function *rbinom()*?

## Drawing the binomial distribution


**Exercise:** assuming a DNA seqsuence with equiprobable nucleotides, draw the expected distribution for the number of Adenin ($A$) residues in an  oligonucleotide of length 30 (count the adenines on a single strand).  

- probability mass funciton
- cumulative distribution

In the forthcoming slides, we will guide you step by step to start the exercise, and you will then be invited to improve the result at your will. 

## Formula of the solution

The number of adenins can take any value between $0$ and $30$. The problem can be modeled as  Bernoulli Schema with $n=30$ trials, which can each result in either a success (observing and adenin, with $p=0.25$) or a failure (any other nucleotide, with probability $q = 1-p = 0.75$). 
The probability to observe exactly $x$ adenins is thus:

$$P(X=x) = \binom{n}{x} p^x (1-p)^{n-x} = \frac{30!}{x!(30-x)!} \cdot 0.25^x \cdot 0.75^{n-x}$$
where $x$ can take any value between  $0$ and $30$. 

## Computation of the probability mass function

```{r binomial_PMF}
## Define all possible values for X
n <- 30
x <- 0:n
p <- 0.25

## Compute the binomial PMF
pmf <- dbinom(x = x, size = n, prob = p)
```

## Selecting a subset of entries from a vector

We can select the 4 first values of the `pmf`variable defined above (which correspond to number of successes $x$ from $0$ to $4$) ...

```{r}
pmf[1:5]
```

... or the 4 last values ($x$ values from $27$ to $30$ succeses).

```{r}
pmf[(n-3):n]
```

## Restricting the number of decimals

The function `round()` rounds a result with a given number of digits. 

```{r}
round(pmf, digit=3)
```

However, this is not always the best solution. Indeed, probability values can rapidly reach very small values, and we would thus not like to replace all these values by $0$ if they are lower than the number of decimals. We would like to see both the order of magnitude, and to show the value with  a reasonable number of significant digits. For this, we can use the  `signif()` function.  

```{r}
signif(pmf, digit=3)
```


## Drawing the binomial distribution


```{r dbinom_plot, fig.width=7, fig.height=4, fig.align="center"}
n <- 30; x <- 0:n   # Define the X values from 0 to 14
y <- dbinom(x = x, size = n, prob = 0.25) # Poisson density
plot(x,y) # Check the result
```

This first drawing is not very elegant. The empty circles do not make it easy to perceive the general shape of the distribution. We can attempt to improve its interpretability. 

## Exercise 1: improving the drawing of the binomial

Use the following options of the `plot()` function to highlight the shape of the distribution. To give you an idea of the target, the expected result is displayed on the next slide.

- Choose a type of points (option `type`) that reflects the height of the probability (valeur $Y$) at the different values of $X$.

- Add a title (option `main`) and specify relevant axis legends  (options `xlab` and  `ylab`)

- Color the drawing (option `col`)

- Thicken the lines (option `lwd`)

- Add an horizontal grid (function `grid()`)

- make sure that the axis labels are horizontal (option `las`).


## Expected result

```{r dbinom_plot_improved, fig.width=7, fig.height=4, out.width="80%", fig.align="center", echo = FALSE}
n <- 30; x <- 0:n   # Define the X values from 0 to 14
y <- dbinom(x = x, size = n, prob = 0.25) # Poisson density
plot(x, y,
     main = "Binomial distribution",
     xlab = "X (Number of successes)",
     ylab = "P(X)",
     col = 'blue',
     lwd = 6,
     las = 1,
     type = "h")
grid(lty = "solid", col="gray")


```


## Exercise: series of binomial curves

Draw a series of binomial curves with $n=30$ trials, and probabilities ranging from $p=0.1$ to $p=0.9$ by steps of $0.1$. 

```{r binomial_series, fig.width = 12, fig.height = 9, out.width = "60%", echo = FALSE}
par(mfrow = c(3,3))
n <- 30  # Number of trials
x <- 0:n # Number of successes
for (p in seq(from = 0.1, to = 0.9, by = 0.1)) {
  plot(x, dbinom(x, prob = p, size = n),
       main = paste("p =", p),
       ylab = "P(X)",
       type = "h", lwd = 2, col = "blue"
       )
}

par(mfrow = c(1,1))

```


## Exercise: convergence of the binomial to the normal

```{r fig.width = 12, fig.height = 6, out.width="60%", echo = FALSE}
n <- 10
p <- 0.4

#' @title PLot a binomial disrtribution
#' @param size number of trials
#' @param prob probability of success at each trial
BinomPlot <- function(size, prob, ...) {
  x <- 0:size
  
  mean  <- prob * size
  binomVar <- size * prob * (1 - prob)
  binomSD <- sqrt(binomVar)
  plotXLim <- c(mean - 4 * binomSD, 
            mean + 4 * binomSD)

  y <- dbinom(x = x, size = size, prob = prob)
  plot(x, y, type="h", col = "gray", xlim = plotXLim, ...)

  
  abline(v=mean)
  
  abline(v = mean - binomSD, lty = "dashed", col = "blue")
  abline(v = mean + binomSD, lty = "dashed", col = "blue")
  
}


par(mfrow=c(2,3))
for (n in c(10, 100, 1000, 1000, 10000, 10000)) {
  BinomPlot(prob = 0.4, size = n)
  x <- 0:n
  
  ## Draw a normal curve above the binomial
  lines(x, dnorm(x, mean = p * n, sd = sqrt(p * (1 - p ) * n)), type = "l", col = "red")
}
par(mfrow=c(1,1))

```


## Before finishing: keep a trace of your session!

Traceability and reproducibility of the analyses are crucial issues for sciences. The  ***R*** function `sessionInfo()` summarises the software environment used for the current a working session: R version, operating system, loaded libraries/packages and their versions.


```{r}
sessionInfo()
```

*******