--- title: "First steps with R" author: "Jacques van Helden" date: '`r Sys.Date()`' output: slidy_presentation: self_contained: no fig_caption: yes fig_height: 6 fig_width: 7 highlight: tango incremental: no keep_md: no smaller: yes theme: cerulean toc: yes toc_float: yes widescreen: yes ioslides_presentation: colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango incremental: no keep_md: no smaller: yes theme: cerulean toc: yes widescreen: yes beamer_presentation: colortheme: dolphin fig_caption: yes fig_height: 6 fig_width: 7 fonttheme: structurebold highlight: tango incremental: no keep_tex: no slide_level: 2 theme: Montpellier toc: yes pdf_document: fig_caption: yes highlight: zenburn toc: yes toc_depth: 3 html_document: code_folding: show self_contained: no fig_caption: yes highlight: zenburn theme: cerulean toc: yes toc_depth: 3 toc_float: yes word_document: toc: yes toc_depth: 3 font-import: http://fonts.googleapis.com/css?family=Risque font-family: Garamond subtitle: Probabilités et statistique pour la biologie (STAT1) address: TAGC lab, Aix-Marseille Université, France transition: linear editor_options: chunk_output_type: console --- ```{r setup, include=FALSE, size="huge"} library(knitr) ## Default parameters for displaying the slides knitr::opts_chunk$set( echo = TRUE, eval=TRUE, fig.width = 7, fig.height = 5, fig.align = "center", fig.path = "figures/", size = "tiny", warning = FALSE, results = TRUE, message = FALSE, comment = "") dir.main <- "~/stat1" dir.current <- file.path(dir.main, "practicals", "01_intro_R") setwd(dir.current) ``` ## Goal of this tutorial This tutorial aims at discovering the fundamental elements of the ***R*** statistical language. We will briefly survey the following concepts. - Handling variables - Assigning a value to a variable - Basic operations on numbers - Basic data structures - vectors - matrices - data.frames - lists - Using functions - Graph drawing - Distributions of probabilities ## Using ***R*** as a calculator **Example:** an addition with R. At the R prompt, type the following instruction and press the Enter key. ```{r} 2 + 5 ``` The result ($7$) of the addition is printed out, preceded by an index $[1]$ (we will explain later why this index appears). ## Assigning a value to a variable In ***R***, the succession of a the hyphen and "smaller than" characters (**`<-`**) serves to assign a value to a variable. If the variable does not exist yet, it is created. For example ```{r} a <- 2 ``` creates a variable named $a$, and assigns it the value $2$. The result can be displayed with the `print()` function. ```{r} print(a) ``` **Remark: ** R also allows to use the equal symbol (`=`) to assign a value to a variable. However, we prefer to use the original assignation (`<-`), to follow the [R style recommendations](https://google.github.io/styleguide/Rguide.html). ## Naming and syntactic conventions in R A priori, are several conventions can be envisaged to ensure a consistent naming of variables, functions, operators, etc. For each programming language, the community of programmers defines some standard(s) to ensure a consistency of the published code. For this course, we will follow the recommendations of hte **Google R style guide:** However, for [variable idendifiers](https://google.github.io/styleguide/Rguide.xml#identifiers), the traditional notation `variable.name` raises some issues for programmers who are familiar with object-oriented languages (e.g. java, python), where the point serves to apply a method (that follows the point) to an object (that precedes the point). To avoid this confusion, we will use the alternative so-called *camel back* notation (e.g. `variableName`). Attention, according to this convention, variable names always start with a lower case, whereas function / method names start with an uppercase. ## Computing with variables - Create a variable named $b$ with value $5$ - Compute $a + b$ and store the result in a variable named $c$ - Print the result ```{r} b <- 5 c <- a + b print(a) print(b) print(c) ``` ## Assignation $\neq$ equality - Exercise - Replace the value of $a$ by $3$ - Print out the value of $c$ - Is-it still true that $c = a + b$? Why? ## Assignation $\neq$ equality - Solution ```{r} a <- 3 ## Change the value of a print(a) print(b) print(c) ## Check whether c equals a + b c == a + b ``` Interpretation: **`==`** tests whether two variables have the same content. The result is a logical value (TRUE or FALSE). ## Recomputing a result When the content of a given variable $a$ is changed, another variable ($c$) previously computed from it has no reason to be recomputed if not explicitly requested. **Example:** - Replace the value of $a$ by $27$, - Recompute the value of $c$ - Test the equality $c = a + b$ ```{r} a <- 27 ## Change the value of a c <- a + b print(c) ## Print the value of c ## Check whether c equals a + b c == a + b ``` ## Vectors of values In ***R***, the simplest data structure is a **vector**. - In the previous example, the variable $a$ contained a single number, but in practice it was stored in a single-entry vector. - The R function `print()` displays the indices at the beginning of each row. This is useful when displaying a vector with alarge number of entries. **Example:** create a variable named ***threeNumbers***, and initialise it with a vector containing the values ***27***, ***12*** and ***3000***. **Tips:** - the function `c()` combines several values into a vector. ```{r} threeNumbers <- c(27,12,3000) print(threeNumbers) ``` ## Series of numbers The simplest way to create a series of number is to use the column character `:`, which generates all integer values between two boundaries; ```{r} x <- 0:30 print(x) ``` **Note: ** if the printout of the values extends beyond the width of the console, R goes to the next row but displays between square brackets the index of the first element at the beginning of the new row. Another example ```{r} print(58:157) ``` ## Computing with vectors **R** enables to handle vectors in a every practical ways: mathematical operations involving a vector automatically apply to all its elements. ```{r} x <- 1:10 # Define a series from 1 to 10 print(x) y <- x^2 # Compute the square of each number print(y) ``` ## Sequences of numbers The function `seq()` enables to generate series of bumbers separated by an arbitrary interval. ```{r} seq(from=-1, to=1, by=0.1) ``` ## Variables can also contain text Variables are not restricted to numbers: they can contain text (strings of characters). We will now use the function `c()` to combine severak character strings into a vector. ```{r} # The # symbol allows to insert comments in R code # Define a vector named "whoami", and # containing two names whoami <- c("Denis", "Siméon") print(whoami) # Comment at the end of a line ``` ## Sring concatenation The function `paste()` enables to concatenate string-containing variables. ```{r} # To concatenate the elements of a vector in a single chain, use "collapse" firstName <- paste(collapse = " ", whoami) print(firstName) # TO concatenate two vectors, use "sep" lastName <- "Poisson" print(paste(sep = " ", firstName, lastName)) ## Concatenate 2 vectors with 3 values each firstNames <- c("George", "Alfred", "Frédéric") lastNames <- c("Sand", "Musset", "Chopin") fullNames <- paste(sep = " ", firstNames, lastNames) print(fullNames) ``` Note that the `paste()` functions can also be used to concatenate all the values of a given vector, but this requires to use the `collapse` argument insteead of `sep`. ```{r} paste(fullNames, collapse = ", ") ``` ## Graphical functions R includes a large number of functions enabling to draw simple or elaborate graphics. We explore hereafter the simplest methods. ## Scatter plot ```{r scatter_plot, fig.width=7, fig.height=5, fig.align="center"} x <- seq(from = -10, to = 10, by = 0.1) y <- x^2 plot(x,y) ``` ## Curve (line plot) ```{r line_plot, fig.width=7, fig.height=5, fig.align="center"} x <- seq(from = -10, to = 10, by = 0.1) y <- x^2 plot(x,y, type="l") ``` ## Improving a graphics **Exercise: ** on the R console, type `help(plot)`, read the help of the `plot()` function, and explore the parameters in order to improve the previous graphics. Also consult the help, function for the graphical parameters (`help(par`). You could for example, attempt to add the following elements to the figure: - title - axis labels - line color - line width - grid - trace an horizontal line to mark the Y axis at coordinate $X=0$ - trace a vertical line to mark the X axis at coordinate $Y=0$ - any other parameter that will improve the readability / interpretability of the resulting Figure An example of solution is shown in the next section (don't look before having done the exercise!). ## Solution Run the following chunk of code to generate the improved figure. ```{r line_plot_improved, fig.width=7, fig.height=5, fig.align="center", echo = TRUE, eval = FALSE} x <- seq(from = -10, to = 10, by = 0.1) y <- x^2 plot(x,y, type="l", # Plot type main = "Parabole", # Main title xlab = "x", #X label ylab = "y = x^2", # Y label col = "blue", # Curve color lwd = 3, # Line width las = 1 # display axis labels horizontally ) grid(lty = "dashed", col="gray") # Grid abline(h = 0) # Horizontal line abline(v = 0) # Vertical line ``` ## Probability distributions For each one of the classical distribution of probabilities, **R** provides 4 functions-. Before going any further, read carefully the help for the functions associated to the binomial distribution. ```{r binomial_help} help("Binomial") ``` **Questions:** 1. Which R function enables to compuote the probabilty mass function (also called "density" for convenience)? 2. Which R function corresponds to the cumulative distribution function (CDF)? 2. What is the role of the function *rbinom()*? ## Drawing the binomial distribution **Exercise:** assuming a DNA seqsuence with equiprobable nucleotides, draw the expected distribution for the number of Adenin ($A$) residues in an oligonucleotide of length 30 (count the adenines on a single strand). - probability mass funciton - cumulative distribution In the forthcoming slides, we will guide you step by step to start the exercise, and you will then be invited to improve the result at your will. ## Formula of the solution The number of adenins can take any value between $0$ and $30$. The problem can be modeled as Bernoulli Schema with $n=30$ trials, which can each result in either a success (observing and adenin, with $p=0.25$) or a failure (any other nucleotide, with probability $q = 1-p = 0.75$). The probability to observe exactly $x$ adenins is thus: $$P(X=x) = \binom{n}{x} p^x (1-p)^{n-x} = \frac{30!}{x!(30-x)!} \cdot 0.25^x \cdot 0.75^{n-x}$$ where $x$ can take any value between $0$ and $30$. ## Computation of the probability mass function ```{r binomial_PMF} ## Define all possible values for X n <- 30 x <- 0:n p <- 0.25 ## Compute the binomial PMF pmf <- dbinom(x = x, size = n, prob = p) ``` ## Selecting a subset of entries from a vector We can select the 4 first values of the `pmf`variable defined above (which correspond to number of successes $x$ from $0$ to $4$) ... ```{r} pmf[1:5] ``` ... or the 4 last values ($x$ values from $27$ to $30$ succeses). ```{r} pmf[(n-3):n] ``` ## Restricting the number of decimals The function `round()` rounds a result with a given number of digits. ```{r} round(pmf, digit=3) ``` However, this is not always the best solution. Indeed, probability values can rapidly reach very small values, and we would thus not like to replace all these values by $0$ if they are lower than the number of decimals. We would like to see both the order of magnitude, and to show the value with a reasonable number of significant digits. For this, we can use the `signif()` function. ```{r} signif(pmf, digit=3) ``` ## Drawing the binomial distribution ```{r dbinom_plot, fig.width=7, fig.height=4, fig.align="center"} n <- 30; x <- 0:n # Define the X values from 0 to 14 y <- dbinom(x = x, size = n, prob = 0.25) # Poisson density plot(x,y) # Check the result ``` This first drawing is not very elegant. The empty circles do not make it easy to perceive the general shape of the distribution. We can attempt to improve its interpretability. ## Exercise 1: improving the drawing of the binomial Use the following options of the `plot()` function to highlight the shape of the distribution. To give you an idea of the target, the expected result is displayed on the next slide. - Choose a type of points (option `type`) that reflects the height of the probability (valeur $Y$) at the different values of $X$. - Add a title (option `main`) and specify relevant axis legends (options `xlab` and `ylab`) - Color the drawing (option `col`) - Thicken the lines (option `lwd`) - Add an horizontal grid (function `grid()`) - make sure that the axis labels are horizontal (option `las`). ## Expected result ```{r dbinom_plot_improved, fig.width=7, fig.height=4, out.width="80%", fig.align="center", echo = FALSE} n <- 30; x <- 0:n # Define the X values from 0 to 14 y <- dbinom(x = x, size = n, prob = 0.25) # Poisson density plot(x, y, main = "Binomial distribution", xlab = "X (Number of successes)", ylab = "P(X)", col = 'blue', lwd = 6, las = 1, type = "h") grid(lty = "solid", col="gray") ``` ## Exercise: series of binomial curves Draw a series of binomial curves with $n=30$ trials, and probabilities ranging from $p=0.1$ to $p=0.9$ by steps of $0.1$. ```{r binomial_series, fig.width = 12, fig.height = 9, out.width = "60%", echo = FALSE} par(mfrow = c(3,3)) n <- 30 # Number of trials x <- 0:n # Number of successes for (p in seq(from = 0.1, to = 0.9, by = 0.1)) { plot(x, dbinom(x, prob = p, size = n), main = paste("p =", p), ylab = "P(X)", type = "h", lwd = 2, col = "blue" ) } par(mfrow = c(1,1)) ``` ## Exercise: convergence of the binomial to the normal ```{r fig.width = 12, fig.height = 6, out.width="60%", echo = FALSE} n <- 10 p <- 0.4 #' @title PLot a binomial disrtribution #' @param size number of trials #' @param prob probability of success at each trial BinomPlot <- function(size, prob, ...) { x <- 0:size mean <- prob * size binomVar <- size * prob * (1 - prob) binomSD <- sqrt(binomVar) plotXLim <- c(mean - 4 * binomSD, mean + 4 * binomSD) y <- dbinom(x = x, size = size, prob = prob) plot(x, y, type="h", col = "gray", xlim = plotXLim, ...) abline(v=mean) abline(v = mean - binomSD, lty = "dashed", col = "blue") abline(v = mean + binomSD, lty = "dashed", col = "blue") } par(mfrow=c(2,3)) for (n in c(10, 100, 1000, 1000, 10000, 10000)) { BinomPlot(prob = 0.4, size = n) x <- 0:n ## Draw a normal curve above the binomial lines(x, dnorm(x, mean = p * n, sd = sqrt(p * (1 - p ) * n)), type = "l", col = "red") } par(mfrow=c(1,1)) ``` ## Before finishing: keep a trace of your session! Traceability and reproducibility of the analyses are crucial issues for sciences. The ***R*** function `sessionInfo()` summarises the software environment used for the current a working session: R version, operating system, loaded libraries/packages and their versions. ```{r} sessionInfo() ``` *******