--- title: "Reproducible science
using ![Rlogo](../../img/slides/Rlogo-small.png)

" author: Thibaut Jombart date: "2019-11-19" output: ioslides_presentation --- ```{r setup, include=FALSE} ## This code defines the 'verbatim' option for chunks ## which will include the chunk with its header and the ## trailing "```". require(knitr) hook_source_def = knit_hooks$get('source') knit_hooks$set(source = function(x, options){ if (!is.null(options$verbatim) && options$verbatim){ opts = gsub(",\\s*verbatim\\s*=\\s*TRUE\\s*.*$", "", options$params.src) bef = sprintf('\n\n ```{r %s}\n', opts, "\n") stringr::str_c(bef, paste(knitr:::indent_block(x, " "), collapse = '\n'), "\n ```\n") } else { hook_source_def(x, options) } }) ``` # On reproducibility ## What is reproducibility in science?

> - ability to reproduce results by a peer > - requires data, methods, and procedures > - increasingly, science is supposed to be reproducible ## Why does it not happen, in practice? Some opinions on whether reproducibility is needed: > - *Ideally, yes but we don't have time for this.* > - *If it gets published, yes.* > - *If it gets published, yes; unless it is in PLoS One...* > - *No need: I work on my own.* > - *For others to copy us? You crazy?!* > - *No way! We rigged the data, the method does not work, and we ran the analyses in Excel.* ## Main obstacles to reproducibility {.columns-2}

> - lack of time: ultimately, reproducibility is faster > - fear of plagiarism: low risks in practice > - internal work, no need to share: almost never true
> - one good reason: lack of tools to facilitate reproducibility ## You never work alone

Be nice to your future selves! ## Two aspects of reproducibility using

> - implementing methods as

packages > - making transparent and reproducible analyses #

eproducibility in practice ## Literate programming

> *Let us change our traditional attitude to the construction of programs: instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do.* (Donald E. Knuth, Literate Programming, 1984) ## A data-centred approach to programming

## Literate programming in

Current workflows use the following equation: **markdown** (`.md`) +

= **Rmarkdown** (`.Rmd`)

Example:
`knitr::knit2html("foo.Rmd")` $\rightarrow$ `foo.html`
`rmarkdown::render("foo.Rmd")` $\rightarrow$ `foo.pdf`
`rmarkdown::render("foo.Rmd")` $\rightarrow$ `foo.doc`
`...` ## **Rmarkdown**:

chunks in markdown {.smaller} ```{r chunk-title, ..., verbatim = TRUE, eval = FALSE} a <- rnorm(1000) hist(a, col = terrain.colors(15), border = "white", main = "Normal distribution") ``` results in: ```{r rmarkdown, out.width = "80%", fig.width = 12, echo = c(2,3)} set.seed(1) a <- rnorm(1000) hist(a, col = terrain.colors(15), border = "white", main = "Normal distribution") ``` ## Formatting outputs ```{r another-chunk-title, ..., verbatim = TRUE, eval = FALSE} [some R code here] ``` where `...` are options for processing and formatting, e.g: - `eval` (`TRUE`/`FALSE`): evaluate code? - `echo` (`TRUE`/`FALSE`): show code input? - `results` (`"markup"/"hide"/"asis"`): show/format code output - `message/warning/error`: show messages, warnings, errors? - `cache` (`TRUE`/`FALSE`): cache analyses?
See [http://yihui.name/knitr/options](http://yihui.name/knitr/options) for details on all options. ## One format, several outputs **`rmarkdown`** can generate different types of documents: - standardised reports (`html`, `pdf`) - journal articles. using the `rticles` package (`.pdf`) - Tufte handouts (`.pdf`) - word documents (`.doc`) - slides for presentations (`html`, `pdf`) - ... See: [http://rmarkdown.rstudio.com/gallery.html](http://rmarkdown.rstudio.com/gallery.html). ## **`rmarkdown`**: toy example 1/2 {.smaller} Let us consider the file \texttt{foo.Rmd}:


---
title: "A toy example of rmarkdown"
author: "John Snow"
date: "`r Sys.Date()`"
output: html_document
---

This is some nice R code:

```{r rnorm-example, verbatim = TRUE, eval = FALSE, echo = 2:4} set.seed(1) x <- rnorm(100) x[1:6] hist(x, col = "grey", border = "white") ``` ## **`rmarkdown`**: toy example 1/2 {.smaller} ```{r toy-rmd, eval = FALSE} rmarkdown::render("foo.Rmd") ```

# Good practices ## **`rmarkdown`** is just the beginning {.columns-2}

> - alter your original data > - have a messy project > - write non-portable code > - write horrible code > - lose work permanently ## How to treat your original data

> - **do not touch your original data** > - save it as read-only > - make copies - you can play with these > - track the changes made to the original data ## How to avoid messy projects

> - **1 project = 1 folder** > - subfolders for: data, analyses, figures, manuscripts, ... > - document the project using a `README` file > - use the Rstudio projects (if you use Rstudio) ## How to write portable code?

> - avoid absolute paths e.g.:
`my_file <- "C:\project1\data\data.csv"`
> - use the package `here` for portable paths e.g.:
`my_file <- here("data/data.csv")` > - avoid special characters and spaces in all names e.g.:
`éèçêäÏ*%~!?&` > - assume case sensitivity:
`FooBar` $\neq$ `foobar` $\neq$ `FOOBAR` ## How to write better code?

> - name things explicitly > - settle for one naming convention; `snake_case` is currently recommended for

packages > - document your code using comments (`##`) > - write simple code, in short sections > - use current coding standards -- see the `lintr` package ## Example of `lintr`

source: [https://github.com/jimhester/lintr](https://github.com/jimhester/lintr) ## Structuring analysis reports: question-driven report

> - organised by questions / analysis topics > - pros: better narrative > - cons: harder code to follow / review

## Structuring analysis reports: code-driven report

> - organised by type of code > - pros: easier to read / review code > - cons: narrative harder to follow

## Structuring analysis reports: hybrid report

> - differentiates **infrastructure** *vs* **analysis** code > - makes question-specific code *simple*, and *repetitive* > - pros: narrative and code easier to read > - cons: harder to design (need frequent re-factoring)

## Do not lose your work! Because you never know what can happen..

## How to avoid losing work?

> - **never rely on a single computer** to store your work > - backups are good, syncing with a server is better (e.g. Dropbox) > - use version numbers to track progress > - use reportfactory for repeated analysis updates > - use version control systems (e.g. GIT) for serious coding projects ## Going further

> - check our golden rules for writing analysis reports > - use report factory templates as starting points > - use R4epis templates as starting points ##