---
title: "Reproducible Research"
subtitle: "Biostat 203B"
author: "Dr. Hua Zhou @ UCLA"
date: today
format:
html:
theme: cosmo
embed-resources: true
number-sections: true
toc: true
toc-depth: 4
toc-location: left
code-fold: false
---
## Reproducible research in statistics/data science
> An article about computational science in a scientific publication is **not** the scholarship itself, it is merely **advertising** of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.
>
> [Buckheit and Donoho (1995)](https://link.springer.com/chapter/10.1007/978-1-4612-2544-7_5)
## Non-reproducible research
### [Duke Potti scandal](https://en.wikipedia.org/wiki/Anil_Potti)
- [Potti, Dressman, Bild, and Riedel (2006)](https://www.nature.com/articles/nm1491)
- [Baggerly, K. A., & Coombes, K. R. (2009)](https://projecteuclid.org/euclid.aoas/1267453942)
- [Simply Statistics Blog: The Duke Saga Starter Set](https://simplystatistics.org/posts/2012-02-27-the-duke-saga-starter-set/)
### Microarray studies
Nature Genetics (2015 Impact Factor: 31.616). 20 articles about microarray profiling published in Nature Genetics between Jan 2005 and Dec 2006.
### Bible code
::: {layout="[[1,1], [1]]" layout-valign="bottom"}



:::
- [Witztum, D., Rips, E., & Rosenberg, Y. (1994)](https://doi.org/10.1214/ss/1177010393)
- [McKay, B., Bar-Natan, D., Bar-Hillel, M., & Kalai, G. (1999)](https://doi.org/10.1214/ss/1009212243)
## Why reproducible research
- Reproducibility has been the foundation of science. It helps accumulate scientific knowledge.
- Greater research impact.
- Better work habit boosts quality of research.
- Better teamwork. For **you** as graduate students, it means better communication with your advisor.
```{r}
#| eval: false
while true
Student: "that idea you told me to try - it doesn't work!"
Professor: "ok. how about trying this instead."
end
```
Unless you reproduce the computing environment (algorithms, dataset, tuning parameters), others cannot help you.
## How to be reproducible in data science?
> When we publish articles containing figures which were generated by computer, we also publish the complete software environment which generates the figures.
>
> [Buckheit and Donoho (1995)](https://link.springer.com/chapter/10.1007/978-1-4612-2544-7_5)
- I highly recommend the book _Reproducible Research with R and RStudio_ by Christopher Gandrud. In HW1, you are going to reproduce the whole book.
- [Amazon](https://www.amazon.com/Reproducible-Research-Studio-Chapman-Hall/dp/1466572841)
- [GitHub repo](https://github.com/christophergandrud/Rep-Res-Book)
## Tools for reproducible research
- Version control: Git+GitHub.
- Distribute method implementation, e.g., R/Python/Julia packages, on GitHub or bitbucket.
- Dynamic document: RMarkdown for R, [Jupyter](http://jupyter.org) for Julia/Python/R, Quarto.
- Docker container for reproducing a computing environment.
- Cloud computing tools.
We are going to practice reproducible research **now**. That is to make your homework reproducible using Git, GitHub, and Quarto/RMarkdown.