---
title: "Reproducible Research"
subtitle: "Biostat 203B"
author: "Dr. Hua Zhou @ UCLA"
date: "2023-01-17"
format:
html:
theme: cosmo
number-sections: true
toc: true
toc-depth: 4
toc-location: left
code-fold: false
bibliography: "../bib-HZ.bib"
csl: "../apa.csl"
---
## Reproducible research in statistics/data science
> An article about computational science in a scientific publication is **not** the scholarship itself, it is merely **advertising** of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.
>
> @BuckheitDonoho95ReproRes
## Non-reproducible research
### [Duke Potti scandal](https://en.wikipedia.org/wiki/Anil_Potti)
- @Potti06GenomeSignature:
- @BaggerlyCoombes09:
- [Simply Statistics Blog: The Duke Saga Starter Set](https://simplystatistics.org/posts/2012-02-27-the-duke-saga-starter-set/)
### Microarray studies
Nature Genetics (2015 Impact Factor: 31.616). 20 articles about microarray profiling published in Nature Genetics between Jan 2005 and Dec 2006.
### Bible code
::: {layout="[[1,1], [1]]" layout-valign="bottom"}



:::
- @WitztumRipsRosenberg94BibleCode
- @McKayBarNatanBarHillelKalai99BibleCode
## Why reproducible research
- Reproducibility has been the foundation of science. It helps accumulate scientific knowledge.
- Greater research impact.
- Better work habit boosts quality of research.
- Better teamwork. For **you** as graduate students, it means better communication with your advisor.
```{r}
#| eval: false
while true
Student: "that idea you told me to try - it doesn't work!"
Professor: "ok. how about trying this instead."
end
```
Unless you reproduce the computing environment (algorithms, dataset, tuning parameters), others cannot help you.
## How to be reproducible in data science?
> When we publish articles containing figures which were generated by computer, we also publish the complete software environment which generates the figures.
>
> @BuckheitDonoho95ReproRes
- A good example:
- I **highly** recommend the book _Reproducible Research with R and RStudio_ by Christopher Gandrud.
- [Amazon](https://www.amazon.com/Reproducible-Research-Studio-Chapman-Hall/dp/1466572841)
- [GitHub repo](https://github.com/christophergandrud/Rep-Res-Book)
## Tools for reproducible research
- Version control: Git+GitHub.
- Distribute method implementation, e.g., R/Python/Julia packages, on GitHub or bitbucket.
- Dynamic document: RMarkdown for R, [Jupyter](http://jupyter.org) for Julia/Python/R, Quarto.
- Docker container for reproducing a computing environment.
- Cloud computing tools.
We are going to practice reproducible research **now**. That is to make your homework reproducible using Git, GitHub, and Quarto/RMarkdown.
## References