--- title: "Reproducible Research" subtitle: "Biostat 203B" author: "Dr. Hua Zhou @ UCLA" date: today format: html: theme: cosmo embed-resources: true number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false --- ## Reproducible research in statistics/data science > An article about computational science in a scientific publication is **not** the scholarship itself, it is merely **advertising** of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures. > > [Buckheit and Donoho (1995)](https://link.springer.com/chapter/10.1007/978-1-4612-2544-7_5) ## Non-reproducible research ### [Duke Potti scandal](https://en.wikipedia.org/wiki/Anil_Potti)

- [Potti, Dressman, Bild, and Riedel (2006)](https://www.nature.com/articles/nm1491) - [Baggerly, K. A., & Coombes, K. R. (2009)](https://projecteuclid.org/euclid.aoas/1267453942) - [Simply Statistics Blog: The Duke Saga Starter Set](https://simplystatistics.org/posts/2012-02-27-the-duke-saga-starter-set/) ### Microarray studies

Nature Genetics (2015 Impact Factor: 31.616). 20 articles about microarray profiling published in Nature Genetics between Jan 2005 and Dec 2006. ### Bible code ::: {layout="[[1,1], [1]]" layout-valign="bottom"} ![](biblecode.jpg) ![](biblecode_statsci.png) ![](trump_bible_code.jpg) ::: - [Witztum, D., Rips, E., & Rosenberg, Y. (1994)](https://doi.org/10.1214/ss/1177010393) - [McKay, B., Bar-Natan, D., Bar-Hillel, M., & Kalai, G. (1999)](https://doi.org/10.1214/ss/1009212243) ## Why reproducible research - Reproducibility has been the foundation of science. It helps accumulate scientific knowledge. - Greater research impact. - Better work habit boosts quality of research. - Better teamwork. For **you** as graduate students, it means better communication with your advisor. ```{r} #| eval: false while true Student: "that idea you told me to try - it doesn't work!" Professor: "ok. how about trying this instead." end ``` Unless you reproduce the computing environment (algorithms, dataset, tuning parameters), others cannot help you. ## How to be reproducible in data science? > When we publish articles containing figures which were generated by computer, we also publish the complete software environment which generates the figures. > > [Buckheit and Donoho (1995)](https://link.springer.com/chapter/10.1007/978-1-4612-2544-7_5) - I highly recommend the book _Reproducible Research with R and RStudio_ by Christopher Gandrud. In HW1, you are going to reproduce the whole book. - [Amazon](https://www.amazon.com/Reproducible-Research-Studio-Chapman-Hall/dp/1466572841) - [GitHub repo](https://github.com/christophergandrud/Rep-Res-Book) ## Tools for reproducible research - Version control: Git+GitHub. - Distribute method implementation, e.g., R/Python/Julia packages, on GitHub or bitbucket. - Dynamic document: RMarkdown for R, [Jupyter](http://jupyter.org) for Julia/Python/R, Quarto. - Docker container for reproducing a computing environment. - Cloud computing tools. We are going to practice reproducible research **now**. That is to make your homework reproducible using Git, GitHub, and Quarto/RMarkdown.