{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reproducible Research" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> An article about computational science in a scientific publication is **not** the scholarship itself, it is merely **advertising** of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.\n", "> \n", "> -- Buckheit and Donoho (1995)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Non-reproducible research\n", "\n", "1. Duke Potti Scandal\n", "\n", " \n", "\n", " * Potti et al (2006) Genomic signatures to guide the use of chemotherapeutics, [Nature Medicine](http://www.nature.com/nm/journal/v12/n11/full/nm1491.html), 12(11):1294--1300. \n", "\n", " * Baggerly and Coombes (2009) Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology, [Ann. Appl. Stat.](https://projecteuclid.org/euclid.aoas/1267453942), 3(4):1309--1334. \n", "\n", " * More information:\n", " * [Wiki page](http://en.wikipedia.org/wiki/Anil_Potti)\n", " * [Simply Statistics Blog: The Duke Saga Starter Set](http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/)\n", "\n", "2. Nature Genetics (2015 Impact Factor: 31.616). 20 articles about microarray profiling published in _Nature Genetics_ between Jan 2005 and Dec 2006.\n", "\n", " \n", " \n", "\n", "3. Bible code.\n", "\n", " \n", " \n", " \n", "\n", " * Witztum, Rips, and Rosenberg (1994) Equidistant letter sequences in the book of genesis. [Statist. Sci.](http://projecteuclid.org/euclid.ss/1177010393), 9(3):429-438. \n", "\n", " * McKay, Bar-Natan, Bar-Hillel, and Kalai (1999) Solving the Bible code puzzle, [Statist. Sci.](https://www.math.washington.edu/~greenber/BibleCode.html), 14(2):150-173." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why reproducible research\n", "\n", "0. Replicability has been a foundation of science. It helps accumulate scientific knowledge.\n", "\n", "0. Greater research impact.\n", "\n", "0. Better work habit boosts quality of research.\n", "\n", "0. Better teamwork. For **you** (graduate students), it means better communication with your advisor. \n", "```julia\n", "while true \n", " Stud: \"that idea you told me to try - it doesn't work!\" \n", " Prof: \"ok. how about trying this instead.\"\n", "end\n", "```\n", "Unless you reproduce the computing environment (algorithms, dataset, tuning parameters), there's no way professor can help you." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to be reproducible in statistics?\n", "\n", "> When we publish articles containing figures which were generated by computer, we also publish the complete software environment which generates the figures.\n", "> \n", "> -- Buckheit and Donoho (1995)\n", "\n", "A good example: [http://stanford.edu/~boyd/papers/admm_distr_stats.html](http://stanford.edu/~boyd/papers/admm_distr_stats.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tools for reproducible research\n", "\n", "* Version control: Git. \n", "* Distributing research, e.g., Julia or R packages: github, bitbucket. \n", "* Dynamic document: IJulia for Julia or RMarkdown for R. \n", "* Docker container for reproducing a computing environment. \n", "* Cloud computing tools.\n", "\n", "We are going to practice reproducible research **now**. That is to make your homework reproducible using Git/GitHub, and IJulia." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Version control using Git\n", "\n", "> If it's not in source control, it doesn't exist.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Collaborative research. \n", "\n", "Statisticians, as opposed to _closet mathematicians_, rarely do things in vacuum. \n", "* We talk to scientists/clients about their data and questions. \n", "* We write code (a lot!) together with team members or coauthors. \n", "* We run code/program on different platforms. \n", "* We write manuscripts/reports with co-authors. \n", "* We distribute software so potential users have access to your methods. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why version control?\n", "\n", "* A centralized repository helps coordinate multi-person projects. \n", "* Time machine. Keep track of all the changes and revert back easily (reproducible). \n", "* Storage efficiency. \n", "* Synchronize files across multiple computers and platforms. \n", "* [github.com](https://github.com) is becoming a _de facto_ central repository for open source development. \n", "E.g., all packages in Julia are distributed through github.com. \n", "* Advertise yourself thru github.com.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Available version control tools\n", "\n", "* Open source: **Git**, subversion (aka svn), cvs, mercurial, ...\n", "* Proprietary: Visual SourceSafe (VSS), ...\n", "* Dropbox? Mostly for file backup and sharing, limited version control (1 month?), ...\n", "\n", "We use Git in this course." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why Git?\n", "\n", "\n", "* As of 2016, Git is the most popular version control system. \n", "[https://rhodecode.com/insights/version-control-systems-2016](https://rhodecode.com/insights/version-control-systems-2016)\n", "\n", " \n", "\n", "* History: Initially designed and developed by [Linus Torvalds](http://en.wikipedia.org/wiki/Linus_Torvalds#The_Linus.2FLinux_connection) in 2005 for Linux kernel development. \n", "_git_ is the British English slang for _unpleasant person_. \n", "\n", "> I'm an egotistical bastard, and I name all my projects after myself. First 'Linux', now 'git'.\n", ">\n", "> -- Linus Torvalds\n", "\n", "* svn: **centralized** version control system. \n", "\n", "Git: **distributed** version control system.\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What do I need to use Git?\n", "\n", "* A **Git server** enabling multi-person collaboration through a centralized repository.\n", " - [github.com](github.com): unlimited public repositories, private repositories costs $, academic user can get unlimited private repositories from the [Student Developer Pack](https://education.github.com/pack) \n", " - [bitbucket.org](bitbucket.org): unlimited public repositories, unlimited private repositories for academic account (register for free using your UCLA email) \n", " - We use [github.com](github.com) in this course for developing and submitting homework \n", "\n", "* **Git client** on your own machine.\n", " - Linux: shipped with many Linux distributions, e.g., Ubuntu. If not, install using a package manager, e.g., `yum install git` on CentOS \n", " - Mac: install by `port install git` or other package managers \n", " - Windows: GitHub for Windows (GUI), TortoiseGIT (is this good?) \n", " \n", "Don't totally rely on GUI or IDE. Learn to use Git on command line, which is needed for cluster and cloud computing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic workflow of Git\n", "\n", "\n", "\n", "* Synchronize local Git directory with remote repository (`git pull` = `git fetch` + `git merge`). \n", "* Modify files in local working directory. \n", "* Add snapshots of them to staging area (`git add`). \n", "* Commit: store snapshots permanently to (local) Git repository (`git commit`). \n", "* Push commits to remote repository (`git push`). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic Git usage\n", "\n", "0. Register for an account on a Git server, e.g., [github.com](github.com). \n", "\n", "0. Upload your SSH public key to the server.\n", "\n", "0. Identify yourself at local machine, e.g., \n", "```\n", "git config --global user.name \"Hua Zhou\"\n", "git config --global user.email \"huazhou@ucla.edu\"\n", "```\n", "Name and email appear in each commit you make.\n", "\n", "0. Initialize a project: \n", " - Create a repository `biostat-m280-2018-spring` on the server. \n", " - Clone the repository to your local machine, e.g.,\n", " ```bash\n", " git clone git@github.com:Hua-Zhou/biostat-m280-2018-spring.git\n", " ```\n", " Now you have a local repo of the project.\n", " \n", "0. Working with your local copy.\n", " - `git pull`: update local Git repository with remote repository (fetch + merge) \n", " - `git log filename`: display the current status of working directory \n", " - `git diff`: show differences (by default difference from the most recent commit) \n", " - `git add file1 file2 ...`: add file(s) to the staging area \n", " - `git commit`: commit changes in staging area to Git directory \n", " - `git push`: publish commits in local Git repository to remote repository \n", " - `git reset --soft HEAD~1`: undo the last commit \n", " - `git checkout filename`: go back to the last commit, **discarding** all changes made \n", " - `git rm`: remove files from git control " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Branching in Git\n", "\n", "* Branching in Git. \n", "\n", "\n", "* For this course, you need to have two branches: \n", " - `develop` for your own development\n", " - `master` for releases (homework submission). Note `master` is the default branch when you initialize the project; create and switch to `develop` branch immediately after project initialization.\n", "\n", "\n", "* Commonly used commands: \n", " - `git branch branchname`: create a branch \n", " - `git branch`: show all project branches \n", " - `git checkout branchname`: switch to a branch \n", " - `git tag`: show tags (major landmarks)\n", " - `git tag tagname`: create a tag" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sample sessions\n", "\n", "* Clone the project, create a `develop` branch, where your write solution for HW1. \n", "```bash\n", "# clone the project\n", "git clone git@github.com:UCLA-BIOSTAT-M280-2017-Spring/biostat-m280-2017-HuaZhou.git\n", "# enter project folder\n", "cd biostat-m280-2017-HuaZhou\n", "# what branches are there?\n", "git branch\n", "# create develop branch\n", "git branch develop\n", "# switch to the develop branch\n", "git checkout develop\n", "# create folder for HW1\n", "mkdir hw1\n", "cd hw1\n", "# let's write some code\n", "echo \"x = 1\" > code.jl\n", "echo \"some bug\" >> code.jl\n", "# commit the code\n", "git add code.jl\n", "git commit -m \"famous x = 1 function\"\n", "# push to remote repo\n", "git push\n", "```\n", "\n", "* Submit and tag HW1 solution to `master` branch. \n", "```bash\n", "# which branch are we in\n", "git branch\n", "# change to the master branch\n", "git checkout master\n", "# merge develop branch to master branch\n", "git pull origin develop \n", "# push to the remote master branch\n", "git push\n", "# tag version hw1\n", "git tag hw1\n", "git push --tags\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Etiquettes of using Git and version control systems in general\n", "\n", "* Be judicious what to put in repository. \n", " - Not too less: Make sure collaborators or yourself can reproduce everything on other machines \n", " - Not too much: No need to put all intermediate files in repository. Make good use of the `.gitignore` file\n", " \n", "* Strictly version control system is for source files only. E.g. only `xxx.tex`, `xxx.bib`, and figure files are necessary to produce a pdf file. Pdf file doesn't need to be version controlled or, if version controlled, doesn't need to be frequently committed.\n", "\n", "* \n", "> Commit early, commit often and don't spare the horses.\n", "\n", "* Adding an informative message when you commit is **not** optional. Spending one minute on commit message saves hours later for your collaborators and yourself. Read the following sentence to yourself 3 times:\n", "> Write every commit message like the next person who reads it is an axe-wielding maniac who knows where you live." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dynamic document using IJulia notebook\n", "\n", "* IPython notebook is a powerful tool for authoring dynamic document, which combines code, formatted text, math, and multimedia in a single document. \n", "\n", "* [Jupyter](http://jupyter.org) is the current development that emcompasses multiple languages including **Ju**lia, **Pyt**hon, and **R**. \n", "\n", "* Julia uses Jupyter notebook through the [IJulia.jl](https://github.com/JuliaLang/IJulia.jl) package.\n", "\n", "* In this course, you are required to write your homework reports using IJulia.\n", "\n", "* For each homework, you need to submit your IJulia notebook (.e.g, `hw1.ipynb`), html (e.g., `hw1.html`), along with all code and data that are necessary to reproduce the results.\n", "\n", "* You can start with the Jupyter notebook for the lectures. " ] } ], "metadata": { "kernelspec": { "display_name": "Julia 0.6.2", "language": "julia", "name": "julia-0.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.6.2" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "311px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": true, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }