{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "

1  Reproducible Research
1.1  Non-reproducible research
1.2  Why reproducible research
1.3  How to be reproducible in statistics?
1.4  Tools for reproducible research
1.5  Version control using Git
1.5.1  Collaborative research.
1.5.2  Why version control?
1.5.3  Available version control tools
1.5.4  Why Git?
1.5.5  What do I need to use Git?
1.5.6  Basic workflow of Git
1.5.7  Basic Git usage
1.6  Branching in Git
1.6.1  Sample sessions
1.6.2  Etiquettes of using Git and version control systems in general
1.7  Dynamic document using IJulia notebook
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reproducible Research" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> An article about computational science in a scientific publication is **not** the scholarship itself, it is merely **advertising** of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.\n", "> \n", "> -- Buckheit and Donoho (1995)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Non-reproducible research\n", "\n", "1. Duke Potti Scandal\n", "\n", " \n", "\n", " * Potti et al (2006) Genomic signatures to guide the use of chemotherapeutics, [Nature Medicine](http://www.nature.com/nm/journal/v12/n11/full/nm1491.html), 12(11):1294--1300. \n", "\n", " * Baggerly and Coombes (2009) Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology, [Ann. Appl. Stat.](https://projecteuclid.org/euclid.aoas/1267453942), 3(4):1309--1334. \n", "\n", " * More information:\n", " * [Wiki page](http://en.wikipedia.org/wiki/Anil_Potti)\n", " * [Simply Statistics Blog: The Duke Saga Starter Set](http://simplystatistics.org/2012/02/27/the-duke-saga-starter-set/)\n", "\n", "2. Nature Genetics (2015 Impact Factor: 31.616). 20 articles about microarray profiling published in _Nature Genetics_ between Jan 2005 and Dec 2006.\n", "\n", " \n", " \n", "\n", "3. Bible code.\n", "\n", " \n", " \n", " \n", "\n", " * Witztum, Rips, and Rosenberg (1994) Equidistant letter sequences in the book of genesis. [Statist. Sci.](http://projecteuclid.org/euclid.ss/1177010393), 9(3):429-438. \n", "\n", " * McKay, Bar-Natan, Bar-Hillel, and Kalai (1999) Solving the Bible code puzzle, [Statist. Sci.](https://www.math.washington.edu/~greenber/BibleCode.html), 14(2):150-173." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why reproducible research\n", "\n", "0. Replicability has been a foundation of science. It helps accumulate scientific knowledge.\n", "\n", "0. Greater research impact.\n", "\n", "0. Better work habit boosts quality of research.\n", "\n", "0. Better teamwork. For **you** (graduate students), it means better communication with your advisor. \n", "```julia\n", "while true \n", " Stud: \"that idea you told me to try - it doesn't work!\" \n", " Prof: \"ok. how about trying this instead.\"\n", "end\n", "```\n", "Unless you reproduce the computing environment (algorithms, dataset, tuning parameters), there's no way professor can help you." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to be reproducible in statistics?\n", "\n", "> When we publish articles containing figures which were generated by computer, we also publish the complete software environment which generates the figures.\n", "> \n", "> -- Buckheit and Donoho (1995)\n", "\n", "A good example: [http://stanford.edu/~boyd/papers/admm_distr_stats.html](http://stanford.edu/~boyd/papers/admm_distr_stats.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tools for reproducible research\n", "\n", "* Version control: Git. \n", "* Distributing research, e.g., Julia or R packages: github, bitbucket. \n", "* Dynamic document: IJulia for Julia or RMarkdown for R. \n", "* Docker container for reproducing a computing environment. \n", "* Cloud computing tools.\n", "\n", "We are going to practice reproducible research **now**. That is to make your homework reproducible using Git/GitHub, and IJulia." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Version control using Git\n", "\n", "> If it's not in source control, it doesn't exist.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Collaborative research. \n", "\n", "Statisticians, as opposed to _closet mathematicians_, rarely do things in vacuum. \n", "* We talk to scientists/clients about their data and questions. \n", "* We write code (a lot!) together with team members or coauthors. \n", "* We run code/program on different platforms. \n", "* We write manuscripts/reports with co-authors. \n", "* We distribute software so potential users have access to your methods. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why version control?\n", "\n", "* A centralized repository helps coordinate multi-person projects. \n", "* Time machine. Keep track of all the changes and revert back easily (reproducible). \n", "* Storage efficiency. \n", "* Synchronize files across multiple computers and platforms. \n", "* [github.com](https://github.com) is becoming a _de facto_ central repository for open source development. \n", "E.g., all packages in Julia are distributed through github.com. \n", "* Advertise yourself thru github.com.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Available version control tools\n", "\n", "* Open source: **Git**, subversion (aka svn), cvs, mercurial, ...\n", "* Proprietary: Visual SourceSafe (VSS), ...\n", "* Dropbox? Mostly for file backup and sharing, limited version control (1 month?), ...\n", "\n", "We use Git in this course." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Why Git?\n", "\n", "\n", "* As of 2016, Git is the most popular version control system. \n", "[https://rhodecode.com/insights/version-control-systems-2016](https://rhodecode.com/insights/version-control-systems-2016)\n", "\n", " \n", "\n", "* History: Initially designed and developed by [Linus Torvalds](http://en.wikipedia.org/wiki/Linus_Torvalds#The_Linus.2FLinux_connection) in 2005 for Linux kernel development. \n", "_git_ is the British English slang for _unpleasant person_. \n", "\n", "> I'm an egotistical bastard, and I name all my projects after myself. First 'Linux', now 'git'.\n", ">\n", "> -- Linus Torvalds\n", "\n", "* svn: **centralized** version control system. \n", "\n", "Git: **distributed** version control system.\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What do I need to use Git?\n", "\n", "* A **Git server** enabling multi-person collaboration through a centralized repository.\n", " - [github.com](github.com): unlimited public repositories, private repositories costs $, academic user can get unlimited private repositories from the [Student Developer Pack](https://education.github.com/pack) \n", " - [bitbucket.org](bitbucket.org): unlimited public repositories, unlimited private repositories for academic account (register for free using your UCLA email) \n", " - We use [github.com](github.com) in this course for developing and submitting homework \n", "\n", "* **Git client** on your own machine.\n", " - Linux: shipped with many Linux distributions, e.g., Ubuntu. If not, install using a package manager, e.g., `yum install git` on CentOS \n", " - Mac: install by `port install git` or other package managers \n", " - Windows: GitHub for Windows (GUI), TortoiseGIT (is this good?) \n", " \n", "Don't totally rely on GUI or IDE. Learn to use Git on command line, which is needed for cluster and cloud computing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic workflow of Git\n", "\n", "\n", "\n", "* Synchronize local Git directory with remote repository (`git pull` = `git fetch` + `git merge`). \n", "* Modify files in local working directory. \n", "* Add snapshots of them to staging area (`git add`). \n", "* Commit: store snapshots permanently to (local) Git repository (`git commit`). \n", "* Push commits to remote repository (`git push`). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basic Git usage\n", "\n", "0. Register for an account on a Git server, e.g., [github.com](github.com). \n", "\n", "0. Upload your SSH public key to the server.\n", "\n", "0. Identify yourself at local machine, e.g., \n", "```\n", "git config --global user.name \"Hua Zhou\"\n", "git config --global user.email \"huazhou@ucla.edu\"\n", "```\n", "Name and email appear in each commit you make.\n", "\n", "0. Initialize a project: \n", " - Create a repository `biostat-m280-2018-spring` on the server. \n", " - Clone the repository to your local machine, e.g.,\n", " ```bash\n", " git clone git@github.com:Hua-Zhou/biostat-m280-2018-spring.git\n", " ```\n", " Now you have a local repo of the project.\n", " \n", "0. Working with your local copy.\n", " - `git pull`: update local Git repository with remote repository (fetch + merge) \n", " - `git log filename`: display the current status of working directory \n", " - `git diff`: show differences (by default difference from the most recent commit) \n", " - `git add file1 file2 ...`: add file(s) to the staging area \n", " - `git commit`: commit changes in staging area to Git directory \n", " - `git push`: publish commits in local Git repository to remote repository \n", " - `git reset --soft HEAD~1`: undo the last commit \n", " - `git checkout filename`: go back to the last commit, **discarding** all changes made \n", " - `git rm`: remove files from git control " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Branching in Git\n", "\n", "* Branching in Git. \n", "\n", "\n", "* For this course, you need to have two branches: \n", " - `develop` for your own development\n", " - `master` for releases (homework submission). Note `master` is the default branch when you initialize the project; create and switch to `develop` branch immediately after project initialization.\n", "\n", "\n", "* Commonly used commands: \n", " - `git branch branchname`: create a branch \n", " - `git branch`: show all project branches \n", " - `git checkout branchname`: switch to a branch \n", " - `git tag`: show tags (major landmarks)\n", " - `git tag tagname`: create a tag" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sample sessions\n", "\n", "* Clone the project, create a `develop` branch, where your write solution for HW1. \n", "```bash\n", "# clone the project\n", "git clone git@github.com:UCLA-BIOSTAT-M280-2017-Spring/biostat-m280-2017-HuaZhou.git\n", "# enter project folder\n", "cd biostat-m280-2017-HuaZhou\n", "# what branches are there?\n", "git branch\n", "# create develop branch\n", "git branch develop\n", "# switch to the develop branch\n", "git checkout develop\n", "# create folder for HW1\n", "mkdir hw1\n", "cd hw1\n", "# let's write some code\n", "echo \"x = 1\" > code.jl\n", "echo \"some bug\" >> code.jl\n", "# commit the code\n", "git add code.jl\n", "git commit -m \"famous x = 1 function\"\n", "# push to remote repo\n", "git push\n", "```\n", "\n", "* Submit and tag HW1 solution to `master` branch. \n", "```bash\n", "# which branch are we in\n", "git branch\n", "# change to the master branch\n", "git checkout master\n", "# merge develop branch to master branch\n", "git pull origin develop \n", "# push to the remote master branch\n", "git push\n", "# tag version hw1\n", "git tag hw1\n", "git push --tags\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Etiquettes of using Git and version control systems in general\n", "\n", "* Be judicious what to put in repository. \n", " - Not too less: Make sure collaborators or yourself can reproduce everything on other machines \n", " - Not too much: No need to put all intermediate files in repository. Make good use of the `.gitignore` file\n", " \n", "* Strictly version control system is for source files only. E.g. only `xxx.tex`, `xxx.bib`, and figure files are necessary to produce a pdf file. Pdf file doesn't need to be version controlled or, if version controlled, doesn't need to be frequently committed.\n", "\n", "* \n", "> Commit early, commit often and don't spare the horses.\n", "\n", "* Adding an informative message when you commit is **not** optional. Spending one minute on commit message saves hours later for your collaborators and yourself. Read the following sentence to yourself 3 times:\n", "> Write every commit message like the next person who reads it is an axe-wielding maniac who knows where you live." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dynamic document using IJulia notebook\n", "\n", "* IPython notebook is a powerful tool for authoring dynamic document, which combines code, formatted text, math, and multimedia in a single document. \n", "\n", "* [Jupyter](http://jupyter.org) is the current development that emcompasses multiple languages including **Ju**lia, **Pyt**hon, and **R**. \n", "\n", "* Julia uses Jupyter notebook through the [IJulia.jl](https://github.com/JuliaLang/IJulia.jl) package.\n", "\n", "* In this course, you are required to write your homework reports using IJulia.\n", "\n", "* For each homework, you need to submit your IJulia notebook (.e.g, `hw1.ipynb`), html (e.g., `hw1.html`), along with all code and data that are necessary to reproduce the results.\n", "\n", "* You can start with the Jupyter notebook for the lectures. " ] } ], "metadata": { "kernelspec": { "display_name": "Julia 0.6.2", "language": "julia", "name": "julia-0.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.6.2" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "311px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": true, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }