--- title: "Intro to Computational Biology - UNC BIOS/BCB 784" author: "[Michael Love](http://mikelove.github.io) (he/him)" output: html_document: self_contained: false --- ### Course information BIOS/BCB 784 Instructor details: - Assoc. Prof. Michael Love, love [at] unc [dot] edu Course time and place: - Fall 2026
8/17/2026-12/11/2025
Tue/Thur 12:30-1:45pm
Michael Hooker Research Center room 0003
Syllabus: - [2024 syllabus](BIOS784-Fall2024-Gillings-v.docx) Pre-reqs and expectations: - This course makes extensive use of R and assumes basic familiarity with base R (not packages) as a prerequisite. A self-quiz is available [here](https://github.com/biodatascience/compbio/blob/gh-pages/selfquiz/selfquiz.R), with answers provided [here](https://github.com/biodatascience/compbio/raw/gh-pages/selfquiz/selfquiz_answers.rda). You can also find a [list of base R functions](https://github.com/biodatascience/compbio_src/blob/master/eda/functions.md) that one should be familiar with. - This course also assumes basic familiarity with statistical concepts such as parameter inference, hypothesis testing, basic probability (e.g. conditional probability and conditional expectation). For BCB students, BCB 720, which covers statistical inference, is suitable as a pre-requisite for BIOS/BCB 784. For other students not sure about pre-requisites, email me with any questions with BIOS 784 in the subject line. ### Schedule and course notes For `Rmd` or `qmd` files, go to the [course repo](https://github.com/biodatascience/compbio_src) and navigate the directories, or clone the repo and navigate within RStudio. ```{r echo=FALSE} hw <- function(x) paste0("https://github.com/biodatascience/compbio_src/blob/master/",x,".Rmd") rscr <- function(x) paste0("https://github.com/biodatascience/compbio_src/blob/master/",x,".R") qrto <- function(x) paste0("https://github.com/biodatascience/compbio_src/blob/master/",x,".qmd") ``` | Week | Topic | Dir. | HW | HTML | Title | |:-----|:------|:-----|:---|:-----|:------| | | **Data analysis** | | | | | | Aug 19 | Biological intro / GitHub | `-` | [HW0](`r hw("github/github_HW")`) | [github](github.html) | RStudio, git, and GitHub | | Aug 21 | Exploring bio data | `eda` | | [EDA](eda/EDA.html) | Exploratory data analysis | | | | | | [brain RNA](eda/brain_RNA.html) | Exploring brain RNA | | Aug 26 & 28 | R/Bioconductor I | `bioc` | [HW1](`r hw("bioc/bioc_objects_ranges_HW")`) | [objects](bioc/objects.html) | Bioc data objects | | | | | | [ranges](bioc/ranges.html) | Genomic ranges | | | | | | [GRL](bioc/GRL.html) | GRangesList: lists of ranges | | Sep 2 | Labor day | | | | | | Sep 4 | "tidyomics" | `TO@GH` | | [tidy intro](https://tidyomics.github.io/tidy-intro-talk/) | Tidiness-in-Bioconductor intro | | | | | | [tidy ranges](https://tidyomics.github.io/tidy-ranges-tutorial/) | Tidy ranges tutorial | | Sep 9 & 11 | R/Bioconductor II | `bioc` | [HW2](`r hw("bioc/bioc_anno_strings_HW")`) | [anno](bioc/anno.html) | Accessing annotations | | | | | | [strings](bioc/strings.html) | Manipulating DNA strings | | Sep 16 & 18 | Distances & norm. I | `dist` | [HW3](`r hw("dist/dist_biases_scaling_HW")`) | [distances](dist/distances.html) | Distances in high dimensions | | | | | | [transform_clust](dist/transform_clust.html) | Transformations and clustering | | | | | | [vst_math](`r qrto("dist/vst_math")`) | VST math | | Sep 23 | Wellness day | | | | | | Sep 25 | Spatial data analysis (guest lecture) | | | | [Peyton Kuhlers](https://www.linkedin.com/in/peyton-kuhlers/) (Hoadley & Raab labs, UNC) | | Sep 30 | BCB retreat | | | | | | Oct 2 | Sources of technical bias | | | | | | Oct 7 & 9 | Distances & norm. II | `dist` | | [batch](dist/batch.html) | Batch effects and sources | | | | | | [sva](dist/sva.html) | Surrogate variable analysis | | | | | | | Batch effect solutions | | | | | | [ruv script](`r rscr("dist/ruv")`)| RUV and friends | | Oct 7 | Midterm assigned, due Oct 21 | | | | | | | **Data modeling** | | | | | | Oct 14 & 16 | Hierarchical models | `hier` | | [hierarchical](hier/hierarchical.html) | Hierarchical models | | | | | | [jamesstein](hier/jamesstein.html) | James-Stein estimator | | Oct 21 & 23 | Models and EM | `model` | | [EM](model/EM.html) | Expectation maximization | | | | | | [motif](model/motif.html) | EM for finding DNA motifs | | Oct 28 & 30 | Markov models | `markov` | [HW4](`r hw("model/model_HW")`) | [hmm](markov/hmm.html) | Hidden Markov Models | | Nov 4 | HMM: Baum-Welch | | | | | | Nov 6 | ASHG | | | | | | Nov 11 | Biostatistics 75th event | | | | | | Nov 13 | Class time for projects | | | | | | Nov 18 & 20 | Class time for projects | | | | | | Nov 25 | Gene regulatory networks | `net` | | [network](net/network.html) | Network analysis | | Nov 27 | Thanksgiving | | | | | | Dec 2 & 4 | Final project presentations | | | | | | Extra lectures | | | | | | | | multiple testing | `multiple` | | [multtest](multiple/multtest.html) | FDR and Benjamini-Hochberg | | | | | | [localfdr](multiple/localfdr.html) | Local false discovery rate | ### Reading list * **What is the role of the computational biologist / statistician?** - [All biology is computational biology](http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2002050) Florian Markowetz - [Questions, Answers and Statistics](https://www.stat.berkeley.edu/~sandrine/Docs/TerrySelectedWorksSpringer/Version1/Nolan/Nolan.pdf) Deborah Nolan - [50 Years of Data Science](https://dl.dropboxusercontent.com/u/23421017/50YearsDataScience.pdf) David Donoho - [The Future of Data Analysis](https://projecteuclid.org/euclid.aoms/1177704711) John Tukey (this article, discussed by Donoho, is from 1962) - [Ten Simple Rules for Effective Statistical Practice](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004961) Kass, Caffo, Davidian, Meng, Yu, and Reid - [Statistical Modeling: The Two Cultures](https://projecteuclid.org/euclid.ss/1009213726) Leo Breiman * **Exploratory data analysis** - [dplyr - A Grammar of Data Manipulation](http://dplyr.tidyverse.org/) - [ggplot2](http://ggplot2.org/) - [The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4547484/) * **Bioconductor** - [Orchestrating high-throughput genomic analysis with Bioconductor](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4509590/) Huber et al * **Distances and normalization** - [Differential expression analysis for sequence count data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3218662/) Simon Anders and Wolfgang Huber - [Tackling the widespread and critical impact of batch effects in high-throughput data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3880143/) Leek et al - [Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1994707/) Jeffrey Leek and John Storey - [Normalization of RNA-seq data using factor analysis of control genes or samples](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4404308/) Risso et al - [Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3398141/) Stegle et al - **More on factor analysis methods:** - RUV - [Using control genes to correct for unwanted variation in microarray data](https://doi.org/10.1093/biostatistics/kxr034) Gagnon-Bartsch and Speed et al., 2012 - RUV - [Removing Unwanted Variation from High Dimensional Data with Negative Controls](https://doi.org/10.1038/s41587-022-01440-w) Gagnon-Bartsch et al., 2013 - RUVSeq - [Normalization of RNA-seq data using factor analysis of control genes or samples](https://doi.org/10.1038/nbt.2931) Risso et al., 2014 - ZINB-WaVE - [A general and flexible method for signal extraction from single-cell RNA-seq data](https://doi.org/10.1038/s41467-017-02554-5) Risso and Perraudeau et al., 2018 - NewWave - [A scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA-seq data](https://doi.org/10.1093/bioinformatics/btac149) Agostinis et al., 2022 - GLM-PCA - [Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model](https://doi.org/10.1186/s13059-019-1861-6) Townes et al., 2019 * **Multiple testing** - [Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing](http://www.jstor.org/stable/2346101) Yoav Benjamini and Yosef Hochberg - [A direct approach to false discovery rates](http://genomics.princeton.edu/storeylab//papers/directfdr.pdf) John Storey - [Statistical significance for genomewide studies](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC170937/) John Storey and Robert Tibshirani - [Large-scale simultaneous hypothesis testing](http://www.tandfonline.com/doi/abs/10.1198/016214504000000089) Bradley Efron - [Empirical Bayes Analysis of a Microarray Experiment](http://dx.doi.org/10.1198/016214501753382129) Efron et al - [Measuring reproducibility of high-throughput experiments](https://projecteuclid.org/euclid.aoas/1318514284) Li et al * **Expectation maximization** - [What is the expectation maximization algorithm?](https://www.nature.com/nbt/journal/v26/n8/full/nbt1406.html) Chuong B Do and Serafim Batzoglou - [Gaussian mixture models and the EM algorithm](https://people.csail.mit.edu/rameshvs/content/gmm-em.pdf) Ramesh Sridharan - [EM algorithm notes](http://cs229.stanford.edu/notes/cs229-notes8.pdf) Andrew Ng - [MEME: discovering and analyzing DNA and protein sequence motifs](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1538909/) Bailey et al * **Hierarchical models** - [Linear models and empirical Bayes methods for assessing differential expression in microarray experiments](http://www.statsci.org/smyth/pubs/ebayes.pdf) Gordon Smyth - [Analyzing ’omics data using hierarchical models](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2904972/) Hongkai Ji and X Shirley Liu - [Stein's Paradox in Statistics](http://statweb.stanford.edu/~ckirby/brad/other/Article1977.pdf) Bradley Efron and Carl Morris - [Stein's estimation rule and its competitors - an empirical Bayes approach](https://www.jstor.org/stable/2284155) Bradley Efron and Carl Morris * **Signal processing** - [An Introduction to Hidden Markov Models](http://ieeexplore.ieee.org/abstract/document/1165342/?reload=true) Lawrence Rabiner and Biing-Hwang Juang - [Hidden Markov models approach to the analysis of array CGH data](http://www.sciencedirect.com/science/article/pii/S0047259X04000260) Fridlyand et al * **Network analysis** - [Static And Dynamic DNA Loops Form AP-1 Bound Activation Hubs During Macrophage Development](http://dx.doi.org/10.1016/j.molcel.2017.08.006) Phanstiel et al ### Resources * [Online R Classes and Resources](http://genomicsclass.github.io/book/pages/resources.html) * Rafael Irizarry and Michael Love, "Data Analysis for the Life Sciences" [Free PDF](https://leanpub.com/dataanalysisforthelifesciences), [HTML](http://genomicsclass.github.io/book/) * [Kasper Hansen, "Bioconductor for Genomic Data Science"](https://kasperdanielhansen.github.io/genbioconductor/) * [Aaron Quinlan, "Applied Computational Genomics" (Slides)](https://github.com/quinlan-lab/applied-computational-genomics) * [Jennifer Bryan et al, Stat 545](http://stat545.com/) * [Florian Markowetz, "You Are Not Working for Me; I Am Working with You"](https://doi.org/10.1371/journal.pcbi.1004387) * [Tips to succeed in Computational Biology research](tips.html) Some R resources * [Quick-R](http://www.statmethods.net/) * [R for Matlab users](http://mathesaurus.sourceforge.net/octave-r.html) * [R Cookbook](http://www.cookbook-r.com/) * [R Reference Card](https://cran.r-project.org/doc/contrib/Short-refcard.pdf) * [Advanced R](http://adv-r.had.co.nz/) * [Bioconductor Reference Card](https://github.com/mikelove/bioc-refcard) --- This page was last updated on `r format(Sys.time(), "%m/%d/%Y")`. * [GitHub repo](https://github.com/biodatascience/compbio_src) * [License](https://github.com/biodatascience/compbio_src/blob/master/LICENSE)