--- title: Course Introduction subtitle: Biostat/Biomath M257 author: Dr. Hua Zhou @ UCLA date: today format: html: theme: cosmo embed-resources: true number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false jupyter: jupytext: formats: 'ipynb,qmd' text_representation: extension: .qmd format_name: quarto format_version: '1.0' jupytext_version: 1.14.5 kernelspec: display_name: Julia 1.8.5 language: julia name: julia-1.8 --- System information (for reproducibility): ```{julia} versioninfo() ``` Load packages: ```{julia} using Pkg Pkg.activate(pwd()) Pkg.instantiate() Pkg.status() ``` ## Basic information * Tue/Thu 1pm-2:50pm @ CHS 41-268A * Instructor: Dr. Hua Zhou ## What is statistics? * Statistics, the science of *data analysis*, is the applied mathematics in the 21st century. * People (scientists, goverment, health professionals, companies) collect data in order to answer certain questions. Statisticians's job is to help them extract knowledge and insights from data. * If existing software tools readily solve the problem, all the better. * Often statisticians need to implement their own methods, test new algorithms, or tailor classical methods to new types of data (big, streaming). * This entails at least two essential skills: **programming** and fundamental knowledge of **algorithms**. ## What is this course about? * **Not** a course on statistical packages. It does not answer questions such as _How to fit a linear mixed model in R, Julia, SAS, SPSS, or Stata?_ * **Not** a pure programming course, although programming is important and we do homework in Julia. * **Not** a course on data science. [BIOSTAT 203B (Introduction to Data Science)](https://ucla-biostat-203b.github.io/2023winter/schedule/schedule.html) in winter quarter focuses on some software tools for data scientists. * This course focuses on **algorithms**, mostly those in **numerical linear algebra** and **numerical optimization**. ## Learning objectives 1. Be highly appreciative of this quote by [James Gentle](https://www.google.com/books/edition/Computational_Statistics/mQ5KAAAAQBAJ?hl=en&gbpv=1&dq=inauthor:%22James+E.+Gentle%22) > The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different. Examples: $\mathbf{X}^T \mathbf{W} \mathbf{X}$, $\operatorname{tr} (\mathbf{A} \mathbf{B})$, $\operatorname{diag} (\mathbf{A} \mathbf{B})$, multivariate normal density, ... 2. Become **memory-conscious**. You care about looping order. You do benchmarking on hot functions fanatically to make sure it's not allocating. 3. **No inversion mentality**. Whenever you see a matrix inverse in mathematical expression, your brain reacts with *matrix decomposition*, *iterative solvers*, etc. For R users, that means you almost never use the `solve(M)` function to obtain inverse of a matrix $\boldsymbol{M}$. Examples: $(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$, $\mathbf{y}^T \boldsymbol{\Sigma}^{-1} \mathbf{y}$, Newton-Raphson algorithm, ... 4. Master some basic strategies to solve **big data** problems. Examples: how Google solve the PageRank problem with $10^{9}$ webpages, linear regression with $10^7$ observations, etc. 5. No afraid of **optimizations** and treat it as a technology. Be able to recognize some major optimization classes and choose the best solver(s) correspondingly. 6. Be immune to the language fight. ## Course logistics * Course webpage: . * [Syllabus](https://ucla-biostat-257.github.io/2023spring/syllabus/syllabus.html). * Check the [Schedule](https://ucla-biostat-257.github.io/2023spring/schedule/schedule.html) page frequently. * Jupyter notebooks will be posted/updated before each lecture. ## How to get started All course materials are in GitHub repo . Lecture notes are Jupyter Notebooks (`.ipynb` files) and Quarto Markdown (`.qmd` files) under the `slides` folder. It is a good idea to learn by running through the code examples. You can do this in several ways. ### Run Jupyter Notebook in Binder A quick and easy way to run the Jupyter Notebooks is Binder, a free service that allows us to run Jupyter Notebooks in cloud. Simply follow the Binder link at the [schedule](https://ucla-biostat-257.github.io/2023spring/schedule/schedule.html) page. If you want the JupyterLab interface, replace the `tree` by `lab` in the URL. ### Run Jupyter Notebook locally on your own computer 1. Download and install Julia v1.8.x from . On Mac, use Bash command ```bash sudo ln -s /Applications/Julia-1.8.app/Contents/Resources/julia/bin/julia /usr/local/bin/julia ``` to create a symbolic link so `julia` command is available anywhere in the terminal. 2. Install `IJulia` package. Open Julia REPL, type `]` to enter the package mode, then type ```julia add IJulia build IJulia ``` 3. Git clone the course material. ```bash git clone https://github.com/ucla-biostat-257/2023spring.git biostat-257-2023spring ``` You can change `biostat-257-2023spring` to any other directory name you prefer. 4. On terminal, enter the folder for the ipynb file you want to run, e.g. `biostat-257-2023spring/slides/01-intro/`. 5. Open Julia REPL, type ```julia using IJulia jupyterlab(dir = pwd()) ``` to open the JupyterLab in browser or ```julia using IJulia notebook(dir = pwd()) ``` to open a Jupyter notebook. 6. Course material is updated frequently. Remember to `git pull` to obtain the most recent material. ### Run Quarto Markdown locally on your own computer 1. Download and install Julia v1.8.x from . On Mac, use Bash command ```bash sudo ln -s /Applications/Julia-1.8.app/Contents/Resources/julia/bin/julia /usr/local/bin/julia ``` to create a symbolic link so `julia` command is available anywhere in the terminal. 2. Follow the [instructions](https://quarto.org/docs/get-started/) to install Quarto. 3. Git clone the course material. ```bash git clone https://github.com/ucla-biostat-257/2023spring.git biostat-257-2023spring ``` You can change `biostat-257-2023spring` to any other directory name you prefer. 4. Double click the file `2023spring.Rproj` to open the project in RStudio. 5. Navigate to the `slies` folder and run/render `qmd` files as you want. ## In class dicussion The logistic regression is typically estimated by the Fisher scoring algorithm, or iteratively reweighted least squares (IWLS), which iterates according to $$ \boldsymbol{\beta}^{(t)} = (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)}, $$ where $\mathbf{z}^{(t)}$ are pseudo-responses and $\mathbf{W}^{(t)} = \text{diag}(\mathbf{w}^{(t)})$ is a diagonal matrix with nonnegative weights on the diagonal. Superscript $t$ is the iterate number. Question: How much speedup we can achieve, by careful consideration of flops and memory usage, over the following naive implementation? ```julia inv(X' * diagm(w) * X) * X' * diagm(w) * z ``` ### Experiment First generate some data. ```{julia} using LinearAlgebra, Random # Random seed for reproducibility Random.seed!(257) # samples, features n, p = 5000, 100 # design matrix X = [ones(n) randn(n, p - 1)] # pseudo-responses z = randn(n) # weights w = 0.25 * rand(n); ``` ### Method 1 The following code literally translates the mathematical expression into code. ```{julia} # method 1 res1 = inv(X' * diagm(w) * X) * X' * diagm(w) * z ``` ```{julia} using BenchmarkTools bm1 = @benchmark ((inv($X' * diagm($w) * $X) * $X') * diagm($w)) * $z bm1 ``` Several unwise choices of algorithms waste lots of flops. The memeory allocations, caused by intermediate results, also slow down the program because of the need for garbage collection. This is a common mistake a beginner programmer in a high-level language makes. For example, the following R code (same algorithm on the same data) shows similar allocation. R code is much slower than Julia possibly because of the outdated BLAS/LAPACK library being used. ```{julia} using RCall R""" library(bench) library(tidyverse) # Interpolate Julia variables into R workspace X <- $X w <- $w z <- $z rbm <- bench::mark( solve(t(X) %*% diag(w) %*% X) %*% t(X) %*% diag(w) %*% z, iterations = 10 ) %>% print(width = Inf) """; ``` ### Method 2 In the following code, we make smarter choice of algorithms (rearranging order of evaluation; utilizing data structures such as diagonal matrix, triangular matrix, and positive definite matrices) and get rid of memeory allocation by pre-allocating intermediate arrays. ```{julia} # preallocation XtWt = Matrix{Float64}(undef, p, n) XtX = Matrix{Float64}(undef, p, p) Xtz = Vector{Float64}(undef, p) function myfun(X, z, w, XtWt, XtX, Xtz) # XtWt = X' * W mul!(XtWt, transpose(X), Diagonal(w)) # XtX = X' * W * X mul!(XtX, XtWt, X) # Xtz = X' * W * z mul!(Xtz, XtWt, z) # Cholesky on XtX LAPACK.potrf!('U', XtX) # Two triangular solves to solve (XtX) \ (Xtz) BLAS.trsv!('U', 'T', 'N', XtX, Xtz) BLAS.trsv!('U', 'N', 'N', XtX, Xtz) end # First check correctness vs Method 1 res2 = myfun(X, z, w, XtWt, XtX, Xtz) ``` ```{julia} bm2 = @benchmark myfun($X, $z, $w, $XtWt, $XtX, $Xtz) bm2 ``` In R, a better implementation is ```r solve(t(X * w) %*% X, t(X) %*% (z * w)) ``` It's much faster than the naive implementation. To achieve zero memory allocation, some low-level coding using C++ and RcppEigen is necessary. ```{julia} R""" bench::mark( solve(t(X * w) %*% X, t(X) %*% (z * w)), ) %>% print(width = Inf) """; ``` ### Conclusion By careful consideration of the computational algorithms, we achieve a whooping speedup (in Julia) of ```{julia} # speed-up median(bm1.times) / median(bm2.times) ``` For PhD students, that means, instead of waiting two months (65 days) for your simulations to finish, you only need one day!