---
title: Course Introduction
subtitle: Biostat/Biomath M257
author: Dr. Hua Zhou @ UCLA
date: today
format:
  html:
    theme: cosmo
    embed-resources: true
    number-sections: true
    toc: true
    toc-depth: 4
    toc-location: left
    code-fold: false
jupyter:
  jupytext:
    formats: 'ipynb,qmd'
    text_representation:
      extension: .qmd
      format_name: quarto
      format_version: '1.0'
      jupytext_version: 1.14.5
  kernelspec:
    display_name: Julia 1.8.5
    language: julia
    name: julia-1.8
---

System information (for reproducibility):

```{julia}
versioninfo()
```

Load packages:

```{julia}
using Pkg
Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()
```

## Basic information

* Tue/Thu 1pm-2:50pm @ CHS 41-268A   
* Instructor: Dr. Hua Zhou  

## What is statistics?

* Statistics, the science of *data analysis*, is the applied mathematics in the 21st century. 

* People (scientists, goverment, health professionals, companies) collect data in order to answer certain questions. Statisticians's job is to help them extract knowledge and insights from data. 

* If existing software tools readily solve the problem, all the better. 

* Often statisticians need to implement their own methods, test new algorithms, or tailor classical methods to new types of data (big, streaming). 

* This entails at least two essential skills: **programming** and fundamental knowledge of **algorithms**. 

## What is this course about?

* **Not** a course on statistical packages. It does not answer questions such as _How to fit a linear mixed model in R,  Julia, SAS, SPSS, or Stata?_

* **Not** a pure programming course, although programming is important and we do homework in Julia.  

* **Not** a course on data science. [BIOSTAT 203B (Introduction to Data Science)](https://ucla-biostat-203b.github.io/2023winter/schedule/schedule.html) in winter quarter focuses on some software tools for data scientists.

* This course focuses on **algorithms**, mostly those in **numerical linear algebra** and **numerical optimization**. 

## Learning objectives

1. Be highly appreciative of this quote by [James Gentle](https://www.google.com/books/edition/Computational_Statistics/mQ5KAAAAQBAJ?hl=en&gbpv=1&dq=inauthor:%22James+E.+Gentle%22)

    > The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

    Examples: $\mathbf{X}^T \mathbf{W} \mathbf{X}$, $\operatorname{tr} (\mathbf{A} \mathbf{B})$, $\operatorname{diag} (\mathbf{A} \mathbf{B})$, multivariate normal density, ...  

2. Become **memory-conscious**. You care about looping order. You do benchmarking on hot functions fanatically to make sure it's not allocating.    

3. **No inversion mentality**. Whenever you see a matrix inverse in mathematical expression, your brain reacts with *matrix decomposition*, *iterative solvers*, etc. For R users, that means you almost never use the `solve(M)` function to obtain inverse of a matrix $\boldsymbol{M}$.   

    Examples: $(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$, $\mathbf{y}^T \boldsymbol{\Sigma}^{-1} \mathbf{y}$, Newton-Raphson algorithm, ...   

4. Master some basic strategies to solve **big data** problems. 

    Examples: how Google solve the PageRank problem with $10^{9}$ webpages, linear regression with $10^7$ observations, etc.  

5. No afraid of **optimizations** and treat it as a technology. Be able to recognize some major optimization classes and choose the best solver(s) correspondingly.

6. Be immune to the language fight. 

## Course logistics

* Course webpage: <https://ucla-biostat-257.github.io/2023spring>.

* [Syllabus](https://ucla-biostat-257.github.io/2023spring/syllabus/syllabus.html).

* Check the [Schedule](https://ucla-biostat-257.github.io/2023spring/schedule/schedule.html) page frequently. 

* Jupyter notebooks will be posted/updated before each lecture.

## How to get started

All course materials are in GitHub repo <https://github.com/ucla-biostat-257/2023spring>. Lecture notes are Jupyter Notebooks (`.ipynb` files) and Quarto Markdown (`.qmd` files) under the `slides` folder. It is a good idea to learn by running through the code examples. You can do this in several ways. 

### Run Jupyter Notebook in Binder

A quick and easy way to run the Jupyter Notebooks is Binder, a free service that allows us to run Jupyter Notebooks in cloud. Simply follow the Binder link at the [schedule](https://ucla-biostat-257.github.io/2023spring/schedule/schedule.html) page. 

If you want the JupyterLab interface, replace the `tree` by `lab` in the URL.  

### Run Jupyter Notebook locally on your own computer

1. Download and install Julia v1.8.x from <https://julialang.org/downloads/>. On Mac, use Bash command   
```bash  
sudo ln -s /Applications/Julia-1.8.app/Contents/Resources/julia/bin/julia /usr/local/bin/julia
```
to create a symbolic link so `julia` command is available anywhere in the terminal. 

2. Install `IJulia` package. Open Julia REPL, type `]` to enter the package mode, then type
```julia
add IJulia
build IJulia
```

3. Git clone the course material.   
```bash
git clone https://github.com/ucla-biostat-257/2023spring.git biostat-257-2023spring
```
You can change `biostat-257-2023spring` to any other directory name you prefer.

4. On terminal, enter the folder for the ipynb file you want to run, e.g. `biostat-257-2023spring/slides/01-intro/`. 

5. Open Julia REPL, type  
```julia  
using IJulia
jupyterlab(dir = pwd())
```
to open the JupyterLab in browser or
```julia  
using IJulia
notebook(dir = pwd())
```
to open a Jupyter notebook.

6. Course material is updated frequently. Remember to `git pull` to obtain the most recent material.

### Run Quarto Markdown locally on your own computer

1. Download and install Julia v1.8.x from <https://julialang.org/downloads/>. On Mac, use Bash command   
```bash  
sudo ln -s /Applications/Julia-1.8.app/Contents/Resources/julia/bin/julia /usr/local/bin/julia
```
to create a symbolic link so `julia` command is available anywhere in the terminal. 

2. Follow the [instructions](https://quarto.org/docs/get-started/) to install Quarto.

3. Git clone the course material.   
```bash
git clone https://github.com/ucla-biostat-257/2023spring.git biostat-257-2023spring
```
You can change `biostat-257-2023spring` to any other directory name you prefer.

4. Double click the file `2023spring.Rproj` to open the project in RStudio.

5. Navigate to the `slies` folder and run/render `qmd` files as you want.

## In class dicussion

The logistic regression is typically estimated by the Fisher scoring algorithm, or iteratively reweighted least squares (IWLS), which iterates according to
$$
\boldsymbol{\beta}^{(t)} = (\mathbf{X}^T \mathbf{W}^{(t)} \mathbf{X})^{-1} \mathbf{X}^T \mathbf{W}^{(t)} \mathbf{z}^{(t)},
$$
where $\mathbf{z}^{(t)}$ are pseudo-responses and $\mathbf{W}^{(t)} = \text{diag}(\mathbf{w}^{(t)})$ is a diagonal matrix with nonnegative weights on the diagonal. Superscript $t$ is the iterate number.

Question: How much speedup we can achieve, by careful consideration of flops and memory usage, over the following naive implementation?
```julia
inv(X' * diagm(w) * X) * X' * diagm(w) * z
```

### Experiment

First generate some data.

```{julia}
using LinearAlgebra, Random

# Random seed for reproducibility
Random.seed!(257)
# samples, features
n, p = 5000, 100
# design matrix
X = [ones(n) randn(n, p - 1)]
# pseudo-responses
z = randn(n)
# weights
w = 0.25 * rand(n);
```

### Method 1

The following code literally translates the mathematical expression into code.

```{julia}
# method 1 
res1 = inv(X' * diagm(w) * X) * X' * diagm(w) * z
```

```{julia}
using BenchmarkTools

bm1 = @benchmark ((inv($X' * diagm($w) * $X) * $X') * diagm($w)) * $z
bm1
```

Several unwise choices of algorithms waste lots of flops. The memeory allocations, caused by intermediate results, also slow down the program because of the need for garbage collection. This is a common mistake a beginner programmer in a high-level language makes. For example, the following R code (same algorithm on the same data) shows similar allocation. R code is much slower than Julia possibly because of the outdated BLAS/LAPACK library being used. 

```{julia}
using RCall

R"""
library(bench)
library(tidyverse)

# Interpolate Julia variables into R workspace
X <- $X
w <- $w
z <- $z

rbm <- bench::mark(
  solve(t(X) %*% diag(w) %*% X) %*% t(X) %*% diag(w) %*% z,
  iterations = 10
  ) %>% 
  print(width = Inf)
""";
```

### Method 2

In the following code, we make smarter choice of algorithms (rearranging order of evaluation; utilizing data structures such as diagonal matrix, triangular matrix, and positive definite matrices) and get rid of memeory allocation by pre-allocating intermediate arrays. 

```{julia}
# preallocation
XtWt = Matrix{Float64}(undef, p, n)
XtX = Matrix{Float64}(undef, p, p)
Xtz = Vector{Float64}(undef, p)

function myfun(X, z, w, XtWt, XtX, Xtz)
    # XtWt = X' * W
    mul!(XtWt, transpose(X), Diagonal(w))
    # XtX = X' * W * X
    mul!(XtX, XtWt, X)
    # Xtz = X' * W * z
    mul!(Xtz, XtWt, z)
    # Cholesky on XtX
    LAPACK.potrf!('U', XtX)
    # Two triangular solves to solve (XtX) \ (Xtz)
    BLAS.trsv!('U', 'T', 'N', XtX, Xtz)
    BLAS.trsv!('U', 'N', 'N', XtX, Xtz)
end

# First check correctness vs Method 1
res2 = myfun(X, z, w, XtWt, XtX, Xtz)
```

```{julia}
bm2 = @benchmark myfun($X, $z, $w, $XtWt, $XtX, $Xtz)
bm2
```

In R, a better implementation is
```r
solve(t(X * w) %*% X, t(X) %*% (z * w))
```
It's much faster than the naive implementation. To achieve zero memory allocation, some low-level coding using C++ and RcppEigen is necessary.

```{julia}
R"""
bench::mark(
  solve(t(X * w) %*% X, t(X) %*% (z * w)),
  ) %>% 
  print(width = Inf)
""";
```

### Conclusion

By careful consideration of the computational algorithms, we achieve a whooping speedup (in Julia) of

```{julia}
# speed-up
median(bm1.times) / median(bm2.times)
```

For PhD students, that means, instead of waiting two months (65 days) for your simulations to finish, you only need one day!