--- title: Cholesky Decomposition subtitle: Biostat/Biomath M257 author: Dr. Hua Zhou @ UCLA date: today format: html: theme: cosmo embed-resources: true number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false jupyter: jupytext: formats: 'ipynb,qmd' text_representation: extension: .qmd format_name: quarto format_version: '1.0' jupytext_version: 1.14.5 kernelspec: display_name: Julia (8 threads) 1.8.5 language: julia name: julia-_8-threads_-1.8 --- System information (for reproducibility): ```{julia} versioninfo() ``` Load packages: ```{julia} using Pkg Pkg.activate(pwd()) Pkg.instantiate() Pkg.status() ``` ## Introduction * A basic tenet in numerical analysis: > **The structure should be exploited whenever solving a problem.** Common structures include: symmetry, positive (semi)definiteness, sparsity, Kronecker product, low rank, ... * LU decomposition (Gaussian Elimination) is **not** used in statistics so often because most of time statisticians deal with positive (semi)definite matrix. * That's why we often criticize use of the `solve()` function in base R code, which inverts a matrix using LU decomposition. First, most likely we don't need a matrix inverse. Second, most likely we are dealing with a pd/psd matrix and should use Cholesky. * For example, in the normal equation $$ \mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y} $$ for linear regression, the coefficient matrix $\mathbf{X}^T \mathbf{X}$ is symmetric and positive semidefinite. How to exploit this structure? ## Cholesky decomposition * **Theorem**: Let $\mathbf{A} \in \mathbb{R}^{n \times n}$ be symmetric and positive definite. Then $\mathbf{A} = \mathbf{L} \mathbf{L}^T$, where $\mathbf{L}$ is lower triangular with positive diagonal entries and is unique. **Proof** (by induction): If $n=1$, then $\ell = \sqrt{a}$. For $n>1$, the block equation $$ \begin{eqnarray*} \begin{pmatrix} a_{11} & \mathbf{a}^T \\ \mathbf{a} & \mathbf{A}_{22} \end{pmatrix} = \begin{pmatrix} \ell_{11} & \mathbf{0}_{n-1}^T \\ \mathbf{l} & \mathbf{L}_{22} \end{pmatrix} \begin{pmatrix} \ell_{11} & \mathbf{l}^T \\ \mathbf{0}_{n-1} & \mathbf{L}_{22}^T \end{pmatrix} \end{eqnarray*} $$ has solution $$ \begin{eqnarray*} \ell_{11} &=& \sqrt{a_{11}} \\ \mathbf{l} &=& \ell_{11}^{-1} \mathbf{a} \\ \mathbf{L}_{22} \mathbf{L}_{22}^T &=& \mathbf{A}_{22} - \mathbf{l} \mathbf{l}^T = \mathbf{A}_{22} - a_{11}^{-1} \mathbf{a} \mathbf{a}^T. \end{eqnarray*} $$ Now $a_{11}>0$ (why?), so $\ell_{11}$ and $\mathbf{l}$ are uniquely determined. $\mathbf{A}_{22} - a_{11}^{-1} \mathbf{a} \mathbf{a}^T$ is positive definite because $\mathbf{A}$ is positive definite (why?). By induction hypothesis, $\mathbf{L}_{22}$ exists and is unique. * The constructive proof completely specifies the algorithm: * Computational cost: $$ \frac{1}{2} [2(n-1)^2 + 2(n-2)^2 + \cdots + 2 \cdot 1^2] \approx \frac{1}{3} n^3 \quad \text{flops} $$ plus $n$ square roots. Half the cost of LU decomposition by utilizing symmetry. * In general Cholesky decomposition is very stable. Failure of the decomposition simply means $\mathbf{A}$ is not positive definite. It is an efficient way to test positive definiteness. ## Pivoting * When $\mathbf{A}$ does not have full rank, e.g., $\mathbf{X}^T \mathbf{X}$ with a non-full column rank $\mathbf{X}$, we encounter $a_{kk} = 0$ during the procedure. * **Symmetric pivoting**. At each stage $k$, we permute both row and column such that $\max_{k \le i \le n} a_{ii}$ becomes the pivot. If we encounter $\max_{k \le i \le n} a_{ii} = 0$, then $\mathbf{A}[k:n,k:n] = \mathbf{0}$ (why?) and the algorithm terminates. * With symmetric pivoting: $$ \mathbf{P} \mathbf{A} \mathbf{P}^T = \mathbf{L} \mathbf{L}^T, $$ where $\mathbf{P}$ is a permutation matrix and $\mathbf{L} \in \mathbb{R}^{n \times r}$, $r = \text{rank}(\mathbf{A})$. ## LAPACK and Julia implementation * LAPACK functions: [?potrf](http://www.netlib.org/lapack/explore-html/d1/d7a/group__double_p_ocomputational_ga2f55f604a6003d03b5cd4a0adcfb74d6.html#ga2f55f604a6003d03b5cd4a0adcfb74d6) (without pivoting), [?pstrf](http://www.netlib.org/lapack/explore-html/da/dba/group__double_o_t_h_e_rcomputational_ga31cdc13a7f4ad687f4aefebff870e1cc.html#ga31cdc13a7f4ad687f4aefebff870e1cc) (with pivoting). * Julia functions: [cholesky](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.cholesky), [cholesky!](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.cholesky!), or call LAPACK wrapper functions [potrf!](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.LAPACK.potrf!) and [pstrf!](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.LAPACK.pstrf!) ### Example: positive definite matrix ```{julia} using LinearAlgebra A = [4.0 12 -16; 12 37 -43; -16 -43 98] ``` ```{julia} # Cholesky without pivoting Achol = cholesky(Symmetric(A)) ``` ```{julia} typeof(Achol) ``` ```{julia} dump(Achol) ``` ```{julia} # retrieve the lower triangular Cholesky factor Achol.L ``` ```{julia} # retrieve the upper triangular Cholesky factor Achol.U ``` ```{julia} b = [1.0; 2.0; 3.0] A \ b # this does LU; wasteful!!!; 2/3 n^3 + 2n^2 ``` ```{julia} Achol \ b # two triangular solves; only 2n^2 flops ``` ```{julia} det(A) # this does LU; wasteful!!! (2/3) n^3 ``` ```{julia} det(Achol) # cheap ``` ```{julia} inv(A) # this does LU! (2/3) n^3 + (4/3) n^3 ``` ```{julia} inv(Achol) # (4/3) n^3 ``` ### Example: positive semi-definite matrix. ```{julia} using Random Random.seed!(123) # seed A = randn(5, 3) A = A * transpose(A) # A has rank 3 ``` ```{julia} Achol = cholesky(A, Val(true)) # 2nd argument requests pivoting ``` ```{julia} Achol = cholesky(A, Val(true), check=false) # turn off checking pd ``` ```{julia} rank(Achol) # determine rank from Cholesky factor ``` ```{julia} rank(A) # determine rank from SVD (default), which is more numerically stable ``` ```{julia} Achol.L ``` ```{julia} Achol.U ``` ```{julia} Achol.p ``` ```{julia} # P A P' = L U A[Achol.p, Achol.p] ≈ Achol.L * Achol.U ``` ## Applications * **No inversion** mentality: Whenever we see matrix inverse, we should think in terms of solving linear equations. If the matrix is positive (semi)definite, use Cholesky decomposition, which is twice cheaper than LU decomposition. ### Multivariate normal density Multivariate normal density $\text{MVN}(0, \Sigma)$, where $\Sigma$ is p.d., is $$ \, - \frac{n}{2} \log (2\pi) - \frac{1}{2} \log \det \Sigma - \frac{1}{2} \mathbf{y}^T \Sigma^{-1} \mathbf{y}. $$ * **Method 1**: (a) compute explicit inverse $\Sigma^{-1}$ ($2n^3$ flops), (b) compute quadratic form ($2n^2 + 2n$ flops), (c) compute determinant ($2n^3/3$ flops). * **Method 2**: (a) Cholesky decomposition $\Sigma = \mathbf{L} \mathbf{L}^T$ ($n^3/3$ flops), (b) Solve $\mathbf{L} \mathbf{x} = \mathbf{y}$ by forward substitutions ($n^2$ flops), (c) compute quadratic form $\mathbf{x}^T \mathbf{x}$ ($2n$ flops), and (d) compute determinant from Cholesky factor ($n$ flops). **Which method is better?** ```{julia} # this is a person w/o numerical analysis training function logpdf_mvn_1(y::Vector, Σ::Matrix) n = length(y) - (n//2) * log(2π) - (1//2) * logdet(Symmetric(Σ)) - (1//2) * transpose(y) * inv(Σ) * y end # this is a computing-savvy person function logpdf_mvn_2(y::Vector, Σ::Matrix) n = length(y) Σchol = cholesky(Symmetric(Σ)) - (n//2) * log(2π) - (1//2) * logdet(Σchol) - (1//2) * abs2(norm(Σchol.L \ y)) end # better memory efficiency - input Σ is overwritten function logpdf_mvn_3(y::Vector, Σ::Matrix) n = length(y) Σchol = cholesky!(Symmetric(Σ)) # Σ is overwritten - (n//2) * log(2π) - (1//2) * logdet(Σchol) - (1//2) * dot(y, Σchol \ y) end ``` ```{julia} using BenchmarkTools, Distributions, Random Random.seed!(257) # seed n = 1000 # a pd matrix Σ = convert(Matrix{Float64}, Symmetric([i * (n - j + 1) for i in 1:n, j in 1:n])) y = rand(MvNormal(Σ)) # one random sample from N(0, Σ) # at least they should give the same answer @show logpdf_mvn_1(y, Σ) @show logpdf_mvn_2(y, Σ) Σc = copy(Σ) @show logpdf_mvn_3(y, Σc); ``` ```{julia} @benchmark logpdf_mvn_1($y, $Σ) ``` ```{julia} @benchmark logpdf_mvn_2($y, $Σ) ``` ```{julia} @benchmark logpdf_mvn_3($y, $Σc) setup=(copy!(Σc, Σ)) evals=1 ``` * To evaluate same multivariate normal density at many observations $y_1, y_2, \ldots$, we pre-compute the Cholesky decomposition ($n^3/3$ flops), then each evaluation costs $n^2$ flops. ### Linear regression - Cholesky decomposition is **one** approach to solve linear regression $$ \widehat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}. $$ - Assume $\mathbf{X} \in \mathbb{R}^{n \times p}$ and $\mathbf{y} \in \mathbb{R}^n$. - Compute $\mathbf{X}^T \mathbf{X}$: $np^2$ flops - Compute $\mathbf{X}^T \mathbf{y}$: $2np$ flops - Cholesky decomposition of $\mathbf{X}^T \mathbf{X}$: $\frac{1}{3} p^3$ flops - Solve normal equation $\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}$: $2p^2$ flops - If need standard errors, another $(4/3)p^3$ flops - Total computational cost is $np^2 + (1/3) p^3$ (without s.e.) or $np^2 + (5/3) p^3$ (with s.e.) flops. ## Further reading * Section 7.7 of [Numerical Analysis for Statisticians](http://ucla.worldcat.org/title/numerical-analysis-for-statisticians/oclc/793808354&referer=brief_results) of Kenneth Lange (2010). * Section II.5.3 of [Computational Statistics](http://ucla.worldcat.org/title/computational-statistics/oclc/437345409&referer=brief_results) by James Gentle (2010). * Section 4.2 of [Matrix Computation](http://catalog.library.ucla.edu/vwebv/holdingsInfo?bibId=7122088) by Gene Golub and Charles Van Loan (2013).