--- title: Conjugate Gradient and Krylov Subspace Methods subtitle: Biostat/Biomath M257 author: Dr. Hua Zhou @ UCLA date: today format: html: theme: cosmo embed-resources: true number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false jupyter: jupytext: formats: 'ipynb,qmd' text_representation: extension: .qmd format_name: quarto format_version: '1.0' jupytext_version: 1.14.5 kernelspec: display_name: Julia (8 threads) 1.8.5 language: julia name: julia-_8-threads_-1.8 --- System information (for reproducibility): ```{julia} versioninfo() ``` Load packages: ```{julia} using Pkg Pkg.activate(pwd()) Pkg.instantiate() Pkg.status() ``` ## Introduction * Conjugate gradient is the top-notch iterative method for solving large, **structured** linear systems $\mathbf{A} \mathbf{x} = \mathbf{b}$, where $\mathbf{A}$ is pd. Earlier we talked about Jacobi, Gauss-Seidel, and successive over-relaxation (SOR) as the classical iterative solvers. They are rarely used in practice due to slow convergence. [Kershaw's results](http://www.sciencedirect.com/science/article/pii/0021999178900980?via%3Dihub) for a fusion problem. | Method | Number of Iterations | |----------------------------------------|----------------------| | Gauss Seidel | 208,000 | | Block SOR methods | 765 | | Incomplete Cholesky **conjugate gradient** | 25 | * History: Hestenes (**UCLA** professor!) and Stiefel proposed conjugate gradient method in 1950s. Hestenes and Stiefel (1952), [Methods of conjugate gradients for solving linear systems](http://nvlpubs.nist.gov/nistpubs/jres/049/jresv49n6p409_A1b.pdf), _Jounral of Research of the National Bureau of Standards_. * Solve linear equation $\mathbf{A} \mathbf{x} = \mathbf{b}$, where $\mathbf{A} \in \mathbb{R}^{n \times n}$ is **pd**, is equivalent to $$ \begin{eqnarray*} \text{minimize} \,\, f(\mathbf{x}) = \frac 12 \mathbf{x}^T \mathbf{A} \mathbf{x} - \mathbf{b}^T \mathbf{x}. \end{eqnarray*} $$ Denote $\nabla f(\mathbf{x}) = \mathbf{A} \mathbf{x} - \mathbf{b} =: r(\mathbf{x})$. ## Conjugate gradient (CG) method * Consider a simple idea: coordinate descent, that is to update components $x_j$ alternatingly. Same as the Gauss-Seidel iteration. Usually it takes too many iterations. * A set of vectors $\{\mathbf{p}^{(0)},\ldots,\mathbf{p}^{(l)}\}$ is said to be **conjugate with respect to $\mathbf{A}$** if $$ \begin{eqnarray*} \mathbf{p}_i^T \mathbf{A} \mathbf{p}_j = 0, \quad \text{for all } i \ne j. \end{eqnarray*} $$ For example, eigen-vectors of $\mathbf{A}$ are conjugate to each other. Why? * **Conjugate direction** method: Given a set of conjugate vectors $\{\mathbf{p}^{(0)},\ldots,\mathbf{p}^{(l)}\}$, at iteration $t$, we search along the conjugate direction $\mathbf{p}^{(t)}$ $$ \begin{eqnarray*} \mathbf{x}^{(t+1)} = \mathbf{x}^{(t)} + \alpha^{(t)} \mathbf{p}^{(t)}, \end{eqnarray*} $$ where $$ \begin{eqnarray*} \alpha^{(t)} = - \frac{\mathbf{r}^{(t)T} \mathbf{p}^{(t)}}{\mathbf{p}^{(t)T} \mathbf{A} \mathbf{p}^{(t)}} \end{eqnarray*} $$ is the optimal step length. * Theorem: In conjugate direction method, $\mathbf{x}^{(t)}$ converges to the solution in **at most** $n$ steps. Intuition: Look at graph. * **Conjugate gradient** method. Idea: generate $\mathbf{p}^{(t)}$ using only $\mathbf{p}^{(t-1)}$ $$ \begin{eqnarray*} \mathbf{p}^{(t)} = - \mathbf{r}^{(t)} + \beta^{(t)} \mathbf{p}^{(t-1)}, \end{eqnarray*} $$ where $\beta^{(t)}$ is determined by the conjugacy condition $\mathbf{p}^{(t-1)T} \mathbf{A} \mathbf{p}^{(t)} = 0$ $$ \begin{eqnarray*} \beta^{(t)} = \frac{\mathbf{r}^{(t)T} \mathbf{A} \mathbf{p}^{(t-1)}}{\mathbf{p}^{(t-1)T} \mathbf{A} \mathbf{p}^{(t-1)}}. \end{eqnarray*} $$ * **CG algorithm (preliminary version)**: 0. Given $\mathbf{x}^{(0)}$ 0. Initialize: $\mathbf{r}^{(0)} \gets \mathbf{A} \mathbf{x}^{(0)} - \mathbf{b}$, $\mathbf{p}^{(0)} \gets - \mathbf{r}^{(0)}$, $t=0$ 0. While $\mathbf{r}^{(t)} \ne \mathbf{0}$ 1. $\alpha^{(t)} \gets - \frac{\mathbf{r}^{(t)T} \mathbf{p}^{(t)}}{\mathbf{p}^{(t)T} \mathbf{A} \mathbf{p}^{(t)}}$ 2. $\mathbf{x}^{(t+1)} \gets \mathbf{x}^{(t)} + \alpha^{(t)} \mathbf{p}^{(t)}$ 3. $\mathbf{r}^{(t+1)} \gets \mathbf{A} \mathbf{x}^{(t+1)} - \mathbf{b}$ 4. $\beta^{(t+1)} \gets \frac{\mathbf{r}^{(t+1)T} \mathbf{A} \mathbf{p}^{(t)}}{\mathbf{p}^{(t)T} \mathbf{A} \mathbf{p}^{(t)}}$ 5. $\mathbf{p}^{(t+1)} \gets - \mathbf{r}^{(t+1)} + \beta^{(t+1)} \mathbf{p}^{(t)}$ 6. $t \gets t+1$ Remark: The initial conjugate direction $\mathbf{p}^{(0)} \gets - \mathbf{r}^{(0)}$ is crucial. * Theorem: With CG algorithm 0. $\mathbf{r}^{(t)}$ are mutually orthogonal. 0. $\{\mathbf{r}^{(0)},\ldots,\mathbf{r}^{(t)}\}$ is contained in the **Krylov subspace** of degree $t$ for $\mathbf{r}^{(0)}$, denoted by $$ \begin{eqnarray*} {\cal K}(\mathbf{r}^{(0)}; t) = \text{span} \{\mathbf{r}^{(0)},\mathbf{A} \mathbf{r}^{(0)}, \mathbf{A}^2 \mathbf{r}^{(0)}, \ldots, \mathbf{A}^{t} \mathbf{r}^{(0)}\}. \end{eqnarray*} $$ 0. $\{\mathbf{p}^{(0)},\ldots,\mathbf{p}^{(t)}\}$ is contained in ${\cal K}(\mathbf{r}^{(0)}; t)$. 0. $\mathbf{p}^{(0)}, \ldots, \mathbf{p}^{(t)}$ are conjugate with respect to $\mathbf{A}$. The iterates $\mathbf{x}^{(t)}$ converge to the solution in at most $n$ steps. * **CG algorithm (economical version)**: saves one matrix-vector multiplication. 0. Given $\mathbf{x}^{(0)}$ 0. Initialize: $\mathbf{r}^{(0)} \gets \mathbf{A} \mathbf{x}^{(0)} - \mathbf{b}$, $\mathbf{p}^{(0)} \gets - \mathbf{r}^{(0)}$, $t=0$ 0. While $\mathbf{r}^{(t)} \ne \mathbf{0}$ 1. $\alpha^{(t)} \gets \frac{\mathbf{r}^{(t)T} \mathbf{r}^{(t)}}{\mathbf{p}^{(t)T} \mathbf{A} \mathbf{p}^{(t)}}$ 2. $\mathbf{x}^{(t+1)} \gets \mathbf{x}^{(t)} + \alpha^{(t)} \mathbf{p}^{(t)}$ 3. $\mathbf{r}^{(t+1)} \gets \mathbf{r}^{(t)} + \alpha^{(t)} \mathbf{A} \mathbf{p}^{(t)}$ 4. $\beta^{(t+1)} \gets \frac{\mathbf{r}^{(t+1)T} \mathbf{r}^{(t+1)}}{\mathbf{r}^{(t)T} \mathbf{r}^{(t)}}$ 5. $\mathbf{p}^{(t+1)} \gets - \mathbf{r}^{(t+1)} + \beta^{(t+1)} \mathbf{p}^{(t)}$ 6. $t \gets t+1$ * Computation cost per iteration is **one** matrix vector multiplication: $\mathbf{A} \mathbf{p}^{(t)}$. Consider PageRank problem, $\mathbf{A}$ has dimension $n \approx 10^{10}$ but is highly structured (sparse + low rank). Each matrix vector multiplication takes $O(n)$. * Theorem: If $\mathbf{A}$ has $r$ distinct eigenvalues, $\mathbf{x}^{(t)}$ converges to solution $\mathbf{x}^*$ in at most $r$ steps. ## Pre-conditioned conjugate gradient (PCG) * Summary of conjugate gradient method for solving $\mathbf{A} \mathbf{x} = \mathbf{b}$ or equivalently minimizing $\frac 12 \mathbf{x}^T \mathbf{A} \mathbf{x} - \mathbf{b}^T \mathbf{x}$: * Each iteration needs one matrix vector multiplication: $\mathbf{A} \mathbf{p}^{(t+1)}$. For structured $\mathbf{A}$, often $O(n)$ cost per iteration. * Guaranteed to converge in $n$ steps. * Two important bounds for conjugate gradient algorithm: Let $\lambda_1 \le \cdots \le \lambda_n$ be the ordered eigenvalues of a pd $\mathbf{A}$. $$ \begin{eqnarray*} \|\mathbf{x}^{(t+1)} - \mathbf{x}^*\|_{\mathbf{A}}^2 &\le& \left( \frac{\lambda_{n-t} - \lambda_1}{\lambda_{n-t} + \lambda_1} \right)^2 \|\mathbf{x}^{(0)} - \mathbf{x}^*\|_{\mathbf{A}}^2 \\ \|\mathbf{x}^{(t+1)} - \mathbf{x}^*\|_{\mathbf{A}}^2 &\le& 2 \left( \frac{\sqrt{\kappa(\mathbf{A})}-1}{\sqrt{\kappa(\mathbf{A})}+1} \right)^{t} \|\mathbf{x}^{(0)} - \mathbf{x}^*\|_{\mathbf{A}}^2, \end{eqnarray*} $$ where $\kappa(\mathbf{A}) = \lambda_n/\lambda_1$ is the condition number of $\mathbf{A}$. * Messages: * Roughly speaking, if the eigenvalues of $\mathbf{A}$ occur in $r$ distinct clusters, the CG iterates will _approximately_ solve the problem after $O(r)$ steps. * $\mathbf{A}$ with a small condition number ($\lambda_1 \approx \lambda_n$) converges fast. * **Pre-conditioning**: Change of variables $\widehat{\mathbf{x}} = \mathbf{C} \mathbf{x}$ via a nonsingular $\mathbf{C}$ and solve $$ (\mathbf{C}^{-T} \mathbf{A} \mathbf{C}^{-1}) \widehat{\mathbf{x}} = \mathbf{C}^{-T} \mathbf{b}. $$ Choose $\mathbf{C}$ such that * $\mathbf{C}^{-T} \mathbf{A} \mathbf{C}^{-1}$ has small condition number, or * $\mathbf{C}^{-T} \mathbf{A} \mathbf{C}^{-1}$ has clustered eigenvalues * Inexpensive solution of $\mathbf{C}^T \mathbf{C} \mathbf{y} = \mathbf{r}$ * Preconditioned CG does not make use of $\mathbf{C}$ explicitly, but rather the matrix $\mathbf{M} = \mathbf{C}^T \mathbf{C}$. * **Preconditioned CG (PCG)** algorithm: 0. Given $\mathbf{x}^{(0)}$, pre-conditioner $\mathbf{M}$ 0. $\mathbf{r}^{(0)} \gets \mathbf{A} \mathbf{x}^{(0)} - \mathbf{b}$ 0. solve $\mathbf{M} \mathbf{y}^{(0)} = \mathbf{r}^{(0)}$ for $\mathbf{y}^{(0)}$ 0. $\mathbf{p}^{(0)} \gets - \mathbf{r}^{(0)}$, $t=0$ 0. While $\mathbf{r}^{(t)} \ne \mathbf{0}$ 1. $\alpha^{(t)} \gets \frac{\mathbf{r}^{(t)T} \mathbf{y}^{(t)}}{\mathbf{p}^{(t)T} \mathbf{A} \mathbf{p}^{(t)}}$ 2. $\mathbf{x}^{(t+1)} \gets \mathbf{x}^{(t)} + \alpha^{(t)} \mathbf{p}^{(t)}$ 3. $\mathbf{r}^{(t+1)} \gets \mathbf{r}^{(t)} + \alpha^{(t)} \mathbf{A} \mathbf{p}^{(t)}$ 4. Solve $\mathbf{M} \mathbf{y}^{(t+1)} \gets \mathbf{r}^{(t+1)}$ for $\mathbf{y}^{(t+1)}$ 5. $\beta^{(t+1)} \gets \frac{\mathbf{r}^{(t+1)T} \mathbf{y}^{(t+1)}}{\mathbf{r}^{(t)T} \mathbf{r}^{(t)}}$ 6. $\mathbf{p}^{(t+1)} \gets - \mathbf{y}^{(t+1)} + \beta^{(t+1)} \mathbf{p}^{(t)}$ 7. $t \gets t+1$ Remark: Only extra cost in the pre-conditioned CG algorithm is the need to solve the linear system $\mathbf{M} \mathbf{y} = \mathbf{r}$. * Pre-conditioning is more like an art than science. Some choices include * Incomplete Cholesky. $\mathbf{A} \approx \tilde{\mathbf{L}} \tilde{\mathbf{L}}^T$, where $\tilde{\mathbf{L}}$ is a sparse approximate Cholesky factor. Then $\tilde{\mathbf{L}}^{-1} \mathbf{A} \tilde{\mathbf{L}}^{-T} \approx \mathbf{I}$ (perfectly conditioned) and $\mathbf{M} \mathbf{y} = \tilde{\mathbf{L}} \tilde {\mathbf{L}}^T \mathbf{y} = \mathbf{r}$ is easy to solve. * Banded pre-conditioners. * Choose $\mathbf{M}$ as a coarsened version of $\mathbf{A}$. * Subject knowledge. Knowledge about the structure and origin of a problem is often the key to devising efficient pre-conditioner. For example, see recent work of Stein, Chen, Anitescu (2012) for pre-conditioning large covariance matrices. http://epubs.siam.org/doi/abs/10.1137/110834469 ### Example of PCG [Preconditioners.jl](https://github.com/mohamed82008/Preconditioners.jl) wraps a bunch of preconditioners. We use the Wathen matrix (sparse and positive definite) as a test matrix. ```{julia} using BenchmarkTools, MatrixDepot, IterativeSolvers, LinearAlgebra, SparseArrays # Wathen matrix of dimension 30401 x 30401 A = matrixdepot("wathen", 100) ``` ```{julia} using UnicodePlots spy(A) ``` ```{julia} # sparsity level count(!iszero, A) / length(A) ``` ```{julia} # rhs b = ones(size(A, 1)) # solve Ax=b by CG xcg = cg(A, b); @benchmark cg($A, $b) ``` #### Diagonal preconditioner Compute the diagonal preconditioner: ```{julia} using Preconditioners # Diagonal preconditioner @time p = DiagonalPreconditioner(A) dump(p) ``` ```{julia} # solver Ax=b by PCG xpcg = cg(A, b, Pl = p) # same answer? norm(xcg - xpcg) ``` ```{julia} # PCG with diagonal preconditioner is >5 fold faster than CG @benchmark cg($A, $b, Pl = $p) ``` #### Incomplete Cholesky preconditioner Compute the incomplete cholesky preconditioner: ```{julia} @time p = CholeskyPreconditioner(A, 2) dump(p) ``` Pre-conditioned conjugate gradient: ```{julia} # solver Ax=b by PCG xpcg = cg(A, b, Pl = p) # same answer? norm(xcg - xpcg) ``` ```{julia} # PCG with incomplete Cholesky is >5 fold faster than CG @benchmark cg($A, $b, Pl = $p) ``` #### AMG preconditioner Let's try the AMG preconditioner. ```{julia} using AlgebraicMultigrid @time ml = AMGPreconditioner{RugeStuben}(A) # Construct a Ruge-Stuben solver ``` ```{julia} # use AMG preconditioner in CG xamg = cg(A, b, Pl = ml) # same answer? norm(xcg - xamg) ``` ```{julia} @benchmark cg($A, $b, Pl = $ml) ``` ## Other Krylov subspace methods * We leant about CG/PCG, which is for solving $\mathbf{A} \mathbf{x} = \mathbf{b}$, $\mathbf{A}$ pd. * **MINRES (minimum residual method)**: symmetric indefinite $\mathbf{A}$. * **Bi-CG (bi-conjugate gradient)**: unsymmetric $\mathbf{A}$. * **Bi-CGSTAB (Bi-CG stabilized)**: improved version of Bi-CG. * **GMRES (generalized minimum residual method)**: current _de facto_ method for unsymmetric $\mathbf{A}$. E.g., PageRank problem. * **Lanczos method**: top eigen-pairs of a large symmetric matrix. * **Arnoldi method**: top eigen-pairs of a large unsymmetric matrix. * **Lanczos bidiagonalization** algorithm: top singular triplets of large matrix. * **LSQR**: least square problem $\min \|\mathbf{y} - \mathbf{X} \beta\|_2^2$. Algebraically equivalent to applying CG to the normal equation $(\mathbf{X}^T \mathbf{X} + \lambda^2 I) \beta = \mathbf{X}^T \mathbf{y}$. * **LSMR**: least square problem $\min \|\mathbf{y} - \mathbf{X} \beta\|_2^2$. Algebraically equivalent to applying MINRES to the normal equation $(\mathbf{X}^T \mathbf{X} + \lambda^2 I) \beta = \mathbf{X}^T \mathbf{y}$. ## Software ### Matlab * Iterative methods for solving linear equations: `pcg`, `bicg`, `bicgstab`, `gmres`, ... * Iterative methods for top eigen-pairs and singular pairs: `eigs`, `svds`, ... * Pre-conditioner: `cholinc`, `luinc`, ... * Get familiar with the **reverse communication interface (RCI)** for utilizing iterative solvers: ```matlab x = gmres(A, b) x = gmres(@Afun, b) eigs(A) eigs(@Afun) ``` ### Julia * `eigs` and `svds` in the [Arpack.jl](https://github.com/JuliaLinearAlgebra/Arpack.jl) package. [Numerical examples](http://hua-zhou.github.io/teaching/biostatm280-2019spring/slides/17-eigsvd/eigsvd.html#Lanczos/Arnoldi-iterative-method-for-top-eigen-pairs) later. * [`IterativeSolvers.jl`](https://github.com/JuliaMath/IterativeSolvers.jl) package. [CG numerical examples](http://hua-zhou.github.io/teaching/biostatm280-2019spring/slides/15-iterative/iterative.html#Numerical-examples) * See the [list](https://jutho.github.io/KrylovKit.jl/stable/#Package-features-and-alternatives-1) of Julia packages for iterative methods. #### Least squares example ```{julia} using BenchmarkTools, IterativeSolvers, LinearAlgebra, Random, SparseArrays Random.seed!(257) # seed n, p = 10000, 5000 X = sprandn(n, p, 0.001) # iid standard normals with sparsity 0.01 β = ones(p) y = X * β + randn(n) β̂_qr = X \ y # least squares by QR @benchmark $X \ $y ``` ```{julia} β̂_lsqr = lsqr(X, y) @show norm(β̂_qr - β̂_lsqr) # least squares by lsqr @benchmark lsqr($X, $y) ``` ```{julia} β̂_lsmr = lsmr(X, y) @show norm(β̂_qr - β̂_lsmr) # least squares by lsmr @benchmark lsmr($X, $y) ``` #### Use LinearMaps in iterative solvers In many applications, it is advantageous to define linear maps indead of forming the actual (sparse) matrix. For a linear map, we need to specify how it acts on right- and left-multiplication on a vector. The [`LinearMaps.jl`](https://github.com/Jutho/LinearMaps.jl) package is exactly for this purpose and interfaces nicely with `IterativeSolvers.jl`, `Arnoldi.jl` and other iterative solver packages. Applications: 1. The matrix is not sparse but admits special structure, e.g., easy + low rank (PageRank), Kronecker proudcts, etc. 2. Less memory usage. 3. Linear algebra on a standardized (centered and scaled) sparse matrix. Consider the differencing operator that takes differences between neighboring pixels $$ \mathbf{D} = \begin{pmatrix} -1 & 1 & & & \\ & -1 & 1 & & \\ & & \ddots & \\ & & & - 1 & 1 \\ 1 & & & & -1 \end{pmatrix}. $$ ```{julia} using LinearMaps, IterativeSolvers # Overwrite y with A * x # left difference assuming periodic boundary conditions function leftdiff!(y::AbstractVector, x::AbstractVector) N = length(x) length(y) == N || throw(DimensionMismatch()) @inbounds for i in 1:N y[i] = x[i] - x[mod1(i - 1, N)] end return y end # Overwrite y with A' * x # minus right difference function mrightdiff!(y::AbstractVector, x::AbstractVector) N = length(x) length(y) == N || throw(DimensionMismatch()) @inbounds for i in 1:N y[i] = x[i] - x[mod1(i + 1, N)] end return y end # define linear map D = LinearMap{Float64}(leftdiff!, mrightdiff!, 100; ismutating=true) ``` Linear maps can be used like a regular matrix. ```{julia} @show size(D) v = ones(size(D, 2)) # vector of all 1s @show D * v @show D' * v; ``` If we form the corresponding dense matrix, it will look like ```{julia} Matrix(D) ``` If we form the corresponding sparse matrix, it will look like ```{julia} using SparseArrays sparse(D) ``` ```{julia} using UnicodePlots spy(sparse(D)) ``` Compute top singular values using iterative method (Arnoldi). ```{julia} using Arpack Arpack.svds(D, nsv = 3) ``` ```{julia} using LinearAlgebra # check solution against the direct method for SVD svdvals(Matrix(D)) ``` Compute top eigenvalues of the Gram matrix `D'D` using iterative method (Arnoldi). ```{julia} Arpack.eigs(D'D, nev = 3, which = :LM) ``` ## Further reading * Chapter 5 of [Numerical Optimization](https://ucla.worldcat.org/title/numerical-optimization/oclc/209918411&referer=brief_results) by Jorge Nocedal and Stephen Wright (1999). * Sections 11.3-11.5 of [Matrix Computations](https://ucla.worldcat.org/title/matrix-computations/oclc/824733531&referer=brief_results) by Gene Golub and Charles Van Loan (2013).