---
title: Eigen-decomposition and SVD
subtitle: Biostat/Biomath M257
author: Dr. Hua Zhou @ UCLA
date: today
format:
  html:
    theme: cosmo
    embed-resources: true
    number-sections: true
    toc: true
    toc-depth: 4
    toc-location: left
    code-fold: false
jupyter:
  jupytext:
    formats: 'ipynb,qmd'
    text_representation:
      extension: .qmd
      format_name: quarto
      format_version: '1.0'
      jupytext_version: 1.14.5
  kernelspec:
    display_name: Julia 1.9.0
    language: julia
    name: julia-1.9
---

System information (for reproducibility):

```{julia}
versioninfo()
```

Load packages:

```{julia}
using Pkg

Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()
```

# Introduction

Our last topic on numerical linear algebra is eigen-decomposition and singular value decomposition (SVD). We already saw the wide applications of QR decomposition in least squares problem and solving square and under-determined linear equations. Eigen-decomposition and SVD can be deemed as more thorough orthogonalization of a matrix. We start with a brief review of the related linear algebra.

## Linear algebra review: eigen-decomposition

For a quick review of eigen-decomposition, see [Biostat 216 slides](https://ucla-biostat-216.github.io/2022fall/slides/10-eig/10-eig.html).

* **Eigenvalues** are defined as roots of the characteristic equation $\det(\lambda \mathbf{I}_n - \mathbf{A})=0$.

* If $\lambda$ is an eigenvalue of $\mathbf{A}$, then there exist non-zero $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ such that $\mathbf{A} \mathbf{x} = \lambda \mathbf{x}$ and $\mathbf{y}^T \mathbf{A} = \lambda \mathbf{y}^T$. $\mathbf{x}$ and $\mathbf{y}$ are called the (column) **eigenvector** and **row eigenvector** of $\mathbf{A}$ associated with the eigenvalue $\lambda$.

* $\mathbf{A}$ is singular if and only if it has at least one 0 eigenvalue.

* Eigenvectors associated with distinct eigenvalues are linearly independent.

* Eigenvalues of an upper or lower triangular matrix are its diagonal entries: $\lambda_i = a_{ii}$.

* Eigenvalues of an idempotent matrix are either 0 or 1.

* Eigenvalues of an orthogonal matrix have complex modulus 1.

* In most statistical applications, we deal with eigenvalues/eigenvectors of symmetric matrices. 
The eigenvalues and eigenvectors of a real **symmetric** matrix are real.

* Eigenvectors associated with distinct eigenvalues of a symmetry matrix are orthogonal.

* **Eigen-decompostion of a symmetric matrix**: $\mathbf{A} = \mathbf{U} \Lambda \mathbf{U}^T$, where
    * $\Lambda = \text{diag}(\lambda_1,\ldots,\lambda_n)$
    * columns of $\mathbf{U}$ are the eigenvectors, which are (or can be chosen to be) mutually orthonormal. Thus $\mathbf{U}$ is an orthogonal matrix.

* A real symmetric matrix is positive semidefinite (positive definite) if and only if all eigenvalues are nonnegative (positive).

* **Spectral radius** $\rho(\mathbf{A}) = \max_i |\lambda_i|$.

* $\mathbf{A} \in \mathbb{R}^{n \times n}$ a square matrix (not required to be symmetric), then $\text{tr}(\mathbf{A}) = \sum_i \lambda_i$ and $\det(\mathbf{A}) = \prod_i \lambda_i$.

## Linear algebra review: singular value decomposition (SVD)

For a quick review of SVD, see [Biostat 216 slides](https://ucla-biostat-216.github.io/2022fall/slides/12-svd/12-svd.html).

* **Singular value decomposition (SVD)**: For a rectangular matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, let $p = \min\{m,n\}$, then we have the SVD
$$
\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T,
$$
where
    * $\mathbf{U} = (\mathbf{u}_1,\ldots,\mathbf{u}_m) \in \mathbb{R}^{m \times m}$ is orthogonal
    * $\mathbf{V} = (\mathbf{v}_1,\ldots,\mathbf{v}_n) \in \mathbb{R}^{n \times n}$ is orthogonal
    * $\Sigma = \text{diag}(\sigma_1, \ldots, \sigma_p)  \in \mathbb{R}^{m \times n}$, $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_p \ge 0$.  
$\sigma_i$ are called the **singular values**, $\mathbf{u}_i$ are the **left singular vectors**, and $\mathbf{v}_i$ are the **right singular vectors**.

* **Thin/Skinny SVD**. Assume $m \ge n$. $\mathbf{A}$ can be factored as 
$$
\mathbf{A} = \mathbf{U}_n \Sigma_n \mathbf{V}^T = \sum_{i=1}^n \sigma_i \mathbf{u}_i \mathbf{v}_i^T,
$$ 
where 
    * $\mathbf{U}_n \in \mathbb{R}^{m \times n}$, $\mathbf{U}_n^T \mathbf{U}_n = \mathbf{I}_n$
    * $\mathbf{V} \in \mathbb{R}^{n \times n}$, $\mathbf{V}^T \mathbf{V} = \mathbf{I}_n$
    * $\Sigma_n = \text{diag}(\sigma_1,\ldots,\sigma_n) \in \mathbb{R}^{n \times n}$, $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_n \ge 0$
    
* Denote $\sigma(\mathbf{A})=(\sigma_1,\ldots,\sigma_p)^T$. Then 
    * $r = \text{rank}(\mathbf{A}) = \# \text{ nonzero singular values} = \|\sigma(\mathbf{A})\|_0$  
    * $\mathbf{A} = \mathbf{U}_r \Sigma_r \mathbf{V}_r^T = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^T$
    * $\|\mathbf{A}\|_{\text{F}} = (\sum_{i=1}^p \sigma_i^2)^{1/2} = \|\sigma(\mathbf{A})\|_2$
    * $\|\mathbf{A}\|_2 = \sigma_1 = \|\sigma(\mathbf{A})\|_\infty$

* Assume $\text{rank}(\mathbf{A}) = r$ and partition 
$$
\begin{eqnarray*}
\mathbf{U} &=& (\mathbf{U}_r, \tilde{\mathbf{U}}_r) \in \mathbb{R}^{m \times m} \\
\mathbf{V} &=& (\mathbf{V}_r, \tilde{\mathbf{V}}_r) \in \mathbb{R}^{n \times n}.
\end{eqnarray*}
$$
Then
    * ${\cal C}(\mathbf{A}) = {\cal C}(\mathbf{U}_r)$, ${\cal N}(\mathbf{A}^T) = {\cal C}(\tilde{\mathbf{U}}_r)$
    * ${\cal N}(\mathbf{A}) = {\cal C}(\tilde{\mathbf{V}}_r)$, ${\cal C}(\mathbf{A}^T) = {\cal C}(\mathbf{V}_r)$
    * $\mathbf{U}_r \mathbf{U}_r^T$ is the orthogonal projection onto ${\cal C}(\mathbf{A})$
    * $\tilde{\mathbf{U}}_r \tilde{\mathbf{U}}_r^T$ is the orthogonal projection onto ${\cal N}(\mathbf{A}^T)$
    * $\mathbf{V}_r \mathbf{V}_r^T$ is the orthogonal projection onto ${\cal C}(\mathbf{A}^T)$
    * $\tilde{\mathbf{V}}_r \tilde{\mathbf{V}}_r^T$ is the orthogonal projection onto ${\cal N}(\mathbf{A})$

* Relation to eigen-decomposition. Using thin SVD,
$$
\begin{eqnarray*}
	\mathbf{A}^T \mathbf{A} &=& \mathbf{V} \Sigma \mathbf{U}^T \mathbf{U} \Sigma \mathbf{V}^T = \mathbf{V} \Sigma^2 \mathbf{V}^T \\
	\mathbf{A} \mathbf{A}^T &=& \mathbf{U} \Sigma \mathbf{V}^T \mathbf{V} \Sigma \mathbf{U}^T = \mathbf{U} \Sigma^2 \mathbf{U}^T.
\end{eqnarray*}
$$
In principle we can obtain singular triplets of $\mathbf{A}$ by doing two eigen-decompositions.

* Another relation to eigen-decomposition. Using thin SVD,
$$
\begin{eqnarray*}
	\begin{pmatrix} \mathbf{0}_{n \times n} & \mathbf{A}^T \\ \mathbf{A} & \mathbf{0}_{m \times m} \end{pmatrix} = \frac{1}{\sqrt 2} \begin{pmatrix} \mathbf{V} & \mathbf{V} \\ \mathbf{U} & -\mathbf{U} \end{pmatrix} \begin{pmatrix} \Sigma & \mathbf{0}_{n \times n} \\ \mathbf{0}_{n \times n} & - \Sigma \end{pmatrix} \frac{1}{\sqrt 2} \begin{pmatrix} \mathbf{V}^T & \mathbf{U}^T \\ \mathbf{V}^T & - \mathbf{U}^T \end{pmatrix}.
\end{eqnarray*}
$$
Hence any symmetric eigen-solver can produce the SVD of a matrix $\mathbf{A}$ without forming $\mathbf{A} \mathbf{A}^T$ or $\mathbf{A}^T \mathbf{A}$.

* Yet another relation to eigen-decomposition: If the eigen-decomposition of a real symmetric matrix is $\mathbf{A} = \mathbf{W} \Lambda \mathbf{W}^T = \mathbf{W} \text{diag}(\lambda_1, \ldots, \lambda_n) \mathbf{W}^T$, then
$$
\begin{eqnarray*}
	\mathbf{A} = \mathbf{W} \Lambda \mathbf{W}^T = \mathbf{W} \begin{pmatrix} 
	|\lambda_1| & & \\
	& \ddots & \\
	& & |\lambda_n|
	\end{pmatrix} \begin{pmatrix} 
	\text{sgn}(\lambda_1) & & \\
	& \ddots & \\
	& & \text{sgn}(\lambda_n)
	\end{pmatrix} \mathbf{W}^T
\end{eqnarray*}
$$
is the SVD of $\mathbf{A}$.

## Applications of eigen-decomposition and SVD


### Principal components analysis (PCA). 

$\mathbf{X} \in \mathbb{R}^{n \times p}$ is a centered data matrix. Perform SVD $\mathbf{X} = \mathbf{U} \Sigma \mathbf{V}^T$ or equivalently eigendecomposition $\mathbf{X}^T \mathbf{X} = \mathbf{V} \Sigma^2 \mathbf{V}^T$. The linear combinations $\tilde{\mathbf{x}}_i = \mathbf{X} \mathbf{v}_i$ are the **principal components** (PC) and have variance $\sigma_i^2$.

* Dimension reduction: reduce dimensionality $p$ to $q \ll p$. Use top PCs $\tilde{\mathbf{x}}_1, \ldots, \tilde{\mathbf{x}}_q$ in visualization and downstream analysis.

<img src="https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fnature07331/MediaObjects/41586_2008_Article_BFnature07331_Fig1_HTML.jpg" width="400" align="center"/>

Above picture is from the article [Genes mirror geography within Europe](http://www.nature.com/nature/journal/v456/n7218/full/nature07331.html) by Novembre et al (2008) published in _Nature_.  

* Use PCs to adjust for confounding - a serious issue in association studies in large data sets.
    * Use of PCA to adjust for confounding in modern genetic studies is proposed in the paper [Principal components analysis corrects for stratification in genome-wide association studies](http://www.nature.com/ng/journal/v38/n8/full/ng1847.html) by Price et al (2006) published in _Nature Genetics_. It has been cited 6,937 times as of May 3, 2019.
    
### Low rank approximation

For example, image/data compression. Find a low rank approximation of data matrix $\mathbf{x}$.  
**Eckart-Young theorem**: 
$$
\min_{\text{rank}(\mathbf{Y})=r} \|\mathbf{X} - \mathbf{Y} \|_{\text{F}}^2
$$
is achieved by $\mathbf{Y} = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^T$ with optimal value $\sum_{i=r}^{p} \sigma_i^2$, where $(\sigma_i, \mathbf{u}_i, \mathbf{v}_i)$ are singular values and vectors of $\mathbf{X}$.

* Gene Golub's $897 \times 598$ picture requires $3 \times 897 \times 598 \times 8 = 12,873,744$ bytes (3 RGB channels).  
* Rank 50 approximation requires $3 \times 50 \times (897 + 598) \times 8 = 1,794,000$ bytes. 
* Rank 12 approximation requires $12 \times (2691+598) \times 8 = 430,560$ bytes.

<img src="https://www.mathworks.com/content/mathworks/www/en/company/newsletters/articles/professor-svd/_jcr_content/mainParsys/columns/3/image_7.img.jpg/1490211663024.jpg" width="150" align="center"/>

<img src="https://www.mathworks.com/content/mathworks/www/en/company/newsletters/articles/professor-svd/_jcr_content/mainParsys/columns/2/image_6.img.jpg/1490211662986.jpg" width="150" align="center"/>

<img src="https://www.mathworks.com/content/mathworks/www/en/company/newsletters/articles/professor-svd/_jcr_content/mainParsys/columns/1/image_5.img.jpg/1490211662955.jpg" width="150" align="center"/>


### Moore-Penrose (MP) inverse

Using thin SVD, 
$$
\mathbf{A}^+ = \mathbf{V} \Sigma^+ \mathbf{U}^T,
$$
where $\Sigma^+ = \text{diag}(\sigma_1^{-1}, \ldots, \sigma_r^{-1}, 0, \ldots, 0)$, $r= \text{rank}(\mathbf{A})$. This is how the [`pinv`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.pinv) function is implemented in Julia.

```{julia}
using Random, LinearAlgebra

Random.seed!(123)
X = randn(5, 3)
pinv(X)
```

```{julia}
# calculation of Moore-Penrose inverse by SVD
@which pinv(X)
```

### Least squares

Given thin SVD $\mathbf{X} = \mathbf{U} \Sigma \mathbf{V}^T$, 
$$
\begin{eqnarray*}
	\widehat \beta &=& (\mathbf{X}^T \mathbf{X})^- \mathbf{X}^T \mathbf{y} \\
	&=& (\mathbf{V} \Sigma^2 \mathbf{V}^T)^+ \mathbf{V} \Sigma \mathbf{U}^T \mathbf{y} \\
	&=& \mathbf{V} (\Sigma^{2})^+ \mathbf{V}^T \mathbf{V} \Sigma \mathbf{U}^T \mathbf{y} \\
	&=& \mathbf{V}_r \Sigma_r^{-1} \mathbf{U}_r^T \mathbf{y} \\
	&=& \sum_{i=1}^r \left( \frac{\mathbf{u}_i^T \mathbf{y}}{\sigma_i} \right) \mathbf{v}_i
\end{eqnarray*}
$$
and
$$
\begin{eqnarray*}
	\widehat{\mathbf{y}} &=& \mathbf{X} \widehat \beta = \mathbf{U}_r \mathbf{U}_r^T \mathbf{y}.
\end{eqnarray*}
$$
In general, SVD is more expensive than other approaches (Cholesky, Sweep, QR) we learnt. In some applications, SVD is computed for other purposes then we get least squares solution for free.

### Ridge regression

* In ridge regression, we minimize
$$
	\|\mathbf{y} - \mathbf{X} \beta\|_2^2 + \lambda \|\beta\|_2^2,
$$
where $\lambda$ is a tuning parameter.

* Ridge regression by augmented linear regression. Ridge regression problem is equivalent to
$$
	\left\| \begin{pmatrix} \mathbf{y} \\ \mathbf{0}_p \end{pmatrix} - \begin{pmatrix}
	\mathbf{X} \\ \sqrt \lambda \mathbf{I}_p
	\end{pmatrix} \beta \right\|_2^2.
$$
Therefore any methods for linear regression can be applied.

* Ridge regression by method of normal equation. The normal equation for the ridge problem is
$$
	(\mathbf{X}^T \mathbf{X} + \lambda \mathbf{I}_p) \beta = \mathbf{X}^T \mathbf{y}.
$$
Therefore Cholesky or sweep operator can be used.

* Ridge regression by SVD. If we obtain the (thin) SVD of $\mathbf{X}$
$$
	\mathbf{X} = \mathbf{U} \Sigma_{p \times p} \mathbf{V}^T.
$$
Then the normal equation reads
$$
	(\Sigma^2 + \lambda \mathbf{I}_p) \mathbf{V}^T \beta = \Sigma \mathbf{U}^T \mathbf{y}
$$
and we get
$$
	\widehat \beta (\lambda) = \sum_{i=1}^p \frac{\sigma_i \mathbf{u}_i^T \mathbf{y}}{\sigma_i^2 + \lambda} \mathbf{v}_i = \sum_{i=1}^r \frac{\sigma_i \mathbf{u}_i^T \mathbf{y}}{\sigma_i^2 + \lambda} \mathbf{v}_i, \quad r = \text{rank}(\mathbf{X}).
$$
It is clear that 
$$
\begin{eqnarray*}
	\lim_{\lambda \to 0} \widehat \beta (\lambda) = \widehat \beta_{\text{OLS}}
\end{eqnarray*}
$$
and $\|\widehat \beta (\lambda)\|_2$ is monotone decreasing as $\lambda$ increases.

* Only one SVD is needed for all $\lambda$ (!), in contrast to the method of augmented linear regression, Cholesky, or sweep.

### Other applications

See Chapters 8-9 of [Numerical Analysis for Statisticians](http://ucla.worldcat.org/title/numerical-analysis-for-statisticians/oclc/793808354&referer=brief_results) by Kenneth Lange (2010) for more applications of eigen-decomposition and SVD.

## Algorithms for eigen-decomposition

### One eigen-pair: power method

* **Power method** iterates according to
$$
\begin{eqnarray*}
	\mathbf{x}^{(t)} &\gets& \frac{1}{\|\mathbf{A} \mathbf{x}^{(t-1)}\|_2} \mathbf{A} \mathbf{x}^{(t-1)}
\end{eqnarray*}
$$
from an initial guess $\mathbf{x}^{(0)}$ of unit norm.

* Suppose we arrange $|\lambda_1| > |\lambda_2| \ge \cdots \ge |\lambda_n|$ (the first inequality strict) with corresponding eigenvectors $\mathbf{u}_i$, and expand $\mathbf{x}^{(0)} = c_1 \mathbf{u}_1 + \cdots + c_n \mathbf{u}_n$, then
$$
\begin{eqnarray*}
	\mathbf{x}^{(t)} &=& \frac{\left( \sum_i \lambda_i^t \mathbf{u}_i \mathbf{u}_i^T \right) \left( \sum_i c_i \mathbf{u}_i \right)}{\|\left( \sum_i \lambda_i^t \mathbf{u}_i \mathbf{u}_i^T \right) \left( \sum_i c_i \mathbf{u}_i \right)\|_2} \\
	&=& \frac{\sum_i c_i \lambda_i^t \mathbf{u}_i}{\|\sum_i c_i \lambda_i^t \mathbf{u}_i\|_2}	\\
	&=& \frac{c_1 \mathbf{u}_1 + c_2 (\lambda_2/\lambda_1)^t \mathbf{u}_2 + \cdots + c_n (\lambda_n/\lambda_1)^t \mathbf{u}_n}{\|c_1 \mathbf{u}_1 + c_2 (\lambda_2/\lambda_1)^t \mathbf{u}_2 + \cdots + c_n (\lambda_n/\lambda_1)^t \mathbf{u}_n\|_2} \left( \frac{\lambda_1}{|\lambda_1|} \right)^t.
\end{eqnarray*}
$$
Thus $\mathbf{x}^{(t)} - \frac{c_1 \mathbf{u}_1}{\|c_1 \mathbf{u}_1\|_2} \left( \frac{\lambda_1}{|\lambda_1|} \right)^t \to 0$ as $t \to \infty$. The convergence rate is $|\lambda_2|/|\lambda_1|$.

* $\lambda_1^{(t)} = \mathbf{x}^{(t)T} \mathbf{A} \mathbf{x}^{(t)}$ converges to $\lambda_1$.

* **Inverse power method** for finding the eigenvalue of smallest absolute value: Substitute $\mathbf{A}$ by $\mathbf{A}^{-1}$ in the power method. (E.g., pre-compute LU or Cholesky of $\mathbf{A}$).

* **Shifted inverse power**: Substitute $(\mathbf{A} - \mu \mathbf{I})^{-1}$ in the power method. It converges to an eigenvalue close to the given $\mu$.

* **Rayleigh quotient iteration**: Substitute $(\mathbf{A} - \mu^{(t-1)} \mathbf{I})^{-1}$, where $\mu^{(t-1)} = \mathbf{x}^{(t-1)T} \mathbf{A} \mathbf{x}^{(t-1)}$ in the shifted inverse method. Faster convergence.

* Example: PageRank problem seeks top left eigenvector of transition matrix $\mathbf{P}$ and costs $O(n)$ per iteration.

### Top $r$ eigen-pairs: orthogonal iteration

Generalization of power method to higher dimensional invariant subspace.

* **Orthogonal iteration**: Initialize $\mathbf{Q}^{(0)} \in \mathbb{R}^{n \times r}$ with orthonormal columns. For $t=1,2,\ldots$, 
$$
\begin{eqnarray*}
	\mathbf{Z}^{(t)} &\gets& \mathbf{A} \mathbf{Q}^{(t-1)} \quad \text{($2n^2r$ flops)} \\
	\mathbf{Q}^{(t)} \mathbf{R}^{(t)} &\gets& \mathbf{Z}^{(t)} \quad \text{(QR factorization)}%, $nr^2 - r^3/3$ flops)}
\end{eqnarray*}
$$

* $\mathbf{Z}^{(t)}$ converges to the eigenspace of the largest $r$ eigenvalues if they are real and separated from remaining spectrum. The convergence rate is $|\lambda_{r+1}|/|\lambda_r|$.

### (Impractical) full eigen-decomposition: QR iteration

Assume $\mathbf{A} \in \mathbb{R}^{n \times n}$ symmetric.

*  Take $r=n$ in the orthogonal iteration. Then $\mathbf{Q}^{(t)}$ converges to the eigenspace $\mathbf{U}$ of $\mathbf{A}$. This implies that
$$
	\mathbf{T}^{(t)} := \mathbf{Q}^{(t)T} \mathbf{A} \mathbf{Q}^{(t)}
$$
converges to a diagonal form $\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n)$.

* Note how to compute $\mathbf{T}^{(t)}$ from $\mathbf{T}^{(t-1)}$
$$
\begin{eqnarray*}
	\mathbf{T}^{(t-1)} &=& \mathbf{Q}^{(t-1)T} \mathbf{A} \mathbf{Q}^{(t-1)} = \mathbf{Q}^{(t-1)T} (\mathbf{A} \mathbf{Q}^{(t-1)}) = (\mathbf{Q}^{(t-1)T} \mathbf{Q}^{(t)}) \mathbf{R}^{(t)}	\\\mathbf{A}
	\mathbf{T}^{(t)} &=& \mathbf{Q}^{(t)T} \mathbf{A} \mathbf{Q}^{(t)} = \mathbf{Q}^{(t)T} \mathbf{A} \mathbf{Q}^{(t-1)} \mathbf{Q}^{(t-1)T} \mathbf{Q}^{(t)} = \mathbf{R}^{(t)} ( \mathbf{Q}^{(t-1)T} \mathbf{Q}^{(t)}).
\end{eqnarray*}
$$

* **QR iteration**: Initialize $\mathbf{U}^{(0)} \in \mathbb{R}^{n \times n}$ orthogonal and set $\mathbf{T}^{(0)} = \mathbf{U}^{(0)T} \mathbf{A} \mathbf{U}^{(0)}$. \\
For $t=1,2,\ldots$
$$
\begin{eqnarray*}
	\mathbf{U}^{(t)} \mathbf{R}^{(t)} &\gets& \mathbf{T}^{(t-1)} \quad \text{(QR factorization)} \\
	\mathbf{T}^{(t)} &\gets& \mathbf{R}^{(t)} \mathbf{U}^{(t)}
\end{eqnarray*}
$$

* QR iteration is expensive in general: $O(n^3)$ per iteration and linear convergence rate.

### QR algorithm for symmetric eigen-decomposition

Assume $\mathbf{A} \in \mathbb{R}^{n \times n}$ symmetric.

* Reading: [The QR algorithm](http://hua-zhou.github.io/teaching/biostatm280-2018spring/readings/qr.pdf) by Beresford N. Parlett.

* This is the algorithm implemented in LAPACK: used by Julia, Matlab, R.

<img src="./householder_tridiagonalization.png" width="300" align="center"/>

* Idea: Tri-diagonalization (by Householder) + QR iteration on the tri-diagonal system with implicit shift.
   
   0. Step 1: Householder tri-diagonalization: $4n^3/3$ for eigenvalues only, $8n^3/3$ for both eigenvalues and eigenvectors. (Why can't we apply Householder to make it diagonal directly?)
    
   0. Step 2: QR iteration on the tridiagonal matrix. Implicit shift accelerates convergence rate. On average 1.3-1.6 QR iteration per eigenvalue, $\sim 20n$ flops per QR iteration. So total operation count is about $30n^2$. Eigenvectors need an extra of about $6n^3$ flops.

| Stage                  | Eigenvalue   | Eigenvector |
|------------------------|--------------|-------------|
| Householder reduction  | $4n^3/3$     | $4n^3/3$    |
| QR with implicit shift | $\sim 30n^2$ | $\sim 6n^3$ |

* Message: **Don't request eigenvectors unless necessary**. Use [`eigvals`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.eigvals) in Julia to request only eigenvalues.

* The **unsymmetric QR algorithm** obtains the real Schur decomposition of an asymmetric matrix $\mathbf{A}$.

### Example

Julia functions: [`eigen`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.eigen), [`eigen!`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.eigen!), [`eigvals`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.eigvals!), [`eigvecs`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.eigvecs), [`eigmax`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.eigmax), [`eigmin`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.eigmin).

```{julia}
Random.seed!(123)
A = Symmetric(randn(5, 5), :U)
Aeig = eigen(A)
```

```{julia}
# eigen-values
Aeig.values
```

```{julia}
# eigen-vectors
Aeig.vectors
```

```{julia}
# inversion by eigen-decomposition
inv(Aeig)
```

```{julia}
@which inv(Aeig)
```

```{julia}
# determinant by eigen-decomposition
det(Aeig)
```

```{julia}
@which det(Aeig)
```

```{julia}
@which eigvals(A)
```

```{julia}
@which eigmax(A)
```

```{julia}
@which eigmin(A)
```

Don't request eigenvectors unless needed.

```{julia}
using BenchmarkTools, Random, LinearAlgebra

Random.seed!(123)
n = 1000
A = Symmetric(randn(n, n), :U)
```

```{julia}
# requesting eigenvalues only is cheaper
@benchmark eigvals($A)
```

```{julia}
# requesting eigenvectors requires extra work
@benchmark eigen($A)
```

```{julia}
@benchmark eigvecs($A)
```

## Algorithm for singular value decomposition (SVD)

<img src="https://www.mathworks.com/content/mathworks/www/en/company/newsletters/articles/professor-svd/_jcr_content/mainParsys/columns/3/image_7.img.jpg/1490211663024.jpg" width="150" align="center"/>

<img src="https://www.mathworks.com/content/mathworks/www/en/company/newsletters/articles/professor-svd/jcr:content/mainParsys/image_0.adapt.full.high.gif/1490211662576.gif" width="200" align="center"/>

Assume $\mathbf{A} \in \mathbb{R}^{m \times n}$ and we seek the SVD $\mathbf{A} = \mathbf{U} \mathbf{D} \mathbf{V}^T$. 

<img src="svd_bidiagonalization.png" width="400" align="center"/>

<img src="SVD_bidiagonalization_2.png" width="400" align="center"/>

* **Golub-Kahan-Reinsch algorithm**: 
    * Stage 1: Transform $\mathbf{A}$ to an upper bidiagonal form $\mathbf{B}$ (by Householder).  
    * Stage 2: Apply implicit-shift QR iteration to the tridiagonal matrix $\mathbf{B}^T \mathbf{B}$ _implicitly_.

* See Section 8.6 of [Matrix Computation](http://catalog.library.ucla.edu/vwebv/holdingsInfo?bibId=7122088) by Gene Golub and Charles Van Loan (2013) for more details.

* $4m^2 n + 8mn^2 + 9n^3$ flops for a tall $(m \ge n)$ matrix.

### Example

Julia functions: [`svd`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.svd), [`svd!`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.svd), [`svdvals`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.svdvals), [`svdvals!`](https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/#LinearAlgebra.svdvals!).

```{julia}
Random.seed!(123)

A = randn(5, 3)
Asvd = svd(A)
```

```{julia}
Asvd.U
```

```{julia}
# Vt is cheaper to extract than V
Asvd.Vt
```

```{julia}
Asvd.V
```

```{julia}
Asvd.S
```

**Don't request singular vectors unless needed.**

```{julia}
Random.seed!(123)

n, p = 1000, 500
A = randn(n, p)
@benchmark svdvals(A)
```

```{julia}
@benchmark svd(A)
```

## Lanczos/Arnoldi iterative method for top eigen-pairs

* Consider the Google PageRank problem. We want to find the top left eigenvector of the transition matrix $\mathbf{P}$. Direct methods such as (unsymmetric) QR or SVD takes forever. Iterative methods such as power method is feasible. However power method may take a large number of iterations.

* **Krylov subspace methods** are the state-of-art iterative methods for obtaining the top eigen-values/vectors or singular values/vectors of large **sparse** or **structured** matrices.

* **Lanczos method**: top eigen-pairs of a large _symmetric_ matrix.  

* **Arnoldi method**: top eigen-pairs of a large _asymmetric_ matrix.

* Both methods are also adapted to obtain top singular values/vectors of large sparse or structured matrices.

* `eigs` and `svds` in Julia [Arpack.jl](https://github.com/JuliaLinearAlgebra/Arpack.jl) package and Matlab are wrappers of the ARPACK package, which implements Lanczos and Arnoldi methods.

```{julia}
using MatrixDepot, SparseArrays

# Download the Boeing/bcsstk38 matrix (sparse, pd, 8032-by-8032) from SuiteSparse collection
# https://www.cise.ufl.edu/research/sparse/matrices/Boeing/bcsstk38.html
A = matrixdepot("Boeing/bcsstk38")
# Change type of A from Symmetric{Float64,SparseMatrixCSC{Float64,Int64}} to SparseMatrixCSC
A = sparse(A)
@show typeof(A)
Afull = Matrix(A)
@show typeof(Afull)
# actual sparsity level
count(!iszero, A) / length(A)
```

```{julia}
using UnicodePlots
spy(A)
```

```{julia}
# top 5 eigenvalues by LAPACK (QR algorithm)
n = size(A, 1)
@time eigvals(Symmetric(Afull), (n-4):n)
```

```{julia}
using Arpack
# top 5 eigenvalues by iterative methods
@time eigs(A; nev=5, ritzvec=false, which=:LM)
```

```{julia}
@benchmark eigs($A; nev=5, ritzvec=false, which=:LM)
```

We see >1000 fold speedup in this case.

## Jacobi method for symmetric eigen-decomposition

Assume $\mathbf{A} \in \mathbf{R}^{n \times n}$ is symmetric and we seek the eigen-decomposition $\mathbf{A} = \mathbf{U} \Lambda \mathbf{U}^T$.

* Idea: Systematically reduce off-diagonal entries 
$$
\text{off}(\mathbf{A}) = \sum_i \sum_{j \ne i} a_{ij}^2
$$
by Jacobi rotations.

* Jacobi/Givens rotations:
$$
\begin{eqnarray*}
	\mathbf{J}(p,q,\theta) = \begin{pmatrix} 
	1 & & 0 & & 0 & & 0 \\
	\vdots & \ddots & \vdots & & \vdots & & \vdots \\
	0 & & \cos(\theta) & & \sin(\theta) & & 0 \\ 
	\vdots & & \vdots & \ddots & \vdots & & \vdots \\
	0 & & - \sin(\theta) & & \cos(\theta) & & 0 \\
	\vdots & & \vdots & & \vdots & \ddots & \vdots \\
	0 & & 0 & & 0 & & 1 \end{pmatrix},
\end{eqnarray*}
$$
$\mathbf{J}(p,q,\theta)$ is orthogonal.

* Consider $\mathbf{B} = \mathbf{J}^T \mathbf{A} \mathbf{J}$. $\mathbf{B}$ preserves the symmetry and eigenvalues of $\mathbf{A}$. Taking 
$$
\begin{eqnarray*}
\begin{cases}
\tan (2\theta) = 2a_{pq}/({a_{qq}-a_{pp}}) & \text{if } a_{pp} \ne a_{qq} \\
\theta = \pi/4 & \text{if } a_{pp}=a_{qq}
\end{cases}
\end{eqnarray*}
$$
forces $b_{pq}=0$.

* Since orthogonal transform preserves Frobenius norm, we have
$$
b_{pp}^2 + b_{qq}^2 = a_{pp}^2 + a_{qq}^2 + 2a_{pq}^2.
$$ 
(Just check the 2-by-2 block)

* Since $\|\mathbf{A}\|_{\text{F}} = \|\mathbf{B}\|_{\text{F}}$, this implies that the off-diagonal part 
$$
	\text{off}(\mathbf{B}) = \text{off}(\mathbf{A}) - 2a_{pq}^2
$$
is decreased whenever $a_{pq} \ne 0$.

* One Jacobi rotation costs $O(n)$ flops.

* **Classical Jacobi**: search for the largest $|a_{ij}|$ at each iteration. $O(n^2)$ efforts!

* $\text{off}(\mathbf{A}) \le n(n-1) a_{ij}^2$ and $\text{off}(\mathbf{B}) = \text{off}(\mathbf{A}) - 2 a_{ij}^2$ together implies 
$$
\text{off}(\mathbf{B}) \le \left( 1 - \frac{2}{n(n-1)} \right) \text{off}(\mathbf{A}).
$$
Thus Jacobi method converges in $O(n^2)$ iterations.

* In practice, cyclic-by-row implementation, to avoid the costly $O(n^2)$ search in the classical Jacobi.

* Jacobi method attracts a lot recent attention because of its rich inherent parallelism.

* **Parallel Jacobi**: "merry-go-round" to generate parallel ordering.

<img src="./merry_go_round.pdf" width="500" align="center"/>