--- title: Matrices (BV Chapters 6-10) subtitle: Biostat 216 author: Dr. Hua Zhou @ UCLA date: today format: html: theme: cosmo embed-resources: true number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false jupyter: jupytext: formats: 'ipynb,qmd' text_representation: extension: .qmd format_name: quarto format_version: '1.0' jupytext_version: 1.15.2 kernelspec: display_name: Julia 1.9.3 language: julia name: julia-1.9 --- ```{julia} using Pkg Pkg.activate(pwd()) Pkg.instantiate() Pkg.status() ``` ```{julia} using BenchmarkTools, Distributions, DSP, ForwardDiff, GraphPlot using ImageCore, ImageFiltering, ImageIO, ImageShow, Graphs using LinearAlgebra, Plots, Polynomials, Random, SparseArrays, SpecialMatrices using Symbolics, TestImages Random.seed!(216) ``` ## Notations and terminology - A matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ $$ \mathbf{A} = \begin{pmatrix} a_{11} & \cdots & a_{1n} \\ \vdots & \ddots & \vdots \\ a_{m1} & \cdots & a_{mn} \end{pmatrix}. $$ - $a_{ij}$ are the matrix **elements** or **entries** or **coefficients** or **components**. - **Size** or **dimension** of the matrix is $m \times n$. - Many authors use square brackets: $$ \mathbf{A} = \begin{bmatrix} a_{11} & \cdots & a_{1n} \\ \vdots & \ddots & \vdots \\ a_{m1} & \cdots & a_{mn} \end{bmatrix}. $$ ```{julia} # a 3x4 matrix A = [0 1 -2.3 0.1; 1.3 4 -0.1 0; 4.1 -1 0 1.7] ``` - We say a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ is - **tall** if $m > n$; - **wide** if $m < n$; - **square** if $m = n$ ```{julia} # a tall matrix A = randn(5, 3) ``` ```{julia} # a wide matrix A = randn(2, 4) ``` ```{julia} # a square matrix A = randn(3, 3) ``` - **Block matrix** is a rectangular array of matrices $$ \mathbf{A} = \begin{pmatrix} \mathbf{B} & \mathbf{C} \\ \mathbf{D} & \mathbf{E} \end{pmatrix}. $$ Elements in the block matrices are the **blocks** or **submatrices**. Dimensions of the blocks must be compatible. ```{julia} # blocks B = [2; 1] # 2x1 C = [0 2 3; 5 4 7] # 2x3 D = [1] # 1x1 E = [-1 6 0] # 1x3 # block matrix A = [B C; D E] ``` - Matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ can be viewed as a $1 \times n$ block matrix with each column as a block $$ \mathbf{A} = \begin{pmatrix} \mathbf{a}_1 & \cdots & \mathbf{a}_n \end{pmatrix}, $$ where $$ \mathbf{a}_j = \begin{pmatrix} a_{1j} \\ \vdots \\ a_{mj} \end{pmatrix} $$ is the $j$-th column of $\mathbf{A}$. - $\mathbf{A} \in \mathbb{R}^{m \times n}$ viewed as an $m \times 1$ block matrix with each row as a block $$ \mathbf{A} = \begin{pmatrix} \mathbf{b}_1' \\ \vdots \\ \mathbf{b}_m' \end{pmatrix}, $$ where $$ \mathbf{b}_i' = \begin{pmatrix} a_{i1} \, \cdots \, a_{in} \end{pmatrix} $$ is the $i$-th row of $\mathbf{A}$. - Submatrix: $\mathbf{A}_{i_1:i_2,j_1:j_2}$ or $\mathbf{A}[i_1:i_2, j_1:j_2]$. ```{julia} A = [0 1 -2.3 0.1; 1.3 4 -0.1 0; 4.1 -1 0 1.7] ``` ```{julia} # sub-array A[2:3, 2:3] ``` ```{julia} # a row A[3, :] ``` ```{julia} # a column A[:, 3] ``` ## Some special matrices - **Zero matrix** $\mathbf{0}_{m \times n} \in \mathbb{R}^{m \times n}$ is the matrix with all entries $a_{ij}=0$. The subscript is sometimes omitted when the dimension is clear from context. - **One matrix** $\mathbf{1}_{m \times n} \in \mathbb{R}^{m \times n}$ is the matrix with all entries $a_{ij}=1$. It is often written as $\mathbf{1}_m \mathbf{1}_n'$. - **Identity matrix** $\mathbf{I}_n \in \mathbb{R}^{n \times n}$ has entries $a_{ij} = \delta_{ij}$ (Kronecker delta notation): $$ \mathbf{I}_n = \begin{pmatrix} 1 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & 1 \end{pmatrix}. $$ The subscript is often omitted when the dimension is clear from context. The columns of $\mathbf{A}$ are the unit vectors $$ \mathbf{I}_n = \begin{pmatrix} \mathbf{e}_1 \, \mathbf{e}_2 \, \cdots \, \mathbf{e}_n \end{pmatrix}. $$ Identity matrix is called the **uniform scaling** in some computer languages. ```{julia} # a 3x5 zero matrix zeros(3, 5) ``` ```{julia} # a 3x5 one matrix ones(3, 5) ``` ```{julia} @show ones(3, 5) == ones(3) * ones(5)' ``` ```{julia} # uniform scaling I ``` ```{julia} # a 3x3 identity matrix I(3) ``` ```{julia} # convert to dense matrix Matrix(I(3)) ``` - **Symmetric matrix** is a square matrix $\mathbf{A}$ with $a_{ij} = a_{ji}$. ```{julia} # a symmetric matrix A = [4 3 -2; 3 -1 5; -2 5 0] ``` ```{julia} issymmetric(A) ``` ```{julia} # turn a general square matrix into a symmetric matrix A = randn(3, 3) ``` ```{julia} # use upper triangular part as data Symmetric(A) ``` ```{julia} # use lower triangular part as data Symmetric(A, :L) ``` - A **diagonal matrix** is a square matrix $\mathbf{A}$ with $a_{ij} = 0$ for all $i \ne j$. ```{julia} # a diagonal matrix A = [-1 0 0; 0 2 0; 0 0 -5] ``` ```{julia} # turn a general square matrix into a diagonal matrix Diagonal(A) ``` - A **lower triangular matrix** is a square matrix $\mathbf{A}$ with $a_{ij} = 0$ for all $i < j$. - A **upper triangular matrix** is a square matrix $\mathbf{A}$ with $a_{ij} = 0$ for all $i > j$. ```{julia} # a lower triangular matrix A = [4 0 0; 3 -1 0; -1 5 -2] ``` ```{julia} # turn a general square matrix into a lower triangular matrix A = randn(3, 3) LowerTriangular(A) ``` ```{julia} # turn a general square matrix into an upper triangular matrix UpperTriangular(A) ``` - A **unit lower triangular matrix** is a square matrix $\mathbf{A}$ with $a_{ij} = 0$ for all $i < j$ and $a_{ii}=1$. - A **unit upper triangular matrix** is a square matrix $\mathbf{A}$ with $a_{ij} = 0$ for all $i > j$ and $a_{ii}=1$. ```{julia} # turn a general square matrix into a unit lower triangular matrix UnitLowerTriangular(A) ``` ```{julia} # turn a general square matrix into a unit upper triangular matrix UnitUpperTriangular(A) ``` - We say a matrix is **sparse** if most (almost all) of its elements are zero. - In computer, a sparse matrix only stores the non-zero entries and their positions. - Special algorithms exploit the sparsity. Efficiency depends on number of nonzeros and their positions. - Sparse matrix can be visualized by the **spy plot**. ```{julia} # generate a random 50x120 sparse matrix, with sparsity level 0.05 A = sprandn(50, 120, 0.05) ``` ```{julia} # show contents of SparseMatrixCSC{Bool, Int64} dump(A) ``` ```{julia} Plots.spy(A) ``` ## Matrix operations - **Scalar-matrix multiplication** or **scalar-matrix product**: For $\beta \in \mathbb{R}$ and $\mathbf{A} \in \mathbb{R}^{m \times n}$, $$ \beta \mathbf{A} = \begin{pmatrix} \beta a_{11} & \cdots & \beta a_{1n} \\ \vdots & \ddots & \vdots \\ \beta a_{m1} & \cdots & \beta a_{mn} \end{pmatrix}. $$ ```{julia} β = 0.5 A = [0 1 -2.3 0.1; 1.3 4 -0.1 0; 4.1 -1 0 1.7] ``` ```{julia} β * A ``` - **Matrix addition and substraction**: For $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$, $$ \mathbf{A} + \mathbf{B} = \begin{pmatrix} a_{11} + b_{11} & \cdots & a_{1n} + b_{1n} \\ \vdots & \ddots & \vdots \\ a_{m1} + b_{m1} & \cdots & a_{mn} + b_{mn} \end{pmatrix}, \quad \mathbf{A} - \mathbf{B} = \begin{pmatrix} a_{11} - b_{11} & \cdots & a_{1n} - b_{1n} \\ \vdots & \ddots & \vdots \\ a_{m1} - b_{m1} & \cdots & a_{mn} - b_{mn} \end{pmatrix}. $$ ```{julia} A = [0 1 -2.3 0.1; 1.3 4 -0.1 0; 4.1 -1 0 1.7] ``` ```{julia} B = ones(size(A)) ``` ```{julia} A + B ``` ```{julia} A - B ``` - Composition of scalar-matrix product and and matrix addition/substraction is linear combination of matrices of same size $$ \alpha_1 \mathbf{A}_1 + \cdots + \alpha_k \mathbf{A}_k $$ One example of linear combination of matrices is color-image-to-grayscale-image conversion. ```{julia} rgb_image = testimage("lighthouse") ``` ```{julia} # R(ed) channel channelview(rgb_image)[1, :, :] ``` ```{julia} # G(reen) channel channelview(rgb_image)[2, :, :] ``` ```{julia} # B(lue) channel channelview(rgb_image)[3, :, :] ``` ```{julia} gray_image = Gray.(rgb_image) ``` ```{julia} channelview(gray_image) ``` The gray image are computed by taking a linear combination of the R(ed), G(green), and B(lue) channels. $$ 0.299 \mathbf{R} + 0.587 \mathbf{G} + 0.114 \mathbf{B} $$ according to [Rec. ITU-R BT.601-7](https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.601-7-201103-I!!PDF-E.pdf). ```{julia} @show 0.299 * channelview(rgb_image)[1, 1, 1] + 0.587 * channelview(rgb_image)[2, 1, 1] + 0.114 * channelview(rgb_image)[3, 1, 1] @show channelview(gray_image)[1, 1]; ``` ```{julia} # first pixel rgb_image[1] ``` ```{julia} # first pixel converted to gray scale Gray{N0f8}(0.299 * rgb_image[1].r + 0.587 * rgb_image[1].g + 0.114 * rgb_image[1].b) ``` - The **transpose** of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ is the $n \times m$ matrix $$ \mathbf{A}' = \begin{pmatrix} a_{11} & \cdots & a_{m1} \\ \vdots & \ddots & \vdots \\ a_{1n} & \cdots & a_{mn} \end{pmatrix}. $$ In words, we swap $a_{ij} \leftrightarrow a_{ji}$ for all $i, j$. - Alternative notation: $\mathbf{A}^T$, $\mathbf{A}^t$. - Properties of matrix transpose: - A symmetric matrix satisfies $\mathbf{A} = \mathbf{A}'$. - $(\beta \mathbf{A})' = \beta \mathbf{A}'$. - $(\mathbf{A} + \mathbf{B})' = \mathbf{A}' + \mathbf{B}'$. ```{julia} A ``` ```{julia} A' ``` ```{julia} # note it's not rotating the picture! rgb_image' ``` - Transpose a block matrix $$ \begin{pmatrix} \mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D} \end{pmatrix}' = \begin{pmatrix} \mathbf{A}' & \mathbf{C}' \\ \mathbf{B}' & \mathbf{D}' \end{pmatrix}. $$ ## Matrix norm - The **(Frobenius) norm** or **norm** of a matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ is defined as $$ \|\mathbf{A}\| = \sqrt{\sum_{i,j} a_{ij}^2}. $$ Often we use $\|\mathbf{A}\|_{\text{F}}$ to differentiate with other matrix norms. - Define the $\operatorname{vec}$ operator that stacks the columns of a matrix into a long vector $$ \operatorname{vec} \mathbf{A} = \begin{pmatrix} a_{11} \\ \vdots \\ a_{m1} \\ \vdots \\ a_{1n} \\ \vdots \\ a_{mn} \end{pmatrix}. $$ It is easy to see that $\|\mathbf{A}\| = \|\operatorname{vec} \mathbf{A}\|$. - Similar to the vector norm, matrix norm satisfies the following properties: 1. Positive definiteness: $\|\mathbf{A}\| \ge 0$. $\|\mathbf{A}\| = 0$ if and only if $\mathbf{A}=\mathbf{0}_{m \times n}$. 2. Homogeneity: $\|\alpha \mathbf{A}\| = |\alpha| \|\mathbf{A}\|$ for any scalar $\alpha$ and matrix $\mathbf{A}$. 3. Triangle inequality: $\|\mathbf{A} + \mathbf{B}\| \le \|\mathbf{A}\| + \|\mathbf{B}\|$ for any $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$. Proof: TODO in class. 4. A new rule for (any) matrix norm: $\|\mathbf{A} \mathbf{B}\| \le \|\mathbf{A}\| \|\mathbf{B}\|$. Proof: HW2. ```{julia} A = [0 1 -2.3 0.1; 1.3 4 -0.1 0; 4.1 -1 0 1.7] ``` ```{julia} # Frobenius norm norm(A) ``` ```{julia} # manually calculate Frobenius norm sqrt(sum(abs2, A)) ``` ## Matrix-matrix and matrix-vector multiplications ### Matrix-matrix multiplication - **Matrix-matrix multiplication** or **matrix-matrix product**: For $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{B} \in \mathbb{R}^{n \times p}$, $$ \mathbf{C} = \mathbf{A} \mathbf{B} $$ is the $m \times p$ matrix with entries $$ c_{ij} = a_{i1} b_{1j} + a_{i2} b_{2k} + \cdots + a_{in} b_{nj} = \sum_{k=1}^n a_{ik} b_{kj}. $$ Note the number of columns in $\mathbf{A}$ must be equal to the number of rows in $\mathbf{B}$. - Vector inner product $\mathbf{a}' \mathbf{b}$ is a special case of matrix-matrix multiplication with $m=p=1$. $$ \mathbf{a}' \mathbf{b} = \begin{pmatrix} a_1 \, \cdots \, a_n \end{pmatrix} \begin{pmatrix} b_1 \\ \vdots \\ b_n \end{pmatrix} = \sum_{i=1}^n a_i b_i. $$ - View of matrix-matrix multiplication as inner products. $c_{ij}$ is the inner product of the $i$-th row of $\mathbf{A}$ and the $j$-th column of $\mathbf{B}$: $$ \begin{pmatrix} & & \\ & c_{ij} & \\ & & \end{pmatrix} = \begin{pmatrix} & & \\ - & \mathbf{a}_i' & - \\ & & \end{pmatrix} \begin{pmatrix} & | & \\ & \mathbf{b}_j & \\ & | & \end{pmatrix} = \begin{pmatrix} & & \\ a_{i1} & \cdots & a_{in} \\ & & \end{pmatrix} \begin{pmatrix} & b_{1j} & \\ & \vdots & \\ & b_{nj} & \end{pmatrix}. $$ ```{julia} # an example with m, n, p = 2, 3, 2 A = [-1.5 3 2; 1 -1 0] ``` ```{julia} B = [-1 -1; 0 -2; 1 0] ``` ```{julia} C = A * B ``` ```{julia} # check C[2,3] = A[2,:]'B[:,3] C[2, 2] ≈ A[2, :]'B[:, 2] ``` ### Matrix-vector multiplication - **Matrix-vector multiplication** or **matrix-vector product** are special cases of the matrix-matrix multiplication with $m=1$ or $p=1$. - (Inner product view) For $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{b} \in \mathbb{R}^{n}$, $$ \mathbf{A} \mathbf{b} = \begin{pmatrix} - & \mathbf{c}_1' & - \\ & \vdots & \\ - & \mathbf{c}_m' & - \end{pmatrix} \begin{pmatrix} | \\ \mathbf{b} \\ | \end{pmatrix} = \begin{pmatrix} \mathbf{c}_1' \mathbf{b} \\ \vdots \\ \mathbf{c}_m' \mathbf{b} \end{pmatrix} \in \mathbb{R}^m. $$ (Linear combination view) Alternatively, $\mathbf{A} \mathbf{b}$ can be viewed as a linear combination of columns of $\mathbf{A}$ $$ \mathbf{A} \mathbf{b} = \begin{pmatrix} | & & | \\ \mathbf{a}_1 & \cdots & \mathbf{a}_n \\ | & & | \end{pmatrix} \begin{pmatrix} b_1 \\ \vdots \\ b_n \end{pmatrix} = b_1 \mathbf{a}_1 + \cdots + b_n \mathbf{a}_n. $$ - (Inner product view) For $\mathbf{a} \in \mathbb{R}^n$ and $\mathbf{B} \in \mathbb{R}^{n \times p}$, $$ \mathbf{a}' \mathbf{B} = \begin{pmatrix} - & \mathbf{a}' & - \end{pmatrix} \begin{pmatrix} | & & | \\ \mathbf{b}_1 & & \mathbf{b}_p \\ | & & | \end{pmatrix} = (\mathbf{a}' \mathbf{b}_1 \, \ldots \, \mathbf{a}'\mathbf{b}_p). $$ (Linear combination view) Alternatively, $\mathbf{a}' \mathbf{B}$ can be viewed as a linear combination of the rows of $\mathbf{B}$ $$ \mathbf{a}' \mathbf{B} = (a_1 \, \ldots \, a_n) \begin{pmatrix} - & \mathbf{b}_1' & - \\ & & \\ - & \mathbf{b}_n' & - \end{pmatrix} = a_1 \mathbf{b}_1' + \cdots + a_n \mathbf{b}_n'. $$ - View of matrix mulplication $\mathbf{C} = \mathbf{A} \mathbf{B}$ as matrix-vector products. - $j$-th column of $\mathbf{C}$ is equal to product of $\mathbf{A}$ and $j$-th column of $\mathbf{B}$ $$ \begin{pmatrix} & | & \\ & \mathbf{c}_j & \\ & | & \end{pmatrix} = \mathbf{A} \begin{pmatrix} & | & \\ & \mathbf{b}_j & \\ & | & \end{pmatrix}. $$ - $i$-th row of $\mathbf{C}$ is equal to product of $i$-th row of $\mathbf{A}$ and $\mathbf{B}$ $$ \begin{pmatrix} & & \\ - & \mathbf{c}_i' & - \\ & & \end{pmatrix} = \begin{pmatrix} & & \\ - & \mathbf{a}_i' & - \\ & & \end{pmatrix} \mathbf{B}. $$ ```{julia} # check C[:, 2] = A * B[:, 2] C[:, 2] ≈ A * B[:, 2] ``` ```{julia} # check C[2, :]' = A[2, :]' * B # note C[2, :] returns a column vector in Julia! C[2, :]' ≈ A[2, :]'B ``` ### Exercise - multiplication of adjacency matrix Here is a directed graph with 4 nodes and 5 edges. ```{julia} # a simple directed graph on GS p16 g = SimpleDiGraph(4) add_edge!(g, 1, 2) add_edge!(g, 1, 3) add_edge!(g, 2, 3) add_edge!(g, 2, 4) add_edge!(g, 4, 3) gplot(g, nodelabel=["x1", "x2", "x3", "x4"], edgelabel=["b1", "b2", "b3", "b4", "b5"]) ``` The adjacency matrix $\mathbf{A}$ has entries \begin{eqnarray*} a_{ij} = \begin{cases} 1 & \text{if node $i$ links to node $j$} \\ 0 & \text{otherwise} \end{cases}. \end{eqnarray*} Note this definition differs from the BV book (p112). ```{julia} # adjacency matrix A A = convert(Matrix{Int64}, adjacency_matrix(g)) ``` Give a graph interpretation of $\mathbf{A}^2 = \mathbf{A} \mathbf{A}$, $\mathbf{A}^3 = \mathbf{A} \mathbf{A} \mathbf{A}$, ... ```{julia} A * A ``` ```{julia} A * A * A ``` **Answer**: $(\mathbf{A}^k)_{ij}$ counts the number of paths from node $i$ to node $j$ in exactly $k$ steps. ### Properties of matrix multiplications - Associative: $$ (\mathbf{A} \mathbf{B}) \mathbf{C} = \mathbf{A} (\mathbf{B} \mathbf{C}) = \mathbf{A} \mathbf{B} \mathbf{C}. $$ - Associative with scalar-matrix multiplication: $$ (\alpha \mathbf{B}) \mathbf{C} = \alpha \mathbf{B} \mathbf{C} = \mathbf{B} (\alpha \mathbf{C}). $$ - Distributive with sum: $$ (\mathbf{A} + \mathbf{B}) \mathbf{C} = \mathbf{A} \mathbf{C} + \mathbf{B} \mathbf{C}, \quad \mathbf{A} (\mathbf{B} + \mathbf{C}) = \mathbf{A} \mathbf{B} + \mathbf{A} \mathbf{C}. $$ - Transpose of product: $$ (\mathbf{A} \mathbf{B})' = \mathbf{B}' \mathbf{A}'. $$ - **Not** commutative: In general, $$ \mathbf{A} \mathbf{B} \ne \mathbf{B} \mathbf{A}. $$ There are exceptions, e.g., $\mathbf{A} \mathbf{I} = \mathbf{I} \mathbf{A}$ for square $\mathbf{A}$. ```{julia} A = randn(3, 2) B = randn(2, 4) A * B ``` ```{julia} # dimensions are even incompatible! B * A ``` - For two vectors $\mathbf{a} \in \mathbb{R}^m$ and $\mathbf{b} \in \mathbb{R}^n$, the matrix $$ \mathbf{a} \mathbf{b}' = \begin{pmatrix} a_1 \\ \vdots \\ a_m \end{pmatrix} \begin{pmatrix} b_1 \, \cdots \, b_n \end{pmatrix} = \begin{pmatrix} a_1 b_1 & \cdots & a_1 b_n \\ \vdots & & \vdots \\ a_m b_1 & \cdots & a_m b_n \end{pmatrix} $$ is called the **outer product** of $\mathbf{a}$ and $\mathbf{b}$. Apparently $\mathbf{a} \mathbf{b}' \ne \mathbf{b}' \mathbf{a}$ unless both are scalars. - View of matrix-matrix product $\mathbf{C} = \mathbf{A} \mathbf{B}$ as sum of outer products: $$ \mathbf{C} = \begin{pmatrix} | & & | \\ \mathbf{a}_1 & \cdots & \mathbf{a}_n \\ | & & | \end{pmatrix} \begin{pmatrix} - & \mathbf{b}_1' & - \\ & \vdots & \\ - & \mathbf{b}_n' & - \end{pmatrix} = \mathbf{a}_1 \mathbf{b}_1' + \cdots + \mathbf{a}_n \mathbf{b}_n'. $$ ```{julia} m, n, p = 3, 2, 4 A = randn(m, n) B = randn(n, p) C = A * B @show C ≈ A[:, 1] * B[1, :]' + A[:, 2] * B[2, :]' ``` ### Gram matrix - Let $\mathbf{A} \in \mathbb{R}^{m \times n}$ with columns $\mathbf{a}_1, \ldots, \mathbf{a}_n$. The matrix product $$ \mathbf{G} = \mathbf{A}' \mathbf{A} = \begin{pmatrix} - & \mathbf{a}_1' & - \\ & \vdots & \\ - & \mathbf{a}_n' & - \end{pmatrix} \begin{pmatrix} | & & | \\ \mathbf{a}_1 & \cdots & \mathbf{a}_n \\ | & & | \end{pmatrix} = \begin{pmatrix} \mathbf{a}_1' \mathbf{a}_1 & \cdots & \mathbf{a}_1' \mathbf{a}_n \\ \vdots & \ddots & \vdots \\ \mathbf{a}_n' \mathbf{a}_1 & \cdots & \mathbf{a}_n' \mathbf{a}_n \end{pmatrix} \in \mathbb{R}^{n \times n} $$ is called the **Gram matrix**. - Gram matrix is symmetric: $\mathbf{G}' = (\mathbf{A}' \mathbf{A})' = \mathbf{A}' (\mathbf{A}')' = \mathbf{A}' \mathbf{A} = \mathbf{G}$. - We also call $\mathbf{A} \mathbf{A}'$ a Gram matrix. ```{julia} A = randn(5, 3) ``` ```{julia} # 3x3 A' * A ``` ```{julia} # 5x5 A * A' ``` ### Linear independence of columns - A set of vectors $\mathbf{a}_1, \ldots, \mathbf{a}_k \in \mathbb{R}^{n}$ are linearly independent if, for the matrix $\mathbf{A} = (\mathbf{a}_1 \, \ldots \, \mathbf{a}_k) \in \mathbb{R}^{n \times k}$, $$ \mathbf{A} \mathbf{b} = b_1 \mathbf{a}_1 + \cdots + b_k \mathbf{a}_k = \mathbf{0}_n $$ implies that $\mathbf{b} = \mathbf{0}_k$. In this case, we also say $\mathbf{A}$ has linearly independent columns. ### Block matrix-matrix multiplication - Matrix-matrix multiplication in block form: $$ \begin{pmatrix} \mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D} \end{pmatrix} \begin{pmatrix} \mathbf{W} & \mathbf{Y} \\ \mathbf{X} & \mathbf{Z} \end{pmatrix} = \begin{pmatrix} \mathbf{A} \mathbf{W} + \mathbf{B} \mathbf{X} & \mathbf{A} \mathbf{Y} + \mathbf{B} \mathbf{Z} \\ \mathbf{C} \mathbf{W} + \mathbf{D} \mathbf{X} & \mathbf{C} \mathbf{Y} + \mathbf{D} \mathbf{Z} \end{pmatrix}. $$ Submatrices need to have compatible dimensions. ## Linear functions and operators - The matrix-vector product $\mathbf{y} = \mathbf{A} \mathbf{x}$ can be viewed as a function acting on input $\mathbf{x} \in \mathbb{R}^n$ and outputing $\mathbf{y} \in \mathbb{R}^m$. In this sense, we also say any matrix $\mathbf{A}$ is a linear operator. - A function $f: \mathbb{R}^n \mapsto \mathbb{R}^m$ is **linear** if $$ f(\alpha \mathbf{x} + \beta \mathbf{y}) = \alpha f(\mathbf{x}) + \beta f(\mathbf{y}). $$ for all scalars $\alpha, \beta$ and vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$. - Definition of linear function implies that the superposition property holds for any linear combination $$ f(\alpha_1 \mathbf{x}_1 + \cdots + \alpha_p \mathbf{x}_p) = \alpha_1 f(\mathbf{x}_1) + \cdots + \alpha_p f(\mathbf{x}_p). $$ - Any linear function is a matrix-vector product and vice versa. 1. For a fixed matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, the function $f: \mathbb{R}^n \mapsto \mathbb{R}^m$ defined by $$ f(\mathbf{x}) = \mathbf{A} \mathbf{x} $$ is linear, because $f(\alpha \mathbf{x} + \beta \mathbf{y}) = \mathbf{A} (\alpha \mathbf{x} + \beta \mathbf{y}) = \alpha \mathbf{A} \mathbf{x} + \beta \mathbf{A} \mathbf{y} = \alpha f(\mathbf{x}) + \beta f(\mathbf{y})$. 2. Any linear function is a matrix-vector product because \begin{eqnarray*} f(\mathbf{x}) &=& f(x_1 \mathbf{e}_1 + \cdots + x_n \mathbf{e}_n) \\ &=& x_1 f(\mathbf{e}_1) + \cdots + x_n f(\mathbf{e}_n) \\ &=& \begin{pmatrix} f(\mathbf{e}_1) \, \cdots \, f(\mathbf{e}_n) \end{pmatrix} \mathbf{x}. \end{eqnarray*} Hence $f(\mathbf{x}) = \mathbf{A} \mathbf{x}$ with $\mathbf{A} = \begin{pmatrix} f(\mathbf{e}_1) \, \cdots \, f(\mathbf{e}_n) \end{pmatrix}$. ### Permutation - **Reverser matrix**: $$ A = \begin{pmatrix} 0 & \cdots & 1 \\ \vdots & .^{.^{.}} & \vdots \\ 1 & \cdots & 0 \end{pmatrix}. $$ For any vector $\mathbf{x} \in \mathbb{R}^n$, $$ \mathbf{A} \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_{n-1} \\ x_n \end{pmatrix} = \begin{pmatrix} x_n \\ x_{n-1} \\ \vdots \\ x_2 \\ x_1 \end{pmatrix}. $$ ```{julia} # a reverser matrix n = 5 A = [i + j == n + 1 ? 1 : 0 for i in 1:n, j in 1:n] ``` ```{julia} x = Vector(1:n) ``` ```{julia} A * x ``` - **Circular shift matrix** $$ \mathbf{A} = \begin{pmatrix} 0 & 0 & \cdots & 0 & 1 \\ 1 & 0 & \cdots & 0 & 0 \\ 0 & 1 & \cdots & 0 & 0 \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & \cdots & 1 & 0 \end{pmatrix}. $$ For any vector $\mathbf{x} \in \mathbb{R}^n$, $$ \quad \mathbf{A} \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_{n-1} \\ x_n \end{pmatrix} = \begin{pmatrix} x_n \\ x_1 \\ x_2 \\ \vdots \\ x_{n-1} \end{pmatrix}. $$ ```{julia} n = 5 A = [mod(i - j, n) == 1 ? 1 : 0 for i in 1:n, j in 1:n] ``` ```{julia} x = Vector(1:n) ``` ```{julia} A * x ``` ```{julia} A * (A * x) ``` - The reverser and circular shift matrices are examples of permutation matrix. A **permutation matrix** is a square 0-1 matrix with one element 1 in each row and one element 1 in each column. Equivalently, a permutation matrix is the identity matrix with columns reordered. Equivalently, a permutation matrix is the identity matrix with rows reordered. $\mathbf{A} \mathbf{x}$ is a permutation of elements in $\mathbf{x}$. ```{julia} σ = randperm(n) ``` ```{julia} # permute the rows of identity matrix P = I(n)[σ, :] ``` ```{julia} x = Vector(1:n) ``` ```{julia} # operator P * x ``` ```{julia} x[σ] ``` ### Rotation - A **rotation matrix** in the plane $\mathbb{R}^2$: $$ \mathbf{A} = \begin{pmatrix} \cos \theta & - \sin \theta \\ \sin \theta & \cos \theta \end{pmatrix}. $$ $\mathbf{A} \mathbf{x}$ is $\mathbf{x}$ rotated counterclockwise over an angle $\theta$. ```{julia} θ = π/4 A = [cos(θ) -sin(θ); sin(θ) cos(θ)] ``` ```{julia} # rotate counterclockwise 45 degree x1 = [2, 1] x2 = A * x1 ``` ```{julia} # rotate counterclockwise 90 degree x3 = A * x2 ``` ```{julia} #| code-fold: true x_vals = [0 0 0; x1[1] x2[1] x3[1]] y_vals = [0 0 0; x1[2] x2[2] x3[2]] plot(x_vals, y_vals, arrow = true, color = :blue, legend = :none, xlims = (-3, 3), ylims = (-3, 3), annotations = [((x1 .+ 0.2)..., "x1"), ((x2 .+ 0.2)..., "x2 = A * x1"), ((x3 .+ [-1, 0.2])..., "x3 = A * A * x1")], xticks = -3:1:3, yticks = -3:1:3, framestyle = :origin, aspect_ratio = :equal) ``` - **Exercise**: Given two vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^2$ of same length, how do we find a rotation matrix such that $\mathbf{A} \mathbf{x} = \mathbf{y}$? ```{julia} x, y = randn(2), randn(2) x = x / norm(x) y = y / norm(y) cosθ = x'y / (norm(x) * norm(y)) ``` ```{julia} sinθ = sqrt(1 - cosθ^2) A = [cosθ -sinθ; sinθ cosθ] ``` ```{julia} # note we have either Ax = y or Ay = x if A * y ≈ x A = [cosθ sinθ; -sinθ cosθ] end ``` ```{julia} A * x ≈ y ``` ### Projection and reflection - **Projection** of $\mathbf{x} \in \mathbb{R}^n$ on the line through $\mathbf{0}_n$ and $\mathbf{a}$: $$ \mathbf{y} = \frac{\mathbf{a}'\mathbf{x}}{\|\mathbf{a}\|^2} \mathbf{a} = \frac{\mathbf{a}\mathbf{a}'}{\|\mathbf{a}\|^2} \mathbf{x} = \mathbf{A} \mathbf{x}, $$ where $$ \mathbf{A} = \frac{1}{\|\mathbf{a}\|^2} \mathbf{a} \mathbf{a}'. $$ - **Reflection** of $\mathbf{x} \in \mathbb{R}^n$ with respect to the line through $\mathbf{0}_n$ and $\mathbf{a}$: $$ \mathbf{z} = \mathbf{x} + 2(\mathbf{y} - \mathbf{x}) = 2 \mathbf{y} - \mathbf{x} = \left( \frac{2}{\|\mathbf{a}\|^2} \mathbf{a} \mathbf{a}' - \mathbf{I} \right) \mathbf{x} = \mathbf{B} \mathbf{x}, $$ with $$ \mathbf{B} = 2 \mathbf{A} - \mathbf{I} = \frac{2}{\|\mathbf{a}\|^2} \mathbf{a} \mathbf{a}' - \mathbf{I}. $$ ```{julia} a = [3, 2] A = a * a' / norm(a)^2 x = [1, 2] y = A * x B = 2A - I z = B * x; ``` ```{julia} x_vals = [0 0 0; x[1] y[1] z[1]] y_vals = [0 0 0; x[2] y[2] z[2]] plt = plot(x_vals, y_vals, arrow = true, color = :blue, legend = :none, xlims = (-4, 4), ylims = (-2, 4), annotations = [((x .+ 0.2)..., "x"), ((y .+ 0.2)..., "y"), ((z .+ 0.2)..., "z")], xticks = -3:1:3, yticks = -1:1:3, framestyle = :origin, aspect_ratio = :equal) plot!(plt, [0, a[1]], [0, a[2]], arrow = true, annotations = [((a .+ 0.2)..., "a")]) ``` ### Incidence matrix - For a directed graph with $m$ nodes and $n$ arcs (directed edges), the **node-arc indicence matrix** $\mathbf{B} \in \{-1,0,1\}^{m \times n}$ has entries \begin{eqnarray*} b_{ij} = \begin{cases} -1 & \text{if edge $j$ starts at vertex $i$} \\ 1 & \text{if edge $j$ ends at vertex $i$} \\ 0 & \text{otherwise} \end{cases}. \end{eqnarray*} ```{julia} # a simple directed graph on GS p16 g = SimpleDiGraph(4) add_edge!(g, 1, 2) add_edge!(g, 1, 3) add_edge!(g, 2, 3) add_edge!(g, 2, 4) add_edge!(g, 4, 3) gplot(g, nodelabel=["x1", "x2", "x3", "x4"], edgelabel=["b1", "b2", "b3", "b4", "b5"]) ``` ```{julia} # incidence matrix B B = convert(Matrix{Int64}, incidence_matrix(g)) ``` - **Kirchhoff's current law**: Let $\mathbf{B} \in \mathbb{R}^{m \times n}$ be an incidence matrix and $\mathbf{x}=(x_1 \, \ldots \, x_n)'$ with $x_j$ the current through arc $j$, $$ (\mathbf{B} \mathbf{x})_i = \sum_{\text{arc $j$ enters node $i$}} x_j - \sum_{\text{arc $j$ leaves node $i$}} x_j = \text{net current arriving at node $i$}. $$ ```{julia} # symbolic calculation @variables x1 x2 x3 x4 x5 B * [x1, x2, x3, x4, x5] ``` - **Kirchhoff's voltage law**: Let $\mathbf{B} \in \mathbb{R}^{m \times n}$ be an incidence matrix and $\mathbf{y} = (y_1, \ldots, y_m)'$ with $y_i$ the potential at node $i$, $$ (\mathbf{B}' \mathbf{y})_j = y_k - y_l \text{ if edge $j$ goes from node $l$ to node $k$ = negative of the voltage across arc $j$}. $$ ```{julia} @variables y1 y2 y3 y4 B' * [y1, y2, y3, y4] ``` ### Convolution and filtering - Convolution plays important roles in signal processing, image processing, and (convolutional) neural network. - The **convolution** of $\mathbf{a} \in \mathbb{R}^n$ and $\mathbf{b} \in \mathbb{R}^m$ is a vector $\mathbf{c} \in \mathbb{R}^{n+m-1}$ with entries $$ c_k = \sum_{i+j=k+1} a_i b_j. $$ Notation: $\mathbf{c} = \mathbf{a} * \mathbf{b}$. - Example: $n=4$, $m=3$, $$ \mathbf{a} = \begin{pmatrix} a_1 \\ a_1 \\ a_3 \\ a_4 \end{pmatrix}, \quad \mathbf{b} = \begin{pmatrix} b_1 \\ b_2 \\ b_3 \end{pmatrix}. $$ Then $\mathbf{c} = \mathbf{a} * \mathbf{b} \in \mathbb{R}^6$ has entries \begin{eqnarray*} c_1 &=& a_1 b_1 \\ c_2 &=& a_1 b_2 + a_2 b_1 \\ c_3 &=& a_1 b_3 + a_2 b_2 + a_3 b_1 \\ c_4 &=& a_2 b_3 + a_3 b_2 + a_4 b_1 \\ c_5 &=& a_3 b_3 + a_4 b_2 \\ c_6 &=& a_4 b_3 \end{eqnarray*} - Polynomial interpretation: Let $\mathbf{a}$ and $\mathbf{b}$ be the coefficients in polynomials \begin{eqnarray*} p(x) &=& a_1 + a_2 x + \cdots + a_n x^{n-1} \\ q(x) &=& b_1 + b_2 x + \cdots + b_m x^{m-1}, \end{eqnarray*} then $\mathbf{c} = \mathbf{a} * \mathbf{b}$ gives the coefficients of the product polynomial $$ p(x) q(x) = c_1 + c_2 x + \cdots + c_{n+m-1} x^{m+n-2}. $$ ```{julia} n, m = 4, 3 @variables a[1:n] b[1:m] ``` ```{julia} p = Polynomial(a) ``` ```{julia} q = Polynomial(b) ``` ```{julia} p * q ``` ```{julia} coeffs(p * q) ``` ```{julia} a = [1, 0, -1, 2] b = [2, 1, -1] c = DSP.conv(a, b) ``` - Probabilistic interpretation. In probability and statistics, convolution often appears when computing the distribution of the sum of two independent random variables. Let $X$ be a discrete random variable taking value $i \in \{0,1,\ldots,n-1\}$ with probability $a_{i+1}$ and $Y$ be another discrete random variable taking values $j \in \{0,1,\ldots,m-1\}$ with probability $b_{j+1}$. Assume $X$ is independent of $Y$, then the distribution of $Z = X+Y$ is $$ c_k = \mathbb{P}(Z = k - 1) = \sum_{i+j=k-1} \mathbb{P}(X = i) \mathbb{P}(Y = j) = \sum_{i+j=k-1} a_{i+1} b_{j+1} = \sum_{i'+j'=k+1} a_{i'} b_{j'} $$ for $k = 1,\ldots,n+m-1$. In the probability setting, the polynomial is called the probability generating function of a discrete random variable. ```{julia} n, m, p = 10, 3, 0.5 # pmf of X ∼ Bin(n-1, p) a = [Distributions.pdf(Binomial(n-1, p), j) for j in 0:n-1] ``` ```{julia} # pmf of Y ∼ Unif(0, m-1) b = [Distributions.pdf(DiscreteUniform(0, m - 1), j) for j in 0:m-1] ``` ```{julia} # compute pmf of Z = X + Y by convolution: c = a * b c = DSP.conv(a, b) ``` ```{julia} plta = bar(0:(n - 1), a, label="X∼Bin(9, 0.5)") pltb = bar(0:(m - 1), b, label="Y∼Unif(0, 2)") pltc = bar(0:(n + m - 1), c, label="Z=X+Y") plot(plta, pltb, pltc, layout=(1,3), size=(1000, 300), ylim=[0, 0.4]) ``` - Propoerties of convolution - symmetric: $\mathbf{a} * \mathbf{b} = \mathbf{b} * \mathbf{a}$. - associative: $(\mathbf{a} * \mathbf{b}) * \mathbf{c} = \mathbf{a} * \mathbf{b} * \mathbf{c} = \mathbf{a} * (\mathbf{b} * \mathbf{c})$. - If $\mathbf{a} * \mathbf{b} = \mathbf{0}$, then $\mathbf{a} = \mathbf{0}$ or $\mathbf{b} = \mathbf{0}$. These properties follow either from the polynomial interpretation or probabilistic interpretation. - $\mathbf{c} = \mathbf{a} * \mathbf{b}$ is a linear function $\mathbf{b}$ if we fix $\mathbf{a}$. - $\mathbf{c} = \mathbf{a} * \mathbf{b}$ is a linear function $\mathbf{a}$ if we fix $\mathbf{b}$. - For $n=4$ and $m=3$, ```{julia} n, m = 4, 3 @variables a[1:n] b[1:m] ``` ```{julia} # Toeplitz matrix corresponding to the vector a Ta = diagm(6, 3, 0 => [a[1], a[1], a[1]], -1 => [a[2], a[2], a[2]], -2 => [a[3], a[3], a[3]], -3 => [a[4], a[4], a[4]] ) ``` ```{julia} c = Ta * b ``` ```{julia} # c = Ta * b Symbolics.scalarize(c) ``` ```{julia} # Toeplitz matrix corresponding to the vector b Tb = diagm(6, 4, 0 => [b[1], b[1], b[1], b[1]], -1 => [b[2], b[2], b[2], b[2]], -2 => [b[3], b[3], b[3], b[3]] ) ``` ```{julia} # c = Tb * a Symbolics.scalarize(Tb * a) ``` - The convolution matrices $\mathbf{T}_a$ and $\mathbf{T}_b$ are examples of **Toeplitz matrices**. - When one vector is much longer than the other, for example, $m \ll n$, we can consider the longer vector $\mathbf{a} \in \mathbb{R}^n$ as **signal**, and apply the inner product of the **filter** (also called **kernel**) $(b_m, \cdots, b_1) \in \mathbb{R}^m$ along a sliding window of $\mathbf{a}$. - If we apply filter (or kernel) in original order to the sliding window of the signal, it is called **correlation** instead of **convolution**. For symmetric filter (or kernel), correlation is same as convolution. - (2D) Filtering is extensively used in image processing. ```{julia} img = testimage("mandrill") ``` ```{julia} # correlatin with Gaussian kernel imgg = imfilter(img, Kernel.gaussian(3)) ``` ```{julia} ?Kernel.gaussian() ``` ```{julia} # Gaussian kernel with σ = 3 Kernel.gaussian(3) ``` ```{julia} # convolution with Gaussian kernel is same as correlation, since Gaussian kernel is symmetric imgg = imfilter(img, reflect(Kernel.gaussian(3))) ``` ```{julia} # Correlation with Laplace kernel imgl = imfilter(img, Kernel.Laplacian()) ``` ```{julia} # Laplacian kernel Kernel.Laplacian() ``` ```{julia} ?Kernel.Laplacian() ``` ```{julia} convert(AbstractArray, Kernel.Laplacian()) ``` ## Affine functions - A function $f: \mathbb{R}^n \mapsto \mathbb{R}^m$ is **affine** if $$ f(\alpha \mathbf{x} + \beta \mathbf{y}) = \alpha f(\mathbf{x}) + \beta f(\mathbf{y}). $$ for all vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ and scalars $\alpha, \beta$ with $\alpha + \beta = 1$. - Any linear function is affine. Why? - Definition of affine function implies that $$ f(\alpha_1 \mathbf{x}_1 + \cdots + \alpha_p \mathbf{x}_p) = \alpha_1 f(\mathbf{x}_1) + \cdots + \alpha_p f(\mathbf{x}_p) $$ for all vectors $\mathbf{x}_1, \ldots, \mathbf{x}_p$ and scalars $\alpha_1, \ldots, \alpha_p$ such that $\alpha_1 + \cdots + \alpha_p = 1$. - Any affine function $f: \mathbb{R}^n \mapsto \mathbb{R}^m$ is of the form $\mathbf{A} \mathbf{x} + \mathbf{b}$ for some matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ and vector $\mathbf{b} \in \mathbb{R}^m$ and vice versa. 1. For fixed $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{b} \in \mathbb{R}^m$, define function $f: \mathbb{R}^n \mapsto \mathbb{R}^m$ by a matrix-vector product plus a constant: $$ f(\mathbf{x}) = \mathbf{A} \mathbf{x} + \mathbf{b}. $$ Then $f$ is an affine function: if $\alpha + \beta = 1$, then $$ f(\alpha \mathbf{x} + \beta \mathbf{y}) = \mathbf{A}(\alpha \mathbf{x} + \beta \mathbf{y}) + \mathbf{b} = \alpha \mathbf{A} \mathbf{x} + \beta \mathbf{A} \mathbf{y} + \alpha \mathbf{b} + \beta \mathbf{b} = \alpha(\mathbf{A} \mathbf{x} + \mathbf{b}) + \beta (\mathbf{A} \mathbf{y} + \mathbf{b}) = \alpha f(\mathbf{x}) + \beta f(\mathbf{y}). $$ 2. Any affine function can be written as $f(\mathbf{x}) = \mathbf{A} \mathbf{x} + \mathbf{b}$ with $$ \mathbf{A} = \begin{pmatrix} f(\mathbf{e}_1) - f(\mathbf{0}) \quad f(\mathbf{e}_2) - f(\mathbf{0}) \quad \ldots \quad f(\mathbf{e}_n) - f(\mathbf{0}) \end{pmatrix} $$ and $\mathbf{b} = f(\mathbf{0})$. Hint: Write $\mathbf{x} = x_1 \mathbf{e}_1 + \cdots + x_n \mathbf{e}_n + (1 - x_1 - \cdots - x_n) \mathbf{0}$. ## Affine approximation of a function (important) - The **first-order Taylor approximation of a differentiable function $f: \mathbb{R} \mapsto \mathbb{R}$ at a point $z \in \mathbb{R}$** is $$ \widehat f(x) = f(z) + f'(z) (x - z), $$ where $f'(z)$ is the **derivative** of $f$ at $z$. - Example: $f(x) = e^{2x} - x$. The derivative is $f'(x) = 2e^{2x} - 1$. Then the first-order Taylor approximation of $f$ at point $z$ is $$ f(x) \approx e^{2z} - z + (2e^{2z} - 1) (x - z). $$ ```{julia} x = Vector(-1:0.1:1) f1(x) = exp(2x) - x # draw the function plt = plot(x, f1.(x), label = "f(x)") # draw the affine approximation at z=0 z = 0.0 @show ForwardDiff.derivative(f1, z) # let computer calculate derivative! f̂1 = f1(z) .+ ForwardDiff.derivative(f1, z) .* (x .- z) plot!(plt, x, f̂1, label = "f(0)+f'(0)*(x-0)", legend = :topleft) ``` - The **first-order Taylor approximation of a differentiable function $f: \mathbb{R}^n \mapsto \mathbb{R}$ at a point $\mathbf{z} \in \mathbb{R}^n$** is $$ \widehat f(\mathbf{x}) = f(\mathbf{z}) + [\nabla f(\mathbf{z})]' (\mathbf{x} - \mathbf{z}), $$ where $$ \nabla f(\mathbf{z}) = \begin{pmatrix} \frac{\partial f}{\partial x_1}(\mathbf{z}) \\ \vdots \\ \frac{\partial f}{\partial x_n}(\mathbf{z}) \end{pmatrix} $$ is the **gradient** of $f$ evaluated at point $\mathbf{z}$. - An example with $n=2$: The gradient of $$ f(\mathbf{x}) = f(x_1, x_2) = e^{2x_1 + x_2} - x_1 $$ is $$ \nabla f(\mathbf{x}) = \begin{pmatrix} 2 e^{2x_1 + x_2} - 1\\ e^{2x_1 + x_2} \end{pmatrix}. $$ The first-order Taylor approximation of $f$ at a point $\mathbf{z}$ is $$ f(\mathbf{x}) \approx e^{2z_1 + z_2} - z_1 + \begin{pmatrix} 2 e^{2z_1 + z_2} - 1 \,\, e^{2z_1 + z_2} \end{pmatrix} \begin{pmatrix} x_1 - z_1 \\ x_2 - z_2 \end{pmatrix}. $$ ```{julia} # a non-linear function f f2(x) = exp(2x[1] + x[2]) - x[1] # define grid n = 20 grid = range(-1, 1, length = n) # draw the first-order approximation at (0,0) z = [0.0, 0.0] @show ForwardDiff.gradient(f2, z) # let computer calculate gradient! f2_approx(x) = f2(z) + ForwardDiff.gradient(f2, z)' * (x - z) f̂2x = [f2_approx([grid[row], grid[col]]) for row in 1:n, col in 1:n] plt = wireframe(grid, grid, f̂2x, label="f") # draw the function f2x = [f2([grid[row], grid[col]]) for row in 1:n, col in 1:n] wireframe!(plt, grid, grid, f2x, label="fhat") ``` - Affine approximation: The **first-order Taylor approximation of a differentiable function $f: \mathbb{R}^n \mapsto \mathbb{R}^m$ at a point $\mathbf{z} \in \mathbb{R}^n$** is $$ \widehat f(\mathbf{x}) = f(\mathbf{z}) + [\operatorname{D} f(\mathbf{z})] (\mathbf{x} - \mathbf{z}), $$ where $\operatorname{D} f(\mathbf{z}) \in \mathbb{R}^{m \times n}$ is the **derivative matrix** or **Jacobian matrix** or **differential** evaluated at $\mathbf{z}$ $$ \operatorname{D} f(\mathbf{z}) = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} (\mathbf{z}) & \frac{\partial f_1}{\partial x_2} (\mathbf{z}) & \cdots & \frac{\partial f_1}{\partial x_n} (\mathbf{z}) \\ \frac{\partial f_2}{\partial x_1} (\mathbf{z}) & \frac{\partial f_2}{\partial x_2} (\mathbf{z}) & \cdots & \frac{\partial f_2}{\partial x_n} (\mathbf{z}) \\ \vdots & \vdots & & \vdots \\ \frac{\partial f_m}{\partial x_1} (\mathbf{z}) & \frac{\partial f_m}{\partial x_2} (\mathbf{z}) & \cdots & \frac{\partial f_m}{\partial x_n} (\mathbf{z}) \end{pmatrix} = \begin{pmatrix} \nabla f_1(\mathbf{z})' \\ \nabla f_2(\mathbf{z})' \\ \vdots \\ \nabla f_m(\mathbf{z})' \end{pmatrix}. $$ - An example with $n=m=2$: $$ f(\mathbf{x}) = \begin{pmatrix} f_1(\mathbf{x}) \\ f_2(\mathbf{x}) \end{pmatrix} = \begin{pmatrix} e^{2x_1 + x_2} - x_1 \\ x_1^2 - x_2 \end{pmatrix}. $$ Derivative matrix is $$ \operatorname{D} f(\mathbf{x}) = \begin{pmatrix} 2e^{2x_1 + x_2} - 1 & e^{2x_1 + x_2} \\ 2x_1 & -1 \end{pmatrix} $$ First-order approximation of $f$ around $\mathbf{z} = \mathbf{0}$ is $$ f(\mathbf{x}) \approx \begin{pmatrix} 1 \\ 0 \end{pmatrix} + \begin{pmatrix} 1 & 1 \\ 0 & -1 \end{pmatrix} \begin{pmatrix} x_1 - 0 \\ x_2 - 0 \end{pmatrix}. $$ ```{julia} # a nonlinear function f3(x) = [exp(2x[1] + x[2]) - x[1], x[1]^2 - x[2]] z = [0, 0] ForwardDiff.jacobian(f3, [0, 0]) ``` ## Computational complexity (important) - Matrix-vector product: $\mathbf{y} = \mathbf{A} \mathbf{x}$ where $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{x} \in \mathbb{R}^n$. $2mn$ flops. - Special cases: - $\mathbf{A} \in \mathbb{R}^{n \times n}$ is diagonal, $\mathbf{A} \mathbf{x}$ takes $n$ flops. - $\mathbf{A} \in \mathbb{R}^{n \times n}$ is lower triangular, $\mathbf{A} \mathbf{x}$ takes $n^2$ flops. - $\mathbf{A} \in \mathbb{R}^{m \times n}$ is sparse, between $\text{nnz}(\mathbf{A})$ and $2\text{nnz}(\mathbf{A})$ flops, much less than $\ll 2mn$. - Matrix-matrix product: $\mathbf{C} = \mathbf{A} \mathbf{B}$ where $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{B} \in \mathbb{R}^{n \times p}$. Naive algorithm $2mnp$ flops. [Strassen's algorithm](https://en.wikipedia.org/wiki/Strassen_algorithm) reduces multiplication of two square matrices to $O(n^{\log_2 7})$ flops. Recent breakthroughs by [AlphaTensor](https://www.nature.com/nature/volumes/610/issues/7930). ```{julia} Random.seed!(216) n = 2000 A = Diagonal(randn(n)) ``` ```{julia} Ads = convert(Matrix{Float64}, A) ``` ```{julia} x = randn(n) ``` ```{julia} # full matrix times vector @benchmark $Ads * $x ``` ```{julia} # dense matrix times vector @benchmark $A * $x ``` ```{julia} # benchmark Random.seed!(216) n = 2000 A = LowerTriangular(randn(n, n)) ``` ```{julia} Ads = convert(Matrix{Float64}, A) ``` ```{julia} x = randn(n) # full matrix times vector @benchmark $Ads * $x ``` ```{julia} # lower triangular matrix times vector @benchmark $A * $x ``` - Matrix-matrix product: $\mathbf{C} = \mathbf{A} \mathbf{B}$, where $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{B} \in \mathbb{R}^{n \times p}$. $2mnp$ flops. - Exercise: Evaluate $\mathbf{y} = \mathbf{A} \mathbf{B} \mathbf{x}$, where $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{n \times n}$, in two ways - $\mathbf{y} = (\mathbf{A} \mathbf{B}) \mathbf{x}$. $2n^3$ flops. - $\mathbf{y} = \mathbf{A} (\mathbf{B} \mathbf{x})$. $4n^2$ flops. Which method is faster? ```{julia} Random.seed!(216) n = 2000 A = randn(n, n) B = randn(n, n) x = randn(n); ``` ```{julia} @benchmark ($A * $B) * $x ``` ```{julia} @benchmark $A * ($B * $x) ``` - Exercise: Evaluate $\mathbf{y} = (\mathbf{I} + \mathbf{u} \mathbf{v}') \mathbf{x}$, where $\mathbf{u}, \mathbf{v}, \mathbf{x} \in \mathbb{R}^n$, in two ways. - Evaluate $\mathbf{A} = \mathbf{I} + \mathbf{u} \mathbf{v}'$, then $\mathbf{y} = \mathbf{A} \mathbf{x}$. $3n^2$ flops. - Evaluate $\mathbf{y} = \mathbf{x} + (\mathbf{v}'\mathbf{x}) \mathbf{u}$. $4n$ flops. Which method is faster? ```{julia} Random.seed!(216) n = 2000 u = randn(n) v = randn(n) x = randn(n); ``` ```{julia} # method 1 @benchmark (I + $u * $v') * $x ``` ```{julia} # method 2 @benchmark $x + dot($v, $x) * $u ``` ## Matrix with orthonormal columns, orthogonal matrix, QR factorization - A set of vectors $\mathbf{a}_1, \ldots, \mathbf{a}_k \in \mathbb{R}^{n}$ are orthonormal if the matrix $\mathbf{A} = (\mathbf{a}_1 \, \ldots \, \mathbf{a}_k) \in \mathbb{R}^{n \times k}$ satisfies $$ \mathbf{A}' \mathbf{A} = \mathbf{I}_k. $$ In this case, we say $\mathbf{A}$ has orthonormal columns, or $\mathbf{A}$ lives in the Stiefel manifold. - When $k = n$, a square matrix $\mathbf{A}$ with $\mathbf{A}'\mathbf{A} = \mathbf{I}_n$ is called an **orthogonal matrix**. Example: $\mathbf{I}_n$, permutation matrix, rotation matrix, ... ```{julia} # a 3x3 orthogonal matrix a1 = [0, 0, 1] a2 = (1 / sqrt(2)) * [1, 1, 0] a3 = (1 / sqrt(2)) * [1, -1, 0] # a 3x3 orthogonal matrix A = [a1 a2 a3] ``` ```{julia} A'A ``` ### Properties of matrices with orthonormal columns. Assume $\mathbf{A} \in \mathbb{R}^{m \times n}$ has orthonormal columns. The corresponding linear mapping is $f:\mathbb{R}^n \mapsto \mathbb{R}^m$ defined by $f(\mathbf{x}) = \mathbf{A} \mathbf{x}$. 1. $f$ is *norm preserving*: $\|\mathbf{A} \mathbf{x}\| = \|\mathbf{x}\|$. Proof: $\|\mathbf{A} \mathbf{x}\|^2 = \mathbf{x}' \mathbf{A}' \mathbf{A} \mathbf{x} = \mathbf{x}' \mathbf{x} = \|\mathbf{x}\|_2^2$. 2. $f$ preserves the inner product between vectors: $(\mathbf{A} \mathbf{x})'(\mathbf{A} \mathbf{y}) = \mathbf{x}'\mathbf{y}$. Proof: $(\mathbf{A} \mathbf{x})'(\mathbf{A} \mathbf{y}) = \mathbf{x}' \mathbf{A}' \mathbf{A} \mathbf{y} = \mathbf{x}'\mathbf{y}$. 3. $f$ preserves the angle between vectors: $\angle (\mathbf{A} \mathbf{x}, \mathbf{A} \mathbf{y}) = \angle (\mathbf{x}, \mathbf{y})$. Proof: $$ \angle (\mathbf{A} \mathbf{x}, \mathbf{A} \mathbf{y}) = \arccos \frac{(\mathbf{A} \mathbf{x})'(\mathbf{A} \mathbf{y})}{\|\mathbf{A} \mathbf{x}\|\|\mathbf{A} \mathbf{y}\|} = \arccos \frac{\mathbf{x}'\mathbf{y}}{\|\mathbf{x}\|\|\mathbf{y}\|}. $$ ```{julia} x = [1, 2] y = [3, 2] θ = π / 4 A = [cos(θ) -sin(θ); sin(θ) cos(θ)] Ax = A * x Ay = A * y x_vals = [0 0 0 0; x[1] y[1] Ax[1] Ay[1]] y_vals = [0 0 0 0; x[2] y[2] Ax[2] Ay[2]] plt = plot(x_vals, y_vals, arrow = true, color = :blue, legend = :none, xlims = (-4, 4), ylims = (-2, 4), annotations = [((x .+ 0.2)..., "x"), ((y .+ 0.2)..., "y"), ((Ax .+ 0.2)..., "Ax"), ((Ay .+ 0.2)..., "Ay")], xticks = -3:1:3, yticks = -1:1:3, framestyle = :origin, aspect_ratio = :equal) ``` ### QR factorization - Recall in the [Gram-Schmidt algorithm](https://ucla-biostat-216.github.io/2023fall/slides/02-vector/02-vector.html#how-to-orthonormalize-a-set-of-vectors-gram-schmidt-algorithm), given input vectors $\mathbf{a}_1, \ldots, \mathbf{a}_k \in \mathbb{R}^n$, the Gram-Schmidt algorithm outputs orthonormal vectors $\mathbf{q}_1, \ldots, \mathbf{q}_k$. This can be compactly expressed as $$ \mathbf{A} = \mathbf{Q} \mathbf{R}, $$ where $\mathbf{A} = (\mathbf{a}_1 \, \cdots \, \mathbf{a}_k) \in \mathbb{R}^{n \times k}$, $\mathbf{Q} = (\mathbf{q}_1 \, \cdots \, \mathbf{q}_k) \in \mathbb{R}^{n \times k}$ satisfying $\mathbf{Q}'\mathbf{Q} = \mathbf{I}_k$ ($\mathbf{Q}$ has orthonormal columns), and $\mathbf{R} \in \mathbb{R}^{k \times k}$ is an upper triangular matrix with positive diagonal elements. - What are the entries in $\mathbf{R}$? In the $i$-th iteration of the Gram-Schmidt algorithm \begin{eqnarray*} \text{Orthogonalization step: } & & \tilde{\mathbf{q}}_i = \mathbf{a}_i - (\mathbf{q}_1' \mathbf{a}_i) \mathbf{q}_1 - \cdots - (\mathbf{q}_{i-1}' \mathbf{a}_i) \mathbf{q}_{i-1} \\ \text{Normalization step: } & & \mathbf{q}_i = \tilde{\mathbf{q}}_i / \|\tilde{\mathbf{q}}_i\|. \end{eqnarray*} Therefore, $$ \mathbf{a}_i = (\mathbf{q}_1' \mathbf{a}_i) \mathbf{q}_1 + \cdots + (\mathbf{q}_{i-1}' \mathbf{a}_i) \mathbf{q}_{i-1} + \|\tilde{\mathbf{q}}_i\| \mathbf{q}_i. $$ This tells us $R_{ij} = \mathbf{q}_i' \mathbf{a}_j$ for $i < j$ and $R_{ii} = \|\tilde{\mathbf{q}}_i\|$. - Thus overall Gram-Schmidt algorithm performs $$ \mathbf{A} = \begin{pmatrix} | & | & & | & | \\ \mathbf{a}_1 & \mathbf{a}_2 & \cdots & \mathbf{a}_{k-1} & \mathbf{a}_k \\ | & | & & | & | \end{pmatrix} = \begin{pmatrix} | & | & & | & | \\ \mathbf{q}_1 & \mathbf{q}_2 & \cdots & \mathbf{q}_{k-1} & \mathbf{q}_k \\ | & | & & | & | \end{pmatrix} \begin{pmatrix} \|\tilde{\mathbf{q}}_1\| & \mathbf{q}_1'\mathbf{a}_2 & \cdots & \mathbf{q}_1' \mathbf{a}_{k-1} & \mathbf{q}_1' \mathbf{a}_k \\ & \|\tilde{\mathbf{q}}_2\| & \cdots & \mathbf{q}_2' \mathbf{a}_{k-1} & \mathbf{q}_2' \mathbf{a}_k \\ & & \ddots & \vdots & \vdots \\ & & & \|\tilde{\mathbf{q}}_{k-1}\| & \mathbf{q}_{k-1}' \mathbf{a}_k \\ & & & & \|\tilde{\mathbf{q}}_k\| \end{pmatrix} = \mathbf{Q} \mathbf{R}. $$ ```{julia} # HW2: BV 5.6 n = 5 A = [i ≤ j ? 1 : 0 for i in 1:n, j in 1:n] ``` ```{julia} # in computer, QR decomposition is performed by some algorithm other than Gram-Schmidt qr(A) ```