---
title: "More Linear Algebra"
output: pdf_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(knitr)
library(Matrix)
```

### Matrix Rank

A matrix can be expressed as a combination of the column vectors. 

\vfill

```{r}
X <- tibble(x_cat = sample(c('1','2'), 20, replace = T),
            x_contin = runif(20))

X_mat <- model.matrix(~x_cat + x_contin, data = X)
rankMatrix(X_mat)
```


\vfill
#### Vector space

As previously mentioned, a matrix can be constructed as a linear combination of vectors. Formally, a vector space is the combination of vectors that can be constructed with a set of vectors.  A vector space. $V_n$ has the two following properties. First assume $\underline{x}$ and $\underline{y}$ are in $V_n$, then

\vfill


\vfill


\vfill


\vfill
 If the spanning set are linearly independent, then the spanning set is a
 
\newpage

For purposes of this class, and most statistical applications, we will think about a vector space as the combination of column vectors from a matrix. This is referred to as a column space. Here the dimension of the column space corresponds to the rank of the matrix, $X$.

\vfill

#### Idempotent matrix


\vfill

```{r}
I <- diag(3); I
I %*% I

P <- matrix(c(3,1,-6,-2),2,2); P
P %*% P
```

\vfill

A square matrix $P$ is also a projection operator if $P = P \times P$. 


\vfill
\newpage
Let $\underline{y} \sim N(X \underline{\beta}, \sigma^2 I).$ We will derive this later, but consider the hat matrix $H = X(X^TX)^{-1} X^T$.

\vfill

Test $H \times H$

\vfill
\vfill

Let $\underline{y} \sim N(X \underline{\beta}, \sigma^2 I).$ Then following a simple example from Boik, $X = \begin{bmatrix} 1 &1\\ 1 & 2\\ 1 & 4 \end{bmatrix}$ and $y = \begin{bmatrix} 2 \\ 1 \\ 3 \end{bmatrix}$.

\vfill

The OLS estimate of $\underline{\beta}$, $\hat{\underline{\beta}}_{OLS} = \begin{bmatrix} 1 \\ \frac{3}{7} \end{bmatrix}.$ 
\vfill


Formally, $\underline{\beta}_{OLS} = (X^TX)^{-1} X^T$, so 
$$X \underline{\beta}_{OLS}= X(X^TX)^{-1} X^T \underline{y} = H \underline{y}$$

\vfill

```{r}
X <- matrix(c(1, 1, 1, 1, 2, 4),3, 2)
beta_hat <- c(1, 3/7)

X %*% beta_hat

H <- X %*% solve(t(X) %*% X) %*% t(X)
y <- c(2, 1, 3)
H %*% y
```

\newpage

![Projection, obtained from Boik 505 notes](Proj.pdf){width=65%}

\newpage


### Properties of random vectors and matrices


#### Expectation , Variance, Covariance

Expectation of a vector is fairly simple, formally 
$$E[\underline{y}] = \begin{pmatrix}E[y_1]\\ E[y_2]\\ \vdots \\ E[y_n]\end{pmatrix}= $$
\vfill

The same idea holds for a random matrix.

\vfill

If $\underline{y}$ is an $n \times 1$ vector and $\underline{x}$ is an $r \times 1$ vector, then.

\vfill


\vfill


\vfill

If $\underline{y} \sim N(\underline{\mu}, \Sigma)$ then the pdf of $\underline{y}$ is

$$p(\underline{y}) = |\Sigma|^{-1/2} (2 \pi)^{n/2}$$

\newpage

###### Partitioned Matrices

Now consider splitting the sampling units into two partitions such that $\underline{y} = \begin{bmatrix} \underline{y}_1 \\ \underline{y}_2 \end{bmatrix}$. Then,

\vfill

$$\begin{bmatrix}\underline{y}_1 \\ \underline{y}_2 \end{bmatrix} \sim N \left( \begin{bmatrix}X_1 \\ X_2 \end{bmatrix} \underline{\beta}  ,
\begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{12} & \Sigma_{22} \end{bmatrix}\right)$$

\vfill

Fundamentally, there is no change to the model, we have just created "groups" by partitioning the model. Do note that $\Sigma_{11}$ is an $n_1 \times n_1$ covariance matrix.

$$\Sigma_{11} = \begin{bmatrix}
\sigma^2_1 & \sigma_{12} & \cdots &\sigma_{1n_1} \\
\sigma_{22} & \sigma^2_2 &   \cdots &\sigma_{2n_1} \\
\sigma_{31} & \sigma_{32} &   \ddots &\sigma_{3n_1} \\
\vdots & \vdots &   \ddots &\vdots \\
\sigma_{n_1 1} & \sigma_{n_1 2} &   \ddots &\sigma^2_{n_1} 
\end{bmatrix}$$

\vfill


However, while $\Sigma_{12} = \Sigma_{21}^T$, neither of these are necessarily symmetric matrices. They also do not have any variance components, but rather just covariance terms. $\Sigma_{12}$ will be an $n_1 \times n_2$ matrix.

$$\Sigma_{11} = \begin{bmatrix}
\sigma_{1,n_1 +1} & \sigma_{1,n_1 + 2} & \cdots &\sigma_{1,n_1 + n_2} \\
\sigma_{2, n_1 + 1} & \sigma_{2, n1 + 2} &   \cdots &\sigma_{2,n_1 + n_2} \\
\vdots & \vdots &   \ddots &\vdots \\
\sigma_{n_1, n_1 + 1} & \sigma_{n_1, n_1 + 2} &   \ddots &\sigma_{n_1, n_1 + n_2} 
\end{bmatrix}$$

\newpage

### Conditional Multivariate Normal

Let $\underline{y_1}|\underline{Y_2}=\underline{y_2}$ be a conditional distribution for $\underline{y_1}$ given that $\underline{y_2}$ is known. Then

\vfill

$$\underline{y_1}|\underline{y_2} \sim N \left( X_1\beta + \Sigma_{12} \Sigma_{22}^{-1}\left(\underline{y_2} - X_2\beta \right), \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \right)$$
\vfill

Now let's consider a few special cases (in the context of the DC housing dataset.)

\vfill

1. Let $\Sigma = \sigma^2 I$, then the batch of houses in group 1 are conditionally dependent from the houses in group 2 and 

$$\underline{y_1}|\underline{y_2}  \sim N \left( X_1\beta, \Sigma_{11} \right)$$
\vfill

2. Otherwise, let $\Sigma = \sigma^2 H$ and we'll assume $\Sigma_{12}$ has some non-zero elements. Then we have a more precise estimate of $\underline{y_1}$ as $\Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}$ will be "less than" $\Sigma_{11}$ (that positive definite thing). Furthermore, the mean will shift such that highly correlated observations such as houses in close proximity (local model structure) will tend to differ from the global mean in the same fashion.

\vfill

\newpage

3. Let $n_1 = 1$ and $n_2 = 1$, then

$$\begin{bmatrix}y_1 \\ y_2 \end{bmatrix} \sim N \left( \begin{bmatrix}\mu_1 \\ \mu_2 \end{bmatrix}   ,
\begin{bmatrix} \sigma^2_{1} & \sigma_{12} \\ \sigma_{12} & \sigma_{2}^2 \end{bmatrix}\right)$$

and

$$y_1|y_2 \sim N \left( \mu_1 + \sigma_{12} (\sigma_{2}^{2})^{-1}\left(y_2 - \mu_2 \right), \sigma_{1}^2 - \sigma_{12} (\sigma_{2}^{2})^{-1} \sigma_{21} \right)$$
\vfill

Now consider an illustration for a couple simple scenarios. Let $\mu_1 = \mu_2 = 0$ and $\sigma^2_1 = \sigma^2_2 = 1$. Now assume $y_2 = -2$ and we compare the conditional distribution for a few values of $\sigma_{12}$.
\vfill

```{r, echo = F}
mu1 <- 0
mu2 <- 0
sigmasq1 <- sigmasq2 <- 1

dat_seq <- seq(-4,4, by = .01)
n_seq <- length(dat_seq)
tibble(group = rep(c('y2 = -2; sigma12 = 0',
                     'y2 = -2; sigma12 = .2',
                     'y2 = -2; sigma12 = .8'), each = n_seq), 
dens = c(dnorm(dat_seq, mu1 + 0*(1/sigmasq2)*(-2 - mu2),
               sqrt(sigmasq1 - 0 * (1/sigmasq2)*0 )), 
         dnorm(dat_seq, mu1 + .2*(1/sigmasq2)*(-2 - mu2),
               sqrt(sigmasq1 - .2 * (1/sigmasq2)*2 )), 
         dnorm(dat_seq, mu1 + .8*(1/sigmasq2)*(-2 - mu2),
               sqrt(sigmasq1 - .8 * (1/sigmasq2)*.8 ))), 
y = rep(dat_seq, 3)) %>% 
  ggplot(aes(x=y, y = dens, group = group, color = group)) +
  geom_line() + theme_bw()
```


\vfill

One last note, the marginal distributions for any partition $\underline{y_1}$ are quite simple.

$$\underline{y_1} \sim N \left( X_1\beta, \Sigma_{11} \right)$$
or just

$$y_1 \sim N \left( X_1\beta, \sigma^2_{1} \right)$$
if $y_1$ is scalar.

\newpage

#### OLS / WLS

Let $\underline{y} \sim N(X \underline{\beta}, \Sigma),$ then the least squares estimate of $\underline{\beta}$ is the minimizer of

\vfill

\vfill

\vfill

\vfill

\newpage

#### Linear Functions and Contrasts

Suppose we are interest in a linear combination of the model coefficients, such as $\psi = \beta_0 + \beta_1$ or $\psi = \beta_2 - \beta_3$. These can be constructed with a matrix C, s.t.


\vfill


\vfill

Contrasts are common in ANOVA models, where $y_{ij} = \mu + \tau_i + \epsilon_{ij}$, where $\tau_i$ is the effect for treatment $i$

\vfill

A linear function is __estimable__ if a linear unbiased estimator of the function exists. Mathematically, $\psi = \underline{c}^T \underline{\beta}$ is estimable if there exists a statistic, $\hat{\psi} = A^T \underline{y} + \underline{k}$ such that $E[\hat{\psi}] = \psi$.

\vfill


#### Best Linear Unbiased Estimator (BLUE)

Suppose that $C^T \underline{\beta}$ is an estimable function. Let $\hat{\psi}_c$ be any unbiased estimator of $C^T \underline{\beta}$. Meaning

1. $\hat{\psi}_c = A^t \underline{y} + \underline{l}$ and

2. $E[\hat{\psi}_c]= \underline{c}^t \underline{\beta} \;\; \forall \beta$

\vfill

\vfill


\vfill


\vfill


#### Gauss - Markov Theorem

The Gauss-Markov theorem typically states that if $\underline{y} \sim N(X \underline{\beta}, \sigma^2I)$, then the OLS estimator is the BLUE.  It can also be extended to the GLS setting, see Boik.

\vfill

It is important to mention a lesser known theorem (particularly in the classical linear models setting), __Stein's Paradox__ The associated James–Stein estimator dominates the "ordinary" least squares approach, meaning the James-Stein estimator has a lower or equal mean squared error than OLS. So biased, but lower overall MSE. __more later in the context of shrinkage and hierarchical models__.