---
title: "Introduction (ISL 1)"
subtitle: "Biostat 212A, Statistical Learning A"
author: "Dr. Jin Zhou @ UCLA"
date: today
format:
  html:
    theme: cosmo
    embed-resources: true
    number-sections: true
    toc: true
    toc-depth: 4
    toc-location: left
    code-fold: false
engine: knitr
knitr:
  opts_chunk: 
    fig.align: 'center'
    # fig.width: 6
    # fig.height: 4
    message: FALSE
    cache: false
---

 <!-- python render setup: (0) `which python` to show which python is used in terminal within R or terminal (1) Tools -> Global Options to change python path; (2) create a `vi .Renviron` file to inclue a python path: `RETICULATE_PYTHON="~/anaconda3/bin/python3.11"`. After I restarted RStudio, library(reticulate) activates the desired python3, and repl_python() opens a python3 interactive window -->

Credit: This note heavily uses material from the books [_An Introduction to Statistical Learning: with Applications in R_](https://www.statlearning.com/) (ISL2) and [_Elements of Statistical Learning: Data Mining, Inference, and Prediction_](https://hastie.su.domains/ElemStatLearn/) (ESL2).

Display system information for reproducibility.

::: {.panel-tabset}

#### R

```{r}
sessionInfo()
```

#### Python

```{python}
import IPython
print(IPython.sys_info())
```


:::

# Brief intro
- PhD in Biomathematics, UCLA (the department was renamed to ["UCLA Computational Medicine"](https://compmed.ucla.edu/))
- Postdoc \& Statistician, Harvard University
  + Department of Biostatistics
  + Channing's Lab, Brigham and Women's Hospital
- Assistant \& Associate professor,  University of Arizona (UofA)
  + Department of Epidemiology and Biostatistics 
  + Statistics and Genetics Graduate Interdisciplinary Program (GIDP)
- Research principal investigator, Phoenix VA Health Care System
- Research principal investigator, Greater Los Angeles VA Health Care System
- Adjunct Associate Professor, UCLA
  + Department of Medicine Statistics Core (DOMStat)
- Associate Professor-in-Residence of Biostatistics, UCLA
  + Department of Biostatistics 


# Overview of statistical/machine learning

In this class, we use the phrases **statistical learning**, **machine learning**, or simply **learning** interchangeably.

## Supervised vs unsupervised learning

- **Supervised learning**: input(s) -> output. 
    - Prediction: the output is continuous (income, weight, bmi, ...).
    - Classification: the output is categorical (disease or not, pattern recognition, ...).

- **Unsupervised learning**: no output. We learn relationships and structure in the data. 
    - Clustering.   
    - Dimension reduction. 

## Supervised learning

- **Predictors**
$$
X = \begin{pmatrix} X_1 \\ \vdots \\ X_p \end{pmatrix}.
$$
Also called **inputs**, **covariates**, **regressors**, **features**, **independent variables**.

- **Outcome** $Y$ (also called **output**, **response variable**, **dependent variable**, **target**).
    - In the **regression problem**, $Y$ is quantitative (price, weight, bmi).  
    - In the **classification problem**, $Y$ is categorical. That is $Y$ takes values in a finite, unordered set (survived/died, customer buy product or not, digit 0-9, object in image, cancer class of tissue sample).  

- We have training data $(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)$. These are observations (also called samples, instances, cases). Training data is often represented by a predictor matrix
$$
\mathbf{X} = \begin{pmatrix}
x_{11} & \cdots & x_{1p} \\
\vdots & \ddots & \vdots \\
x_{n1} & \cdots & x_{np}
\end{pmatrix} = \begin{pmatrix} \mathbf{x}_1^T \\ \vdots \\ \mathbf{x}_n^T \end{pmatrix}
$$ {#eq-predictor-matrix}  
and a response vector
$$
\mathbf{y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}
$$

- Based on the training data, our goal is to
    - Accurately predict unseen outcome of test cases based on their predictors.  
    - Understand which predictors affect the outcome, and how.  
    - Assess the quality of our predictions and inferences. 

### Example: salary

- The `Wage` data set collects the wage and other data for a group of 3000 male workers in the Mid-Atlantic region in 2003-2009.

- Our goal is to establish the relationship between salary and demographic variables in population survey data.

- Since wage is a quantitative variable, it is a regression problem. 

::: {.panel-tabset}
#### R

```{r}
#| message: false
library(gtsummary)
library(ISLR2)
library(tidyverse)

# Convert to tibble
Wage <- as_tibble(Wage) %>% print(width = Inf)

# Summary statistics
Wage %>% tbl_summary()

# Plot wage ~ age
Wage %>%
  ggplot(mapping = aes(x = age, y = wage)) + 
  geom_point() + 
  geom_smooth() +
  labs(title = "Wage changes nonlinearly with age",
       x = "Age",
       y = "Wage (k$)")

# Plot wage ~ year
Wage %>%
  ggplot(mapping = aes(x = year, y = wage)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  labs(title = "Average wage increases by $10k in 2003-2009",
       x = "Year",
       y = "Wage (k$)")

# Plot wage ~ education
Wage %>%
  ggplot(mapping = aes(x = education, y = wage)) + 
  geom_point() + 
  geom_boxplot() +
  labs(title = "Wage increases with education level",
       x = "Year",
       y = "Wage (k$)")
```

#### Python

Summary statistics:
```{python}
# Load the pandas library
import pandas as pd
# Load numpy for array manipulation
import numpy as np
# Load seaborn plotting library
import seaborn as sns
import matplotlib.pyplot as plt

# Set font size in plots
sns.set(font_scale = 2)
# Display all columns
pd.set_option('display.max_columns', None)

# Import Wage data
Wage = pd.read_csv(
  "../data/Wage.csv",
  dtype =  {
    'maritl':'category', 
    'race':'category',
    'education':'category',
    'region':'category',
    'jobclass':'category',
    'health':'category',
    'health_ins':'category'
    }
  )
Wage
Wage.info()
```

```{python}
#| eval: false
# summary statistics
Wage.describe(include = "all")
```

```{python}
#| label: fig-wage-age
#| fig-cap: "Wage changes nonlinearly with age."

# Plot wage ~ age
sns.lmplot(
  data = Wage, 
  x = "age", 
  y = "wage", 
  lowess = True,
  scatter_kws = {'alpha' : 0.1},
  height = 8
  ).set(
  xlabel = 'Age', 
  ylabel = 'Wage (k$)'
  )
```

```{python}
#| label: fig-wage-year
#| fig-cap: "Average wage increases by $10k in 2003-2009."

# Plot wage ~ year
sns.lmplot(
  data = Wage, 
  x = "year", 
  y = "wage", 
  scatter_kws = {'alpha' : 0.1},
  height = 8
  ).set(
  xlabel = 'Year', 
  ylabel = 'Wage (k$)'
  )
```

```{python}
#| label: fig-wage-edu
#| fig-cap: "Wage increases with education level."

# Plot wage ~ education
ax = sns.boxplot(
  data = Wage, 
  x = "education", 
  y = "wage"
  )
ax.set(
  xlabel = 'Education', 
  ylabel = 'Wage (k$)'
  )
ax.set_xticklabels(ax.get_xticklabels(), rotation = 15)
```

```{python}
#| label: fig-wage-race
#| fig-cap: "Any income inequality?"

# Plot wage ~ race
ax = sns.boxplot(
  data = Wage, 
  x = "race", 
  y = "wage"
  )
ax.set(
  xlabel = 'Race', 
  ylabel = 'Wage (k$)'
  )
ax.set_xticklabels(ax.get_xticklabels(), rotation = 15)
```


:::

### Example: stock market

```{r}
#| code-fold: true

library(quantmod)

SP500 <- getSymbols(
  "^GSPC", 
  src = "yahoo", 
  auto.assign = FALSE, 
  from = "2022-01-01",
  to = "2022-12-31")

chartSeries(SP500, theme = chartTheme("white"),
            type = "line", log.scale = FALSE, TA = NULL)
```


- The `Smarket` data set contains daily percentage returns for the S&P 500 stock index between 2001 and 2005.

- Our goal is to predict whether the index will _increase_ or _decrease_ on a given day, using the past 5 days' percentage changes in the index.

- Since the outcome is binary (_increase_ or _decrease_), it is a classification problem. 

- From the boxplots in @fig-sp500-lag, it seems that the previous 5 days percentage returns do not discriminate whether today's return is positive or negative. 

::: {.panel-tabset}


#### R

```{r}
#| label: fig-sp500-lag
#| fig-cap: "LagX is the percentage return for the previous X days."

# Data information
help(Smarket)

# Convert to tibble
Smarket <- as_tibble(Smarket) %>% print(width = Inf)

# Summary statistics
summary(Smarket)

# Plot Direction ~ Lag1, Direction ~ Lag2, ...
Smarket %>%
  pivot_longer(cols = Lag1:Lag5, names_to = "Lag", values_to = "Perc") %>%
  ggplot() + 
  geom_boxplot(mapping = aes(x = Direction, y = Perc)) +
  labs(
    x = "Today's Direction", 
    y = "Percentage change in S&P",
    title = "Up and down of S&P doesn't depend on previous day(s)'s percentage of change."
    ) +
  facet_wrap(~ Lag)
```

#### Python

```{python}
# Import S&P500 data
Smarket = pd.read_csv("../data/Smarket.csv")
Smarket
Smarket.info()
```

```{python}
#| eval: false
# summary statistics
Smarket.describe(include = "all")
```

```{python}


# Pivot to long format for facet plotting
Smarket_long = pd.melt(
  Smarket, 
  id_vars = ['Year', 'Volume', 'Today', 'Direction'], 
  value_vars = ['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5'],
  var_name = 'Lag',
  value_name = 'Perc'
  )
Smarket_long  

g = sns.FacetGrid(Smarket_long, col = "Lag", col_wrap = 3, height = 10)
g.map_dataframe(sns.boxplot, x = "Direction", y = "Perc")
plt.clf()
```


:::

### Real Example (1) 
**Genomic classifier for acute cellular rejection (ACR) among lung transplant recipients**

- **Lung transplant** recipients are at risk for acute cellular rejection (ACR), which is a major cause of morbidity and mortality.

- Genomic classifier 
<p align="center">
![](./genomic_classifier.png){width=600px height=250px}
</p>

- RNA Seq data (Transcriptome) is a high-dimensional data set.

<p align="center">
![](./RNASequencing.png){width=500px height=400px}
![](./RNASeqCounts.png){width=500px height=200px}
</p>

- 183 lung transplant recipients with 58,735 gene transcripts' expression levels measured  
<p align="center">
![](./RNASeqRF.png){width=500px height=300px}
</p>

### Real Example (2)
**Dynamically Predicting Renal Complications After Development of Diabetes for Millions Across Biobanks, Transportability and Transferability** 

- Diabetes is a major cause of kidney disease, which can lead to kidney failure and death.
- After developing diabetes, the risk of kidney disease varies across individuals.
- Goal: Develop a prediction model for kidney disease after developing diabetes.
- EHR and Biobanks provide a rich source of data for developing prediction models.
- AI in healthcare (e.g., <https://www.youtube.com/playlist?list=PLe6zdIMe5B7L-2XFpyJOqEfKJzB1PH794>)

<p align="center">
![](./DP-KF.png){width=700px height=200px} 
</p>

<p align="center">
![](./schema.png){width=500px height=600px}
</p>

### Example: handwritten digit recognition

::: {#fig-handwritten-digits}

<p align="center">
![](./ISL_fig_10_3a.pdf){width=500px height=300px}
</p>

<p align="center">
![](./ISL_fig_10_3b.pdf){width=500px height=300px}
</p>

Examples of handwritten digits from the MNIST corpus (ISL Figure 10.3).

:::

- Input: 784 pixel values from $28 \times 28$ grayscale images. Output: 0, 1, ..., 9, 10 class-classification.
    
- On the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) data set (60,000 training images, 10,000 testing images), accuracies of following methods were reported:

    | Method | Error rate  |
    |--------|----------|
    | tangent distance with 1-nearest neighbor classifier | 1.1% |
    | degree-9 polynomial SVM | 0.8% |
    | LeNet-5 | 0.8% |  
    | boosted LeNet-4 | 0.7% |

### Example: more computer vision tasks

Some popular data sets from computer vision. 

- [Fashion MNIST](https://github.com/zalandoresearch/fashion-mnist#fashion-mnist): 10-category classification.

<p align="center">
![](./fashion-mnist-sprite.png){width=500px}
</p>

- [CIFAR-10 and CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html)

<p align="center">
![](./cifar10.jpg){width=500px}
</p>

- [ImageNet](https://www.image-net.org/)

<p align="center">
![](./colah-KSH-results.png){width=500px}
</p>

- [Microsoft COCO](https://cocodataset.org/#home) (object detection, segmentation, and captioning)

<p align="center">
![](./coco-examples.jpeg){width=500px}
</p>

- [ADE20K](http://groups.csail.mit.edu/vision/datasets/ADE20K/) (scene parsing)

<p align="center">
![](./ade20k_examples.png){width=500px}
</p>


### Example: classify the pixels in a satellite image, by usage

::: {#fig-landsat}

<p align="center">
![](./ESL_fig_13_6.pdf){width=750px height=750px}
</p>

LANDSET images (ESL Figure 13.6).
:::

- LANDSAT: 82x100 pixels. Four heat-map images, two in the visible spectrum and two in the infrared, for an area of agricultural land in Australia. 

- Each pixel has a class label from the 7-element set \{red soil, cotton, vegetation stubble, mixture, gray soil, damp gray soil, very damp gray soil\}, determined manually by
research assistants surveying the area. The objective is to classify the land usage at a pixel, based on the information in the four spectral bands.

## Unsupervised learning

- No outcome variable, just predictors.  

- Objective is more fuzzy: find groups that behave similarly, find features that behave similarly, find linear combinations of features with the most variations, generative models (transformers).  

- Difficult to know how well you are doing. 

- Can be useful in exploratory data analysis (EDA) or as a pre-processing step for supervised learning.

### Example: gene expression

- The `NCI60` data set consists of 6,830 gene expression measurements for each of 64 cancer cell lines. 

::: {.panel-tabset}


#### R

```{r}
# NCI60 data and cancel labels
str(NCI60)
# Cancer type of each cell line
table(NCI60$labs)

# Apply PCA using prcomp function
# Need to scale / Normalize as
# PCA depends on distance measure
prcomp(NCI60$data, scale = TRUE, center = TRUE, retx = T)$x %>%
  as_tibble() %>%
  add_column(cancer_type = NCI60$labs) %>%
  # Plot PC2 vs PC1
  ggplot() + 
  geom_point(mapping = aes(x = PC1, y = PC2, color = cancer_type)) +
  labs(title = "Gene expression profiles cluster according to cancer types")
```


#### Python

```{python}
# Import NCI60 data
nci60_data = pd.read_csv('../data/NCI60_data.csv')
nci60_labs = pd.read_csv('../data/NCI60_labs.csv')
nci60_data.info()
```

```{python}
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

# Obtain the first 2 principal components
nci60_tr = scale(nci60_data, with_mean = True, with_std = True)
nci60_pc = pd.DataFrame(
  PCA(n_components = 2).fit(nci60_tr).transform(nci60_tr),
  columns = ['PC1', 'PC2']
  )
nci60_pc['PC2'] *= -1  # for easier comparison with R
nci60_pc['cancer_type'] = nci60_labs
nci60_pc
```

```{python}
# Plot PC2 vs PC1
sns.relplot(
  kind = 'scatter', 
  data = nci60_pc, 
  x = 'PC1',
  y = 'PC2',
  hue = 'cancer_type',
  height = 10
  )
```


:::

### Example: mapping people from their genomes

- The genetic makeup of $n$ individuals can be represented by a matrix @eq-predictor-matrix, where $x_{ij} \in \{0, 1, 2\}$ is the $j$-th genetic marker of the $i$-th individual.

    Is that possible to visualize the geographic relationship of these individuals?

- Following picture is from the article [_Genes mirror geography within Europe_](http://www.nature.com/nature/journal/v456/n7218/full/nature07331.html) by Novembre et al (2008) published in Nature.

![](https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fnature07331/MediaObjects/41586_2008_Article_BFnature07331_Fig1_HTML.jpg?as=webp)

### Ancestry estimation

::: {#fig-open-admixture}

<p align="center">
![](./OpenAdmixture.jpg){width=750px height=750px}
</p>

Unsupervised discovery of ancestry-informative markers and genetic admixture proportions. [Paper](https://doi.org/10.1016/j.ajhg.2022.12.008).
:::

## No easy answer

In modern applications, the line between supervised and unsupervised learning is blurred.

### Example: the Netflix prize

::: {#fig-netflix fig-ncols=1}

![](./netflix_matrix.png){fig-align="center"}

![](./netflix_prize.png){fig-align="center"}

The Netflix challenge.
:::

- Competition started in Oct 2006. Training data is ratings for 480,189 Netflix customers $\times$ 17,770 movies, each rating between 1 and 5.

- Training data is very sparse, about 98\% sparse. 

- The objective is to predict the rating for a set of 1 million customer-movie pairs that are missing in the training data.

- Netflix's in-house algorithm achieved a root MSE of 0.953. The first team to achieve a 10\% improvement wins one million dollars.

- Is this a supervised or unsupervised problem? 

    - We can treat `rating` as outcome and user-movie combinations as predictors. Then it is a supervised learning problem.
    
    - Or we can treat it as a matrix factorization or low rank approximation problem. Then it is more of a unsupervised learning problem, similar to PCA.
    
### Example: large language models (LLMs)    

Modern large language models, such as [ChatGPT o1](https://chat.openai.com), combine both supervised learning and reinforcement learning [google trends](https://trends.google.com/trends/explore?date=2022-11-01%202024-01-05&geo=US&q=chatgpt&hl=en)

## Statistical learning vs machine learning

- Machine learning arose as a subfield of Artificial Intelligence.

- Statistical learning arose as a subfield of Statistics.

- There is much overlap. Both fields focus on supervised and unsupervised problems. 

    - Machine learning has a greater emphasis on large scale applications and prediction accuracy.
    
    - Statistical learning emphasizes models and their interpretability, and precision and uncertainty.
    
- But the distinction has become more and more blurred, and there is a great deal of "cross-fertilization".

- Machine learning has the upper hand in Marketing!

## A Brief History of Statistical Learning

![](https://people.idsia.ch/~juergen/deep-learning-history1508x932.png)

Image source: <https://people.idsia.ch/~juergen/deep-learning-history.html>

- 1676, chain rule by Leibniz.

- 1805, least squares / linear regression / shallow learning by Gauss.

- 1936, classification by linear discriminant analysis by Fisher. 

- 1940s, logistic regression. 

- Early 1970s, generalized linear models (GLMs).

- Mid 1980s, classification and regression trees.

- 1980s, generalized additive models (GAMs).

- 1980s, neural networks gained popularity. 

- 1990s, support vector machines. 

- 2010s, deep learning.

# Course logistics

## Learning objectives

1. Understand what machine learning is (and isn't).

2. Learn some foundational methods/tools.

3. For specific data problems, be able to choose methods that make sense. 

::: {.callout-tip}

Q: Wait, Dr. Zhou! Why don't we just learn the best method (aka deep learning) first?

A: No single method dominates. One method may prove useful in answering some questions on a given data set. On a related (not identical) data set or question, another might prevail. [Article](https://arxiv.org/abs/2106.03253)
:::

## Syllabus

- Read [syllabus](https://ucla-biostat-212a.github.io/2025winter/syllabus/syllabus.html) and [schedule](https://ucla-biostat-212a.github.io/2025winter/schedule/schedule.html) for a tentative list of topics and course logistics.

- Homework assignments will be a mix of theoretical/conceptual and applied/computational questions. Although not required, you are highly encouraged to practice literate programming (using Jupyter, Quarto, RMarkdown, or Google Colab) coordinated through Git/GitHub. This will enhance your GitHub profile and make you more appealing on job market.

- Although, I do not require homework submission through Git/Github. Homework submission is through BruinLearn. 

- We will mainly use R in this course.  

## What I expect from you

- You are curious and are excited about "figuring stuff out".

- You are proficient in coding and debugging (or are ready to work to get there).

- You have a solid foundation in introductory statistics (or are ready to work to get there).

- You are willing to ask questions. 

## What you can expect from me

- I value your learning experience and process. 

- I'm flexible with respect to the topics we cover. 

- I'm happy to share my professional connections.

- I'll try my best to be responsive in class, in office hours, and other professional encounters.


# Notation and Simple Matrix Algebra

## Notation
- We will use $n$ to represent the number of distinct data points, or observations, in our sample. 
- We will let $p$ denote the number of variables that are available for use in making predictions. 
  + For example, the `Wage` data set consists of 11 variables for 3,000 people, so we have $n = 3,000$ observations and $p = 11$ variables (such as year, age, race, and more). 
  + $p$ can be quite large, such as on the order of thousands or even millions, e.g., modern biological data, like gene expression, DNA sequences along the genome. 
- We will let $x_{ij}$ represent the value of the $j$th variable for the $i$th observation, where $i = 1,2,\ldots,n$ and $j = 1,2,\ldots,p$. 
- We will let $i$ be the index of the samples or observations (from 1 to $n$) and $j$ will be used to index the variables (from 1 to $p$). 
- We let $\mathbf{X}$ denote an $n \times p$ matrix whose $(i, j )$th element is $x_{ij}$
$$
\mathbf{X} = \begin{pmatrix}
x_{11} & \cdots & x_{1p} \\
\vdots & \ddots & \vdots \\
x_{n1} & \cdots & x_{np}
\end{pmatrix} = \begin{pmatrix} \mathbf{x}_1^T \\ \vdots \\ \mathbf{x}_n^T \end{pmatrix}
$$  

- Rows of X, which we write as $x_1, x_2, \ldots , x_n$. Here $x_i$ is a vector of length $p$, containing the $p$ variable measurements for the $i$th observation. That is,
$$
x_i = \begin{pmatrix} x_{i1} \\ \vdots \\ x_{ip} \end{pmatrix}.
$$
**Note: Vectors are by default represented as columns.** 

- At other times we will instead be interested in the columns of $\mathbf{X}$, which we write as $\mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_p$. Each is a vector of length $n$. That is,
$$
\mathbf{x}_j = \begin{pmatrix} \mathbf{x}_{1j} \\ \vdots \\ \mathbf{x}_{nj} \end{pmatrix}.
$$


- Using this notation, the matrix $\mathbf{X}$ can be written as
$$
\mathbf{X} = (\mathbf{x}_1 \quad \mathbf{x}_2 \quad \ldots \quad \mathbf{x}_p),  
$$
or 
$$
\mathbf{X} = \begin{pmatrix} x_{1}^T \\ \vdots \\ x_{n}^T \end{pmatrix}.
$$ 

- We use $y_i$ to denote the $i$th observation of the variable on which we wish to make predictions (i.e., "outcome"), such as wage. Hence, we write the set of all $n$ observations in vector form as
$$
\mathbf{y} = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix}
$$
Then our observed data consists of  ${(x_1, y_1), (x_2, y_2), \ldots , (x_n, y_n)}$, where each $x_i$ is a vector of length $p$. (If $p = 1$, then $x_i$ is simply a scalar.)

## Matrix Algebra

- Matrices will be denoted using bold capitals, such as $\mathbf{A}$. 
- To indicate that an object is a scalar, we will use the notation $a \in R$. 
- To indicate that it is a vector of length $k$,we will use $\mathbf{a}\in R^k$ (or $\mathbf{a}\in R^n$ if it is of length $n$). 
- We will indicate that an object is an $r \times s$ matrix using $\mathbf{A} \in R^{r\times s}$.

### Special cases of matrices

- A column vector is a matrix with only one column, e.g.
$$
\mathbf{A} = \left(\begin{array}{c}
  1  \\
 4   \\ 
 0\\
 -2\\
\end{array}\right)
$$ 
- A row vector is a matrix with only one row, e.g.
$$
\mathbf{A} = \left(\begin{array}{cccc}
  1  & 4   & 0 & -2\\
\end{array}\right)
$$
- A matrix with $r = s$, that is, with the same number of rows and columns is called a **square matrix**. If a matrix \mathbf{A} is square, the elements $a_{ii}$ are said to lie on the **diagonal**
of \mathbf{A}. 
$$
\mathbf{A} = \left(\begin{array}{cc}
  1  & 4   \\
  0 & -2
\end{array}\right)
$$
- A square matrix \mathbf{A} is called \alert{symmetric} if $a_{ij} = a_{ji}$ for all values of i and j. 
$$
\mathbf{A} =  \left(\begin{array}{ccc}
3 &5& 7 \\
5 &1& 4 \\
7 &4 &8
\end{array}\right)
$$
Symmetric matrices turn out to be quite important in formulating statistical models for all types of data!

-  An important special case of a square, symmetric matrix is the identity matrix, i.e., a square matrix with $1$s on diagonal, $0$s elsewhere, e.g.
$$
\mathbf{A} =  \left(\begin{array}{ccc}
1 & 0 & 0 \\ 
 0 & 1 & 0 \\
0& 0& 1\\
\end{array}\right)
$$
The identity matrix functions the same way as "$1$" does in the real number system.

### Matrix operations

#### Transpose
- The $^T$ notation denotes the transpose of a matrix or vector

$$
\mathbf{X}^T = \begin{pmatrix}
x_{11} & \cdots & x_{1n} \\
\vdots & \ddots & \vdots \\
x_{p1} & \cdots & x_{pn}
\end{pmatrix} = \begin{pmatrix} \mathbf{x}_1^T \\ \vdots \\ \mathbf{x}_n^T \end{pmatrix}
$$ 
So the transpose of a $n\times p$ matrix is a $p\times n$ matrix.  That is, the transpose of $A$ is the matrix found by ``flipping" the matrix around.

- For example,
$$
\mathbf{A} = 
\left(
\begin{array}{ccc}
 1 & 2  & 3  \\
 4 & 5  & 6  \\ 
\end{array}
\right)
\quad
\mathbf{A}^T = 
\left(
\begin{array}{cc}
 1 & 4 \\
 2 & 5 \\ 
 3 & 6 \\
\end{array}
\right)
$$
A **fundamental property of a symmetric matrix** is that the matrix and its transpose are the same; i.e., if $\mathbf{A}$ is symmetric then $\mathbf{A} = \mathbf{A}^T$. (Try it on the symmetric matrix above.)

#### Matrix Addition and Subtraction
Adding or subtracting two matrices are operations that are defined **element-by-element**. That is, to add to matrices, add their corresponding elements, e.g.
$$
\mathbf{A} = 
\left(
\begin{array}{cc}
 1 & 2    \\
 4 & 5    \\ 
\end{array}
\right)
\quad
\mathbf{B} = 
\left(
\begin{array}{cc}
 6 & 4  \\
 2 & -1   \\ 
\end{array}
\right)
$$
Then,
$$
\mathbf{A} + \mathbf{B} = 
\left(
\begin{array}{cc}
 7 & 6    \\
 6 & 4    \\ 
\end{array}
\right)
\quad
\mathbf{A} - \mathbf{B} = 
\left(
\begin{array}{cc}
 -5 &-2  \\
 -2 & 6  \\ 
\end{array}
\right)
$$
- Note that these operations only make sense if the two matrices have the **same dimension**; the operations are not defined otherwise.

#### Matrix Multiplication

- The effect of multiplying a matrix $\mathbf{A}$ with any dimension by a real number (scalar) $b$, say, is to multiply each element in $\mathbf{A}$ by $b$. 
$$
3\left(
\begin{array}{cc}
1 & 2     \\
4 & 5    \\ 
\end{array}
\right) = 
\left(
\begin{array}{cc}
  3 & 6     \\
 12 & 15    \\ 
\end{array}
\right)
$$

- General rules,
  + $\mathbf{A} + \mathbf{B} = \mathbf{B} + \mathbf{A}$, $b(\mathbf{A} + \mathbf{B})=b\mathbf{A} + b\mathbf{B}$
  + $(\mathbf{A} + \mathbf{B})^T=\mathbf{A}^T + \mathbf{B}^T$, $(b\mathbf{A})^T=b\mathbf{A}^T$

- Order matters
  + Number of columns of first matrix must = Number of rows of second matrix, e.g.,
$$
\mathbf{A} = \left(
\begin{array}{ccc}
 1 & 2  &5   \\
 4 & 5  &1 \\ 
\end{array}
\right) \quad
\mathbf{B} = \left(
\begin{array}{cc}
 3 & 6    \\
 2 & 5    \\ 
 1 & 2    \\
\end{array}
\right)\\ \quad
\mathbf{C} = (c_{ij}) = \mathbf{AB} = \left(
\begin{array}{cc}
 12 & 26    \\
 23 & 51    \\ 
\end{array}
\right)
$$
- Formally, if $\mathbf{A}$ is $(r\times s)$ and $\mathbf{B}$ is $(s\times q)$, then $\mathbf{AB}$ is a $(r\times q)$ matrix with $(i,j)$th element 
$$
\sum_{k=1}^s a_{ik}b_{kj}.
$$
- General rules,
  + $\mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{A}\mathbf{B} + \mathbf{A}\mathbf{C}$, $(\mathbf{A}+\mathbf{B}) \mathbf{C} = \mathbf{A}\mathbf{C} + \mathbf{B}\mathbf{C}$
  + For any matrix $\mathbf{A}$, $\mathbf{A}^T\mathbf{A}$ will be a square matrix. 
  + The transpose of a matrix product: $(\mathbf{AB})^T=\mathbf{B}^T\mathbf{A}^T$.

### Example
- Consider a prediction model, e.g., `wage` data example: suppose that we have $n$ pairs $(x_1,Y_1),\ldots,(x_n,Y_n)$, and we believe that, except for a random deviation, the relationship between the \alert{covariate} $x$ (e.g., `age`) and the response $\mathbf{Y}$ follows a straight line. That is, for $j=1,\ldots,n$, we have 
$$
\mathbf{Y}_j = \beta_0 + \beta_1x_j + \epsilon_j,
$$
where $\epsilon_j$ is a random deviation representing the amount by which the actual observed response $Y_j$ deviates from the exact straight line relationship. Defining,
$$
\mathbf{X}= \left(
\begin{array}{cc}
  1 & x_1    \\
 1 & x_2   \\ 
 \vdots & \vdots\\
 1&x_n\\
\end{array}
\right),\quad
Y= \left(
\begin{array}{c}
  Y_1    \\
 Y_2   \\ 
 \vdots \\
 Y_n\\
\end{array}
\right),\quad
\epsilon= \left(
\begin{array}{c}
  \epsilon_1    \\
 \epsilon_2   \\ 
 \vdots \\
 \epsilon_n\\
\end{array}
\right),
\beta= \left(
\begin{array}{c}
  \beta_0    \\
 \beta_1  \\ 
\end{array}
\right),
$$ 
we may express the model succinctly as 
$$
\mathbf{Y}=\mathbf{X}\beta +\epsilon.
$$