![](ISL_fig_2_7.pdf){width=500px height=300px}

Trade-off of model flexibility vs interpretability. ::: ## Assessing model accuracy - Given training data $\{(x_1, y_1), \ldots, (x_n, y_n)\}$, we fit a model $\hat f$. We can evaluate the model accuracy on the training data by the **mean squared error** $$ \operatorname{MSE}_{\text{train}} = \frac 1n \sum_{i=1}^n [y_i - \hat f(x_i)]^2. $$ The smaller $\operatorname{MSE}_{\text{train}}$, the better model fit. - However, in most situations, we are not interested in the training MSE. Rather, we are interested in the accuracy of the predictions on previously unseen test data. - If we have a separate test set with both predictors and outcomes. Then the task is easy, we choose the learning method that yields the best test MSE $$ \operatorname{MSE}_{\text{test}} = \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} [y_i - \hat f(x_i)]^2. $$ - In many applications, we don't have a separate test set. Is this a good idea to choose the learning method with smallest training MSE? ::: {#fig-tradeoff-truth}![](ISL_fig_2_9.pdf){width=500px height=300px}

Black curve is truth. Red curve on right is the test MSE, grey curve is the training MSE. Orange, blue and green curves/squares correspond to fits of different flexibility. ::: ::: {#fig-tradeoff-smooth-truth}![](ISL_fig_2_10.pdf){width=500px height=300px}

Here the truth is smoother, so the smoother fit and linear model do really well. ::: ::: {#fig-tradeoff-wiggly-truth}![](ISL_fig_2_11.pdf){width=500px height=300px}

Here the truth is wiggly and the noise is low, so the more flexible fits do the best. ::: - As the previous three examples illustrate, the flexibility level corresponding to the model with the minimal test MSE can vary considerably among data sets. - Later we will discuss the **cross-validation** strategy to estimate test MSE using only the training data. ## Bias-variance trade-off - The U-shaped observed in the test MSE curves (@fig-tradeoff-truth-@fig-tradeoff-wiggly-truth) reflects the **bias-variance** trade-off. - Let $(x_0, y_0)$ be a test observation. Under the model @eq-statistical-model, the **expected prediction error (EPE)** at $x_0$, or the **test error**, or **generalization error**, can be decomposed as (HW1) $$ \operatorname{E}[y_0 - \hat f(x_0)]^2 = \underbrace{\operatorname{Var}(\hat f(x_0)) + [\operatorname{Bias}(\hat f(x_0))]^2}_{\text{MSE of } \hat f(x_0) \text{ for estimating } f(x_0)} + \underbrace{\operatorname{Var}(\epsilon)}_{\text{irreducible}}, $$ where - $\operatorname{Bias}(\hat f(x_0)) = \operatorname{E}[\hat f(x_0)] - f(x_0)$; - the expectation averages over the variability in $y_0$ and $\hat f$ (function of training data). - Typically as the flexibility of $\hat f$ increases, its variance increases and its bias decreases. ::: {#fig-tradeoff-bias-variance-tradeoff}![](ISL_fig_2_12.pdf){width=500px height=300px}

Bias-variance trade-off. ::: ## Classical regime vs modern regime - Above U-shaped test MSE curves are in the so-called **classical regime** where the number of features (or the degree of freedom) is less than the training samples. In the **modern regime**, where the number of features (or the degree of freedom) can be order of magnitude larger than the training samples (recall that ChatGPT3 model has 175 billion parameters!), the **double descent** phenomenon is observed and being actively studied. See the recent [paper](https://epubs.siam.org/doi/pdf/10.1137/20M1336072) and references therein. ::: {#fig-double-descent} ![](https://openai.com/content/images/2019/12/modeldd.svg) Double descent phenomenon ([OpenAI Blog](https://openai.com/blog/deep-double-descent/)). ::: ## Classification problems - When the outcome $Y$ is discrete, for example, email is one of $\mathcal{C}=$\{`spam`, `ham`\} (`ham`=good email), handwritten digit is one of $\mathcal{C} = \{0,1,\ldots,9\}$. - Our goals are to - build a classifier $f(X)$ that assigns a class label from $\mathcal{C}$ to a future unlabeled observation $X$; - assess the uncertainty in each classification; - understand the roles of the different predictors among $X=(X_1,\ldots,X_p)$. - To evaluate the performance of classification algorithms, the **training error rate** is $$ \frac 1n \sum_{i=1}^n I(y_i \ne \hat y_i), $$ where $\hat y_i = \hat f(x_i)$ is the predicted class label for the $i$th observation using $\hat f$. - As in the regression setting, we are most interested in the **test error rate** associated with a set of test observations $$ \frac{1}{n_{\text{test}}} \sum_{i=1}^{n_{\text{test}}} I(y_i \ne \hat y_i). $$ - Suppose $\mathcal{C}=\{1,2,\ldots,K\}$, the **Bayes classifier** assigns a test observation with predictor vector $x_0$ to the class $j \in \mathcal{C}$ for which $$ \operatorname{Pr}(Y=j \mid X = x_0) $$ is largest. In a two-class problem $K=2$, the Bayes classifier assigns a test case to class 1 if $\operatorname{Pr}(Y=1 \mid X = x_0) > 0.5$, and to class 2 otherwise. - The Bayes classifier produces the **lowest** possible test error rate, called the **Bayes error rate** $$ 1 - \max_j \operatorname{Pr}(Y=j \mid X = x_0) $$ at $X=x_0$. The **overall Bayes error** is given by $$ 1 - \operatorname{E} [\max_j \operatorname{Pr}(Y=j \mid X)], $$ where the expectation averages over all possible values of $X$. - Unfortunately, for real data, we don't know the conditional distribution of $Y$ given $X$, and computing the Bayes classifier is impossible. - Various learning algorithms attempt to estimate the conditional distribution of $Y$ given $X$, and then classify a given observation to the class with the highest estimated probability. - One simple classifier is the **$K$-nearest neighbor (KNN)** classifier. Given a positive integer $K$ and a test observation $x_0$, the KNN classifier first identifies the $K$ points in the training data that are closest to $x_0$, represented by $\mathcal{N}_0$. It then estimates the conditional probability by $$ \operatorname{Pr}(Y=j \mid X = x_0) = \frac{1}{K} \sum_{i \in \mathcal{N}_0} I(y_i = j) $$ and then classifies the test observation $x_0$ to the class with the largest probability. ::: {#fig-KNN-K-10}![](ISL_fig_2_15.pdf){width=500px height=300px}

Black curve is the KNN decision boundary using $K=10$. The purple dashed line is the Bayes decision boundary. ::: - Smaller $K$ yields more flexible classification rule. ::: {#fig-KNN-K-1-K-100}![](ISL_fig_2_16.pdf){width=500px height=300px}

Left panel: KNN with $K=1$. Right panel: KNN with $K=100$. ::: - Bias-variance trade-off of KNN. ::: {#fig-KNN-tradeoff}![](ISL_fig_2_17.pdf){width=500px height=300px}

KNN with $K \approx 10$ achieves the Bayes error rate (black dashed line). :::