--- title: "Nonlinear Models (ISL 7)" subtitle: "Biostat 212A" author: "Dr. Jin Zhou @ UCLA" date: today format: html: theme: cosmo embed-resources: true number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false engine: knitr knitr: opts_chunk: fig.align: 'center' # fig.width: 6 # fig.height: 4 message: FALSE cache: false --- Credit: This note heavily uses material from the books [_An Introduction to Statistical Learning: with Applications in R_](https://www.statlearning.com/) (ISL2) and [_Elements of Statistical Learning: Data Mining, Inference, and Prediction_](https://hastie.su.domains/ElemStatLearn/) (ESL2). Display system information for reproducibility. ::: {.panel-tabset} ## R ```{r} sessionInfo() ``` ## Python ```{python} import IPython print(IPython.sys_info()) ``` ::: ## Overview - The truth is never linear! Or almost never! But often the linearity assumption is good enough. - When it's not ... - polynomials - step functions - spline - local regression, and - generalized additive models offer a lot of flexibility, without losing the ease and interpretability of linear models. - Main idea - To augment/replace the vector of inputs $X$ with additional variables, which are transformations of $X$, and then use linear models in this new space of derived input features. - For example, in regression problems, $f(X) = \mbox{E}(Y\mid X)$ is modeled as a linear function of $X$, but now to model is by transformation of $X$, i.e., $h_m(X)$. $$ f(X) =\sum_{m=1}^M \beta_m h_m(X) $$ - The beauty of this approach is that once the basis functions $h_m$ have been determined, the models are linear in these new variables, and the fitting proceeds as for linear models. ### Popular choices for basis functions $h_m(X)$ - $h_m(X) = X_m, m = 1,\ldots,p$ recovers the original linear model - $h_m(X) = X_j^2$ or $h_m(X) = X_jX_k$ allows us to augment the inputs with polynomial terms. Note, however, that the number of variables grows exponentially in the degree of the polynomial. A full quadratic model in $p$ variables requires $O(p^2)$ square and cross-product terms, or more generally $O(p^d)$ for a degree-d polynomial. - $h_m(X) = \mathbf{I}(L_m \leq X_k < U_m)$, an indicator for a region of $X_k$. By breaking the range of $X_k$ up into $M_k$ such nonoverlapping regions results in a model with a piecewise constant contribution for $X_k$. - `wage` vs `age`: ::: {.panel-tabset} #### R ```{r} #| message: false library(gtsummary) library(ISLR2) library(tidyverse) # Convert to tibble Wage <- as_tibble(Wage) %>% print(width = Inf) # Summary statistics Wage %>% tbl_summary() # Plot wage ~ age, GAM fit is display when n >1000 Wage %>% ggplot(mapping = aes(x = age, y = wage)) + geom_point() + geom_smooth() + labs(title = "Wage changes nonlinearly with age", x = "Age", y = "Wage (k$)") ``` #### Python ```{python} # Load the pandas library import pandas as pd # Load numpy for array manipulation import numpy as np # Load seaborn plotting library import seaborn as sns import matplotlib.pyplot as plt # Set font size in plots sns.set(font_scale = 1.2) # Display all columns pd.set_option('display.max_columns', None) # Import Wage data Wage = pd.read_csv("../data/Wage.csv") Wage.info() ``` ```{python} # Visualize wage ~ age, display lowess curve plt.figure() sns.lmplot( data = Wage, x = "age", y = "wage", lowess = True, scatter_kws = {'alpha' : 0.1}, height = 8 ).set( title = "Wage changes nonlinearly with age", xlabel = 'Age', ylabel = 'Wage (k$)' ); plt.show() ``` ::: ## Polynomial regression In most of this lecture, consider $p=1$ in the following examples. $$ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_d x_i^d + \epsilon_i. $$ ::: {.panel-tabset} #### R ```{r} # Plot wage ~ age, display order-4 polynomial fit Wage %>% ggplot(mapping = aes(x = age, y = wage)) + geom_point() + geom_smooth( method = "lm", formula = y ~ poly(x, degree = 4) ) + labs( title = "Degree-4 Polynomial", x = "Age", y = "Wage (k$)" ) ``` #### Python ```{python} # Visualize wage ~ age, display order-4 polynomial fit plt.figure() sns.lmplot( data = Wage, x = "age", y = "wage", # Order-4 polynomial regression order = 4, scatter_kws = {'alpha' : 0.1}, height = 8 ).set( xlabel = 'Age', ylabel = 'Wage (k$)', title = 'Degree-4 Polynomial' ); plt.show() ``` ::: - Create new variables $X_1 = X$, $X_2 = X^2$, ..., and then treat as multiple linear regression. - Not really interested in the coefficients; more interested in the fitted function values at any value $x_0$: $$ \hat f(x_0) = \hat{\beta}_0 + \hat{\beta}_1 x_0 + \hat{\beta}_2 x_0^2 + \hat{\beta}_3 x_0^3 + \hat{\beta}_4 x_0^4. $$ ::: {.panel-tabset} #### R ```{r} # poly(age, 4) constructs orthogonal polynomial of degree 1 to degree, all orthogonal to the constant lmod <- lm(wage ~ poly(age, degree = 4), data = Wage) summary(lmod) # poly(age, 4, raw = TRUE) procudes raw othogonal polynomial, which match Python lmod <- lm(wage ~ poly(age, degree = 4, raw = TRUE), data = Wage) summary(lmod) ``` ::: - Since $\hat f(x_0)$ is a linear function of the $\hat{\beta}_j$, we can get a simple expression for **pointwise-variances** $\operatorname{Var}[\hat f(x_0)]$ at any value $x_0$. - We either fix the degree $d$ at some reasonably low value, or use cross-validation to choose $d$. - Can do separately on several variables. Just stack the variables into one matrix, and separate out the pieces afterwards (see GAMs later). - Polynomial modeling can be done for generalized linear models (logistic regression, Poisson regression, etc) as well. - **Caveat**: polynomials have notorious tail behavior. Very bad for extrapolation. ::: {.panel-tabset} #### R ```{r} #| code-fold: true library(splines) # Plot wage ~ age Wage %>% ggplot(mapping = aes(x = age, y = wage)) + geom_point(alpha = 0.25) + # Polynomial regression with degree 14 geom_smooth( method = "lm", formula = y ~ poly(x, degree = 4), color = "blue" ) + # Natural cubic spline geom_smooth( method = "lm", formula = y ~ ns(x, df = 4), color = "red" ) + labs( title = "Natural cubic spline (red) vs polynomial regression (blue)", subtitle = "Both have df=15", x = "Age", y = "Wage (k$)" ) ``` #### Python (sklearn) ```{python} from sklearn.compose import make_column_transformer from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline from sklearn.preprocessing import PolynomialFeatures # Create polynomial features of age predictor poly_tf = make_column_transformer( (PolynomialFeatures(degree = 4, include_bias = False), ['age']), remainder = 'drop' ) # Define pipeline and fit to Wage data pipe = Pipeline(steps = [ ("poly_tf", poly_tf), ("model", LinearRegression()) ]) # Fit pipeline X = Wage.drop('wage', axis = 1) y = Wage.wage pipe.fit(X, y) # R^2 pipe.score(X, y) ``` ```{python} # Plot plt.figure() ax = sns.scatterplot( data = Wage, x = 'age', y = 'wage', alpha = 0.1 ); sns.lineplot( x = Wage['age'], y = pipe.predict(X), ax = ax ).set( title = "Polynomial regression (order = 4)", xlabel = 'Age', ylabel = 'Wage (k$)' ); plt.show() ``` #### Python (statsmodels) ```{python} import statsmodels.api as sm import statsmodels.formula.api as smf # Fit linear regression lmod = smf.ols(formula = 'wage ~ np.vander(age, 5, increasing = True) - 1', data = Wage).fit() lmod.summary() ``` #### Python (numpy.polyfit) ```{python} np.polyfit(Wage.age, Wage.wage, deg = 4) ``` ::: ## Piecewise polynomials (regression splines) - Instead of a single polynomial in $X$ over its whole domain, we can rather use different polynomials in regions defined by **knots**. E.g., a piecewise cubic polynomial with a single knot at $c$ takes the form $$ y_i = \begin{cases} \beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \beta_{31} x_i^3 + \epsilon_i & \text{if } x_i < c \\ \beta_{02} + \beta_{12} x_i + \beta_{22} x_i^2 + \beta_{32} x_i^3 + \epsilon_i & \text{if } x_i \ge c \end{cases}. $$ - Better to add constraints to the polynomials, e.g., continuity. - Splines have the "maximum" amount of continuity.
{width=600px height=600px}
### Linear spline - A **linear spline** with knots at $\xi_k$, $k = 1,\ldots,K$, is a piecewise linear polynomial continuous at each knot. - We can represent this model as $$ y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \cdots + \beta_{K+1} b_{K+1}(x_i) + \epsilon_i, $$ where $b_k$ are **basis functions**: \begin{eqnarray*} b_1(x_i) &=& x_i \\ b_{k+1}(x_i) &=& (x_i - \xi_k)_+, \quad k=1,\ldots,K. \end{eqnarray*} Here $(\cdot)_+$ means positive part $$ (x_i - \xi_i)_+ = \begin{cases} x_i - \xi_k & \text{if } x_i > \xi_k \\ 0 & \text{otherwise} \end{cases}. $$ ### Cubic splines - A **cubic spline** with knots at $\xi_k$, $k = 1,\ldots,K$, is a piecewise cubic polynomial with continuous derivatives up to order 2 at each knot. - Again we can represent this model with **truncated power basis functions** $$ y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \cdots + \beta_{K+3} b_{K+3}(x_i) + \epsilon_i, $$ with \begin{eqnarray*} b_1(x_i) &=& x_i \\ b_2(x_i) &=& x_i^2 \\ b_3(x_i) &=& x_i^3 \\ b_{k+3}(x_i) &=& (x_i - \xi_k)_+^3, \quad k = 1,\ldots,K, \end{eqnarray*} where $$ (x_i - \xi_i)_+^3 = \begin{cases} (x_i - \xi_k)^3 & \text{if } x_i > \xi_k \\ 0 & \text{otherwise} \end{cases}. $$ - A cubic spline with $K$ knots costs $K+4$ parameters or degrees of freedom. That is $4(K+1)$ polynomial coefficients minus $3K$ constraints. - While the truncated power basis is conceptually simple, it is not too attractive numerically: powers of large numbers can lead to severe rounding problems. In practice, **B-spline basis functions** are preferred for their computational efficiency. See ESL Chapter 5 Appendix. ```{python} #| code-fold: true from sklearn.preprocessing import SplineTransformer # Cubic spline for age X_age = np.array(X['age']).reshape(3000, 1) x_plot = np.linspace(start = 15, stop = 85, num = 70) X_plot = x_plot[:, np.newaxis] bs_plot = SplineTransformer( degree = 3, # knots = np.array([25, 40, 60]).reshape(3, 1), n_knots = 5, extrapolation = 'continue', # include_bias = False ).fit(X_age).transform(X_plot) ns_plot = SplineTransformer( degree = 3, # knots = np.array([25, 40, 60]).reshape(3, 1), n_knots = 5, extrapolation = 'linear', # include_bias = False ).fit(X_age).transform(X_plot) # Plot fig, axes = plt.subplots(ncols = 2, figsize = (20, 6)) axes[0].plot(x_plot, bs_plot) # axes[0].legend(axes[0].lines, [f"spline {n}" for n in range(4)]) axes[0].set_title("B-splines") axes[1].plot(x_plot, ns_plot) # axes[1].legend(axes[0].lines, [f"spline {n}" for n in range(8)]) axes[1].set_title("B-splines with linearity at boundary") plt.show() ``` ### Natural cubic splines - Splines can have high variance at the outer range of the predictors. - A **natural cubic spline** extrapolates linearly beyond the boundary knots. This adds $4 = 2 \times 2$ extra constraints, and allows us to put more internal knots for the same degrees of freedom as a regular cubic spline. - A natural spline with $K$ knots has $K$ degrees of freedom. ::: {.panel-tabset} #### R ```{r} #| code-fold: true library(splines) # Plot wage ~ age Wage %>% ggplot(mapping = aes(x = age, y = wage)) + geom_point(alpha = 0.25) + # Cubic spline geom_smooth( method = "lm", formula = y ~ bs(x, knots = c(25, 40, 60)), color = "blue" ) + # Natural cubic spline geom_smooth( method = "lm", formula = y ~ ns(x, knots = c(25, 40, 60)), color = "red" ) + labs( title = "Natural cubic spline fit (red) vs cubic spline fit (blue)", x = "Age", y = "Wage (k$)" ) ``` ::: ### Knot placement - One strategy is to decide $K$, the number of knots, and then place them at appropriate quantiles of the observed $X$. - In practice users often specify the degree of freedom and let software choose the number of knots and locations. ## Smoothing splines - Consider this criterion for fitting a smooth function $g(x)$ to some data: $$ \text{minimize} \quad \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2 \, dt. $$ - The first term is RSS, and tries to make $g(x)$ match the data at each $x_i$. - The second term is a **roughness penalty** and controls how wiggly $g(x)$ is. It is modulated by the tuning parameters $\lambda \ge 0$. * The smaller $\lambda$, the more wiggly the function, eventually interpolating $y_i$ when $\lambda = 0$. * As $\lambda \to \infty$, the function $g(x)$ becomes linear. - It can be shown that this problem has an explicit, finite-dimensional, unique minimizer which is a natural cubic spline with knots at the unique values of the $x_i$. - The roughness penalty controls the roughness via $\lambda$. - Smoothing splines avoid the knot-selection issue, leaving a single $\lambda$ to be chosen. - The vector of $n$ fitted values can be written as $\hat{g}_\lambda = S_\lambda y$, where $S_{\lambda}$ is an $n \times n$ matrix (determined by the $x_i$ and $\lambda$). - The **effective degrees of freedom** are given by $$ \text{df}_{\lambda} = \sum_{i=1}^n S_{\lambda,ii}. $$ Thus we can specify `df` rather than $\lambda$. - The leave-one-out (LOO) cross-validated error is given by $$ \text{RSS}_{\text{CV}}(\lambda) = \sum_{i=1}^n \left[ \frac{y_i - \hat{g}_\lambda(x_i)}{1 - S_{\lambda,ii}} \right]^2. $$ ::: {.panel-tabset} #### R `ggformula` package supplies `geom_spline` function for displaying smoothing spline fits. ```{r} #| code-fold: true library(ggformula) library(splines) # Plot wage ~ age Wage %>% ggplot(mapping = aes(x = age, y = wage)) + geom_point(alpha = 0.25) + # Smoothing spline with df = 16 geom_spline( df = 16, color = "red" ) + # Smoothing spline with GCV tuned df geom_spline( # df = 6.8, cv = TRUE, color = "blue" ) + labs( title = "Smoothing spline with df=16 (red) vs LOOCV tuned df=6.8 (blue)", x = "Age", y = "Wage (k$)" ) ``` ::: - Application to logistic regression $$ \log\frac{\mathbf{P}(Y=1|X)}{\mathbf{P}(Y=0|X)} = f(X), $$ therefore $$ \mathbf{P}(Y=1|X) = \frac{\exp(f(X))}{1 + \exp(f(X))} = p(x). $$ - The penalized The log-likelihood criterion $$ \ell(f,\lambda) = \sum_{i=1}^n \left[ y_i \log p(x_i) + (1-y_i) \log (1-p(x_i)) \right] - \lambda \int f''(t)^2 \, dt. $$ - The solution is a natural cubic spline with knots at the unique values of the $x_i$. ## Local regression{width=600px height=600px}
- With a sliding weight function, we fit separate linear fits over the range of $X$ by weighted least squares. - At $X=x_0$, $$ \text{minimize} \quad \sum_{i=1}^n K(x_i, x_0) (y_i - \beta_0 - \beta_1 x_i)^2, $$ where $K$ is a weighting function that assigns heavier weight for $x_i$ close to $x_0$ and zero weight for points furthest from $x_0$. - **Locally weighted linear regression**: `loess` function in R and `lowess` in Python. - Anecdotally, loess gives better appearance, but is $O(N^2)$ in memory, so does not work for larger data sets. - While all of these choices make some difference, the most important choice is the span $s$, which is the proportion of points used to compute the local regression at $x_0$. - The span plays a role like that of the tuning parameter $\lambda$ in smoothing splines: it controls the flexibility of the non-linear fit. + The smaller the value of $s$, the more local and wiggly will be our fit; alternatively, a very large value of s will lead to a global fit to the data using all of the training observations. + Cross-validation to choose s, or just specify it directly. ## Generalized additive model (GAM) - Generalized additive models (GAMs) allows for flexible nonlinearities in several variables, but retains the additive structure of linear models. $$ y_i = \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + \cdots + f_p (x_{ip}) + \epsilon_i. $$ - We can fit GAM simply using, e.g. natural splines. - Coefficients not that interesting; fitted functions are. - Can mix terms: some linear, some nonlinear, and use ANOVA to compare models. - Can use smoothing splines or local regression as well. In R: `gam(wage ~ s(year; df = 5) + lo(age; span = :5) + education)`. - GAMs are additive, although low-order interactions can be included in a natural way using, e.g. bivariate smoothers or interactions of the form (in R) `ns(age, df = 5):ns(year, df = 5)`. ::: {.panel-tabset} #### R Natural splines for `year` and `age`. ```{r} gam_mod <- lm( wage ~ ns(year, df = 4) + ns(age, df = 5) + education, data = Wage ) summary(gam_mod) ``` Smoothing splines for `year` and `age`. ```{r} library(gam) gam_mod <- gam( wage ~ s(year, 4) + s(age, 5) + education, data = Wage ) summary(gam_mod) plot(gam_mod, se = TRUE, col = "red") ``` #### Python (sklearn) ```{python} from sklearn.preprocessing import OneHotEncoder, SplineTransformer # Natural cubic spline features of year predictor ns_tf = make_column_transformer( (SplineTransformer( n_knots = 4, # knots = 'quantile', degree = 3, extrapolation = 'linear', # natural cubic spline # include_bias = False ), ['year']), (SplineTransformer( n_knots = 5, # knots = 'quantile', degree = 3, extrapolation = 'linear', # natural cubic spline # include_bias = False ), ['age']), (OneHotEncoder(drop = 'first'), ['education']), remainder = 'drop' ) # Define pipeline and fit to Wage data pipe = Pipeline(steps = [ ("ns_tf", ns_tf), ("model", LinearRegression()) ]) # Fit pipeline X = Wage.drop('wage', axis = 1) y = Wage.wage pipe.fit(X, y) # R^2 pipe.score(X, y) ``` #### Python (statsmodels) ```{python} from statsmodels.gam.api import GLMGam, BSplines # Create spline basis for year and age x_spline = Wage[['year', 'age']] bs = BSplines(x_spline, df = [4, 5], degree = [3, 3]) # Fit GAM gam_mod = GLMGam.from_formula('wage ~ education', data = Wage, smoother = bs).fit() gam_mod.summary() ``` ```{python} # Plot smooth components for i in [0, 1]: plt.figure() gam_mod.plot_partial(i, cpr = True) plt.show() ``` ::: ## Lab ### Polynomial Regression and Step Functions - To predict wage using a fourth-degree polynomial in age: `poly(age, 4)` in `lm` - `poly` function creates orthogonal polynomials, which are uncorrelated and have mean zero. Essentially it means that each column is a linear combination of the variables age, age^2, age^3 and age^4. - Or we can also use `poly()` to obtain age, age^2, age^3 and age^4 directly, if we prefer. We can do this by using the `raw = TRUE` argument to the poly() function. ```{r} attach(Wage) fit <- lm(wage ~ poly(age, 4), data = Wage) coef(summary(fit)) fit2 <- lm(wage ~ poly(age, 4, raw = T), data = Wage) coef(summary(fit2)) # Equilvalent to fit3 <- lm(wage ~ age + I(age^2) + I(age^3) + I(age^4), data = Wage) ``` - In performing a polynomial regression we must decide on the degree of the polynomial to use. One way to do this is by using hypothesis tests. ```{r} fit.1 <- lm(wage ~ age, data = Wage) fit.2 <- lm(wage ~ poly(age, 2), data = Wage) fit.3 <- lm(wage ~ poly(age, 3), data = Wage) fit.4 <- lm(wage ~ poly(age, 4), data = Wage) fit.5 <- lm(wage ~ poly(age, 5), data = Wage) anova(fit.1, fit.2, fit.3, fit.4, fit.5) ``` - As an alternative to using hypothesis tests and ANOVA, we could choose the polynomial degree using cross-validation, as discussed in Chapter 5. - We can also use step functions to fit a piecewise-constant function. The `cut` function is used to create a qualitative variable that represents the age range. The `lm` function can then be used to fit a step function to the `age` variable. ```{r} fit <- glm(I(wage > 250) ~ poly(age, 4), data = Wage, family = binomial) ``` - Once again, we make predictions using the predict() function. ```{r} agelims <- range(age) age.grid <- seq(from = agelims[1], to = agelims[2]) preds <- predict(fit, newdata = list(age = age.grid), se = T) preds <- predict(fit, newdata = list(age = age.grid), type = "response", se = T) ``` - In order to fit a step function, as discussed in Section 7.2, we use the `cut()` function. ```{r} fit <- lm(wage ~ cut(age, 4), data = Wage) summary(fit) ``` ### Splines - In order to fit regression splines in R, we use the `splines` library. - The `bs()` function generates the entire matrix of `bs()` basis functions for splines with the specified set of knots. ```{r} library(splines) fit <- lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage) ``` - The `df` option to `bs()` can be used in order to produce a spline with a specified degrees of freedom. ```{r} dim(bs(age, knots = c(25, 40, 60))) dim(bs(age, df = 6)) attr(bs(age, df = 6), "knots") ``` - In order to instead fit a natural spline, we use the `ns()` function. Here `ns()` we fit a natural spline with four degrees of freedom. ```{r} fit <- lm(wage ~ ns(age, df = 4), data = Wage) summary(fit) ``` - In order to fit a smoothing spline, we use the `smooth.spline()` function. ```{r} fit <- smooth.spline(age, wage, df = 16) fit2 <- smooth.spline(age, wage, cv = TRUE) ``` - local regression ```{r} attach(Wage) plot(age, wage, xlim = agelims, cex = .5, col = "darkgrey") > title("Local Regression") fit <- loess(wage ~ age, span = .2, data = Wage) fit2 <- loess(wage ~ age, span = .5, data = Wage) lines(age.grid, predict(fit, data.frame(age = age.grid)), col = "red", lwd = 2) lines(age.grid, predict(fit2, data.frame(age = age.grid)), col = "blue", lwd = 2) legend("topright", legend = c("Span = 0.2", "Span = 0.5"), col = c("red", "blue"), lty = 1, lwd = 2, cex = .8) ``` ### GAMs - In order to fit a GAM in R, we use the `gam` function, which is part of the `gam` library. - The `s()` function, which is part of the gam library, is used to indicate that s() we would like to use a smoothing spline. ```{r} library(gam) gam.m3 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data = Wage) ``` ```{r} gam.m1 <- gam(wage ~ s(age, 5) + education, data = Wage) gam.m2 <- gam(wage ~ year + s(age, 5) + education, data = Wage) anova(gam.m1, gam.m2, gam.m3, test = "F") ``` - We can also use local regression fits as building blocks in a GAM, using the `lo()` function. ```{r} gam.lo <- gam(wage ~ lo(year, age, span = 0.5) + education, data = Wage) plot(gam.lo, se = TRUE, col = "green") ```