--- title: "Nonlinear Models (ISL 7)" subtitle: "Econ 425T" author: "Dr. Hua Zhou @ UCLA" date: "`r format(Sys.time(), '%d %B, %Y')`" format: html: theme: cosmo number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false engine: knitr knitr: opts_chunk: fig.align: 'center' # fig.width: 6 # fig.height: 4 message: FALSE cache: false --- Credit: This note heavily uses material from the books [_An Introduction to Statistical Learning: with Applications in R_](https://www.statlearning.com/) (ISL2) and [_Elements of Statistical Learning: Data Mining, Inference, and Prediction_](https://hastie.su.domains/ElemStatLearn/) (ESL2). Display system information for reproducibility. ::: {.panel-tabset} ## Python ```{python} import IPython print(IPython.sys_info()) ``` ## R ```{r} sessionInfo() ``` ::: ## Overview - The truth is never linear! Or almost never! But often the linearity assumption is good enough. - When it's not ... - polynomials - step functions - spline - local regression, and - generalized additive models offer a lot of flexibility, without losing the ease and interpretability of linear models. - `wage` vs `age`: ::: {.panel-tabset} #### Python ```{python} # Load the pandas library import pandas as pd # Load numpy for array manipulation import numpy as np # Load seaborn plotting library import seaborn as sns import matplotlib.pyplot as plt # Set font size in plots sns.set(font_scale = 1.2) # Display all columns pd.set_option('display.max_columns', None) # Import Wage data Wage = pd.read_csv("../data/Wage.csv") Wage.info() ``` ```{python} # Visualize wage ~ age, display lowess curve plt.figure() sns.lmplot( data = Wage, x = "age", y = "wage", lowess = True, scatter_kws = {'alpha' : 0.1}, height = 8 ).set( title = "Wage changes nonlinearly with age", xlabel = 'Age', ylabel = 'Wage (k$)' ); plt.show() ``` #### R ```{r} #| message: false library(gtsummary) library(ISLR2) library(tidyverse) # Convert to tibble Wage <- as_tibble(Wage) %>% print(width = Inf) # Summary statistics Wage %>% tbl_summary() # Plot wage ~ age, GAM fit is display when n >1000 Wage %>% ggplot(mapping = aes(x = age, y = wage)) + geom_point() + geom_smooth() + labs(title = "Wage changes nonlinearly with age", x = "Age", y = "Wage (k$)") ``` ::: ## Polynomial regression $$ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_d x_i^d + \epsilon_i. $$ ::: {.panel-tabset} #### Python ```{python} # Visualize wage ~ age, display order-4 polynomial fit plt.figure() sns.lmplot( data = Wage, x = "age", y = "wage", # Order-4 polynomial regression order = 4, scatter_kws = {'alpha' : 0.1}, height = 8 ).set( xlabel = 'Age', ylabel = 'Wage (k$)', title = 'Degree-4 Polynomial' ); plt.show() ``` #### R ```{r} # Plot wage ~ age, display order-4 polynomial fit Wage %>% ggplot(mapping = aes(x = age, y = wage)) + geom_point() + geom_smooth( method = "lm", formula = y ~ poly(x, degree = 4) ) + labs( title = "Degree-4 Polynomial", x = "Age", y = "Wage (k$)" ) ``` ::: - Create new variables $X_1 = X$, $X_2 = X^2$, ..., and then treat as multiple linear regression. - Not really interested in the coefficients; more interested in the fitted function values at any value $x_0$: $$ \hat f(x_0) = \hat{\beta}_0 + \hat{\beta}_1 x_0 + \hat{\beta}_2 x_0^2 + \hat{\beta}_3 x_0^3 + \hat{\beta}_4 x_0^4. $$ ::: {.panel-tabset} #### Python (sklearn) ```{python} from sklearn.compose import make_column_transformer from sklearn.linear_model import LinearRegression from sklearn.pipeline import Pipeline from sklearn.preprocessing import PolynomialFeatures # Create polynomial features of age predictor poly_tf = make_column_transformer( (PolynomialFeatures(degree = 4, include_bias = False), ['age']), remainder = 'drop' ) # Define pipeline and fit to Wage data pipe = Pipeline(steps = [ ("poly_tf", poly_tf), ("model", LinearRegression()) ]) # Fit pipeline X = Wage.drop('wage', axis = 1) y = Wage.wage pipe.fit(X, y) # R^2 pipe.score(X, y) ``` ```{python} # Plot plt.figure() ax = sns.scatterplot( data = Wage, x = 'age', y = 'wage', alpha = 0.1 ); sns.lineplot( x = Wage['age'], y = pipe.predict(X), ax = ax ).set( title = "Polynomial regression (order = 4)", xlabel = 'Age', ylabel = 'Wage (k$)' ); plt.show() ``` #### Python (statsmodels) ```{python} import statsmodels.api as sm import statsmodels.formula.api as smf # Fit linear regression lmod = smf.ols(formula = 'wage ~ np.vander(age, 5, increasing = True) - 1', data = Wage).fit() lmod.summary() ``` #### Python (numpy.polyfit) ```{python} np.polyfit(Wage.age, Wage.wage, deg = 4) ``` #### R ```{r} # poly(age, 4) constructs orthogonal polynomial of degree 1 to degree, all orthogonal to the constant lmod <- lm(wage ~ poly(age, degree = 4), data = Wage) summary(lmod) # poly(age, 4, raw = TRUE) procudes raw othogonal polynomial, which match Python lmod <- lm(wage ~ poly(age, degree = 4, raw = TRUE), data = Wage) summary(lmod) ``` ::: - Since $\hat f(x_0)$ is a linear function of the $\hat{\beta}_j$, we can get a simple expression for **pointwise-variances** $\operatorname{Var}[\hat f(x_0)]$ at any value $x_0$. - We either fix the degree $d$ at some reasonably low value, or use cross-validation to choose $d$. - Can do separately on several variables. Just stack the variables into one matrix, and separate out the pieces afterwards (see GAMs later). - Polynomial modeling can be done for generalized linear models (logistic regression, Poisson regression, etc) as well. - **Caveat**: polynomials have notorious tail behavior. Very bad for extrapolation. ::: {.panel-tabset} #### R ```{r} #| code-fold: true library(splines) # Plot wage ~ age Wage %>% ggplot(mapping = aes(x = age, y = wage)) + geom_point(alpha = 0.25) + # Polynomial regression with degree 14 geom_smooth( method = "lm", formula = y ~ poly(x, degree = 14), color = "blue" ) + # Natural cubic spline geom_smooth( method = "lm", formula = y ~ ns(x, df = 14), color = "red" ) + labs( title = "Natural cubic spline (red) vs polynomial regression (blue)", subtitle = "Both have df=15", x = "Age", y = "Wage (k$)" ) ``` ::: ## Piecewise polynomials (regression splines) - Instead of a single polynomial in $X$ over its whole domain, we can rather use different polynomials in regions defined by **knots**. E.g., a piecewise cubic polynomial with a single knot at $c$ takes the form $$ y_i = \begin{cases} \beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \beta_{31} x_i^3 + \epsilon_i & \text{if } x_i < c \\ \beta_{02} + \beta_{12} x_i + \beta_{22} x_i^2 + \beta_{32} x_i^3 + \epsilon_i & \text{if } x_i \ge c \end{cases}. $$ - Better to add constraints to the polynomials, e.g., continuity. - Splines have the "maximum" amount of continuity.

![](ISL_fig_7_3.pdf){width=600px height=600px}

### Linear spline - A **linear spline** with knots at $\xi_k$, $k = 1,\ldots,K$, is a piecewise linear polynomial continuous at each knot. - We can represent this model as $$ y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \cdots + \beta_{K+1} b_{K+1}(x_i) + \epsilon_i, $$ where $b_k$ are **basis functions**: \begin{eqnarray*} b_1(x_i) &=& x_i \\ b_{k+1}(x_i) &=& (x_i - \xi_k)_+, \quad k=1,\ldots,K. \end{eqnarray*} Here $(\cdot)_k$ means positive part $$ (x_i - \xi_i)_+ = \begin{cases} x_i - \xi_k & \text{if } x_i > \xi_k \\ 0 & \text{otherwise} \end{cases}. $$ ### Cubic splines - A **cubic spline** with knots at $\xi_k$, $k = 1,\ldots,K$, is a piecewise cubic polynomial with continuous derivatives up to order 2 at each knot. - Again we can represent this model with **truncated power basis functions** $$ y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \cdots + \beta_{K+3} b_{K+3}(x_i) + \epsilon_i, $$ with \begin{eqnarray*} b_1(x_i) &=& x_i \\ b_2(x_i) &=& x_i^2 \\ b_3(x_i) &=& x_i^3 \\ b_{k+3}(x_i) &=& (x_i - \xi_k)_+^3, \quad k = 1,\ldots,K, \end{eqnarray*} where $$ (x_i - \xi_i)_+^3 = \begin{cases} (x_i - \xi_k)^3 & \text{if } x_i > \xi_k \\ 0 & \text{otherwise} \end{cases}. $$ - A cubic spline with $K$ knots costs $K+4$ parameters or degrees of freedom. That is $4(K+1)$ polynomial coefficients minus $3K$ constraints. - While the truncated power basis is conceptually simple, it is not too attractive numerically: powers of large numbers can lead to severe rounding problems. In practice, **B-spline basis functions** are preferred for their computational efficiency. See ESL Chapter 5 Appendix. ```{python} #| code-fold: true from sklearn.preprocessing import SplineTransformer # Cubic spline for age X_age = np.array(X['age']).reshape(3000, 1) x_plot = np.linspace(start = 15, stop = 85, num = 70) X_plot = x_plot[:, np.newaxis] bs_plot = SplineTransformer( degree = 3, # knots = np.array([25, 40, 60]).reshape(3, 1), n_knots = 5, extrapolation = 'continue', # include_bias = False ).fit(X_age).transform(X_plot) ns_plot = SplineTransformer( degree = 3, # knots = np.array([25, 40, 60]).reshape(3, 1), n_knots = 5, extrapolation = 'linear', # include_bias = False ).fit(X_age).transform(X_plot) # Plot fig, axes = plt.subplots(ncols = 2, figsize = (20, 6)) axes[0].plot(x_plot, bs_plot) # axes[0].legend(axes[0].lines, [f"spline {n}" for n in range(4)]) axes[0].set_title("B-splines") axes[1].plot(x_plot, ns_plot) # axes[1].legend(axes[0].lines, [f"spline {n}" for n in range(8)]) axes[1].set_title("B-splines with linearity at boundary") plt.show() ``` ### Natural cubic splines - Splines can have high variance at the outer range of the predictors. - A **natural cubic spline** extrapolates linearly beyond the boundary knots. This adds $4 = 2 \times 2$ extra constraints, and allows us to put more internal knots for the same degrees of freedom as a regular cubic spline. - A natural spline with $K$ knots has $K$ degrees of freedom. ::: {.panel-tabset} #### R ```{r} #| code-fold: true library(splines) # Plot wage ~ age Wage %>% ggplot(mapping = aes(x = age, y = wage)) + geom_point(alpha = 0.25) + # Cubic spline geom_smooth( method = "lm", formula = y ~ bs(x, knots = c(25, 40, 60)), color = "blue" ) + # Natural cubic spline geom_smooth( method = "lm", formula = y ~ ns(x, knots = c(25, 40, 60)), color = "red" ) + labs( title = "Natural cubic spline fit (red) vs cubic spline fit (blue)", x = "Age", y = "Wage (k$)" ) ``` ::: ### Knot placement - One strategy is to decide $K$, the number of knots, and then place them at appropriate quantiles of the observed $X$. - In practice users often specify the degree of freedom and let software choose the number of knots and locations. ## Smoothing splines - Consider this criterion for fitting a smooth function $g(x)$ to some data: $$ \text{minimize} \quad \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2 \, dt. $$ - The first term is RSS, and tries to make $g(x)$ match the data at each $x_i$. - The second term is a **roughness penalty** and controls how wiggly $g(x)$ is. It is modulated by the tuning parameters $\lambda \ge 0$. * The smaller $\lambda$, the more wiggly the function, eventually interpolating $y_i$ when $\lambda = 0$. * As $\lambda \to \infty$, the function $g(x)$ becomes linear. - The solution is a (shrunken) natural cubic spline, with a knot at every unique value of $x_i$. The roughness penalty still controls the roughness via $\lambda$. - Smoothing splines avoid the knot-selection issue, leaving a single $\lambda$ to be chosen. - The vector of $n$ fitted values can be written as $\hat{g}_\lambda = S_\lambda y$, where $S_{\lambda}$ is an $n \times n$ matrix (determined by the $x_i$ and $\lambda$). - The **effective degrees of freedom** are given by $$ \text{df}_{\lambda} = \sum_{i=1}^n S_{\lambda,ii}. $$ Thus we can specify `df` rather than $\lambda$. - The leave-one-out (LOO) cross-validated error is given by $$ \text{RSS}_{\text{CV}}(\lambda) = \sum_{i=1}^n \left[ \frac{y_i - \hat{g}_\lambda(x_i)}{1 - S_{\lambda,ii}} \right]^2. $$ ::: {.panel-tabset} #### R `ggformula` package supplies `geom_spline` function for displaying smoothing spline fits. ```{r} #| code-fold: true library(ggformula) library(splines) # Plot wage ~ age Wage %>% ggplot(mapping = aes(x = age, y = wage)) + geom_point(alpha = 0.25) + # Smoothing spline with df = 16 geom_spline( df = 16, color = "red" ) + # Smoothing spline with GCV tuned df geom_spline( # df = 6.8, cv = TRUE, color = "blue" ) + labs( title = "Smoothing spline with df=16 (red) vs LOOCV tuned df=6.8 (blue)", x = "Age", y = "Wage (k$)" ) ``` ::: ## Local regression

![](ISL_fig_7_9.pdf){width=600px height=600px}

- With a sliding weight function, we fit separate linear fits over the range of $X$ by weighted least squares. - At $X=x_0$, $$ \text{minimize} \quad \sum_{i=1}^n K(x_i, x_0) (y_i - \beta_0 - \beta_1 x_i)^2, $$ where $K$ is a weighting function that assigns heavier weight for $x_i$ close to $x_0$ and zero weight for points furthest from $x_0$. - **Locally weighted linear regression**: `loess` function in R and `lowess` in Python. - Anecdotally, loess gives better appearance, but is $O(N^2)$ in memory, so does not work for larger data sets. ## Generalized additive model (GAM) - Generalized additive models (GAMs) allows for flexible nonlinearities in several variables, but retains the additive structure of linear models. $$ y_i = \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + \cdots + f_p (x_{ip}) + \epsilon_i. $$ - We can fit GAM simply using, e.g. natural splines. - Coefficients not that interesting; fitted functions are. - Can mix terms: some linear, some nonlinear, and use ANOVA to compare models. - Can use smoothing splines or local regression as well. In R: `gam(wage ~ s(year; df = 5) + lo(age; span = :5) + education)`. - GAMs are additive, although low-order interactions can be included in a natural way using, e.g. bivariate smoothers or interactions of the form (in R) `ns(age, df = 5):ns(year, df = 5)`. ::: {.panel-tabset} #### Python (sklearn) ```{python} from sklearn.preprocessing import OneHotEncoder, SplineTransformer # Natural cubic spline features of year predictor ns_tf = make_column_transformer( (SplineTransformer( n_knots = 4, # knots = 'quantile', degree = 3, extrapolation = 'linear', # natural cubic spline # include_bias = False ), ['year']), (SplineTransformer( n_knots = 5, # knots = 'quantile', degree = 3, extrapolation = 'linear', # natural cubic spline # include_bias = False ), ['age']), (OneHotEncoder(drop = 'first'), ['education']), remainder = 'drop' ) # Define pipeline and fit to Wage data pipe = Pipeline(steps = [ ("ns_tf", ns_tf), ("model", LinearRegression()) ]) # Fit pipeline X = Wage.drop('wage', axis = 1) y = Wage.wage pipe.fit(X, y) # R^2 pipe.score(X, y) ``` #### Python (statsmodels) ```{python} from statsmodels.gam.api import GLMGam, BSplines # Create spline basis for year and age x_spline = Wage[['year', 'age']] bs = BSplines(x_spline, df = [4, 5], degree = [3, 3]) # Fit GAM gam_mod = GLMGam.from_formula('wage ~ education', data = Wage, smoother = bs).fit() gam_mod.summary() ``` ```{python} # Plot smooth components for i in [0, 1]: plt.figure() gam_mod.plot_partial(i, cpr = True) plt.show() ``` #### R Natural splines for `year` and `age`. ```{r} gam_mod <- lm( wage ~ ns(year, df = 4) + ns(age, df = 5) + education, data = Wage ) summary(gam_mod) ``` Smoothing splines for `year` and `age`. ```{r} library(gam) gam_mod <- gam( wage ~ s(year, 4) + s(age, 5) + education, data = Wage ) summary(gam_mod) plot(gam_mod, se = TRUE, col = "red") ``` :::