Stanford STATS191 in Python, Lecture 4 : Simple Diagnostics for Linear Models
Stanford Stats 191¶
Introduction¶
This is a re-creation of the Stanford Stats 191 course, using Python eco-system tools, instead of R. This is lecture "Diagnostics for simple linear regression" ( see https://web.stanford.edu/class/stats191/notebooks/Simple_diagnostics.html)
We look at how we can assert that a linear model is appropriate (or not, as the case may be)
Initial Notebook Setup¶
watermark
documents the Python and package environment, black
is my chosen Python formatter
%load_ext watermark
The watermark extension is already loaded. To reload it, use: %reload_ext watermark
%load_ext lab_black
The lab_black extension is already loaded. To reload it, use: %reload_ext lab_black
%matplotlib inline
All imports go here.
import pandas as pd
import numpy as np
import seaborn as sn
import math
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.formula.api import rlm
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import (
wls_prediction_std,
)
from statsmodels.stats.stattools import jarque_bera
from statsmodels.stats.diagnostic import het_white
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.diagnostic import het_goldfeldquandt
import os
Create Data¶
To illustrate the basic concepts of linear regression, we create a dataset, where y depends upon x, and has added Gaussion noise.
x = np.linspace(0, 20, 21)
y = 0.5 * x + 1 + np.random.normal(0, 1, 21)
df_dict = {'x': x, 'y': y}
data = pd.DataFrame(df_dict)
res = ols('y ~ x', data=data).fit()
Plot the data. We choose to use high-level matplotlib methods, at the expense of some plotting elegance
_ = plt.plot(x, y, 'o')
Plot the data and line of best fit
_ = plt.plot(x, y, 'o')
_ = plt.plot(x, res.predict(), 'r-')
_ = plt.axhline(y.mean(), color='k')
The graphs below illustrate the definitions of various sums of squares; SST
is Sun Of Squares Total (deviation from the mean of the actual data), SSE
is the Sum of Squares of Errors (deviation from the actuals of the predictions), SSR
is the Sum of Squares of the Residuals (deviation from the mean of the predictions)
_ = plt.plot(x, y, 'o')
_ = plt.plot(x, res.predict(), 'r-')
_ = plt.axhline(y.mean(), color='k')
for i in range(len(x)):
plt.plot([x[i], x[i]], [y[i], y.mean()], 'y-')
# end for
plt.title('Total Sum of Squares')
Text(0.5, 1.0, 'Total Sum of Squares')
_ = plt.plot(x, y, 'o')
_ = plt.plot(x, res.predict(), 'r-')
_ = plt.plot(x, res.predict(), 'r+')
_ = plt.axhline(y.mean(), color='k')
for i in range(len(x)):
plt.plot([x[i], x[i]], [y[i], res.predict()[i]], 'y-')
# end for
plt.title('Error Sum of Squares')
Text(0.5, 1.0, 'Error Sum of Squares')
_ = plt.plot(x, y, 'o')
_ = plt.plot(x, res.predict(), 'r-')
_ = plt.plot(x, res.predict(), 'r+')
_ = plt.axhline(y.mean(), color='k')
for i in range(len(x)):
plt.plot(
[x[i], x[i]], [y.mean(), res.predict()[i]], 'y-'
)
# end for
plt.title('Regression Sum of Squares')
Text(0.5, 1.0, 'Regression Sum of Squares')
data = pd.read_csv('../data/wage.csv')
Explore Data¶
data.head()
education | logwage | |
---|---|---|
0 | 16.750000 | 2.845000 |
1 | 15.000000 | 2.446667 |
2 | 10.000000 | 1.560000 |
3 | 12.666667 | 2.099167 |
4 | 15.000000 | 2.490000 |
_ = plt.plot(data['education'], data['logwage'], 'o')
In order to more clearly see where the data points lie, we add "jitter" so they don't overlap as much. We also adjust alpha
jitter = 0.1
_ = plt.plot(
data['education']
+ np.random.normal(0, 1, len(data['education']))
* jitter,
data['logwage'],
'o',
alpha=0.2,
)
Perform OLS Linear Best Fit¶
res = ols('logwage ~ education', data=data).fit()
Show the line of best fit
jitter = 0.1
_ = plt.plot(
data['education']
+ np.random.normal(0, 1, len(data['education']))
* jitter,
data['logwage'],
'o',
alpha=0.2,
)
data['predict'] = res.predict()
data_std = data.sort_values(by='education')
_ = plt.plot(
data_std['education'], data_std['predict'], 'r-'
)
Compute the Sums of Squares defined above. We use np.ones()
to create vectors of constants
n_data = len(data['education'])
SSE = sum(res.resid * res.resid)
sst_array = (
data['logwage']
- np.ones(n_data) * data['logwage'].mean()
)
SST = sum(sst_array * sst_array)
sse_array = (
np.ones(n_data) * data['logwage'].mean() - res.predict()
)
SSR = sum(sse_array * sse_array)
SST, SSE + SSR
(410.2148277475003, 410.21482774750143)
Compute the F statistic
F = (SSR / 1) / (SSE / res.df_resid)
F
340.0296918700766
Check this manually calculated value with the OLS summary
res.summary()
Dep. Variable: | logwage | R-squared: | 0.135 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.135 |
Method: | Least Squares | F-statistic: | 340.0 |
Date: | Thu, 19 Mar 2020 | Prob (F-statistic): | 1.15e-70 |
Time: | 20:15:35 | Log-Likelihood: | -1114.3 |
No. Observations: | 2178 | AIC: | 2233. |
Df Residuals: | 2176 | BIC: | 2244. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 1.2392 | 0.055 | 22.541 | 0.000 | 1.131 | 1.347 |
education | 0.0786 | 0.004 | 18.440 | 0.000 | 0.070 | 0.087 |
Omnibus: | 46.662 | Durbin-Watson: | 1.769 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 59.725 |
Skew: | -0.269 | Prob(JB): | 1.07e-13 |
Kurtosis: | 3.606 | Cond. No. | 82.4 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We get the interval that contains 90% of the F distribution function, given the degrees of freedom we have
stats.f.interval(0.90, 1, res.df_resid)
(0.003933047182904572, 3.8457358464718414)
We clearly have a F statistic much larger than this,
so we would reject the Null Hypothesis that there is no linear dependence of logwages
on education
.
When a Linear Model is Wrong¶
The Anscombe quartet is a famous quartet of datasets, all having the same descriptive numerical values (mean, etc), but wildly different shapes when plotted. We will use the dataset that is a quadratic.
data = pd.read_csv('../data/anscombe.csv')
data.head()
x1 | y1 | x2 | y2 | x3 | y3 | x4 | y4 | |
---|---|---|---|---|---|---|---|---|
0 | 10 | 8.04 | 10 | 9.14 | 10 | 7.46 | 8 | 6.58 |
1 | 8 | 6.95 | 8 | 8.14 | 8 | 6.77 | 8 | 5.76 |
2 | 13 | 7.58 | 13 | 8.74 | 13 | 12.74 | 8 | 7.71 |
3 | 9 | 8.81 | 9 | 8.77 | 9 | 7.11 | 8 | 8.84 |
4 | 11 | 8.33 | 11 | 9.26 | 11 | 7.81 | 8 | 8.47 |
Add some Gaussian noise
data['y2'] = (
data['y2']
+ np.random.normal(0, 1, len(data['y2'])) * 0.45
)
Display the dataset
_ = plt.plot(data['x2'], data['y2'], 'o')
Fit a Linear Model¶
We get the best OLS line, and display it
res = ols('y2 ~ x2', data=data).fit()
_ = plt.plot(data['x2'], data['y2'], 'bo')
_ = plt.plot(data['x2'], res.predict(), 'r-')
When we plot the residuals, we can see that they are apparently not distributed at random (negative at the end, positive in the middle)
_ = plt.plot(data['x2'], res.resid, 'ro')
_ = plt.axhline(0, color='black')
Fit a Quadratic Model¶
We add a term X^2
to our linear model
data['x2sq'] = data['x2'] * data['x2']
data_std = data.sort_values(by='x2')
res2 = ols('y2 ~ x2 + x2sq', data=data_std).fit()
The fit is clearly much better, and the residuals look to be more randomly distributed
_ = plt.plot(data['x2'], data['y2'], 'bo')
_ = plt.plot(data_std['x2'], res2.resid, 'ro')
_ = plt.plot(data_std['x2'], res2.predict(), 'g-')
_ = plt.axhline(0, color='black')
Just plotting the residuals, we see they are smaller than those of the the pure linear model
_ = plt.plot(data['x2'], res2.resid, 'ro')
_ = plt.axhline(0, color='black')
QQ Plots¶
One technique for finding a model that is bad is to look at the QQ Plots for the residuals of eac h OLS fit (linear and quadratic). This is a test for normality; in this case, the difference is very subtle, and could not be used to reject one model or the other
_ = sm.qqplot(res.resid, line='r')
_ = sm.qqplot(res2.resid, line='r')
OLS Summary Comparison¶
A comparison of the RegressionResults summary() output shows that the quadratic model does a much better job of explaining the spread of Y2 values (R Squared of 0.962 versus 0.094). Further, the Jarque-Bera
statistic (a test for normality of the residuals) indicates the the residuals are NOT normally distributed in the linear model, and are probably normally distributed in the quadratic model.
The Omnibus
statistic also indicates that the residuals are not normal in the pure linear case, and probably are normal in the quadratic case.
res.summary()
D:\Anaconda3\envs\ac5-py37\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11 "anyway, n=%i" % int(n))
Dep. Variable: | y2 | R-squared: | 0.691 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.657 |
Method: | Least Squares | F-statistic: | 20.15 |
Date: | Thu, 19 Mar 2020 | Prob (F-statistic): | 0.00151 |
Time: | 20:15:36 | Log-Likelihood: | -16.975 |
No. Observations: | 11 | AIC: | 37.95 |
Df Residuals: | 9 | BIC: | 38.75 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 2.7066 | 1.139 | 2.377 | 0.041 | 0.131 | 5.282 |
x2 | 0.5358 | 0.119 | 4.489 | 0.002 | 0.266 | 0.806 |
Omnibus: | 1.120 | Durbin-Watson: | 2.389 |
---|---|---|---|
Prob(Omnibus): | 0.571 | Jarque-Bera (JB): | 0.865 |
Skew: | -0.577 | Prob(JB): | 0.649 |
Kurtosis: | 2.255 | Cond. No. | 29.1 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
res2.summary()
D:\Anaconda3\envs\ac5-py37\lib\site-packages\scipy\stats\stats.py:1394: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=11 "anyway, n=%i" % int(n))
Dep. Variable: | y2 | R-squared: | 0.966 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.958 |
Method: | Least Squares | F-statistic: | 115.0 |
Date: | Thu, 19 Mar 2020 | Prob (F-statistic): | 1.28e-06 |
Time: | 20:15:36 | Log-Likelihood: | -4.7797 |
No. Observations: | 11 | AIC: | 15.56 |
Df Residuals: | 8 | BIC: | 16.75 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -5.8860 | 1.134 | -5.189 | 0.001 | -8.502 | -3.270 |
x2 | 2.7142 | 0.272 | 9.961 | 0.000 | 2.086 | 3.342 |
x2sq | -0.1210 | 0.015 | -8.091 | 0.000 | -0.156 | -0.087 |
Omnibus: | 4.088 | Durbin-Watson: | 3.002 |
---|---|---|---|
Prob(Omnibus): | 0.130 | Jarque-Bera (JB): | 1.813 |
Skew: | -0.990 | Prob(JB): | 0.404 |
Kurtosis: | 3.195 | Cond. No. | 954. |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Heteroscedasticity Tests¶
Read and Explore a Medical Dataset¶
data = pd.read_csv('../data/hiv.txt', sep=' ')
data.head()
GSS | VL | |
---|---|---|
0 | 4.0 | 40406 |
1 | 13.4 | 2603 |
2 | 7.4 | 55246 |
3 | 2.7 | 22257 |
4 | 13.5 | 400 |
_ = plt.plot(data['GSS'], data['VL'], 'bo')
Data Cleaning¶
There is clearly an outlier value, so we exclude it.
data2 = data[data['VL'] < 200_000].copy()
len(data), len(data2)
(47, 46)
Fit a Linear Model¶
res = ols('VL ~ GSS', data=data2).fit()
Display the line of best fit
_ = plt.plot(data['GSS'], data['VL'], 'bo')
_ = plt.plot(data2['GSS'], res.predict(), 'g-')
Plotting the residuals indicates the the error variance is NOT constant
_ = plt.plot(data2['GSS'], res.resid, 'ro')
_ = plt.axhline(0, color='black')
Looking at the Jarque-Bera
statistic in the summary()
output, we seec that there is a vanishingly small chance that the residuals have constant variance
res.summary()
Dep. Variable: | VL | R-squared: | 0.094 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.074 |
Method: | Least Squares | F-statistic: | 4.588 |
Date: | Thu, 19 Mar 2020 | Prob (F-statistic): | 0.0378 |
Time: | 20:15:37 | Log-Likelihood: | -533.49 |
No. Observations: | 46 | AIC: | 1071. |
Df Residuals: | 44 | BIC: | 1075. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 2801.5167 | 9296.119 | 0.301 | 0.765 | -1.59e+04 | 2.15e+04 |
GSS | 2270.9411 | 1060.225 | 2.142 | 0.038 | 134.198 | 4407.684 |
Omnibus: | 20.332 | Durbin-Watson: | 2.288 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 27.927 |
Skew: | 1.473 | Prob(JB): | 8.63e-07 |
Kurtosis: | 5.428 | Cond. No. | 20.7 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Additional Heteroscedasticity Tests¶
statsmodels
has a number of tests for constant variance residuals. These tests not addressed in the lecture (but almost certainly R supports them). We show the results for two of them below. They confirm that the residuals are very unlikely to be distributed with constant variance.
x = sm.add_constant(data2['GSS'])
lm, lm_pvalue, fvalue, f_pvalue = het_white(res.resid, x)
print(
'White Test for Heteroscedasticity\n',
f'lagrange multiplier statistic. = {lm}, \n lagrange multiplier statistic = {lm_pvalue},\n',
f'f-statistic of the hypothesis that the error variance does not depend on x = {fvalue}, \n',
f'p-value for the f-statistic = {f_pvalue}',
)
White Test for Heteroscedasticity lagrange multiplier statistic. = 13.406116178729887, lagrange multiplier statistic = 0.0012271534139010932, f-statistic of the hypothesis that the error variance does not depend on x = 8.843116071199791, p-value for the f-statistic = 0.0006069222408836576
lm, lm_pvalue, fvalue, f_pvalue = het_breuschpagan(
res.resid, x
)
print(
'Breusch-Pagan Test for Heteroscedasticity\n',
f'lagrange multiplier statistic. = {lm}, \n lagrange multiplier statistic = {lm_pvalue},\n',
f'f-statistic of the hypothesis that the error variance does not depend on x = {fvalue}, \n',
f'p-value for the f-statistic = {f_pvalue}',
)
Breusch-Pagan Test for Heteroscedasticity lagrange multiplier statistic. = 9.853439236968049, lagrange multiplier statistic = 0.0016951455261255141, f-statistic of the hypothesis that the error variance does not depend on x = 11.994262172516247, p-value for the f-statistic = 0.001200741299367376
Reproducibility¶
%watermark -h -iv
%watermark
statsmodels 0.9.0 pandas 1.0.0 seaborn 0.9.0 matplotlib 3.0.2 scipy 1.1.0 numpy 1.15.4 host name: DESKTOP-SODFUN6 2020-03-19T20:15:37+10:00 CPython 3.7.1 IPython 7.2.0 compiler : MSC v.1915 64 bit (AMD64) system : Windows release : 10 machine : AMD64 processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel CPU cores : 8 interpreter: 64bit