Stanford STATS191 in Python, Lecture 6 : Diagnostics for multiple linear regression
Stanford Stats 191¶
Introduction¶
This is a re-creation of the Stanford Stats 191 course (see https://web.stanford.edu/class/stats191/), using Python eco-system tools, instead of R. This is lecture "Diagnostics for multiple regression"
Initial Notebook Setup¶
watermark
documents the current Python and package environment, black
is my preferred Python formatter
%load_ext watermark
The watermark extension is already loaded. To reload it, use: %reload_ext watermark
%load_ext lab_black
The lab_black extension is already loaded. To reload it, use: %reload_ext lab_black
%matplotlib inline
All imports go here
import pandas as pd
import numpy as np
import seaborn as sn
import math
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.formula.api import rlm
import statsmodels.api as sm
from statsmodels.stats.stattools import jarque_bera
import os
data = pd.read_csv('../data/hills.csv', sep='\t')
data.head()
Race | Distance | Climb | Time | |
---|---|---|---|---|
0 | Greenmantle | 2.5 | 650 | 16.083 |
1 | Carnethy | 6.0 | 2500 | 48.350 |
2 | CraigDunain | 6.0 | 900 | 33.650 |
3 | BenRha | 7.5 | 800 | 45.600 |
4 | BenLomond | 8.0 | 3070 | 62.267 |
data.sort_values(by='Distance').tail()
Race | Distance | Climb | Time | |
---|---|---|---|---|
16 | SevenHills | 14.0 | 2200 | 98.417 |
6 | BensofJura | 16.0 | 7500 | 204.617 |
32 | TwoBreweries | 18.0 | 5200 | 170.250 |
34 | MoffatChase | 20.0 | 5000 | 159.833 |
10 | LairigGhru | 28.0 | 2100 | 192.667 |
Visualize Data¶
We use Seaborn to visually explore the relationships between the variables
sn.pairplot(data)
<seaborn.axisgrid.PairGrid at 0x1f5a5dd2c18>
Same graph as before, but we also add a regression line to each scatter plot
sn.pairplot(data, kind='reg')
D:\Anaconda3\envs\ac5-py37\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
<seaborn.axisgrid.PairGrid at 0x1f5a691db70>
We choose to highlight a particular race that looks unusual. We do this by linking the color of the dot (hue
) to the race name
data['RaceName'] = [
'LairigGhru' if r == 'LairigGhru' else '-'
for r in data['Race']
]
sn.pairplot(
data, kind='scatter', diag_kind='hist', hue='RaceName'
)
<seaborn.axisgrid.PairGrid at 0x1f5a22a0710>
We can also use pandas
to do our scatter plots
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
__ = pd.plotting.scatter_matrix(
data, ax=ax, alpha=0.95, grid=False
)
D:\Anaconda3\envs\ac5-py37\lib\site-packages\ipykernel_launcher.py:3: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared This is separate from the ipykernel package so we can avoid doing imports until
Perform OLS¶
We build a model where time to complete the race depends linearly upon the distance, and height to climb (and of course, a Gaussian error term)
res = ols('Time ~ Distance + Climb', data=data).fit()
res.summary()
Dep. Variable: | Time | R-squared: | 0.919 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.914 |
Method: | Least Squares | F-statistic: | 181.7 |
Date: | Mon, 23 Mar 2020 | Prob (F-statistic): | 3.40e-18 |
Time: | 19:11:56 | Log-Likelihood: | -142.11 |
No. Observations: | 35 | AIC: | 290.2 |
Df Residuals: | 32 | BIC: | 294.9 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -8.9920 | 4.303 | -2.090 | 0.045 | -17.756 | -0.228 |
Distance | 6.2180 | 0.601 | 10.343 | 0.000 | 4.993 | 7.442 |
Climb | 0.0110 | 0.002 | 5.387 | 0.000 | 0.007 | 0.015 |
Omnibus: | 47.910 | Durbin-Watson: | 2.249 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 233.976 |
Skew: | 3.026 | Prob(JB): | 1.56e-51 |
Kurtosis: | 14.127 | Cond. No. | 4.20e+03 |
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.2e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
_ = plt.plot(res.resid, 'o', label='Data')
_ = plt.legend(loc='best')
_ = plt.grid()
_ = plt.xlabel('Observation Number')
_ = plt.ylabel('Residual')
_ = plt.legend()
We can also plot our two explanatory variables against the predicted and actual values of time
_ = plt.plot(
data['Distance'],
res.predict(),
'o',
label='Predicted Time',
)
_ = plt.plot(
data['Distance'],
data['Time'],
'r+',
label='Actual Time',
)
_ = plt.legend(loc='best')
_ = plt.grid()
_ = plt.xlabel('Distance')
_ = plt.ylabel('Time')
_ = plt.legend()
plt.plot(
data['Climb'],
res.predict(),
'o',
label='Predicted Time',
)
plt.plot(
data['Climb'], data['Time'], 'r+', label='Actual Time'
)
_ = plt.legend(loc='best')
_ = plt.grid()
_ = plt.xlabel('Climb')
_ = plt.ylabel('Time')
_ = plt.legend()
We can graph a more detailed view of the residuals, and of residuals plotted against prediction
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
ax.plot(
range(len(res.resid)),
res.resid,
'o',
label='OLS Residuals',
)
ax.grid()
ax.set_xlabel('Residual No.')
ax.set_ylabel('Residual')
ax.spines['bottom'].set_position('zero')
ax.spines['left'].set_position('zero')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.legend(loc='best')
<matplotlib.legend.Legend at 0x1f5a892bc88>
plt.plot(res.predict(), res.resid, 'o', label='Residual')
_ = plt.legend(loc='best')
_ = plt.grid()
_ = plt.xlabel('Predicted Time')
_ = plt.ylabel('Residual')
_ = plt.axhline(0, c='k')
_ = plt.legend()
Display Normality Test Plots¶
One way to check for the normality of the residuals is to plot the actual quantiles against theoretical quantiles (assuming a normal distribution). We see that because of two outliers, we do not get the expected straight line
There are two ways of getting QQ Plots, via statsmodels
(qqplot() ), and via scipy.stats
(probplot() )
fig = sm.qqplot(res.resid, line='s')
__ = stats.probplot(res.resid, dist='norm', plot=plt)
A Normal distribution should be a straight line, as below
_ = stats.probplot(
np.random.normal(0, 1, 50), dist='norm', plot=plt
)
(osm, osr), (slope, intercept, r) = stats.probplot(
res.resid, dist='norm', plot=None
)
plt.plot(osm, osr, 'r+')
plt.plot(osm, osm * slope + intercept, 'g-')
[<matplotlib.lines.Line2D at 0x1f5a8c41d30>]
Plot Studentized Residuals¶
Plot Studentized Residuals against Residuals.
Externally Studentized means taking all other residuals (ie excluding the residual under consideration) to get a 'standard deviation', to scale the residual to get a variable that is distributed as per t distribution.
Internally Studentized means taking all residuals (ie including the residual under consideration) to get a 'standard deviation', to scale the residual to get a variable that is distributed approximately as per t distribution.
The RegressionResults object has a method outlier_test
, that returns the two types of "Studentized" residuals
res_df = res.outlier_test()
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.plot(
res.resid,
res_df['student_resid'],
'o',
label='Studentized Resid.',
)
ax.set_xlabel('Residual')
ax.set_ylabel(
'Studentized Residual (= External Studentized Residual = rstudent in R)'
)
ax.grid()
ax.legend(loc='best')
<matplotlib.legend.Legend at 0x1f5a3c56b00>
As we see above, we have a smallnumber of residuals that require closer attention
The outlier_test()
method returns a pandas DataFrame as below. The column student_resid
is the Externally Studentized Residuals as discussed above. The columns unadj_p
and bonf(p)
columns are discussed below
res_df.head()
student_resid | unadj_p | bonf(p) | |
---|---|---|---|
0 | 0.162024 | 0.872339 | 1.0 |
1 | -0.524115 | 0.603926 | 1.0 |
2 | -0.315720 | 0.754331 | 1.0 |
3 | -0.060574 | 0.952087 | 1.0 |
4 | -0.866026 | 0.393129 | 1.0 |
We can also get the standardized residuals (both types) from the OLSInfluence object, using a method get_influence()
in the RegressionResults object.
influence = res.get_influence()
type(influence)
statsmodels.stats.outliers_influence.OLSInfluence
standardized_residuals_int = (
influence.resid_studentized_internal
)
standardized_residuals_ext = (
influence.resid_studentized_external
)
Outlier_test results (Bonferroni Adjustment)¶
The approach we could take is to look at the residuals for each race (actual time - predicted time), and reject those observations where the residual is unexpectedly large. That is, larger than a pre-chosen threshold (e.g. a residual of this size would only happen at a frequency of 5%, or 1%, or at whatever our chosen cutoff frequency was). The problem with this approach is that the error term in our model is random, and if we have a very large number of valid data points, then we expect 5% of them to be exceed our threshold, if we have set 5% as our threshold.
Carlo Emilio Bonferroni (an Italian mathematician) developed a correction that reduces the number of valid data points being rejected.
The DataFrame returned by outlier_test
, holds (for each observation) the probability that we would see a residual of this size by random chance (assuming all residuals are T-distributed). The bonfp()
column I understand to be probability that the observation is valid (after the Bonferroni adjustment).
res_df.head()
student_resid | unadj_p | bonf(p) | |
---|---|---|---|
0 | 0.162024 | 0.872339 | 1.0 |
1 | -0.524115 | 0.603926 | 1.0 |
2 | -0.315720 | 0.754331 | 1.0 |
3 | -0.060574 | 0.952087 | 1.0 |
4 | -0.866026 | 0.393129 | 1.0 |
We can use develop a boolean mask that extracts those observations that are very surprising (i.e. almost certainly not valid)
outlier_m = [r < 0.05 for r in res_df['bonf(p)']]
print(res_df[outlier_m])
print(data[outlier_m])
student_resid unadj_p bonf(p) 17 7.610845 1.397273e-08 4.890457e-07 Race Distance Climb Time RaceName 17 KnockHill 3.0 350 78.65 -
We can plot the unadjusted and adjusted probabilities, and a 5% cutoff line. We see the one point below the cutoff.
idf = range(1, len(data) + 1)
plt.plot(idf, res_df['bonf(p)'], 'o', label='Adjusted Prob')
plt.plot(idf, res_df['unadj_p'], 'r+', label='Raw Prob')
plt.axhline(0.05, c='g', label='Threshhold')
plt.legend(loc='best')
plt.xlabel('Observation No')
plt.ylabel('Probability')
Text(0, 0.5, 'Probability')
It turns out that the datset has a known error (http://www.statsci.org/data/general/hills.html)
The data contains a known error - Atkinson (1986) reports that the record for Knock Hill (observation 17) should actually be 18 minutes rather than 78 minutes.
Extended Analysis of Residuals¶
We see that outlier_test()
returns the same array as the influence.resid_studentized_external
we will compare the externally and internally standardized residuals for each point; again we see a few suspicious points
res_df.head()
student_resid | unadj_p | bonf(p) | |
---|---|---|---|
0 | 0.162024 | 0.872339 | 1.0 |
1 | -0.524115 | 0.603926 | 1.0 |
2 | -0.315720 | 0.754331 | 1.0 |
3 | -0.060574 | 0.952087 | 1.0 |
4 | -0.866026 | 0.393129 | 1.0 |
print(
'outlier_test student_resid ',
standardized_residuals_ext[0:5],
)
standardized_residuals_int = (
influence.resid_studentized_internal
)
standardized_residuals_ext = (
influence.resid_studentized_external
)
print(
'External Stud Resid ',
standardized_residuals_ext[0:5],
)
print(
'Internal Stud Resid ',
standardized_residuals_int[0:5],
)
outlier_test student_resid [ 0.16202389 -0.52411478 -0.31572031 -0.06057395 -0.86602567] External Stud Resid [ 0.16202389 -0.52411478 -0.31572031 -0.06057395 -0.86602567] Internal Stud Resid [ 0.16454678 -0.53015742 -0.32025768 -0.06153956 -0.86942853]
Display the plots of Externally Studentized Residuals versus the Residuals, and Internally Studentized Residuals. Valid points should like close to a straight line. Again we see a few points that are 'suspicious'
fig, ax = plt.subplots(1, 2, figsize=(10, 8))
ax[0].plot(
res.resid,
standardized_residuals_ext,
'ro',
label='Ext Stud Resid',
)
ax[0].set_xlabel('Residual')
ax[0].set_ylabel(
'External Studentized Residual (= rstudent in R)'
)
ax[0].grid()
ax[0].legend(loc='best')
ax[1].plot(
standardized_residuals_int,
standardized_residuals_ext,
'ro',
label='Int Stud Resid',
)
ax[1].set_ylabel(
'External Studentized Residual (= rstudent in R)'
)
ax[1].set_xlabel(
'Internal Studentized Residual (= rstandard in R)'
)
ax[1].grid()
ax[1].legend(loc='best')
fig.suptitle(
'Standard Residuals\n(as per\n'
+ 'https://web.stanford.edu/class/stats191/notebooks/Diagnostics_for_multiple_regression.html'
)
Text(0.5, 0.98, 'Standard Residuals\n(as per\nhttps://web.stanford.edu/class/stats191/notebooks/Diagnostics_for_multiple_regression.html')
Constant Variance Test¶
We look at what is some sense the 'variance' of each externally standardize residual, and again can identify a few points that seem different from the others. Otherwise the variance appears to be constant
zz = np.sqrt(np.abs(standardized_residuals_ext))
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(res.predict(), zz, 'o', label='Data')
ax.set_xlabel('Fitted Value')
ax.set_ylabel('sqrt(abs(Externally Studentized Residual))')
ax.grid()
ax.axhline(1, c='g', label='Outlier Cutoff')
ax.legend(loc='best')
<matplotlib.legend.Legend at 0x1f5a8fb97f0>
We can plot the residuals against the predicted values. Again variance appear constant across the range of predicted values
plt.plot(res.predict(), res.resid, 'o')
_ = plt.xlabel('Predicted Time')
_ = plt.ylabel('Residual')
Measure of Influence¶
DFFITS¶
DFFITS (standing for Difference in Fits) is a measure of how influential an observation is, gauged by seeing how the prediction for an observation changes if the observation is excluded from the regression.
The OLSInfluence object can be used to get this value for each observation
# df[0] is an array of the DFFITS value for each observation
# df[1] is the suggested threshhold for suspicious DFFITS values (= 2 * sqrt(k / n),
# where k is the size of our model [=3, constant, distance, climb])
# where n is the dataset size
df = influence.dffits
df_val = df[0]
df_thresh = df[1]
ldf = len(df_val)
print(f'DFFITS threshhold value = {df_thresh: 15.5f}')
DFFITS threshhold value = 0.58554
Plot our DFFITS, and the threshhold
idf = range(ldf)
plt.plot(idf, df_val, 'o')
plt.axhline(df_thresh, c='g')
<matplotlib.lines.Line2D at 0x1f5aa08deb8>
Show the races for which the DFFITS indicates to be investigated
outlier = [r > df_thresh for r in df_val]
data[outlier]
Race | Distance | Climb | Time | RaceName | |
---|---|---|---|---|---|
6 | BensofJura | 16.0 | 7500 | 204.617 | - |
10 | LairigGhru | 28.0 | 2100 | 192.667 | LairigGhru |
17 | KnockHill | 3.0 | 350 | 78.650 | - |
Checking the DFFITS threshhold value returned by the OLSInfluence object
n_feat = 2
thresh = 2 * np.sqrt((n_feat + 1) / len(data['Time']))
print('Outlier Threshhold ', thresh)
Outlier Threshhold 0.5855400437691199
Cooks Distance¶
Cooks distance is a measure of how much the whole regression changes, when we delete an observation. We can get this from the OLSInfluence object. Two arrays are returned, the first matches the R convention
cooks = influence.cooks_distance
Plot the Cooks Distance for each observation
plt.plot(idf, cooks[1], 'o', label='cooks[1]')
plt.plot(
idf, cooks[0], 'r+', label='cooks[0] = R'
) # R definition
plt.xlabel('Observation No')
plt.ylabel('Cooks Distance')
plt.legend(loc='best')
<matplotlib.legend.Legend at 0x1f5aa139d68>
Again we get mask in the races with Cooks Distance over a threshhold, and we get our usual suspects
outlier = [r > 0.1 for r in cooks[0]]
data[outlier]
Race | Distance | Climb | Time | RaceName | |
---|---|---|---|---|---|
6 | BensofJura | 16.0 | 7500 | 204.617 | - |
10 | LairigGhru | 28.0 | 2100 | 192.667 | LairigGhru |
17 | KnockHill | 3.0 | 350 | 78.650 | - |
DFBETA¶
DFBETA measures the difference in the estimated coefficients when a single observation is removed from the OLS fit. We can get this from the OLSInfluence object.
Because we are estimating three parameters (constant, climb, distance), we get three arrays back
dfbetas = influence.dfbetas
type(dfbetas)
numpy.ndarray
Plot the influence observations have over the coefficient of Climb
dfb0 = [r[0] for r in dfbetas]
dfb1 = [r[1] for r in dfbetas]
dfb2 = [r[2] for r in dfbetas]
plt.plot(idf, dfb2, 'o')
# Climb
[<matplotlib.lines.Line2D at 0x1f5aa31a208>]
Plot the influence each observation has on the Distance
coefficient
plt.plot(idf, dfb1, 'o')
# Distance
[<matplotlib.lines.Line2D at 0x1f5aa37b160>]
Combine the two plots, and draw threshhold at +-2* sqrt(n)
plt.plot(idf, dfb2, 'o') # Climb
plt.plot(idf, dfb1, 'o') # Distance
plt.axhline(2 / np.sqrt(len(data['Time'])), c='g')
plt.axhline(-2 / np.sqrt(len(data['Time'])), c='g')
<matplotlib.lines.Line2D at 0x1f5a8c31438>
Investigation of outliers¶
Given that we have a set of unusual races that a number of metrics have fingered, we can use seaborn the show these in the scatter plots. We explicitly name each suspect race in a new DataFrame column, as assign a placeholder name to all the rest.
data['RaceName'] = [
r
if r in ['BensofJura', 'LairigGhru', 'KnockHill']
else '-'
for r in data['Race']
]
sn.pairplot(
data, kind='scatter', diag_kind='hist', hue='RaceName'
)
<seaborn.axisgrid.PairGrid at 0x1f5a89343c8>
One observation we can make is that two of our outliers are indeed outliers in input space, being the longest and highest race. Having outliers like this might make us question our linear model. We might ask if runners operate in two biological regimes, sprints and marathons? Should a model that fits short distances be expected to work at long distances?
Outliers like KnockHill we are more likely attribute to human error (as indeed was the case here, reportedly)
Leverage¶
We can get a view as to how far an observation is from the norm in input space by using the leverage
concept, defined as below. We can get this from the OLSInfluence object.
hat = influence.hat_matrix_diag
Plot the leverage values
plt.plot(idf, hat, 'o')
plt.xlabel('Observation No.')
plt.ylabel('Leverage')
Text(0, 0.5, 'Leverage')
Print out the outlier races. We get our races at extremes of distance and height
outlier = [r > 0.3 for r in hat]
data[outlier]
Race | Distance | Climb | Time | RaceName | |
---|---|---|---|---|---|
6 | BensofJura | 16.0 | 7500 | 204.617 | BensofJura |
10 | LairigGhru | 28.0 | 2100 | 192.667 | LairigGhru |
We can display those observation with large residuals, and large leverage
plt.plot(hat, standardized_residuals_int, 'o')
plt.xlabel('Leverage')
plt.ylabel('Internally Standardised Residuals')
Text(0, 0.5, 'Internally Standardised Residuals')
A slightly better version of the graph above, done with the OO interface to Matplotlib
fig, ax = plt.subplots(figsize=(10, 8))
ax.plot(hat, standardized_residuals_int, 'o', label='Data')
ax.set_xlabel('Influence')
ax.set_ylabel('Standard Residual (Internal)')
ax.grid()
ax.legend(loc='best')
<matplotlib.legend.Legend at 0x1f5a8d535c0>
statsmodels
has a purpose-build graphic for visualizing observation influence. Once again, we get our usual suspects.
fig, ax = plt.subplots(figsize=(10, 8))
_ = sm.graphics.influence_plot(res, ax=ax)
data.iloc[[17, 6, 10]]
Race | Distance | Climb | Time | RaceName | |
---|---|---|---|---|---|
17 | KnockHill | 3.0 | 350 | 78.650 | KnockHill |
6 | BensofJura | 16.0 | 7500 | 204.617 | BensofJura |
10 | LairigGhru | 28.0 | 2100 | 192.667 | LairigGhru |
statsmodels
has another closely related graphic to plot leverage statistics vs. normalized residuals squared
fig, ax = plt.subplots(figsize=(10, 8))
_ = sm.graphics.plot_leverage_resid2(res, ax=ax)
Excluding Large Residuals¶
The example below shows the impact of Bonferroni correction. We plot standardised residuals against fitted value, and set thresholds based upon the T test (at the 95% level)
fig, ax = plt.subplots(figsize=(10, 8))
n_features = 2
thresh1, thresh2 = stats.t.interval(
0.95, len(data['Time']) - n_features - 2
)
ax.plot(idf, standardized_residuals_ext, 'o', label='Data')
ax.set_xlabel('Fitted Value')
ax.set_ylabel('Externally Studentized Residual)')
ax.grid()
ax.axhline(thresh1, c='g', label='Outlier Cutoff')
ax.axhline(thresh2, c='g', label='Outlier Cutoff')
ax.legend(loc='best')
print(thresh1, thresh2)
-2.0395134463964077 2.0395134463964077
Because we are looking at many residuals, even under the Null Hypothesis (all observations valid), some will be rejected. An example below, where all points are genuinely from Normal(0,1), still has some rejected.
sample = np.random.normal(0, 1, 100)
n_features = 1
thresh1, thresh2 = stats.t.interval(
0.95, len(sample) - n_features - 2
)
plt.plot(range(100), sample, 'o')
plt.axhline(thresh1, c='g', label='Outlier Cutoff')
plt.axhline(thresh2, c='g', label='Outlier Cutoff')
<matplotlib.lines.Line2D at 0x1f5a84d1f28>
print(thresh1, thresh2)
-1.984723185927883 1.984723185927883
Apply Bonferroni Correction (widen acceptance region)¶
We widen the acceptance region by the Bonferroni correction
fig, ax = plt.subplots(figsize=(10, 8))
n_points = len(data['Time'])
n_features = 2
alpha = 0.05
thresh1, thresh2 = stats.t.interval(
1 - alpha / n_points, len(data['Time']) - n_features - 2
)
ax.plot(idf, standardized_residuals_ext, 'o', label='Data')
ax.set_xlabel('Fitted Value')
ax.set_ylabel('Externally Studentized Residual)')
ax.grid()
ax.axhline(thresh1, c='g', label='Outlier Cutoff')
ax.axhline(thresh2, c='g', label='Outlier Cutoff')
ax.legend(loc='best')
print(thresh1, thresh2)
-3.501165862689355 3.501165862689384
We see that under our new threshholds, onlyone observation is excluded, and it is the one point that we know (from other sources) is not valid
outlier = [
np.abs(r) > thresh2 for r in standardized_residuals_ext
]
data[outlier]
Race | Distance | Climb | Time | RaceName | |
---|---|---|---|---|---|
17 | KnockHill | 3.0 | 350 | 78.65 | KnockHill |
The data contains a known error - Atkinson (1986) reports that the record for Knock Hill (observation 17) should actually be 18 minutes rather than 78 minutes.
Additional Plots¶
There are two additional plots that can be used to assess the linear model. STATS191 calls them "added-variable and component plus residual plots"; statsmodel
calls them "partial regression plot for a set of regressors" and "component and component-plus-residual (CCPR) plots" respectively
Both graphics seem to support our linear model
CCPR¶
fig = plt.figure(figsize=(8, 8))
_ = sm.graphics.plot_ccpr_grid(res, fig=fig)
plt.show()
Partial Regression Plot¶
fig = plt.figure(figsize=(8, 8))
_ = sm.graphics.plot_partregress_grid(res, fig=fig)
Reproducibility¶
%watermark -h -iv
%watermark
scipy 1.1.0 matplotlib 3.0.2 numpy 1.15.4 seaborn 0.9.0 statsmodels 0.9.0 pandas 1.0.0 host name: DESKTOP-SODFUN6 2020-03-23T19:12:05+10:00 CPython 3.7.1 IPython 7.2.0 compiler : MSC v.1915 64 bit (AMD64) system : Windows release : 10 machine : AMD64 processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel CPU cores : 8 interpreter: 64bit