Stanford STATS191 in Python, Lecture 14 : Logistic Regression
Stanford Stats 191¶
Introduction¶
This is a re-creation of the Stanford Stats 191 course, using Python eco-system tools, instead of R. This is lecture "Logistic: " ( see https://web.stanford.edu/class/stats191/notebooks/Logistic.html )
Initial Notebook Setup¶
watermark
documents the Python and package environment, black
is my chosen Python formatter
%load_ext watermark
The watermark extension is already loaded. To reload it, use: %reload_ext watermark
%reload_ext lab_black
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sn
import math
import warnings
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy import stats
from statsmodels.formula.api import logit
from statsmodels.formula.api import probit
from statsmodels.formula.api import glm
import statsmodels.api as sm
import os
Initial Exploration¶
In order to make sure that I understood the statsmodels
API, I applied the API to the example given in Wikipedia (https://en.wikipedia.org/wiki/Logistic_regression)
We have a dataset that relates hours studied to the PASS / FAIL result in an exam.
Note that the result (dependent variable) is a binary choice, and hence linear regression would not be an appropriate method
Hours = [
0.50,
0.75,
1.00,
1.25,
1.50,
1.75,
1.75,
2.00,
2.25,
2.50,
2.75,
3.00,
3.25,
3.50,
4.0,
4.25,
4.50,
4.75,
5.00,
5.50,
]
Pass = [
0,
0,
0,
0,
0,
0,
1,
0,
1,
0,
1,
0,
1,
0,
1,
1,
1,
1,
1,
1,
]
Create a pandas
dataframe
exam_dict = {'Hours': Hours, 'Pass': Pass}
exam = pd.DataFrame(exam_dict)
exam.head()
Hours | Pass | |
---|---|---|
0 | 0.50 | 0 |
1 | 0.75 | 0 |
2 | 1.00 | 0 |
3 | 1.25 | 0 |
4 | 1.50 | 0 |
Use the statsmodels
formula API to perform Logistic Regression.
res = logit('Pass ~ Hours', data=exam).fit()
res.summary()
Optimization terminated successfully. Current function value: 0.401494 Iterations 7
Dep. Variable: | Pass | No. Observations: | 20 |
---|---|---|---|
Model: | Logit | Df Residuals: | 18 |
Method: | MLE | Df Model: | 1 |
Date: | Thu, 21 May 2020 | Pseudo R-squ.: | 0.4208 |
Time: | 18:49:15 | Log-Likelihood: | -8.0299 |
converged: | True | LL-Null: | -13.863 |
LLR p-value: | 0.0006365 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -4.0777 | 1.761 | -2.316 | 0.021 | -7.529 | -0.626 |
Hours | 1.5046 | 0.629 | 2.393 | 0.017 | 0.272 | 2.737 |
Do a quick plot to show the raw data and the results
_ = plt.plot(
exam['Hours'], res.predict(), 'o', label='Predicted'
)
_ = plt.plot(
exam['Hours'], exam['Pass'], 'r+', label='Data'
)
_ = plt.legend(loc='center left')
Compare the probability of passing for two students (one studied for 2 hours, the latter for 4 hours)
xx = {'Hours': [2, 4]}
xx_df = pd.DataFrame(xx)
res.predict(xx_df)
0 0.255703 1 0.874448 dtype: float64
hours = 2
prob = 1.0 / (
1.0 + np.exp(-(res.params[0] + res.params[1] * hours))
)
prob
0.25570318264090985
The second variant of the summary function doesn't seem to be much different (?)
res.summary2()
Model: | Logit | Pseudo R-squared: | 0.421 |
Dependent Variable: | Pass | AIC: | 20.0598 |
Date: | 2020-05-21 18:49 | BIC: | 22.0512 |
No. Observations: | 20 | Log-Likelihood: | -8.0299 |
Df Model: | 1 | LL-Null: | -13.863 |
Df Residuals: | 18 | LLR p-value: | 0.00063648 |
Converged: | 1.0000 | Scale: | 1.0000 |
No. Iterations: | 7.0000 |
Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -4.0777 | 1.7610 | -2.3156 | 0.0206 | -7.5292 | -0.6262 |
Hours | 1.5046 | 0.6287 | 2.3932 | 0.0167 | 0.2724 | 2.7369 |
Probit¶
In order to analyse binary data, and interprete the predictions as probability / odds / likelihood of success or failure, we want a function that ranges from 0 to 1, and has a steep transition, and has tractible mathematical properties. The usual function is the logistic function, as shown below.
Some areas of analysis use the Probit function, being the Cumulative Distribution Function (CDF) of the standard normal distribution.
logistic regression – is more popular in health sciences like epidemiology partly because coefficients can be interpreted in terms of odds ratios. Probit models can be generalized to account for non-constant error variances in more advanced econometric settings (known as heteroskedastic probit models) and hence are used in some contexts by economists and political scientists.
statsmodels
supports Probit models
res2 = probit('Pass ~ Hours', data=exam).fit()
res2.summary()
Optimization terminated successfully. Current function value: 0.394887 Iterations 6
Dep. Variable: | Pass | No. Observations: | 20 |
---|---|---|---|
Model: | Probit | Df Residuals: | 18 |
Method: | MLE | Df Model: | 1 |
Date: | Thu, 21 May 2020 | Pseudo R-squ.: | 0.4303 |
Time: | 18:49:16 | Log-Likelihood: | -7.8977 |
converged: | True | LL-Null: | -13.863 |
LLR p-value: | 0.0005522 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -2.4728 | 0.984 | -2.514 | 0.012 | -4.401 | -0.545 |
Hours | 0.9127 | 0.351 | 2.601 | 0.009 | 0.225 | 1.600 |
Display the predictions versus actual data
_ = plt.plot(exam['Hours'], res2.predict(), 'o')
_ = plt.plot(exam['Hours'], exam['Pass'], 'r+')
The difference between the Logistic and Probit models for the exam dataset is minimal
_ = plt.plot(
exam['Hours'], res.predict() - res2.predict(), 'r+'
)
_ = plt.axhline(0, color='black')
Confidence Intervals¶
At https://stackoverflow.com/questions/47414842/confidence-interval-of-probability-prediction-from-logistic-regression-statsmode, there is a very good discussion of Confidence Intervals in the Logit Model. I was not able to reproduce the results of the R functions used in the lecture notes
Approach 1¶
This is a very mis-guided approach, assuming that the normality of the parameters of our linear model input to the logistic function. The conf-int
method returns the CI limits for these parameters
res.summary()
Dep. Variable: | Pass | No. Observations: | 20 |
---|---|---|---|
Model: | Logit | Df Residuals: | 18 |
Method: | MLE | Df Model: | 1 |
Date: | Thu, 21 May 2020 | Pseudo R-squ.: | 0.4208 |
Time: | 18:49:16 | Log-Likelihood: | -8.0299 |
converged: | True | LL-Null: | -13.863 |
LLR p-value: | 0.0006365 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -4.0777 | 1.761 | -2.316 | 0.021 | -7.529 | -0.626 |
Hours | 1.5046 | 0.629 | 2.393 | 0.017 | 0.272 | 2.737 |
h = np.linspace(0, 6, 20)
ci = res.conf_int(0.05)
print(ci)
lower = ci[0]
upper = ci[1]
prob1 = 1 / (1 + np.exp(-(ci[0][0] + ci[0][1] * h)))
prob2 = 1 / (1 + np.exp(-(ci[1][0] + ci[1][1] * h)))
_ = plt.plot(h, prob1, 'r-')
_ = plt.plot(h, prob2, 'g-')
_ = plt.plot(exam['Hours'], res.predict(), 'o')
0 1 Intercept -7.529199 -0.626228 Hours 0.272375 2.736916
As you can see, the results are close to meaningless
Delta Approach¶
The delta method assumes predicted probabilities are normal
Run the regression again, using the object API
x = np.array(exam['Hours'])
X = sm.add_constant(x)
y = exam['Pass']
model = sm.Logit(y, X).fit()
proba = model.predict(X)
_ = plt.plot(x, proba, 'o', label='Prediction')
_ = plt.plot(x, y, 'r+', label='Actual')
_ = plt.legend(loc='center left')
Optimization terminated successfully. Current function value: 0.401494 Iterations 7
Code taken from the reference above
# estimate confidence interval for predicted probabilities
cov = model.cov_params()
gradient = (
proba * (1 - proba) * X.T
).T # matrix of gradients for each observation
std_errors = np.array(
[np.sqrt(np.dot(np.dot(g, cov), g)) for g in gradient]
)
c = 1.96 # multiplier for confidence interval
upper = np.maximum(0, np.minimum(1, proba + std_errors * c))
lower = np.maximum(0, np.minimum(1, proba - std_errors * c))
_ = plt.plot(x, proba, 'o', label='Prediction')
_ = plt.plot(x, upper, 'g-', label='Upper 95% CI')
_ = plt.plot(x, lower, 'r-', label='Lower 95% CI')
_ = plt.legend(loc='upper left')
This time the results accord more with my intuition
Bootstrap Approach¶
In this approach we sample the original dataset with replacement, and get an empirical assessment of the spread in predictions. For each sample, we fit a model, and predict the response.
Some Gotcha-s:
sampling a small dataset of Logistic regression data can lead to
PerfectSeparationError
errors: there is no overlap in Pass / Fail subsetswe can get Linear Algebra Warnings (I think a result of our small dataset)
statsmodels
issues warning about the models we subsequently throw away. We have to shut these warning off
We throw any bad sample datasets away, and move on. At the end we use numpy
to get the 95% range of prediction values
# Turn off statsmodels warnings
with warnings.catch_warnings():
warnings.filterwarnings("ignore")
x = np.array(exam['Hours'])
X = sm.add_constant(x)
y = exam['Pass']
# preds holds the predictions for each sub-sample
preds = []
# count holds successful fit run count
count = 0
for trial in range(1000):
# get sub-sample
exam_sample = exam.sample(frac=1, replace=True)
# fit and predict
try:
x_sample = np.array(exam_sample['Hours'])
X_sample = sm.add_constant(x_sample)
y_sample = exam_sample['Pass']
# disp=0 will suppress suprious convergece message
res_sample = sm.Logit(y_sample, X_sample).fit(
disp=0
)
preds.append(res_sample.predict(X))
count = count + 1
except:
pass
# end try
# end for
print(f'{count} samples were processed')
# get array of predictions into numpy format
predictions_array = np.array(preds)
_ = plt.plot(
exam['Hours'], exam['Pass'], 'r+', label='Actuals'
)
_ = plt.plot(
exam['Hours'],
res.predict(),
'o',
label='Predictions',
)
_ = plt.plot(
exam['Hours'],
np.percentile(predictions_array, 97.5, axis=0),
'g-',
label='95% CI limit',
)
_ = plt.plot(
exam['Hours'],
np.percentile(predictions_array, 2.5, axis=0),
'r-',
label='95% CI limit',
)
_ = plt.legend(loc='lower right')
# end with
976 samples were processed
An Introduction to Statistical Learning¶
The book "An Introduction to Statistical Learning" apparently assumes the log-odds are normal. Again, using code from the article referenced above (https://stackoverflow.com/questions/47414842/confidence-interval-of-probability-prediction-from-logistic-regression-statsmode), we get the mean, and single observation CIs
fit_mean = res.model.exog.dot(res.params)
fit_mean_se = (
(
res.model.exog
* res.model.exog.dot(res.cov_params())
).sum(axis=1)
) ** 0.5
fit_obs_se = (
(
(res.model.endog - fit_mean).std(
ddof=res.params.shape[0]
)
)
** 2
+ fit_mean_se ** 2
) ** 0.5
Visualizing the results
_ = plt.plot(x, 1 / (1 + np.exp(-fit_mean)), 'o')
_ = plt.plot(
x,
1 / (1 + np.exp(-fit_mean - 1.96 * fit_mean_se)),
'g-',
label='Mean CI 95%',
)
_ = plt.plot(
x,
1 / (1 + np.exp(-fit_mean + 1.96 * fit_mean_se)),
'r-',
label='Mean CI 95%',
)
_ = plt.plot(
x,
1 / (1 + np.exp(-fit_mean + 1.96 * fit_obs_se)),
'r:',
label='Observation CI 95%',
)
_ = plt.plot(
x,
1 / (1 + np.exp(-fit_mean - 1.96 * fit_obs_se)),
'g:',
label='Observation CI 95%',
)
_ = plt.legend(loc='lower right')
flu = pd.read_fwf('../data/flu.txt', widths=[7, 7, 7])
flu.columns = ['Shot', 'Age', 'Aware']
flu.sort_values(by='Shot', inplace=True)
flu.head()
Shot | Age | Aware | |
---|---|---|---|
0 | 0.0 | 38.0 | 40.0 |
22 | 0.0 | 53.0 | 32.0 |
23 | 0.0 | 42.0 | 42.0 |
48 | 0.0 | 45.0 | 35.0 |
25 | 0.0 | 42.0 | 32.0 |
sn.scatterplot('Age', 'Aware', data=flu, hue='Shot')
<matplotlib.axes._subplots.AxesSubplot at 0x23f85aef2b0>
_ = plt.plot(flu['Age'], flu['Shot'], 'o')
_ = plt.plot(flu['Aware'], flu['Shot'], 'o')
Logistic Regression¶
res3 = logit('Shot ~ Age + Aware', data=flu).fit()
Optimization terminated successfully. Current function value: 0.324163 Iterations 8
res3.summary()
Dep. Variable: | Shot | No. Observations: | 50 |
---|---|---|---|
Model: | Logit | Df Residuals: | 47 |
Method: | MLE | Df Model: | 2 |
Date: | Thu, 21 May 2020 | Pseudo R-squ.: | 0.5235 |
Time: | 18:49:21 | Log-Likelihood: | -16.208 |
converged: | True | LL-Null: | -34.015 |
LLR p-value: | 1.848e-08 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -21.5846 | 6.418 | -3.363 | 0.001 | -34.164 | -9.005 |
Age | 0.2218 | 0.074 | 2.983 | 0.003 | 0.076 | 0.368 |
Aware | 0.2035 | 0.063 | 3.244 | 0.001 | 0.081 | 0.326 |
Prediction for a 35 year old with health awareness 50 compared to a 45 year old with the same health awareness
examples = {'Age': [35, 45], 'Aware': [50, 50]}
res3.predict(examples)
0 0.025406 1 0.193210 dtype: float64
We can calculate the probability of a flu shot explicitly from the model parameters, and the logistic function
age = 35
aware = 50
prob = 1.0 / (
1.0
+ np.exp(
-(
res3.params[0]
+ res3.params[1] * age
+ res3.params[2] * aware
)
)
)
prob
0.025405604440954365
General Linear Models¶
It is also possible to perform a Logistic Regression via the statsmodels
General Linear Model API. Binomial here refers to the fact we have two choices of outcome.
res4 = glm(
'Shot ~ Age + Aware',
data=flu,
family=sm.families.Binomial(),
).fit()
res4.summary()
Dep. Variable: | Shot | No. Observations: | 50 |
---|---|---|---|
Model: | GLM | Df Residuals: | 47 |
Model Family: | Binomial | Df Model: | 2 |
Link Function: | logit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -16.208 |
Date: | Thu, 21 May 2020 | Deviance: | 32.416 |
Time: | 18:49:21 | Pearson chi2: | 34.6 |
No. Iterations: | 6 | Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -21.5846 | 6.418 | -3.363 | 0.001 | -34.164 | -9.005 |
Age | 0.2218 | 0.074 | 2.983 | 0.003 | 0.076 | 0.368 |
Aware | 0.2035 | 0.063 | 3.244 | 0.001 | 0.081 | 0.326 |
Predictions are the same as previous regression
examples = {'Age': [35, 45], 'Aware': [50, 50]}
res3.predict(examples)
0 0.025406 1 0.193210 dtype: float64
Probit Regression¶
We can also perform Probit Regression in two ways, via the explicit probit
function, or via a GLM call
res5 = probit('Shot ~ Age + Aware', data=flu).fit()
res5.summary()
Optimization terminated successfully. Current function value: 0.320758 Iterations 7
Dep. Variable: | Shot | No. Observations: | 50 |
---|---|---|---|
Model: | Probit | Df Residuals: | 47 |
Method: | MLE | Df Model: | 2 |
Date: | Thu, 21 May 2020 | Pseudo R-squ.: | 0.5285 |
Time: | 18:49:21 | Log-Likelihood: | -16.038 |
converged: | True | LL-Null: | -34.015 |
LLR p-value: | 1.559e-08 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -12.3504 | 3.284 | -3.761 | 0.000 | -18.787 | -5.914 |
Age | 0.1279 | 0.040 | 3.224 | 0.001 | 0.050 | 0.206 |
Aware | 0.1164 | 0.033 | 3.568 | 0.000 | 0.052 | 0.180 |
To access Probit Regression via GLM, we have to specify the probit
function.
To get a list of supported functions in the Bionomial family, see below (we need the second item of the supported functions list)
sm.families.Binomial.links
[statsmodels.genmod.families.links.logit, statsmodels.genmod.families.links.probit, statsmodels.genmod.families.links.cauchy, statsmodels.genmod.families.links.log, statsmodels.genmod.families.links.cloglog, statsmodels.genmod.families.links.identity]
res6 = glm(
'Shot ~ Age + Aware',
data=flu,
family=sm.families.Binomial(
sm.families.Binomial.links[1]
),
).fit()
res6.summary()
Dep. Variable: | Shot | No. Observations: | 50 |
---|---|---|---|
Model: | GLM | Df Residuals: | 47 |
Model Family: | Binomial | Df Model: | 2 |
Link Function: | probit | Scale: | 1.0000 |
Method: | IRLS | Log-Likelihood: | -16.038 |
Date: | Thu, 21 May 2020 | Deviance: | 32.076 |
Time: | 18:49:21 | Pearson chi2: | 33.4 |
No. Iterations: | 8 | Covariance Type: | nonrobust |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -12.3504 | 3.228 | -3.826 | 0.000 | -18.677 | -6.023 |
Age | 0.1279 | 0.039 | 3.289 | 0.001 | 0.052 | 0.204 |
Aware | 0.1164 | 0.032 | 3.596 | 0.000 | 0.053 | 0.180 |
The two APIs give the same results
res5.predict(examples)
0 0.019966 1 0.218923 dtype: float64
res6.predict(examples)
0 0.019966 1 0.218923 dtype: float64
Reproducibility¶
%watermark -h -iv
%watermark
seaborn 0.9.0 matplotlib 3.0.2 numpy 1.15.4 statsmodels 0.9.0 pandas 1.0.0 scipy 1.1.0 host name: DESKTOP-SODFUN6 2020-05-21T18:49:21+10:00 CPython 3.7.1 IPython 7.2.0 compiler : MSC v.1915 64 bit (AMD64) system : Windows release : 10 machine : AMD64 processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel CPU cores : 8 interpreter: 64bit
sm.show_versions()
INSTALLED VERSIONS ------------------ Python: 3.7.1.final.0 Statsmodels =========== Installed: 0.9.0 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\statsmodels) Required Dependencies ===================== cython: 0.29.2 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\Cython) numpy: 1.15.4 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\numpy) scipy: 1.1.0 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\scipy) pandas: 1.0.0 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\pandas) dateutil: 2.7.5 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\dateutil) patsy: 0.5.1 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\patsy) Optional Dependencies ===================== matplotlib: 3.0.2 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\matplotlib) backend: module://ipykernel.pylab.backend_inline cvxopt: Not installed joblib: 0.13.2 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\joblib) Developer Tools ================ IPython: 7.2.0 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\IPython) jinja2: 2.10.1 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\jinja2) sphinx: 1.8.2 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\sphinx) pygments: 2.3.1 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\pygments) nose: 1.3.7 (D:\Anaconda3\envs\ac5-py37\lib\site-packages\nose) pytest: 5.0.1 (D:\Anaconda3\envs\ac5-py37\lib\site-packages) virtualenv: Not installed