{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#A Primer on Empirical Risk Minimization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Notations and Definitions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Let's first set up some notation and ideas:
\n", "Second, let's define two more things
\n", "The main goal of Supervised Learning can be stated using the Empirical Risk Minimization framework of Statistical Learning.
\n",
"We are looking for a function $f\\in \\mathbb{F}$ that minimizes the expected loss: \n",
"
\n",
"
\n",
"
Logistic Regression: a member of the class of generalized linear models (glm) using the logit as its link function.
\n",
"\n",
"The goal of Logistic Regression is to model the posterior probability of membership in class $c_i$ as a function of $X$. I.e.,\n",
"
\n",
"
\n",
"
\n",
"
How do we fit Logistic Regression into the ERM framework?
\n",
"\n",
"We find the parameters $\\alpha$ and $\\beta$ using the method of Maximum Likelihood Estimation.
\n",
"\n",
"If we consider each observation to be an indepedent Bernoulli draw with $p_i=P(y_i|x_i)$, then the likelihood of each draw can be defined as: $p_i^{y_i}(1-p_i)^{1-y_i}$, with $p_i$ given by the inverse logit function. In MLE, we wish to maximize the likelihood of observing the data as a function of the independent parameters of the model (i.e., $\\alpha$ and $\\beta$). The total likelihood function looks like:
\n",
"\n",
"
Logistic Regression has long been used as a tool in statistics and econometrics so there are a lot of additional data points one can get out of logistic regression model than one might get with standard machine learning tools.
\n",
"We showed how to use scikit-learn to fit a model, but we also used statsmodel. The reason is that statsmodel returns summary statistics on each coefficient fit to the variables. In machine learning, we often only focus on the generalizability of the prediction. But in many analytical applications we want to know how statistically significant are the estimates within our model.\n",
"
| Dep. Variable: | y_buy | No. Observations: | 2087 | \n", "
|---|---|---|---|
| Model: | Logit | Df Residuals: | 2074 | \n", "
| Method: | MLE | Df Model: | 12 | \n", "
| Date: | Tue, 10 Nov 2015 | Pseudo R-squ.: | 0.1523 | \n", "
| Time: | 20:05:51 | Log-Likelihood: | -529.64 | \n", "
| converged: | True | LL-Null: | -624.83 | \n", "
| LLR p-value: | 3.151e-34 | \n", "
| coef | std err | z | P>|z| | [95.0% Conf. Int.] | \n", "|
|---|---|---|---|---|---|
| isbuyer | 0.9714 | 0.395 | 2.462 | 0.014 | 0.198 1.745 | \n", "
| buy_freq | -0.1588 | 0.203 | -0.783 | 0.433 | -0.556 0.239 | \n", "
| visit_freq | 0.0311 | 0.021 | 1.453 | 0.146 | -0.011 0.073 | \n", "
| buy_interval | -0.0030 | 0.014 | -0.208 | 0.835 | -0.031 0.025 | \n", "
| sv_interval | 0.0085 | 0.007 | 1.210 | 0.226 | -0.005 0.022 | \n", "
| expected_time_buy | 0.0101 | 0.011 | 0.935 | 0.350 | -0.011 0.031 | \n", "
| expected_time_visit | -0.0328 | 0.007 | -4.434 | 0.000 | -0.047 -0.018 | \n", "
| last_buy | 0.0115 | 0.005 | 2.284 | 0.022 | 0.002 0.021 | \n", "
| last_visit | -0.0597 | 0.006 | -10.295 | 0.000 | -0.071 -0.048 | \n", "
| multiple_buy | 1.4701 | 1.100 | 1.337 | 0.181 | -0.686 3.626 | \n", "
| multiple_visit | -0.1800 | 0.220 | -0.819 | 0.413 | -0.611 0.251 | \n", "
| uniq_urls | -0.0105 | 0.002 | -6.466 | 0.000 | -0.014 -0.007 | \n", "
| num_checkins | -2.867e-05 | 8.81e-05 | -0.325 | 0.745 | -0.000 0.000 | \n", "
\n",
"A Practical Aside
\n",
"\n",
"What exactly does the estimate of $\\beta$ really mean? How can we interpret it?
\n",
"\n",
"Recall that $Ln \\frac{p}{1-p}=\\alpha+\\beta x$. This means that a unit change in the value of $x$ changes the log-odds by the value of $\\beta$. This is a mathematical statement that IMHO does not offer much intuitive value.
\n",
"\n",
"
In this example we test the sensitivity of out-of-sample performance to training set sample size. Our goal is to plot test-set $AUC$ as a function of $N$, the number of samples in the training set. Because we expect a lot of variance in the lower range of $N$, we use bootstrap algorithm to compute standard errors of AUC measurements.
\n",
"\n",
"To Bootstrap:\n",
"
We can see in the above plot that Logistic Regression does fairly well with small sample sizes. The lower bound of the $95%$ at the $max(N)$ overlaps with the confidence interval at most levels of $N$, suggesting that in expectation, the smaller samples could perform as well as the larger samples.
\n",
"\n",
"While this is true, always try to use as much data as you can to reduce the variance!\n",
"\n",
"