{
 "metadata": {
  "name": "",
  "signature": "sha256:f8aa6ef520972d66663d29d47451078ba588fd67061e9ab4fc62096e3e27be2f"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Overfitting, revisited\n",
      "\n",
      "What is overfitting? Here are a few ways of explaining it:\n",
      "\n",
      "- Building a model that matches the training set too closely.\n",
      "- Building a model that does well on the training data, but doesn't generalize to out-of-sample data.\n",
      "- Learning from the noise in the data, rather than just the signal."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Overfitting\n",
      "\n",
      "<img src=\"19_overfitting.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Underfitting vs Overfitting\n",
      "<img src=\"19_underfitting_overfitting.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**What are some ways to overfit the data?**\n",
      "\n",
      "- Train and test on the same data\n",
      "- Create a model that is overly complex (one that doesn't generalize well)\n",
      "    - Example: KNN in which K is too low\n",
      "    - Example: Decision tree that is grown too deep\n",
      "\n",
      "An overly complex model has **low bias** but **high variance**."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Linear Regression, revisited\n",
      "\n",
      "**Question:** Are linear regression models high bias/low variance, or low bias/high variance?\n",
      "\n",
      "**Answer:** High bias/low variance (generally speaking)\n",
      "\n",
      "Great! So as long as we don't train and test on the same data, we don't have to worry about overfitting, right? Not so fast...."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Overfitting with Linear Regression (part 1)\n",
      "\n",
      "Linear models can overfit if you include irrelevant features.\n",
      "\n",
      "**Question:** Why would that be the case?\n",
      "\n",
      "**Answer:** Because it will learn a coefficient for any feature you feed into the model, regardless of whether that feature has the signal or the noise.\n",
      "\n",
      "This is especially a problem when **p (number of features) is close to n (number of observations)**, because that model will naturally have high variance."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Overfitting with Linear Regression (part 2)\n",
      "\n",
      "Linear models can also overfit when the included features are highly correlated. From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares):\n",
      "\n",
      "> \"...coefficient estimates for Ordinary Least Squares rely on the independence of the model terms. When terms are correlated and the columns of the design matrix X have an approximate linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed response, producing a large variance.\""
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Overfitting with Linear Regression (part 3)\n",
      "\n",
      "Linear models can also overfit if the coefficients are too large.\n",
      "\n",
      "**Question:** Why would that be the case?\n",
      "\n",
      "**Answer:** Because the larger the absolute value of the coefficient, the more power it has to change the predicted response. Thus it tends toward high variance, which can result in overfitting."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Regularization\n",
      "\n",
      "Regularization is a method for \"constraining\" or \"regularizing\" the size of the coefficients, thus \"shrinking\" them towards zero. It tends to reduce variance more than it increases bias, and thus minimizes overfitting.\n",
      "\n",
      "Common regularization techniques for linear models:\n",
      "\n",
      "- **Ridge regression** (also known as \"L2 regularization\"): shrinks coefficients toward zero (but they never reach zero)\n",
      "- **Lasso regularization** (also known as \"L1 regularization\"): shrinks coefficients all the way to zero\n",
      "- **ElasticNet regularization**: balance between Ridge and Lasso\n",
      "\n",
      "Lasso regularization is useful if we believe many features are irrelevant, since a feature with a zero coefficient is essentially removed from the model. Thus, it is a useful technique for feature selection.\n",
      "\n",
      "How does regularization work?\n",
      "\n",
      "- A tuning parameter alpha (or sometimes lambda) imposes a penalty on the size of coefficients.\n",
      "- Instead of minimizing the \"loss function\" (mean squared error), it minimizes the \"loss plus penalty\".\n",
      "- A tiny alpha imposes no penalty on the coefficient size, and is equivalent to a normal linear model.\n",
      "- Increasing the alpha penalizes the coefficients and shrinks them toward zero."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Bias-variance trade-off\n",
      "\n",
      "Our goal is to locate the optimum model complexity, and thus regularization is useful when we believe our model is too complex.\n",
      "\n",
      "<img src=\"19_bias_variance.png\">"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Standardizing features\n",
      "\n",
      "It's usually recommended to standardize your features when using regularization.\n",
      "\n",
      "**Question:** Why would that be the case?\n",
      "\n",
      "**Answer:** If you don't standardize, features would be penalized simply because of their scale. Also, standardizing avoids penalizing the intercept (which wouldn't make intuitive sense)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Ridge vs Lasso path diagrams\n",
      "\n",
      "<img src=\"19_ridge_lasso_path.png\">"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}