{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chapter 13\n",
    "---\n",
    "# Linear Regression\n",
    "\n",
    "### 13.0 Introduction\n",
    "Linear regression is one of the simplest supervised learning algorithms in our toolkit. If you have ever taken an introductory statistics course in college, likely the final topic you covered was linear regression. In fact, it is so simple that it is sometimes not considered machine learning at all!\n",
    "\n",
    "Whatever you believe, the fact is that linear regression--and its extensions--continues to be a common and useful method of making predictions when the target vector is a quantitative value (e.g. home price, age)\n",
    "\n",
    "### 13.1 Fitting a Line\n",
    "#### Problem\n",
    "You want to train a model that represents a linear relationship between the feature and target vector.\n",
    "\n",
    "#### Solution\n",
    "Use a linear regression (`LinearRegression` in scikit-learn)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.datasets import load_boston\n",
    "\n",
    "boston = load_boston()\n",
    "features = boston.data[:, 0:2]\n",
    "target = boston.target\n",
    "\n",
    "regression = LinearRegression()\n",
    "\n",
    "model = regression.fit(features, target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 13.4 Reducing Variance with Regularization\n",
    "#### Problem\n",
    "You want to reduce the variance of your linear regression model\n",
    "\n",
    "#### Solution\n",
    "Use a learning algorithm that includes a *shrinkage penalty* (also called **regularization**) like ridge regression and lasso regression:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Ridge\n",
    "from sklearn.datasets import load_boston\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "boston = load_boston()\n",
    "features = boston.data\n",
    "target = boston.target\n",
    "\n",
    "scaler = StandardScaler()\n",
    "features_standardized = scaler.fit_transform(features)\n",
    "\n",
    "regression = Ridge(alpha=0.5)\n",
    "model = regression.fit(features_standardized, target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Discussion\n",
    "In standard linear regression the model trains to minimize the sum of squared error between the true($y_i$) and prediction ($\\hat y_i$) target values, or residual sum of squares (RSS):\n",
    "$$\n",
    "RSS = \\sum_{i=1}^n{(y_i - \\hat y_i)^2}\n",
    "$$\n",
    "\n",
    "Regularized regression learners are similar, except they attempt to minimize RSS and some penalty for the total size of the coefficient values, called a shrinkage penalty because it attempts to \"shrink\" the model. There are two common types of regularized learners for linear regression: ridge regression and the lasso. The only formal difference is the type of shrinkage penalty used. In ridge regression, the shrinkage penalty is a tuning hyperparameter multiplied by the squared sum of all coefficients:\n",
    "$$\n",
    "RSS+\\alpha \\sum_{j=1}^p{\\hat \\beta_j^2}\n",
    "$$\n",
    "\n",
    "where $\\hat \\beta_j$ is the coefficient of the jth of p features and $\\alpha$ is a hyperparameter (discussed next). The lasso is similar, except the shrinkage penalty is a tuning hyperparmeter multiplied by the squared sum of all coefficients:\n",
    "$$\n",
    "\\frac{1}{2n} RSS + \\alpha \\sum_{j=1}^p{|\\beta_j|}\n",
    "$$\n",
    "\n",
    "where n is the number of observations. So which one should we use? A a very general rule of thumb, ridge regression often produces slightly better predictions than lasso, but lasso (for reasons we will discuss in Recipe 13.5) produces more interpretable models. If we want a balance between, ridge and lasso's penalty functions we can use elastic net, which is simply a regression model with both penalties included. Regardless of which one we use, bot hridge and lasso regresions can penalize large or complex models by including coefficient values in the loss funciton we are trying to minimize\n",
    "\n",
    "The hyper parameter $\\alpha$ lets us control how much we penalize the coefficients, with higher values of $\\alpha$ creating simpler models. The ideal value of $\\alpha$ should be tuned like any other hyperparameter. In scikit-learn, $\\alpha$ is set using the alpha parameter.\n",
    "\n",
    "scikit-learn includes a RidgeCV method that allows us to select the ideal value for $\\alpha:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.91215884,  1.0658758 ,  0.11942614,  0.68558782, -2.03231631,\n",
       "        2.67922108,  0.01477326, -3.0777265 ,  2.58814315, -2.00973173,\n",
       "       -2.05390717,  0.85614763, -3.73565106])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.linear_model import RidgeCV\n",
    "\n",
    "regr_cv = RidgeCV(alphas=[0.1, 1.0, 10.0])\n",
    "\n",
    "model_cv = regr_cv.fit(features_standardized, target)\n",
    "\n",
    "model_cv.coef_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# view alpha\n",
    "model_cv.alpha_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One final note: because in linear regression the value of the coefficients is partially determined by the scale of the feature, and in regularized models all coefficients are summed together, we must make sure to standardize the feature prior to training\n",
    "\n",
    "### 13.5 Reducing Features with Lasso Regression\n",
    "#### Problem\n",
    "You want to simplify your linear regression model by reducing the number of features.\n",
    "\n",
    "#### Solution\n",
    "Use a lasso regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Lasso\n",
    "from sklearn.datasets import load_boston\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "boston = load_boston()\n",
    "features = boston.data\n",
    "target = boston.target\n",
    "\n",
    "scaler = StandardScaler()\n",
    "features_standardized = scaler.fit_transform(features)\n",
    "\n",
    "regression = Lasso(alpha=0.5)\n",
    "model = regression.fit(features_standardized, target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Discussion\n",
    "One interesting characteristic of lasso regression's penalty is that it can shrink the coefficients of a model to zero, effectively reducing the number of features in the model. For example, in our solution we set `alpha` to 0.5 and we can see that many of the coefficients are 0, meaning their corresponding features are not used in the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.10697735,  0.        , -0.        ,  0.39739898, -0.        ,\n",
       "        2.97332316, -0.        , -0.16937793, -0.        , -0.        ,\n",
       "       -1.59957374,  0.54571511, -3.66888402])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.coef_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However if we increase $\\alpha$ to a much higher value, we see that lierally none of the features are being used:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.,  0., -0.,  0., -0.,  0., -0.,  0., -0., -0., -0.,  0., -0.])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regression_a10 = Lasso(alpha=10)\n",
    "model_a10 = regression_a10.fit(features_standardized, target)\n",
    "model_a10.coef_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The practical benefit of this effect is that it means that we could include 100 features in our feature matrix and then, through adjusting lasso's $\\alpha$ hyperparameter, produce a model that uses only 10 (for instance) of the most important features. This lets us reduce variance whiel improving interpretability of our model (since fewer features is easier to explain)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:machine_learning_cookbook]",
   "language": "python",
   "name": "conda-env-machine_learning_cookbook-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}