{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 6 - Linear Model Selection and Regularization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Lab 2: Ridge Regression](#6.6.1-Ridge-Regression)\n", "- [Lab 2: The Lasso](#6.6.2-The-Lasso)\n", "- [Lab 3: Principal Components Regression](#6.7.1-Principal-Components-Regression)\n", "- [Lab 3: Partial Least Squares](#6.7.2-Partial-Least-Squares)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# %load ../standard_import.txt\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "import glmnet as gln\n", "\n", "from sklearn.preprocessing import scale \n", "from sklearn import model_selection\n", "from sklearn.linear_model import LinearRegression, Ridge, RidgeCV, Lasso, LassoCV\n", "from sklearn.decomposition import PCA\n", "from sklearn.cross_decomposition import PLSRegression\n", "from sklearn.model_selection import KFold, cross_val_score\n", "from sklearn.metrics import mean_squared_error\n", "\n", "%matplotlib inline\n", "plt.style.use('seaborn-white')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Lab 2" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 263 entries, -Alan Ashby to -Willie Wilson\n", "Data columns (total 20 columns):\n", "AtBat 263 non-null int64\n", "Hits 263 non-null int64\n", "HmRun 263 non-null int64\n", "Runs 263 non-null int64\n", "RBI 263 non-null int64\n", "Walks 263 non-null int64\n", "Years 263 non-null int64\n", "CAtBat 263 non-null int64\n", "CHits 263 non-null int64\n", "CHmRun 263 non-null int64\n", "CRuns 263 non-null int64\n", "CRBI 263 non-null int64\n", "CWalks 263 non-null int64\n", "League 263 non-null object\n", "Division 263 non-null object\n", "PutOuts 263 non-null int64\n", "Assists 263 non-null int64\n", "Errors 263 non-null int64\n", "Salary 263 non-null float64\n", "NewLeague 263 non-null object\n", "dtypes: float64(1), int64(16), object(3)\n", "memory usage: 43.1+ KB\n" ] } ], "source": [ "# In R, I exported the dataset from package 'ISLR' to a csv file.\n", "df = pd.read_csv('Data/Hitters.csv', index_col=0).dropna()\n", "df.index.name = 'Player'\n", "df.info()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Player
-Alan Ashby31581724383914344983569321414375NW6324310475.0N
-Alvin Davis479130186672763162445763224266263AW8808214480.0A
-Andre Dawson496141206578371156281575225828838354NE200113500.0N
-Andres Galarraga3218710394230239610112484633NE80540491.5N
-Alfredo Griffin5941694745135114408113319501336194AW28242125750.0A
\n", "
" ], "text/plain": [ " AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits \\\n", "Player \n", "-Alan Ashby 315 81 7 24 38 39 14 3449 835 \n", "-Alvin Davis 479 130 18 66 72 76 3 1624 457 \n", "-Andre Dawson 496 141 20 65 78 37 11 5628 1575 \n", "-Andres Galarraga 321 87 10 39 42 30 2 396 101 \n", "-Alfredo Griffin 594 169 4 74 51 35 11 4408 1133 \n", "\n", " CHmRun CRuns CRBI CWalks League Division PutOuts \\\n", "Player \n", "-Alan Ashby 69 321 414 375 N W 632 \n", "-Alvin Davis 63 224 266 263 A W 880 \n", "-Andre Dawson 225 828 838 354 N E 200 \n", "-Andres Galarraga 12 48 46 33 N E 805 \n", "-Alfredo Griffin 19 501 336 194 A W 282 \n", "\n", " Assists Errors Salary NewLeague \n", "Player \n", "-Alan Ashby 43 10 475.0 N \n", "-Alvin Davis 82 14 480.0 A \n", "-Andre Dawson 11 3 500.0 N \n", "-Andres Galarraga 40 4 91.5 N \n", "-Alfredo Griffin 421 25 750.0 A " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 263 entries, -Alan Ashby to -Willie Wilson\n", "Data columns (total 6 columns):\n", "League_A 263 non-null uint8\n", "League_N 263 non-null uint8\n", "Division_E 263 non-null uint8\n", "Division_W 263 non-null uint8\n", "NewLeague_A 263 non-null uint8\n", "NewLeague_N 263 non-null uint8\n", "dtypes: uint8(6)\n", "memory usage: 3.6+ KB\n", " League_A League_N Division_E Division_W NewLeague_A \\\n", "Player \n", "-Alan Ashby 0 1 0 1 0 \n", "-Alvin Davis 1 0 0 1 1 \n", "-Andre Dawson 0 1 1 0 0 \n", "-Andres Galarraga 0 1 1 0 0 \n", "-Alfredo Griffin 1 0 0 1 1 \n", "\n", " NewLeague_N \n", "Player \n", "-Alan Ashby 1 \n", "-Alvin Davis 0 \n", "-Andre Dawson 1 \n", "-Andres Galarraga 1 \n", "-Alfredo Griffin 0 \n" ] } ], "source": [ "dummies = pd.get_dummies(df[['League', 'Division', 'NewLeague']])\n", "dummies.info()\n", "print(dummies.head())" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 263 entries, -Alan Ashby to -Willie Wilson\n", "Data columns (total 19 columns):\n", "AtBat 263 non-null float64\n", "Hits 263 non-null float64\n", "HmRun 263 non-null float64\n", "Runs 263 non-null float64\n", "RBI 263 non-null float64\n", "Walks 263 non-null float64\n", "Years 263 non-null float64\n", "CAtBat 263 non-null float64\n", "CHits 263 non-null float64\n", "CHmRun 263 non-null float64\n", "CRuns 263 non-null float64\n", "CRBI 263 non-null float64\n", "CWalks 263 non-null float64\n", "PutOuts 263 non-null float64\n", "Assists 263 non-null float64\n", "Errors 263 non-null float64\n", "League_N 263 non-null uint8\n", "Division_W 263 non-null uint8\n", "NewLeague_N 263 non-null uint8\n", "dtypes: float64(16), uint8(3)\n", "memory usage: 35.7+ KB\n" ] } ], "source": [ "y = df.Salary\n", "\n", "# Drop the column with the independent variable (Salary), and columns for which we created dummy variables\n", "X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')\n", "# Define the feature set X.\n", "X = pd.concat([X_, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)\n", "X.info()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AtBatHitsHmRunRunsRBIWalksYearsCAtBatCHitsCHmRunCRunsCRBICWalksPutOutsAssistsErrorsLeague_NDivision_WNewLeague_N
Player
-Alan Ashby315.081.07.024.038.039.014.03449.0835.069.0321.0414.0375.0632.043.010.0111
-Alvin Davis479.0130.018.066.072.076.03.01624.0457.063.0224.0266.0263.0880.082.014.0010
-Andre Dawson496.0141.020.065.078.037.011.05628.01575.0225.0828.0838.0354.0200.011.03.0101
-Andres Galarraga321.087.010.039.042.030.02.0396.0101.012.048.046.033.0805.040.04.0101
-Alfredo Griffin594.0169.04.074.051.035.011.04408.01133.019.0501.0336.0194.0282.0421.025.0010
\n", "
" ], "text/plain": [ " AtBat Hits HmRun Runs RBI Walks Years CAtBat \\\n", "Player \n", "-Alan Ashby 315.0 81.0 7.0 24.0 38.0 39.0 14.0 3449.0 \n", "-Alvin Davis 479.0 130.0 18.0 66.0 72.0 76.0 3.0 1624.0 \n", "-Andre Dawson 496.0 141.0 20.0 65.0 78.0 37.0 11.0 5628.0 \n", "-Andres Galarraga 321.0 87.0 10.0 39.0 42.0 30.0 2.0 396.0 \n", "-Alfredo Griffin 594.0 169.0 4.0 74.0 51.0 35.0 11.0 4408.0 \n", "\n", " CHits CHmRun CRuns CRBI CWalks PutOuts Assists \\\n", "Player \n", "-Alan Ashby 835.0 69.0 321.0 414.0 375.0 632.0 43.0 \n", "-Alvin Davis 457.0 63.0 224.0 266.0 263.0 880.0 82.0 \n", "-Andre Dawson 1575.0 225.0 828.0 838.0 354.0 200.0 11.0 \n", "-Andres Galarraga 101.0 12.0 48.0 46.0 33.0 805.0 40.0 \n", "-Alfredo Griffin 1133.0 19.0 501.0 336.0 194.0 282.0 421.0 \n", "\n", " Errors League_N Division_W NewLeague_N \n", "Player \n", "-Alan Ashby 10.0 1 1 1 \n", "-Alvin Davis 14.0 0 1 0 \n", "-Andre Dawson 3.0 1 0 1 \n", "-Andres Galarraga 4.0 1 0 1 \n", "-Alfredo Griffin 25.0 0 1 0 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.head(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### I executed the R code and downloaded the exact same training/test sets used in the book." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "X_train = pd.read_csv('Data/Hitters_X_train.csv', index_col=0)\n", "y_train = pd.read_csv('Data/Hitters_y_train.csv', index_col=0)\n", "X_test = pd.read_csv('Data/Hitters_X_test.csv', index_col=0)\n", "y_test = pd.read_csv('Data/Hitters_y_test.csv', index_col=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.6.1 Ridge Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The __glmnet__ algorithms in R optimize the objective function using cyclical coordinate descent, while scikit-learn Ridge regression uses linear least squares with L2 regularization. They are rather different implementations, but the general principles are the same.\n", "\n", "The __glmnet() function in R__ optimizes:\n", "### $$\\frac{1}{N}|| X\\beta-y||^2_2+\\lambda\\bigg(\\frac{1}{2}(1−\\alpha)||\\beta||^2_2 \\ +\\ \\alpha||\\beta||_1\\bigg)$$\n", "(See R documentation and https://cran.r-project.org/web/packages/glmnet/vignettes/glmnet_beta.pdf)
\n", "The function supports L1 and L2 regularization. For just Ridge regression we need to use $\\alpha = 0$. This reduces the above cost function to\n", "### $$\\frac{1}{N}|| X\\beta-y||^2_2+\\frac{1}{2}\\lambda ||\\beta||^2_2$$\n", "The __sklearn Ridge()__ function optimizes:\n", "### $$||X\\beta - y||^2_2 + \\alpha ||\\beta||^2_2$$\n", "which is equivalent to optimizing\n", "### $$\\frac{1}{N}||X\\beta - y||^2_2 + \\frac{\\alpha}{N} ||\\beta||^2_2$$" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "