{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img align=\"center\" src=\"http://sydney.edu.au/images/content/about/logo-mono.jpg\">\n",
    "<h1 align=\"center\" style=\"margin-top:10px\">Machine Learning Using Python (MEAFA Workshop)</h1>\n",
    "<h2 align=\"center\" style=\"margin-top:10px\">Lesson 8: Regression Application</h2>\n",
    "<br>\n",
    "\n",
    "In this lesson we revisit house pricing dataset of [De Cock (2011)](http://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627) and the corresponding [Kaggle competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). Our goal is to develop a machine learning system that will perform well in the competition. Our final solution is based on model stacking using a linear regression, regularised linear models, and gradient boosting as components. \n",
    "\n",
    "<a href=\"#House-Pricing Data\">House Pricing Data</a> <br>\n",
    "<a href=\"#Linear-Regression\">Linear Regression</a> <br>\n",
    "<a href=\"#Regularised-Linear-Models\">Regularised Linear Models</a> <br>\n",
    "<a href=\"#Regression-Tree\">Regression Tree</a> <br>\n",
    "<a href=\"#Random-Forest\">Random Forest</a> <br>\n",
    "<a href=\"#Gradient-Boosting\">Gradient Boosting</a> <br>\n",
    "<a href=\"#Model-Stacking\">Model Stacking</a> <br>\n",
    "<a href=\"#Model-Evaluation\">Model Evaluation</a> <br>\n",
    "<a href=\"#Making-a-Submission-on-Kaggle\">Making a Submission on Kaggle</a> <br>\n",
    "\n",
    "This notebook relies on the following libraries and settings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Packages\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore') "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.model_selection import GridSearchCV, RandomizedSearchCV\n",
    "from sklearn.metrics import mean_squared_error, r2_score,  mean_absolute_error\n",
    "\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "import xgboost as xgb\n",
    "import lightgbm as lgb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##House Pricing Data \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>1stFlrSF</th>\n",
       "      <th>2ndFlrSF</th>\n",
       "      <th>3SsnPorch</th>\n",
       "      <th>Age</th>\n",
       "      <th>BsmtFinSF1</th>\n",
       "      <th>BsmtFinSF2</th>\n",
       "      <th>BsmtUnfSF</th>\n",
       "      <th>EnclosedPorch</th>\n",
       "      <th>GarageArea</th>\n",
       "      <th>LotArea</th>\n",
       "      <th>...</th>\n",
       "      <th>RoofMatl_Other</th>\n",
       "      <th>RoofStyle_Hip</th>\n",
       "      <th>RoofStyle_Other</th>\n",
       "      <th>ScreenPorchZero</th>\n",
       "      <th>WoodDeckSFZero</th>\n",
       "      <th>YrSold_2007</th>\n",
       "      <th>YrSold_2008</th>\n",
       "      <th>YrSold_2009</th>\n",
       "      <th>YrSold_2010</th>\n",
       "      <th>SalePrice</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1656</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>50</td>\n",
       "      <td>639.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>441.0</td>\n",
       "      <td>0</td>\n",
       "      <td>528.0</td>\n",
       "      <td>31770</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>215000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>896</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>49</td>\n",
       "      <td>468.0</td>\n",
       "      <td>144.0</td>\n",
       "      <td>270.0</td>\n",
       "      <td>0</td>\n",
       "      <td>730.0</td>\n",
       "      <td>11622</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>105000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1329</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>52</td>\n",
       "      <td>923.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>406.0</td>\n",
       "      <td>0</td>\n",
       "      <td>312.0</td>\n",
       "      <td>14267</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>172000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2110</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>42</td>\n",
       "      <td>1065.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1045.0</td>\n",
       "      <td>0</td>\n",
       "      <td>522.0</td>\n",
       "      <td>11160</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>244000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>928</td>\n",
       "      <td>701</td>\n",
       "      <td>0</td>\n",
       "      <td>13</td>\n",
       "      <td>791.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>137.0</td>\n",
       "      <td>0</td>\n",
       "      <td>482.0</td>\n",
       "      <td>13830</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>189900</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 196 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   1stFlrSF  2ndFlrSF  3SsnPorch  Age  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  \\\n",
       "0      1656         0          0   50       639.0         0.0      441.0   \n",
       "1       896         0          0   49       468.0       144.0      270.0   \n",
       "2      1329         0          0   52       923.0         0.0      406.0   \n",
       "3      2110         0          0   42      1065.0         0.0     1045.0   \n",
       "4       928       701          0   13       791.0         0.0      137.0   \n",
       "\n",
       "   EnclosedPorch  GarageArea  LotArea    ...      RoofMatl_Other  \\\n",
       "0              0       528.0    31770    ...                   0   \n",
       "1              0       730.0    11622    ...                   0   \n",
       "2              0       312.0    14267    ...                   0   \n",
       "3              0       522.0    11160    ...                   0   \n",
       "4              0       482.0    13830    ...                   0   \n",
       "\n",
       "   RoofStyle_Hip  RoofStyle_Other  ScreenPorchZero  WoodDeckSFZero  \\\n",
       "0              1                0                1               0   \n",
       "1              0                0                0               0   \n",
       "2              1                0                1               0   \n",
       "3              1                0                1               1   \n",
       "4              0                0                1               0   \n",
       "\n",
       "   YrSold_2007  YrSold_2008  YrSold_2009  YrSold_2010  SalePrice  \n",
       "0            0            0            0            1     215000  \n",
       "1            0            0            0            1     105000  \n",
       "2            0            0            0            1     172000  \n",
       "3            0            0            0            1     244000  \n",
       "4            0            0            0            1     189900  \n",
       "\n",
       "[5 rows x 196 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data=pd.read_csv('Datasets/AmesHousing-Processed.csv')\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We the split the data into training and test sets. We use a small training dataset to better illustrate the advantages of regularisation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "response='SalePrice'\n",
    "predictors=list(data.columns.values[:-1])\n",
    "\n",
    "# Randomly split indexes\n",
    "index_train, index_test  = train_test_split(np.array(data.index), train_size=0.7, random_state=5)\n",
    "\n",
    "# Write training and test sets \n",
    "train = data.loc[index_train,:].copy()\n",
    "test =  data.loc[index_test,:].copy()\n",
    "\n",
    "# Write training and test response vectors\n",
    "y_train = np.log(train[response])\n",
    "y_test = np.log(test[response])\n",
    "\n",
    "# Write training and test design matrices\n",
    "X_train = train[predictors].copy()\n",
    "X_test = test[predictors].copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear Regression\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ols = LinearRegression()\n",
    "ols.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regularised Linear Models\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lasso"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(memory=None,\n",
       "     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('estimator', LassoCV(alphas=None, copy_X=True, cv=5, eps=0.001, fit_intercept=True,\n",
       "    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,\n",
       "    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,\n",
       "    verbose=False))])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lasso = Pipeline((\n",
    "    ('scaler', StandardScaler()),\n",
    "    ('estimator', LassoCV(cv=5)),\n",
    "))\n",
    "\n",
    "lasso.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ridge Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(memory=None,\n",
       "     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('estimator', RidgeCV(alphas=[3.0517578125e-05, 3.5055491790680982e-05, 4.0268185753567341e-05, 4.6255998733837822e-05, 5.3134189654304478e-05, 6.103515625e-05, 7.0110983581361965e-05, 8.0536371507134683e-05, 9.251199746767...cv=5, fit_intercept=True, gcv_mode=None, normalize=False, scoring=None,\n",
       "    store_cv_values=False))])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "alphas = list(np.logspace(-15, 15, 151, base=2))\n",
    "\n",
    "ridge = Pipeline((\n",
    "    ('scaler', StandardScaler()),\n",
    "    ('estimator', RidgeCV(alphas=alphas, cv=5)),\n",
    "))\n",
    "\n",
    "ridge.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Elastic Net"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(memory=None,\n",
       "     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('estimator', ElasticNetCV(alphas=None, copy_X=True, cv=5, eps=0.001, fit_intercept=True,\n",
       "       l1_ratio=[0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99],\n",
       "       max_iter=1000, n_alphas=100, n_jobs=1, normalize=False,\n",
       "       positive=False, precompute='auto', random_state=None,\n",
       "       selection='cyclic', tol=0.0001, verbose=0))])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "enet = Pipeline((\n",
    "    ('scaler', StandardScaler()),\n",
    "    ('estimator', ElasticNetCV(l1_ratio=[0.01,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9, 0.99], cv=5)),\n",
    "))\n",
    "\n",
    "enet.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regression Tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best parameters: {'min_samples_leaf': 5, 'max_depth': 7}\n",
      "Wall time: 2.17 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "model = DecisionTreeRegressor(min_samples_leaf=5)\n",
    "\n",
    "tuning_parameters = {\n",
    "    'min_samples_leaf': [1,5,10,20],\n",
    "    'max_depth': np.arange(1,30),\n",
    "}\n",
    "\n",
    "tree = RandomizedSearchCV(model, tuning_parameters, n_iter=20, cv=5, return_train_score=False)\n",
    "tree.fit(X_train, y_train)\n",
    "\n",
    "print('Best parameters:', tree.best_params_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Random Forest Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best parameters found by randomised search: {'min_samples_leaf': 1, 'max_features': 176} \n",
      "\n",
      "Wall time: 22.5 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "model = RandomForestRegressor(n_estimators=100)\n",
    "\n",
    "tuning_parameters = {\n",
    "    'min_samples_leaf': [1,5, 10, 20, 50],\n",
    "    'max_features': np.arange(1, X_train.shape[1], 5),\n",
    "}\n",
    "\n",
    "rf_search = RandomizedSearchCV(model, tuning_parameters, cv = 5, n_iter= 16, return_train_score=False, n_jobs=4,\n",
    "                              random_state = 20)\n",
    "rf_search.fit(X_train, y_train)\n",
    "\n",
    "rf = rf_search.best_estimator_\n",
    "\n",
    "print('Best parameters found by randomised search:', rf_search.best_params_, '\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,\n",
       "           max_features=176, max_leaf_nodes=None,\n",
       "           min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "           min_samples_leaf=1, min_samples_split=2,\n",
       "           min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1,\n",
       "           oob_score=False, random_state=None, verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rf.n_estimators = 500\n",
    "rf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gradient Boosting\n",
    "\n",
    "\n",
    "### LightGBM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best parameters found by randomised search: {'subsample': 1.0, 'n_estimators': 1500, 'max_depth': 2, 'learning_rate': 0.05} \n",
      "\n",
      "Wall time: 5min 58s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "model = lgb.LGBMRegressor(objective='regression')\n",
    "\n",
    "\n",
    "tuning_parameters = {\n",
    "    'learning_rate': [0.01, 0.05, 0.1],\n",
    "    'n_estimators' : [250, 500, 750, 1000, 1500, 2000, 3000, 4000, 5000],\n",
    "    'max_depth' : [2, 3, 4],\n",
    "    'subsample' : [0.6, 0.8, 1.0],\n",
    "}\n",
    "\n",
    "gb_search = RandomizedSearchCV(model, tuning_parameters, n_iter = 128, cv = 5, return_train_score=False, n_jobs=4, \n",
    "                               random_state = 20)\n",
    "\n",
    "gb_search.fit(X_train, y_train)\n",
    "\n",
    "lbst = gb_search.best_estimator_\n",
    "\n",
    "\n",
    "print('Best parameters found by randomised search:', gb_search.best_params_, '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### XGBoost"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best parameters found by randomised search: {'subsample': 0.6, 'n_estimators': 1000, 'max_depth': 2, 'learning_rate': 0.05} \n",
      "\n",
      "Wall time: 4min 46s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "model = xgb.XGBRegressor()\n",
    "\n",
    "tuning_parameters = {\n",
    "    'learning_rate': [0.01, 0.05, 0.1],\n",
    "    'n_estimators' : [250, 500, 750, 1000, 1500, 2000, 3000, 5000],\n",
    "    'max_depth' : [2, 3, 4],\n",
    "    'subsample' : [0.6, 0.8, 1.0],\n",
    "}\n",
    "\n",
    "gb_search = RandomizedSearchCV(model, tuning_parameters, n_iter = 16, cv = 5, return_train_score=False, n_jobs=4,\n",
    "                              random_state = 20)\n",
    "gb_search.fit(X_train, y_train)\n",
    "\n",
    "xbst = gb_search.best_estimator_\n",
    "\n",
    "\n",
    "print('Best parameters found by randomised search:', gb_search.best_params_, '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Additive Boosting\n",
    "\n",
    "This is an advanced specification. Since gradient boosting is an additive model fit by forward stagewise additive modelling, nothing stops us from fitting a gradient boosting model to the residuals of a linear regression specification, therefore boosting the linear model with additive trees. \n",
    "\n",
    "The only disadvantage is that there are no immediately available functions to add this model to our stack. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Best parameters found by randomised search: {'subsample': 0.8, 'n_estimators': 1500, 'max_depth': 2, 'learning_rate': 0.01} \n",
      "\n",
      "Wall time: 1min 6s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "y_fit = lasso.predict(X_train)\n",
    "resid = y_train - y_fit\n",
    "\n",
    "model = lgb.LGBMRegressor(objective='regression')\n",
    "\n",
    "\n",
    "tuning_parameters = {\n",
    "    'learning_rate': [0.01, 0.05, 0.1],\n",
    "    'n_estimators' : [250, 500, 750, 1000, 1500, 2000, 3000, 4000, 5000],\n",
    "    'max_depth' : [2, 3, 4],\n",
    "    'subsample' : [0.6, 0.8, 1.0],\n",
    "}\n",
    "\n",
    "gb_search = RandomizedSearchCV(model, tuning_parameters, n_iter = 16, cv = 5, return_train_score=False, n_jobs=4, \n",
    "                               random_state = 20)\n",
    "\n",
    "gb_search.fit(X_train, resid)\n",
    "\n",
    "abst = gb_search.best_estimator_\n",
    "\n",
    "\n",
    "print('Best parameters found by randomised search:', gb_search.best_params_, '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Stacking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Wall time: 2min 18s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "from mlxtend.regressor import StackingCVRegressor\n",
    "\n",
    "models = [ols, lasso, ridge, xbst]\n",
    "\n",
    "stack = StackingCVRegressor(models, meta_regressor = LinearRegression(), cv=10)\n",
    "stack.fit(X_train.values, y_train.ravel())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Evaluation\n",
    "\n",
    "\n",
    "### Original prices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Test RMSE</th>\n",
       "      <th>Test R2</th>\n",
       "      <th>Test MAE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>OLS</th>\n",
       "      <td>14875.800</td>\n",
       "      <td>0.950</td>\n",
       "      <td>10572.203</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Lasso</th>\n",
       "      <td>14791.127</td>\n",
       "      <td>0.951</td>\n",
       "      <td>10671.239</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ridge</th>\n",
       "      <td>14704.635</td>\n",
       "      <td>0.951</td>\n",
       "      <td>10519.765</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Elastic Net</th>\n",
       "      <td>14791.233</td>\n",
       "      <td>0.951</td>\n",
       "      <td>10671.241</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Tree</th>\n",
       "      <td>30543.080</td>\n",
       "      <td>0.791</td>\n",
       "      <td>20826.741</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Random Forest</th>\n",
       "      <td>24308.127</td>\n",
       "      <td>0.867</td>\n",
       "      <td>14746.956</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>LightGBM</th>\n",
       "      <td>17561.108</td>\n",
       "      <td>0.931</td>\n",
       "      <td>11675.938</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>XGBoost</th>\n",
       "      <td>16813.901</td>\n",
       "      <td>0.937</td>\n",
       "      <td>11535.039</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Additive Boost</th>\n",
       "      <td>14022.916</td>\n",
       "      <td>0.956</td>\n",
       "      <td>10221.708</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Stack</th>\n",
       "      <td>13482.407</td>\n",
       "      <td>0.959</td>\n",
       "      <td>10032.881</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                Test RMSE  Test R2   Test MAE\n",
       "OLS             14875.800    0.950  10572.203\n",
       "Lasso           14791.127    0.951  10671.239\n",
       "Ridge           14704.635    0.951  10519.765\n",
       "Elastic Net     14791.233    0.951  10671.241\n",
       "Tree            30543.080    0.791  20826.741\n",
       "Random Forest   24308.127    0.867  14746.956\n",
       "LightGBM        17561.108    0.931  11675.938\n",
       "XGBoost         16813.901    0.937  11535.039\n",
       "Additive Boost  14022.916    0.956  10221.708\n",
       "Stack           13482.407    0.959  10032.881"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "columns=['Test RMSE', 'Test R2', 'Test MAE']\n",
    "rows=['OLS', 'Lasso', 'Ridge', 'Elastic Net', 'Tree', 'Random Forest', 'LightGBM', 'XGBoost', 'Additive Boost', 'Stack']\n",
    "results=pd.DataFrame(0.0, columns=columns, index=rows) \n",
    "\n",
    "methods=[ols, lasso, ridge, enet, tree, rf, lbst, xbst, abst, stack]\n",
    "\n",
    "for i, method in enumerate(methods):\n",
    "    \n",
    "    if method != stack:\n",
    "        y_pred=np.exp(method.predict(X_test))   \n",
    "        if method == abst:\n",
    "            y_pred=np.exp(lasso.predict(X_test)+method.predict(X_test)) # combining predictions           \n",
    "    else:\n",
    "        y_pred=np.exp(method.predict(X_test.values))\n",
    "        \n",
    "    results.iloc[i,0] = np.sqrt(mean_squared_error(np.exp(y_test), y_pred))\n",
    "    results.iloc[i,1] = r2_score(np.exp(y_test), y_pred)\n",
    "    results.iloc[i,2] = mean_absolute_error(np.exp(y_test), y_pred)\n",
    "\n",
    "results.round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Log prices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Test RMSE</th>\n",
       "      <th>Test R2</th>\n",
       "      <th>Test MAE</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>OLS</th>\n",
       "      <td>0.083</td>\n",
       "      <td>0.945</td>\n",
       "      <td>0.063</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Lasso</th>\n",
       "      <td>0.085</td>\n",
       "      <td>0.942</td>\n",
       "      <td>0.063</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Ridge</th>\n",
       "      <td>0.084</td>\n",
       "      <td>0.944</td>\n",
       "      <td>0.063</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Elastic Net</th>\n",
       "      <td>0.085</td>\n",
       "      <td>0.942</td>\n",
       "      <td>0.063</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Tree</th>\n",
       "      <td>0.160</td>\n",
       "      <td>0.796</td>\n",
       "      <td>0.120</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Random Forest</th>\n",
       "      <td>0.113</td>\n",
       "      <td>0.898</td>\n",
       "      <td>0.082</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>LightGBM</th>\n",
       "      <td>0.090</td>\n",
       "      <td>0.935</td>\n",
       "      <td>0.067</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>XGBoost</th>\n",
       "      <td>0.088</td>\n",
       "      <td>0.939</td>\n",
       "      <td>0.066</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Additive Boost</th>\n",
       "      <td>0.081</td>\n",
       "      <td>0.947</td>\n",
       "      <td>0.062</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Stack</th>\n",
       "      <td>0.078</td>\n",
       "      <td>0.951</td>\n",
       "      <td>0.059</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                Test RMSE  Test R2  Test MAE\n",
       "OLS                 0.083    0.945     0.063\n",
       "Lasso               0.085    0.942     0.063\n",
       "Ridge               0.084    0.944     0.063\n",
       "Elastic Net         0.085    0.942     0.063\n",
       "Tree                0.160    0.796     0.120\n",
       "Random Forest       0.113    0.898     0.082\n",
       "LightGBM            0.090    0.935     0.067\n",
       "XGBoost             0.088    0.939     0.066\n",
       "Additive Boost      0.081    0.947     0.062\n",
       "Stack               0.078    0.951     0.059"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "columns=['Test RMSE', 'Test R2', 'Test MAE']\n",
    "rows=['OLS', 'Lasso', 'Ridge', 'Elastic Net', 'Tree', 'Random Forest', 'LightGBM', 'XGBoost', 'Additive Boost', 'Stack']\n",
    "results=pd.DataFrame(0.0, columns=columns, index=rows) \n",
    "\n",
    "methods=[ols, lasso, ridge, enet, tree, rf, lbst, xbst, abst, stack]\n",
    "\n",
    "for i, method in enumerate(methods):\n",
    "    \n",
    "    if method != stack:\n",
    "        y_pred= method.predict(X_test)   \n",
    "        if method == abst:\n",
    "            y_pred=ols.predict(X_test)+method.predict(X_test)              \n",
    "    else:\n",
    "        y_pred= method.predict(X_test.values)\n",
    "        \n",
    "    results.iloc[i,0] = np.sqrt(mean_squared_error(y_test, y_pred))\n",
    "    results.iloc[i,1] = r2_score(y_test, y_pred)\n",
    "    results.iloc[i,2] = mean_absolute_error(y_test, y_pred)\n",
    "\n",
    "results.round(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making a Submission on Kaggle\n",
    "\n",
    "Using the methods from this lesson would lead competitive score at the [Kaggle competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques). Note that the Kaggle competition is based on predicting the log prices. \n",
    "\n",
    "If you would like to try it, you would need to download the training and test sets from Kaggle and reprocess the data accordingly. Details on how I processed the data are available on request. \n",
    "\n",
    "The next cell shows you how to generate a submission file (see further instructions on Kaggle regarding the Id column, which does not exist in our version of the dataset). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "submission = pd.DataFrame(np.c_[test.index, y_pred], columns=['Id', response])\n",
    "submission.to_csv('kaggle_submission.csv',  index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}