{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Predicting Housing Prices using Linear Model\n",
    "\n",
    "## Goal:\n",
    "   * Build a linear model to predict housing prices using the Boston house prices data set. The data set is availale on https://bit.ly/31MNlnS. \n",
    "\n",
    "Boston house prices data set\n",
    "---------------------------\n",
    "\n",
    "**Data Set Characteristics:**  \n",
    "\n",
    "    :Number of Instances: 506 \n",
    "\n",
    "    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n",
    "\n",
    "    :Attribute Information (in order):\n",
    "        - CRIM     per capita crime rate by town\n",
    "        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.\n",
    "        - INDUS    proportion of non-retail business acres per town\n",
    "        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n",
    "        - NOX      nitric oxides concentration (parts per 10 million)\n",
    "        - RM       average number of rooms per dwelling\n",
    "        - AGE      proportion of owner-occupied units built prior to 1940\n",
    "        - DIS      weighted distances to five Boston employment centres\n",
    "        - RAD      index of accessibility to radial highways\n",
    "        - TAX      full-value property-tax rate per \\$10,000\n",
    "        - PTRATIO  pupil-teacher ratio by town\n",
    "        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n",
    "        - LSTAT    % lower status of the population\n",
    "        - MEDV     Median value of owner-occupied homes in $1000's\n",
    "\n",
    "    :Missing Attribute Values: None\n",
    "\n",
    "    :Creator: Harrison, D. and Rubinfeld, D.L.\n",
    "\n",
    "This is a copy of UCI ML housing dataset.\n",
    "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n",
    "\n",
    "The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\n",
    "prices and the demand for clean air', J. Environ. Economics & Management,\n",
    "vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n",
    "...', Wiley, 1980.   N.B. Various transformations are used in the table on\n",
    "pages 244-261 of the latter.\n",
    "\n",
    "The Boston house-price data has been used in many machine learning papers that address regression\n",
    "problems.   \n",
    "     \n",
    "References\n",
    "\n",
    "   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n",
    "   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "assert sys.version_info >= (3, 6)\n",
    "\n",
    "import numpy\n",
    "assert numpy.__version__ >=\"1.17.3\" \n",
    "import numpy as np\n",
    "\n",
    "import pandas\n",
    "assert pandas.__version__ >= \"0.25.1\"\n",
    "import pandas as pd\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring the Bost housing data set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Importing data set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# dataset_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'\n",
    "dataset_path = 'data/housing/boston/housing.data'\n",
    "housing = pd.read_csv(dataset_path, sep ='\\s+', header = None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "housing.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "housing.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Data pre-processing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "housing.columns = [\"CRIM\", \"ZN\", \"INDUS\", \"CHAS\", \"NOX\", \"RM\", \"AGE\", \"DIS\", \"RAD\", \"TAX\", \"PTRATIO\", \"B\",  \"LSTAT\", \"MEDV\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "housing.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Couting the number of missing values of each feature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "housing.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "housing.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Exploratory Data Analysis (EDA)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.1 Visualizing the distribution of the variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "h  = housing.hist(bins=20, figsize=(20,15))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.2 Computing the correlation matrix of the variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "correlation_matrix = np.corrcoef(housing[housing.columns].values.T)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visualizing the correlation matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn\n",
    "assert seaborn.__version__ >= '0.9.0'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn\n",
    "assert seaborn.__version__ >= '0.9.0'\n",
    "import seaborn as sns\n",
    "\n",
    "f, ax = plt.subplots(figsize=(10, 8))\n",
    "\n",
    "hm = sns.heatmap(data = correlation_matrix, \n",
    "            annot = True,\n",
    "            square = True,\n",
    "            fmt='.2f',\n",
    "            yticklabels = housing.columns, \n",
    "            xticklabels = housing.columns,\n",
    "            linewidths=.1, ax = ax)\n",
    "\n",
    "# fix for mpl bug that cuts off top/bottom of seaborn viz\n",
    "# See the discussion here (https://github.com/mwaskom/seaborn/issues/1773)\n",
    "b, t = plt.ylim() # discover the values for bottom and top\n",
    "b += 0.5 # Add 0.5 to the bottom\n",
    "t -= 0.5 # Subtract 0.5 from the top\n",
    "plt.ylim(b, t) # update the ylim(bottom, top) values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* **MEDV** is positively correlated with the **RM**(0.7), whereas it has a strong negative correlation with **LSTAT**(-0.74)\n",
    "* Features **RAD** and **Tax** are strongly correlated to each other (0.91). Thus, we should not select these feature together to train the model. The same are valid for features **NOX** and **DIS**, **AGE** and **DIS**, **INDUS** and **DIS**.\n",
    "\n",
    "Let us investigate how **RM** and **LSTAT** vary with **MEDV**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(20,5))\n",
    "features = [\"RM\", \"LSTAT\"]\n",
    "\n",
    "for i, col in enumerate(features):\n",
    "    plt.subplot(1, len(features), i + 1)\n",
    "    plt.scatter(housing[col], housing['MEDV'])\n",
    "    plt.xlabel(col)\n",
    "    plt.title(col)\n",
    "    plt.ylabel('MEDV')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* The price increases as the average number of rooms per dwelling (i.e., **RM**) increases. Therefore, there are some outliers. \n",
    "* The prices tends to decreases when **LSTAT** increases, although it does not seem to follows a linear relationship."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = housing['RM'].values.reshape(-1,1)\n",
    "y = housing['MEDV'].values.reshape(-1,1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Predicting the price of the houses"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.1 Building a linear model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sklearn\n",
    "assert sklearn.__version__ >='0.21.3'\n",
    "\n",
    "from sklearn.linear_model import LinearRegression\n",
    "\n",
    "lm = LinearRegression()\n",
    "lm.fit(X, y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.2 Checking the parameters of the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Intercept = {}, Slope{}\".format(lm.intercept_, lm.coef_))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.3 Assessing the performance of the model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 4.3.1 Splitting the data into training and testing sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "X = pd.DataFrame(np.c_[housing['RM'], housing['LSTAT']], columns = ['RM', 'LSTAT'])\n",
    "y = housing['MEDV']\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 100)\n",
    "\n",
    "print(\"X_train.shape {}, X_test.shape {}\".format(X_train.shape, X_test.shape))\n",
    "print(\"y_train.shape {}, y_test.shape {}\".format(y_train.shape, y_test.shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 4.3.2 Creating a new linear model using only the training set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lm = LinearRegression()\n",
    "lm.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 4.3.4. Evaluating the model using the test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_train_predicted = lm.predict(X_train)\n",
    "y_test_predicted = lm.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 4.3.5. Visualizing the residuals"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(6, 5))\n",
    "plt.scatter(y_train_predicted, y_train_predicted - y_train, color = \"blue\", label = \"Training data\")\n",
    "plt.scatter(y_test_predicted, y_test_predicted - y_test, color = 'red', label = \"Test data\")\n",
    "plt.ylabel(\"Residuals\")\n",
    "plt.xlabel(\"Predicted values\")\n",
    "plt.legend(loc=\"upper left\")\n",
    "plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='green')\n",
    "plt.axis([-10,50, -30,20])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* The model does not seem to be completely wrong, as the residuals are randomly scattered around the centerline.\n",
    "* Therefore, the model is unable to capture some exploratory information, as can be seen in the presence of small patterns."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4.4. Computing the MSE, RMSE, and $R^2$\n",
    "\n",
    "* The **mean square error (MSE)** is simply the average value of the **residual sum of squares (RSS)** that we minimize to fit the linear regression model. \n",
    "\n",
    "$$MSE = \\frac{1}{N}\\sum_{i=1}^{N}(y_i - \\hat{y_i})^2$$\n",
    "\n",
    "* The **root mean square error (RMSE)** comprises the standard deviation of the residuals. It tells us how concentrated the data are around the line of the best fit.\n",
    "\n",
    "$$RMSE = \\sqrt{MSE}$$\n",
    "\n",
    "* The coefficient of determination ($R^2$) represents the fraction of response variance that is captured by the model. It is computed as follows:\n",
    "\n",
    "$$ R^2 = 1 - \\frac{MSE}{Var(y)}$$\n",
    "\n",
    "For the training dataset, $R^2$ is bounded between 0 and 1. Therefore, it can become negative for the test set. If $R^2 = 1$, the model fits the data perfectly, which corresponde a $MSE = 0$.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_squared_error, r2_score\n",
    "\n",
    "mse_train = mean_squared_error(y_train, y_train_predicted)\n",
    "mse_test = mean_squared_error(y_test, y_test_predicted)\n",
    "\n",
    "print('MSE training: %.3f, test: %.3f' %(mse_train, mse_test))\n",
    "print('RMSE training: %.3f, RMSE test: %.3f' % (np.sqrt(mse_train), np.sqrt(mse_test)))\n",
    "\n",
    "print('R2 score training: %.3f, R2 score test: %.3f' %(r2_score(y_train, y_train_predicted), \n",
    "                                                       r2_score(y_test, y_test_predicted)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning curves"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_learning_curves(model, X_train, X_test, y_train, y_test):\n",
    "    \n",
    "    \"\"\" Plot the learning curves of the given model using the training and the test sets\"\"\"\n",
    "    \n",
    "    train_errors, test_errors = [], []\n",
    "    for m in range(1, len(X_train)):\n",
    "        model.fit(X_train[:m], y_train[:m])\n",
    "        y_train_predicted = model.predict(X_train[:m])\n",
    "        y_test_predicted = model.predict(X_test)\n",
    "        train_errors.append(mean_squared_error(y_train_predicted, y_train[:m]))\n",
    "        test_errors.append(mean_squared_error(y_test_predicted, y_test))\n",
    "        \n",
    "    plt.plot(np.sqrt(train_errors), 'r-+', linewidth = 1, label = \"Training set\")\n",
    "    plt.plot(np.sqrt(test_errors), 'b-', linewidth = 2, label = \"Testing set\")\n",
    "    plt.legend(loc=\"upper right\")\n",
    "    plt.xlabel('Training set size')\n",
    "    plt.ylabel('RMSE')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(6, 4))\n",
    "\n",
    "plt.title('Predicting Boston housing prices: learnig curves')\n",
    "plot_learning_curves(lm, X_train, X_test, y_train, y_test)\n",
    "\n",
    "plt.savefig('figures/learning_curves.pdf')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regularized models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ridge Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Ridge\n",
    "\n",
    "X = housing.drop('MEDV', axis = 1)\n",
    "y = housing['MEDV']\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .20, random_state = 100)\n",
    "\n",
    "ridge_reg = Ridge(alpha=2, solver=\"cholesky\", random_state=100)\n",
    "ridge_reg.fit(X_train, y_train)\n",
    "\n",
    "y_pred_train_ridge = ridge_reg.predict(X_train)\n",
    "y_pred_ridge = ridge_reg.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(y_test,y_pred_ridge)\n",
    "plt.xlabel(\"Actual value ($Y)$\")\n",
    "plt.ylabel(\"Predicted value ($\\hat{Y}$)\")\n",
    "plt.savefig('figures/ridge_regression_plot.pdf')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see a diagonal trend, indicating a good _fit_ model  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(6, 4))\n",
    "plot_learning_curves(ridge_reg, X_train, X_test, y_train, y_test)\n",
    "plt.savefig('figures/learning_curve_ridge_regression.pdf')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lasso Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Lasso\n",
    "\n",
    "lasso = Lasso(alpha=0.3)\n",
    "lasso.fit(X_train, y_train)\n",
    "\n",
    "y_pred_train_lasso = lasso.predict(X_train)\n",
    "y_pred_lasso = lasso.predict(X_test)\n",
    "\n",
    "plt.scatter(y_test,y_pred_lasso)\n",
    "plt.title('Lasso Boston housing price prediction')\n",
    "plt.xlabel(\"Actual value ($Y)$\")\n",
    "plt.ylabel(\"Predicted value ($\\hat{Y}$)\")\n",
    "\n",
    "plt.savefig('figures/lasso_regression_plot.pdf')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(6, 4))\n",
    "plot_learning_curves(lasso, X_train, X_test, y_train, y_test)\n",
    "plt.savefig('figures/learning_curve_lasso_regression.pdf')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Elastic Net"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import ElasticNet\n",
    "\n",
    "elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=100)\n",
    "elastic_net.fit(X_train, y_train)\n",
    "\n",
    "y_pred_train_elastic = elastic_net.predict(X_train)\n",
    "y_pred_elastic = elastic_net.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(y_test,y_pred_elastic)\n",
    "plt.title('Elastic Net Boston housing price prediction')\n",
    "plt.xlabel(\"Actual value ($Y)$\")\n",
    "plt.ylabel(\"Predicted value ($\\hat{Y}$)\")\n",
    "\n",
    "plt.savefig('figures/elastic_net_plot.pdf')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(6, 4))\n",
    "plot_learning_curves(elastic_net, X_train, X_test, y_train, y_test)\n",
    "plt.savefig('figures/learning_curve_elastic_net_regression.pdf')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}