{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicting Housing Prices using Linear Model\n", "\n", "## Goal:\n", " * Build a linear model to predict housing prices using the Boston house prices data set. The data set is availale on https://bit.ly/31MNlnS. \n", "\n", "Boston house prices data set\n", "---------------------------\n", "\n", "**Data Set Characteristics:** \n", "\n", " :Number of Instances: 506 \n", "\n", " :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n", "\n", " :Attribute Information (in order):\n", " - CRIM per capita crime rate by town\n", " - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n", " - INDUS proportion of non-retail business acres per town\n", " - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n", " - NOX nitric oxides concentration (parts per 10 million)\n", " - RM average number of rooms per dwelling\n", " - AGE proportion of owner-occupied units built prior to 1940\n", " - DIS weighted distances to five Boston employment centres\n", " - RAD index of accessibility to radial highways\n", " - TAX full-value property-tax rate per \\$10,000\n", " - PTRATIO pupil-teacher ratio by town\n", " - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n", " - LSTAT % lower status of the population\n", " - MEDV Median value of owner-occupied homes in $1000's\n", "\n", " :Missing Attribute Values: None\n", "\n", " :Creator: Harrison, D. and Rubinfeld, D.L.\n", "\n", "This is a copy of UCI ML housing dataset.\n", "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n", "\n", "The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\n", "prices and the demand for clean air', J. Environ. Economics & Management,\n", "vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n", "...', Wiley, 1980. N.B. Various transformations are used in the table on\n", "pages 244-261 of the latter.\n", "\n", "The Boston house-price data has been used in many machine learning papers that address regression\n", "problems. \n", " \n", "References\n", "\n", " - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n", " - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "assert sys.version_info >= (3, 6)\n", "\n", "import numpy\n", "assert numpy.__version__ >=\"1.17.3\" \n", "import numpy as np\n", "\n", "import pandas\n", "assert pandas.__version__ >= \"0.25.1\"\n", "import pandas as pd\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring the Bost housing data set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Importing data set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# dataset_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'\n", "dataset_path = 'data/housing/boston/housing.data'\n", "housing = pd.read_csv(dataset_path, sep ='\\s+', header = None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "housing.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "housing.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Data pre-processing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "housing.columns = [\"CRIM\", \"ZN\", \"INDUS\", \"CHAS\", \"NOX\", \"RM\", \"AGE\", \"DIS\", \"RAD\", \"TAX\", \"PTRATIO\", \"B\", \"LSTAT\", \"MEDV\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "housing.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Couting the number of missing values of each feature" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "housing.isnull().sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "housing.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Exploratory Data Analysis (EDA)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1 Visualizing the distribution of the variables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "h = housing.hist(bins=20, figsize=(20,15))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2 Computing the correlation matrix of the variables" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "correlation_matrix = np.corrcoef(housing[housing.columns].values.T)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualizing the correlation matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn\n", "assert seaborn.__version__ >= '0.9.0'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn\n", "assert seaborn.__version__ >= '0.9.0'\n", "import seaborn as sns\n", "\n", "f, ax = plt.subplots(figsize=(10, 8))\n", "\n", "hm = sns.heatmap(data = correlation_matrix, \n", " annot = True,\n", " square = True,\n", " fmt='.2f',\n", " yticklabels = housing.columns, \n", " xticklabels = housing.columns,\n", " linewidths=.1, ax = ax)\n", "\n", "# fix for mpl bug that cuts off top/bottom of seaborn viz\n", "# See the discussion here (https://github.com/mwaskom/seaborn/issues/1773)\n", "b, t = plt.ylim() # discover the values for bottom and top\n", "b += 0.5 # Add 0.5 to the bottom\n", "t -= 0.5 # Subtract 0.5 from the top\n", "plt.ylim(b, t) # update the ylim(bottom, top) values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* **MEDV** is positively correlated with the **RM**(0.7), whereas it has a strong negative correlation with **LSTAT**(-0.74)\n", "* Features **RAD** and **Tax** are strongly correlated to each other (0.91). Thus, we should not select these feature together to train the model. The same are valid for features **NOX** and **DIS**, **AGE** and **DIS**, **INDUS** and **DIS**.\n", "\n", "Let us investigate how **RM** and **LSTAT** vary with **MEDV**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(20,5))\n", "features = [\"RM\", \"LSTAT\"]\n", "\n", "for i, col in enumerate(features):\n", " plt.subplot(1, len(features), i + 1)\n", " plt.scatter(housing[col], housing['MEDV'])\n", " plt.xlabel(col)\n", " plt.title(col)\n", " plt.ylabel('MEDV')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The price increases as the average number of rooms per dwelling (i.e., **RM**) increases. Therefore, there are some outliers. \n", "* The prices tends to decreases when **LSTAT** increases, although it does not seem to follows a linear relationship." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = housing['RM'].values.reshape(-1,1)\n", "y = housing['MEDV'].values.reshape(-1,1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Predicting the price of the houses" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.1 Building a linear model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn\n", "assert sklearn.__version__ >='0.21.3'\n", "\n", "from sklearn.linear_model import LinearRegression\n", "\n", "lm = LinearRegression()\n", "lm.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.2 Checking the parameters of the model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Intercept = {}, Slope{}\".format(lm.intercept_, lm.coef_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.3 Assessing the performance of the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.3.1 Splitting the data into training and testing sets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X = pd.DataFrame(np.c_[housing['RM'], housing['LSTAT']], columns = ['RM', 'LSTAT'])\n", "y = housing['MEDV']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 100)\n", "\n", "print(\"X_train.shape {}, X_test.shape {}\".format(X_train.shape, X_test.shape))\n", "print(\"y_train.shape {}, y_test.shape {}\".format(y_train.shape, y_test.shape))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.3.2 Creating a new linear model using only the training set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lm = LinearRegression()\n", "lm.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.3.4. Evaluating the model using the test set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_train_predicted = lm.predict(X_train)\n", "y_test_predicted = lm.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 4.3.5. Visualizing the residuals" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(6, 5))\n", "plt.scatter(y_train_predicted, y_train_predicted - y_train, color = \"blue\", label = \"Training data\")\n", "plt.scatter(y_test_predicted, y_test_predicted - y_test, color = 'red', label = \"Test data\")\n", "plt.ylabel(\"Residuals\")\n", "plt.xlabel(\"Predicted values\")\n", "plt.legend(loc=\"upper left\")\n", "plt.hlines(y=0, xmin=-10, xmax=50, lw=2, color='green')\n", "plt.axis([-10,50, -30,20])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The model does not seem to be completely wrong, as the residuals are randomly scattered around the centerline.\n", "* Therefore, the model is unable to capture some exploratory information, as can be seen in the presence of small patterns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.4. Computing the MSE, RMSE, and $R^2$\n", "\n", "* The **mean square error (MSE)** is simply the average value of the **residual sum of squares (RSS)** that we minimize to fit the linear regression model. \n", "\n", "$$MSE = \\frac{1}{N}\\sum_{i=1}^{N}(y_i - \\hat{y_i})^2$$\n", "\n", "* The **root mean square error (RMSE)** comprises the standard deviation of the residuals. It tells us how concentrated the data are around the line of the best fit.\n", "\n", "$$RMSE = \\sqrt{MSE}$$\n", "\n", "* The coefficient of determination ($R^2$) represents the fraction of response variance that is captured by the model. It is computed as follows:\n", "\n", "$$ R^2 = 1 - \\frac{MSE}{Var(y)}$$\n", "\n", "For the training dataset, $R^2$ is bounded between 0 and 1. Therefore, it can become negative for the test set. If $R^2 = 1$, the model fits the data perfectly, which corresponde a $MSE = 0$.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import mean_squared_error, r2_score\n", "\n", "mse_train = mean_squared_error(y_train, y_train_predicted)\n", "mse_test = mean_squared_error(y_test, y_test_predicted)\n", "\n", "print('MSE training: %.3f, test: %.3f' %(mse_train, mse_test))\n", "print('RMSE training: %.3f, RMSE test: %.3f' % (np.sqrt(mse_train), np.sqrt(mse_test)))\n", "\n", "print('R2 score training: %.3f, R2 score test: %.3f' %(r2_score(y_train, y_train_predicted), \n", " r2_score(y_test, y_test_predicted)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning curves" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_learning_curves(model, X_train, X_test, y_train, y_test):\n", " \n", " \"\"\" Plot the learning curves of the given model using the training and the test sets\"\"\"\n", " \n", " train_errors, test_errors = [], []\n", " for m in range(1, len(X_train)):\n", " model.fit(X_train[:m], y_train[:m])\n", " y_train_predicted = model.predict(X_train[:m])\n", " y_test_predicted = model.predict(X_test)\n", " train_errors.append(mean_squared_error(y_train_predicted, y_train[:m]))\n", " test_errors.append(mean_squared_error(y_test_predicted, y_test))\n", " \n", " plt.plot(np.sqrt(train_errors), 'r-+', linewidth = 1, label = \"Training set\")\n", " plt.plot(np.sqrt(test_errors), 'b-', linewidth = 2, label = \"Testing set\")\n", " plt.legend(loc=\"upper right\")\n", " plt.xlabel('Training set size')\n", " plt.ylabel('RMSE')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(6, 4))\n", "\n", "plt.title('Predicting Boston housing prices: learnig curves')\n", "plot_learning_curves(lm, X_train, X_test, y_train, y_test)\n", "\n", "plt.savefig('figures/learning_curves.pdf')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regularized models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ridge Regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Ridge\n", "\n", "X = housing.drop('MEDV', axis = 1)\n", "y = housing['MEDV']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .20, random_state = 100)\n", "\n", "ridge_reg = Ridge(alpha=2, solver=\"cholesky\", random_state=100)\n", "ridge_reg.fit(X_train, y_train)\n", "\n", "y_pred_train_ridge = ridge_reg.predict(X_train)\n", "y_pred_ridge = ridge_reg.predict(X_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(y_test,y_pred_ridge)\n", "plt.xlabel(\"Actual value ($Y)$\")\n", "plt.ylabel(\"Predicted value ($\\hat{Y}$)\")\n", "plt.savefig('figures/ridge_regression_plot.pdf')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see a diagonal trend, indicating a good _fit_ model " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(6, 4))\n", "plot_learning_curves(ridge_reg, X_train, X_test, y_train, y_test)\n", "plt.savefig('figures/learning_curve_ridge_regression.pdf')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lasso Regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Lasso\n", "\n", "lasso = Lasso(alpha=0.3)\n", "lasso.fit(X_train, y_train)\n", "\n", "y_pred_train_lasso = lasso.predict(X_train)\n", "y_pred_lasso = lasso.predict(X_test)\n", "\n", "plt.scatter(y_test,y_pred_lasso)\n", "plt.title('Lasso Boston housing price prediction')\n", "plt.xlabel(\"Actual value ($Y)$\")\n", "plt.ylabel(\"Predicted value ($\\hat{Y}$)\")\n", "\n", "plt.savefig('figures/lasso_regression_plot.pdf')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(6, 4))\n", "plot_learning_curves(lasso, X_train, X_test, y_train, y_test)\n", "plt.savefig('figures/learning_curve_lasso_regression.pdf')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Elastic Net" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import ElasticNet\n", "\n", "elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=100)\n", "elastic_net.fit(X_train, y_train)\n", "\n", "y_pred_train_elastic = elastic_net.predict(X_train)\n", "y_pred_elastic = elastic_net.predict(X_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.scatter(y_test,y_pred_elastic)\n", "plt.title('Elastic Net Boston housing price prediction')\n", "plt.xlabel(\"Actual value ($Y)$\")\n", "plt.ylabel(\"Predicted value ($\\hat{Y}$)\")\n", "\n", "plt.savefig('figures/elastic_net_plot.pdf')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(6, 4))\n", "plot_learning_curves(elastic_net, X_train, X_test, y_train, y_test)\n", "plt.savefig('figures/learning_curve_elastic_net_regression.pdf')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 4 }