{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 05\n", "\n", "# Neural networks\n", "\n", "## 4.1 Little Red Riding Hood Network\n", "\n", "Train a neural network to solve the Little Red Riding Hood problem in sklern and Keras. Try the neural networ with different inputs and report the results.\n", "\n", "________________\n", "\n", "## 4.2 Boston House Price Prediction\n", "\n", "In the next questions we are going to work using the dataset *Boston*. This dataset measures the influence of socioeconomical factors on the price of several estates of the city of Boston. This dataset has 506 instances, each one characterized by 13 features:\n", "\n", "* CRIM - per capita crime rate by town\n", "* ZN - proportion of residential land zoned for lots over 25,000 sq.ft.\n", "* INDUS - proportion of non-retail business acres per town.\n", "* CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)\n", "* NOX - nitric oxides concentration (parts per 10 million)\n", "* RM - average number of rooms per dwelling\n", "* AGE - proportion of owner-occupied units built prior to 1940\n", "* DIS - weighted distances to five Boston employment centres\n", "* RAD - index of accessibility to radial highways\n", "* TAX - full-value property-tax rate per 10,000 USD\n", "* PTRATIO - pupil-teacher ratio by town\n", "* B - $1000(Bk - 0.63)^2$ where $Bk$ is the proportion of blacks by town\n", "* LSTAT - % lower status of the population\n", "\n", "Output variable:\n", "* MEDV - Median value of owner-occupied homes in 1000's USD\n", "\n", "**Note:** In this exercise we are going to predict the price of each estate, which is represented in the `MEDV` variable. It is important to remember that we are always aiming to predict `MEDV`, no matter which explanatory variables we are using. That means, in some cases we will use a subset of the 13 previously mentioned variables, while in other cases we will use all the 13 variables. But in no case we will change the dependent variable $y$.\n", "\n", "\n", "\n", "1. Load the dataset using `from sklearn.datasets import load_boston`.\n", "2. Create a DataFrame using the attribute `.data` from the loading function of Scikit-learn.\n", "3. Assign the columns of the DataFrame so they match the `.feature_names` attribute from the loading function of Scikit-learn. \n", "4. Assign a new column to the DataFrame which holds the value to predict, that means, the `.target` attribute of the loading function of Scikit-learn. The name of this columns must be `MEDV`.\n", "5. Use the function `.describe()` from Pandas for obtaining statistics about each column.\n", "\n", "## 4.3 Feature analysis:\n", "\n", "Using the DataFrame generated in the previous section:\n", "* Filter the dataset to just these features:\n", " * Explanatory: 'LSTAT', 'INDUS', 'NOX', 'RM', 'AGE'\n", " * Dependent: 'MEDV'.\n", "* Generate a scatter matrix among the features mentioned above using Pandas (`scatter_matrix`) or Seaborn (` pairplot`).\n", " * Do you find any relationship between the features?\n", "* Generate the correlation matrix between these variables using `numpy.corrcoef`. Also include `MEDV`.\n", " * Which characteristics are more correlated?\n", " * BONUS: Visualize this matrix as heat map using Pandas, Matplotlib or Seaborn.\n", "\n", "## 4.4 Modeling linear and non linear relationships\n", "\n", "* Generate two new subsets filtering these characteristics:\n", " * $D_1$: $X = \\textit{'RM'}$, $y = \\textit{'MEDV'}$\n", " * $D_2$: $X = \\textit{'LSTAT'}$, $y = \\textit{'MEDV'}$\n", "* For each subset, generate a training partition and a test partition using a ratio of $ 70 \\% - 30 \\% $\n", "* Train a linear regression model on both subsets of data:\n", " * Report the mean square error on the test set\n", " * Print the values of $ w $ and $ w_0 $ of the regression equation\n", " * Generate a graph where you visualize the line obtained by the regression model in conjunction with the training data and the test data\n", "* How does the model perform on $ D_1 $ and $ D_2 $? Why?\n", "\n", "## 4.5 Training a regression model\n", "\n", "* Generate a 70-30 partitioning of the data **using all the features**. (Do not include the dependent variable `MEDV`)\n", "* Train a linear regression model with the objective of predicting the output variable `MEDV`.\n", " * Report the mean square error on the test set\n", "* Train a regression model using `MLPRegressor` in order to predict the output variable` MEDV`.\n", " * Report the mean square error on the test set\n", "* Scale the data so that they have zero mean variance one per feature (only $ X $). You can use the following piece of code:\n", "\n", "```python\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "sc_x = StandardScaler()\n", "sc_x.fit(X)\n", "X_train_s = sc_x.transform(X_train)\n", "X_test_s = sc_x.transform(X_test)\n", "```\n", "Check more information about `StandardScaler` [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).\n", "\n", "* Train the following models:\n", " 1. Train a linear regression model using the scaled data.\n", " * Report the mean square error on the test set\n", " 2. Train a regression model using a 2-layer MultiLayer Perceptron (128 neurons in the first and 512 in the second) and with the **scaled data**.\n", " * Report the mean square error on the test set\n", " 3. Which model has better performance? Why?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }