{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicting House Price 🏠" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "🎯 In this challenge, you will **predict the sale price** of houses (`SalePrice`) according to the *surface*, the *number of bedrooms* or the *overall quality*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Python Libraries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the cell below to `import` some Python libraries - these will be our tools for working with data 📊\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.750076Z", "start_time": "2021-10-05T15:33:21.543512Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "👇 Run the cell below to load the `house_prices.csv` dataset into this notebook as a pandas `DataFrame`, and display its first 5 rows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Note: the datasets has been cleaned and federated for learning purposes*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-20T16:48:04.958142Z", "start_time": "2021-10-20T16:48:04.425006Z" }, "scrolled": false }, "outputs": [], "source": [ "houses = pd.read_csv('https://storage.googleapis.com/introduction-to-data-science/house-prices.csv')\n", "houses.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset contains information about houses sold.\n", "\n", "The *columns* in the given dataset are as follows:\n", "\n", "*Features:*\n", "- `GrLivArea`: Surface in squared feet\n", "- `BedroomAbvGr`: Number of bedrooms\n", "- `KitchenAbvGr`: Number of kitchens\n", "- `OverallQual`: Overall quality (1: Very Poor / 10: Very Excellent)\n", "\n", "*Target:*\n", "- `SalePrice`: Sale price in USD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## We can get a lot of insight without ML! 🤔" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Your turn! 🚀\n", "\n", "Let's start by **understanding the data we have** - how big is the dataset, what is the information (columns) we have and so on:\n", "\n", "**💡 Tip:** remember to check the slides for the right methods ;)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.772877Z", "start_time": "2021-10-05T15:33:22.770199Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now try to **separate only some columns** - say we only want to see `SalePrice`, or `GrLivArea` and `BedroomAbvGr`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.777984Z", "start_time": "2021-10-05T15:33:22.775707Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Your turn - Now let's do some **visualization** 📊. \n", "\n", "\n", "Let's follow some basic intuition - **does the surface (`GrLivArea`) affects the price of the house(`SalePrice`)❓**\n", "\n", "Let's use a [Seaborn Scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) - a method inside the Seaborn library (which we imported above and shortened to `sns`) that gives us a graph with data points as dots with `x` and `y` values." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.782652Z", "start_time": "2021-10-05T15:33:22.780534Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Does the **overall quality (`OverallQual`) has an impact on the `SalePrice` ❓**\n", "\n", "**💡Tip:** You can add a `hue` to the previous graph" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.788152Z", "start_time": "2021-10-05T15:33:22.785206Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's also understand the repartition we have for some features:\n", "\n", "- **What is the repartition of the Number of bedrooms❓**\n", "- **What is the repartition of the Number of kitchens❓**\n", "\n", "Seaborn `countplot` is here to help with that." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.792542Z", "start_time": "2021-10-05T15:33:22.790431Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Your first model - Linear Regression 📈" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**1.** First, let's create what will be our features and our target." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a variable `features` containing all features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.799358Z", "start_time": "2021-10-05T15:33:22.796821Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a variable `target` containing the target:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.805867Z", "start_time": "2021-10-05T15:33:22.803321Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Feel free to check what is in your `features` and `target` below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.810133Z", "start_time": "2021-10-05T15:33:22.807932Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2.** Time to **import** the *sklearn* function to split our dataset into a train and a test set\n", "\n", "Try to find the right function [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.815438Z", "start_time": "2021-10-05T15:33:22.812818Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3.** Use this function to create **X_train, X_test, y_train, y_test**\n", "\n", "🚨 Set `random_state=42` as an argument of the function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.819733Z", "start_time": "2021-10-05T15:33:22.817695Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check what is in your `X_train`, `X_test`, `y_train`, `y_test`:\n", "\n", "- What percentage of the observations were allocated to the train and the test set?\n", "- How many features in `X_train` and `X_test`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.823836Z", "start_time": "2021-10-05T15:33:22.821907Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4.** Time to **import** the Linear Regression model\n", "\n", "Python libraries like [Scikit-learn](https://scikit-learn.org/0.21/modules/classes.html) make it super easy for people getting into Data Science and ML to experiment.\n", "\n", "The code is already in the library, it's just about **calling the right methods!** 🛠" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.828261Z", "start_time": "2021-10-05T15:33:22.826171Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now to **initialize** the model. Store it in a variable `model`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.832642Z", "start_time": "2021-10-05T15:33:22.830478Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**5. Train** the model on the **training set**. \n", "\n", "This is the process where the Linear Regression model looks for a line that best fits all the points in the dataset. This is the part where the computer is hard at work **learning**! 🤖" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.836752Z", "start_time": "2021-10-05T15:33:22.834826Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**6. Evaluate** the performance of the model on the **test set**.\n", "\n", "Models can have different default scoring metrics. Linear Regression by default uses something called `R-squared` - a metric that shows how much of change in the target (`SalePrice`) can be explained by the changes in features (`GrLivArea`, `BedroomAbvGr`, `KitchenAbvGr` and `OverallQual`)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.841038Z", "start_time": "2021-10-05T15:33:22.838987Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⚠️ **Careful not to confuse this with accuracy**. The above number is shows that **\"the inputs we have can help us predict this percentage of change in the depreciation\"** Which is decent considering we did with just a few lines of code! " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's **compare** this score to the one the model gets on the **training set**:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.845445Z", "start_time": "2021-10-05T15:33:22.843133Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "👉 You should get a slightly higher score on the training set, which is to be expected in general.\n", "\n", "The good news is that the 2 scores are relatively close to each other, which shows that we achieved a **good balance**, our model **generalises well to new observations**, explaining more than 70% of change in depreciation.\n", "\n", "**Splitting the dataset into a training set and a test set is essential in Machine Learning**. It allows us to **identify**:\n", "- **Overfitting**: we would see a large difference between the 2 scores. The model would be very good on the data it trained on, but would be doing poorly on the test set.\n", "- **Underfitting**: we would have bad score on both the training data and on the test data. In this case, a reason could be that the model is not complex enough to capture the patterns in the data.\n", "\n", "In our case, we have a **robust model** that does well on new observations💪. We can now use it to make predictions on new houses with confidence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**7.** Let's **predict** the price of a new house 🔮\n", "\n", "This new house has a the following characteristics:\n", "- **Surface** of 3,000 squared feet\n", "- 3 **bedrooms**\n", "- 1 **kitchen**\n", "- **Overall quality** score of 5\n", "\n", "**7.1** Start by creating variable `new_house` in which you will store those characteristics. Make sure to use the right format to be able to make a prediction.\n", "\n", "*Note: here is a reminder of the columns in the table:* `['GrLivArea', 'BedroomAbvGr', 'KitchenAbvGr', 'OverallQual']`" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:20:28.119655Z", "start_time": "2021-10-05T15:20:28.115920Z" } }, "source": [ "\n", "
\n", " 💡Hint\n", "

\n", "

\n",
    "`new_house` should be a `list of list`:\n",
    "    [[surface, nb_bedrooms, nb_kitchens, overall_quality]]\n",
    "
\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.851177Z", "start_time": "2021-10-05T15:33:22.847800Z" }, "scrolled": true }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**7.2** Now use the right method to make a prediction using the model we just trained:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.856512Z", "start_time": "2021-10-05T15:33:22.854273Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's say we have another house with the same characteristics, except for the overall quality score being 9. \n", "\n", "**What would be the price of this house❓**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.861202Z", "start_time": "2021-10-05T15:33:22.859069Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**8.** **Explaining** the model\n", "\n", "Linear Regression is a [linear model](https://scikit-learn.org/stable/modules/linear_model.html), so it's explainability is quite high." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**8.1.** We can check the `coef_` or the **coefficients** of the model. These explain how much the target (`SalePrice`) changes with a change of `1` in each of the features (inputs), while holding other features constant." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:22.866447Z", "start_time": "2021-10-05T15:33:22.864152Z" } }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "🤔 We'd need to check the column order again, to know which number is which input. But, **we got you covered!** Run the cell below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:23.008881Z", "start_time": "2021-10-05T15:33:22.874738Z" } }, "outputs": [], "source": [ "pd.concat([pd.DataFrame(features.columns),pd.DataFrame(np.transpose(model.coef_))], axis = 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**8.2** The other thing we can check is the **intercept** of the model. This is the target (`SalePrice`) for when all inputs are 0. So this should be close to a new house with a surface of 0 squared feet, no bedrooms, no kitchens and an overall quality of 0:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2021-10-05T15:33:23.016242Z", "start_time": "2021-10-05T15:33:21.643Z" } }, "outputs": [], "source": [ "# your code here" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }