{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Predicting House Price 🏠"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "🎯 In this challenge, you will **predict the sale price** of houses (`SalePrice`) according to the *surface*, the *number of bedrooms* or the *overall quality*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load Python Libraries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the cell below to `import` some Python libraries - these will be our tools for working with data 📊\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.750076Z",
     "start_time": "2021-10-05T15:33:21.543512Z"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "👇 Run the cell below to load the `house_prices.csv` dataset into this notebook as a pandas `DataFrame`, and display its first 5 rows."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Note: the datasets has been cleaned and federated for learning purposes*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-20T16:48:04.958142Z",
     "start_time": "2021-10-20T16:48:04.425006Z"
    },
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "houses = pd.read_csv('https://storage.googleapis.com/introduction-to-data-science/house-prices.csv')\n",
    "houses.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This dataset contains information about houses sold.\n",
    "\n",
    "The *columns* in the given dataset are as follows:\n",
    "\n",
    "*Features:*\n",
    "- `GrLivArea`: Surface in squared feet\n",
    "- `BedroomAbvGr`: Number of bedrooms\n",
    "- `KitchenAbvGr`: Number of kitchens\n",
    "- `OverallQual`: Overall quality (1: Very Poor / 10: Very Excellent)\n",
    "\n",
    "*Target:*\n",
    "- `SalePrice`: Sale price in USD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## We can get a lot of insight without ML! 🤔"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Your turn! 🚀\n",
    "\n",
    "Let's start by **understanding the data we have** - how big is the dataset, what is the information (columns) we have and so on:\n",
    "\n",
    "**💡 Tip:** remember to check the slides for the right methods ;)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.772877Z",
     "start_time": "2021-10-05T15:33:22.770199Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now try to **separate only some columns** - say we only want to see `SalePrice`, or `GrLivArea` and `BedroomAbvGr`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.777984Z",
     "start_time": "2021-10-05T15:33:22.775707Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Your turn - Now let's do some **visualization** 📊. \n",
    "\n",
    "\n",
    "Let's follow some basic intuition - **does the surface (`GrLivArea`) affects the price of the house(`SalePrice`)❓**\n",
    "\n",
    "Let's use a [Seaborn Scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) - a method inside the Seaborn library (which we imported above and shortened to `sns`) that gives us a graph with data points as dots with `x` and `y` values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.782652Z",
     "start_time": "2021-10-05T15:33:22.780534Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Does the **overall quality (`OverallQual`) has an impact on the `SalePrice` ❓**\n",
    "\n",
    "**💡Tip:** You can add a `hue` to the previous graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.788152Z",
     "start_time": "2021-10-05T15:33:22.785206Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's also understand the repartition we have for some features:\n",
    "\n",
    "- **What is the repartition of the Number of bedrooms❓**\n",
    "- **What is the repartition of the Number of kitchens❓**\n",
    "\n",
    "Seaborn `countplot` is here to help with that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.792542Z",
     "start_time": "2021-10-05T15:33:22.790431Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Your first model - Linear Regression 📈"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**1.** First, let's create what will be our features and our target."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a variable `features` containing all features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.799358Z",
     "start_time": "2021-10-05T15:33:22.796821Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a variable `target` containing the target:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.805867Z",
     "start_time": "2021-10-05T15:33:22.803321Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Feel free to check what is in your `features` and `target` below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.810133Z",
     "start_time": "2021-10-05T15:33:22.807932Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**2.** Time to **import** the *sklearn* function to split our dataset into a train and a test set\n",
    "\n",
    "Try to find the right function [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.815438Z",
     "start_time": "2021-10-05T15:33:22.812818Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**3.** Use this function to create **X_train, X_test, y_train, y_test**\n",
    "\n",
    "🚨 Set `random_state=42` as an argument of the function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.819733Z",
     "start_time": "2021-10-05T15:33:22.817695Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's check what is in your `X_train`, `X_test`, `y_train`, `y_test`:\n",
    "\n",
    "- What percentage of the observations were allocated to the train and the test set?\n",
    "- How many features in `X_train` and `X_test`?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.823836Z",
     "start_time": "2021-10-05T15:33:22.821907Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**4.** Time to **import** the Linear Regression model\n",
    "\n",
    "Python libraries like [Scikit-learn](https://scikit-learn.org/0.21/modules/classes.html) make it super easy for people getting into Data Science and ML to experiment.\n",
    "\n",
    "The code is already in the library, it's just about **calling the right methods!** 🛠"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.828261Z",
     "start_time": "2021-10-05T15:33:22.826171Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now to **initialize** the model. Store it in a variable `model`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.832642Z",
     "start_time": "2021-10-05T15:33:22.830478Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**5. Train** the model on the **training set**. \n",
    "\n",
    "This is the process where the Linear Regression model looks for a line that best fits all the points in the dataset. This is the part where the computer is hard at work **learning**! 🤖"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.836752Z",
     "start_time": "2021-10-05T15:33:22.834826Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**6. Evaluate** the performance of the model on the **test set**.\n",
    "\n",
    "Models can have different default scoring metrics. Linear Regression by default uses something called `R-squared` - a metric that shows how much of change in the target (`SalePrice`) can be explained by the changes in features (`GrLivArea`, `BedroomAbvGr`, `KitchenAbvGr` and `OverallQual`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.841038Z",
     "start_time": "2021-10-05T15:33:22.838987Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "⚠️ **Careful not to confuse this with accuracy**. The above number is shows that **\"the inputs we have can help us predict this percentage of change in the depreciation\"** Which is decent considering we did with just a few lines of code! "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's **compare** this score to the one the model gets on the **training set**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.845445Z",
     "start_time": "2021-10-05T15:33:22.843133Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "👉 You should get a slightly higher score on the training set, which is to be expected in general.\n",
    "\n",
    "The good news is that the 2 scores are relatively close to each other, which shows that we achieved a **good balance**, our model **generalises well to new observations**, explaining more than 70% of change in depreciation.\n",
    "\n",
    "**Splitting the dataset into a training set and a test set is essential in Machine Learning**. It allows us to **identify**:\n",
    "- **Overfitting**: we would see a large difference between the 2 scores. The model would be very good on the data it trained on, but would be doing poorly on the test set.\n",
    "- **Underfitting**: we would have bad score on both the training data and on the test data. In this case, a reason could be that the model is not complex enough to capture the patterns in the data.\n",
    "\n",
    "In our case, we have a **robust model** that does well on new observations💪. We can now use it to make predictions on new houses with confidence."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**7.** Let's **predict** the price of a new house 🔮\n",
    "\n",
    "This new house has a the following characteristics:\n",
    "- **Surface** of 3,000 squared feet\n",
    "- 3 **bedrooms**\n",
    "- 1 **kitchen**\n",
    "- **Overall quality** score of 5\n",
    "\n",
    "**7.1** Start by creating variable `new_house` in which you will store those characteristics. Make sure to use the right format to be able to make a prediction.\n",
    "\n",
    "*Note: here is a reminder of the columns in the table:* `['GrLivArea', 'BedroomAbvGr', 'KitchenAbvGr', 'OverallQual']`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:20:28.119655Z",
     "start_time": "2021-10-05T15:20:28.115920Z"
    }
   },
   "source": [
    "\n",
    "<details>\n",
    "    <summary>💡Hint</summary>\n",
    "<p> \n",
    "<pre>\n",
    "`new_house` should be a `list of list`:\n",
    "    [[surface, nb_bedrooms, nb_kitchens, overall_quality]]\n",
    "</pre>\n",
    "</details>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.851177Z",
     "start_time": "2021-10-05T15:33:22.847800Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**7.2** Now use the right method to make a prediction using the model we just trained:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.856512Z",
     "start_time": "2021-10-05T15:33:22.854273Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's say we have another house with the same characteristics, except for the overall quality score being 9. \n",
    "\n",
    "**What would be the price of this house❓**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.861202Z",
     "start_time": "2021-10-05T15:33:22.859069Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "--------------"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**8.** **Explaining** the model\n",
    "\n",
    "Linear Regression is a [linear model](https://scikit-learn.org/stable/modules/linear_model.html), so it's explainability is quite high."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**8.1.** We can check the `coef_` or the **coefficients** of the model. These explain how much the target (`SalePrice`) changes with a change of `1` in each of the features (inputs), while holding other features constant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:22.866447Z",
     "start_time": "2021-10-05T15:33:22.864152Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "🤔 We'd need to check the column order again, to know which number is which input. But, **we got you covered!** Run the cell below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:23.008881Z",
     "start_time": "2021-10-05T15:33:22.874738Z"
    }
   },
   "outputs": [],
   "source": [
    "pd.concat([pd.DataFrame(features.columns),pd.DataFrame(np.transpose(model.coef_))], axis = 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**8.2** The other thing we can check is the **intercept** of the model. This is the target (`SalePrice`) for when all inputs are 0. So this should be close to a new house with a surface of 0 squared feet, no bedrooms, no kitchens and an overall quality of 0:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-10-05T15:33:23.016242Z",
     "start_time": "2021-10-05T15:33:21.643Z"
    }
   },
   "outputs": [],
   "source": [
    "# your code here"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}