{ "metadata": { "kernelspec": { "name": "python", "display_name": "Pyolite", "language": "python" }, "language_info": { "codemirror_mode": { "name": "python", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8" } }, "nbformat_minor": 4, "nbformat": 4, "cells": [ { "cell_type": "markdown", "source": "

\n \n \"Skills\n \n

\n", "metadata": {} }, { "cell_type": "markdown", "source": "# **Regression Trees**\n", "metadata": {} }, { "cell_type": "markdown", "source": "Estimated time needed: **20** minutes\n", "metadata": {} }, { "cell_type": "markdown", "source": "In this lab you will learn how to implement regression trees using ScikitLearn. We will show what parameters are important, how to train a regression tree, and finally how to determine our regression trees accuracy.\n", "metadata": {} }, { "cell_type": "markdown", "source": "## Objectives\n", "metadata": {} }, { "cell_type": "markdown", "source": "After completing this lab you will be able to:\n", "metadata": {} }, { "cell_type": "markdown", "source": "* Train a Regression Tree\n* Evaluate a Regression Trees Performance\n", "metadata": {} }, { "cell_type": "markdown", "source": "***\n", "metadata": {} }, { "cell_type": "markdown", "source": "## Setup\n", "metadata": {} }, { "cell_type": "markdown", "source": "For this lab, we are going to be using Python and several Python libraries. Some of these libraries might be installed in your lab environment or in SN Labs. Others may need to be installed by you. The cells below will install these libraries when executed.\n", "metadata": {} }, { "cell_type": "code", "source": "from js import fetch\nimport io\n\nURL = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/real_estate_data.csv\"\nresp = await fetch(URL)\nregression_tree_data = io.BytesIO((await resp.arrayBuffer()).to_py())", "metadata": { "trusted": true }, "execution_count": 1, "outputs": [] }, { "cell_type": "code", "source": "import piplite\nawait piplite.install(['pandas'])\nawait piplite.install(['numpy'])\nawait piplite.install(['scikit-learn'])", "metadata": { "trusted": true }, "execution_count": 2, "outputs": [] }, { "cell_type": "code", "source": "# Pandas will allow us to create a dataframe of the data so it can be used and manipulated\nimport pandas as pd\n# Regression Tree Algorithm\nfrom sklearn.tree import DecisionTreeRegressor\n# Split our data into a training and testing data\nfrom sklearn.model_selection import train_test_split", "metadata": { "trusted": true }, "execution_count": 3, "outputs": [] }, { "cell_type": "markdown", "source": "## About the Dataset\n", "metadata": {} }, { "cell_type": "markdown", "source": "Imagine you are a data scientist working for a real estate company that is planning to invest in Boston real estate. You have collected information about various areas of Boston and are tasked with created a model that can predict the median price of houses for that area so it can be used to make offers.\n\nThe dataset had information on areas/towns not individual houses, the features are\n\nCRIM: Crime per capita\n\nZN: Proportion of residential land zoned for lots over 25,000 sq.ft.\n\nINDUS: Proportion of non-retail business acres per town\n\nCHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n\nNOX: Nitric oxides concentration (parts per 10 million)\n\nRM: Average number of rooms per dwelling\n\nAGE: Proportion of owner-occupied units built prior to 1940\n\nDIS: Weighted distances to five Boston employment centers\n\nRAD: Index of accessibility to radial highways\n\nTAX: Full-value property-tax rate per $10,000\n\nPTRAIO: Pupil-teacher ratio by town\n\nLSTAT: Percent lower status of the population\n\nMEDV: Median value of owner-occupied homes in $1000s\n", "metadata": {} }, { "cell_type": "markdown", "source": "## Read the Data\n", "metadata": { "tags": [] } }, { "cell_type": "markdown", "source": "Lets read in the data we have downloaded\n", "metadata": {} }, { "cell_type": "code", "source": "data = pd.read_csv(regression_tree_data)", "metadata": { "trusted": true }, "execution_count": 4, "outputs": [] }, { "cell_type": "code", "source": "data.head()", "metadata": { "trusted": true }, "execution_count": 5, "outputs": [ { "execution_count": 5, "output_type": "execute_result", "data": { "text/plain": " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \\\n0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1 296 15.3 \n1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2 242 17.8 \n2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2 242 17.8 \n3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3 222 18.7 \n4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3 222 18.7 \n\n LSTAT MEDV \n0 4.98 24.0 \n1 9.14 21.6 \n2 4.03 34.7 \n3 2.94 33.4 \n4 NaN 36.2 ", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOLSTATMEDV
00.0063218.02.310.00.5386.57565.24.0900129615.34.9824.0
10.027310.07.070.00.4696.42178.94.9671224217.89.1421.6
20.027290.07.070.00.4697.18561.14.9671224217.84.0334.7
30.032370.02.180.00.4586.99845.86.0622322218.72.9433.4
40.069050.02.180.00.4587.14754.26.0622322218.7NaN36.2
\n
" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": "Now lets learn about the size of our data, there are 506 rows and 13 columns\n", "metadata": {} }, { "cell_type": "code", "source": "data.shape", "metadata": { "trusted": true }, "execution_count": 6, "outputs": [ { "execution_count": 6, "output_type": "execute_result", "data": { "text/plain": "(506, 13)" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": "Most of the data is valid, there are rows with missing values which we will deal with in pre-processing\n", "metadata": {} }, { "cell_type": "code", "source": "data.isna().sum()", "metadata": { "trusted": true }, "execution_count": 7, "outputs": [ { "execution_count": 7, "output_type": "execute_result", "data": { "text/plain": "CRIM 20\nZN 20\nINDUS 20\nCHAS 20\nNOX 0\nRM 0\nAGE 20\nDIS 0\nRAD 0\nTAX 0\nPTRATIO 0\nLSTAT 20\nMEDV 0\ndtype: int64" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": "## Data Pre-Processing\n", "metadata": {} }, { "cell_type": "markdown", "source": "First lets drop the rows with missing values because we have enough data in our dataset\n", "metadata": {} }, { "cell_type": "code", "source": "data.dropna(inplace=True)", "metadata": { "trusted": true }, "execution_count": 8, "outputs": [] }, { "cell_type": "markdown", "source": "Now we can see our dataset has no missing values\n", "metadata": {} }, { "cell_type": "code", "source": "data.isna().sum()", "metadata": { "trusted": true }, "execution_count": 9, "outputs": [ { "execution_count": 9, "output_type": "execute_result", "data": { "text/plain": "CRIM 0\nZN 0\nINDUS 0\nCHAS 0\nNOX 0\nRM 0\nAGE 0\nDIS 0\nRAD 0\nTAX 0\nPTRATIO 0\nLSTAT 0\nMEDV 0\ndtype: int64" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": "Lets split the dataset into our features and what we are predicting (target)\n", "metadata": {} }, { "cell_type": "code", "source": "X = data.drop(columns=[\"MEDV\"])\nY = data[\"MEDV\"]", "metadata": { "trusted": true }, "execution_count": 10, "outputs": [] }, { "cell_type": "code", "source": "X.head()", "metadata": { "trusted": true }, "execution_count": 11, "outputs": [ { "execution_count": 11, "output_type": "execute_result", "data": { "text/plain": " CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \\\n0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1 296 15.3 \n1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2 242 17.8 \n2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2 242 17.8 \n3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3 222 18.7 \n5 0.02985 0.0 2.18 0.0 0.458 6.430 58.7 6.0622 3 222 18.7 \n\n LSTAT \n0 4.98 \n1 9.14 \n2 4.03 \n3 2.94 \n5 5.21 ", "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
CRIMZNINDUSCHASNOXRMAGEDISRADTAXPTRATIOLSTAT
00.0063218.02.310.00.5386.57565.24.0900129615.34.98
10.027310.07.070.00.4696.42178.94.9671224217.89.14
20.027290.07.070.00.4697.18561.14.9671224217.84.03
30.032370.02.180.00.4586.99845.86.0622322218.72.94
50.029850.02.180.00.4586.43058.76.0622322218.75.21
\n
" }, "metadata": {} } ] }, { "cell_type": "code", "source": "Y.head()", "metadata": { "trusted": true }, "execution_count": 12, "outputs": [ { "execution_count": 12, "output_type": "execute_result", "data": { "text/plain": "0 24.0\n1 21.6\n2 34.7\n3 33.4\n5 28.7\nName: MEDV, dtype: float64" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": "Finally lets split our data into a training and testing dataset using `train_test_split` from `sklearn.model_selection`\n", "metadata": {} }, { "cell_type": "code", "source": "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=1)", "metadata": { "trusted": true }, "execution_count": 13, "outputs": [] }, { "cell_type": "markdown", "source": "## Create Regression Tree\n", "metadata": {} }, { "cell_type": "markdown", "source": "Regression Trees are implemented using `DecisionTreeRegressor` from `sklearn.tree`\n\nThe important parameters of `DecisionTreeRegressor` are\n\n`criterion`: {\"mse\", \"friedman_mse\", \"mae\", \"poisson\"} - The function used to measure error\n\n`max_depth` - The max depth the tree can be\n\n`min_samples_split` - The minimum number of samples required to split a node\n\n`min_samples_leaf` - The minimum number of samples that a leaf can contain\n\n`max_features`: {\"auto\", \"sqrt\", \"log2\"} - The number of feature we examine looking for the best one, used to speed up training\n", "metadata": {} }, { "cell_type": "markdown", "source": "First lets start by creating a `DecisionTreeRegressor` object, setting the `criterion` parameter to `mse` for Mean Squared Error\n", "metadata": {} }, { "cell_type": "code", "source": "regression_tree = DecisionTreeRegressor(criterion = \"mse\")", "metadata": { "trusted": true }, "execution_count": 14, "outputs": [] }, { "cell_type": "markdown", "source": "## Training\n", "metadata": {} }, { "cell_type": "markdown", "source": "Now lets train our model using the `fit` method on the `DecisionTreeRegressor` object providing our training data\n", "metadata": {} }, { "cell_type": "code", "source": "regression_tree.fit(X_train, Y_train)", "metadata": { "trusted": true }, "execution_count": 15, "outputs": [ { "name": "stderr", "text": "/lib/python3.10/site-packages/sklearn/tree/_classes.py:359: FutureWarning: Criterion 'mse' was deprecated in v1.0 and will be removed in version 1.2. Use `criterion='squared_error'` which is equivalent.\n warnings.warn(\n", "output_type": "stream" }, { "execution_count": 15, "output_type": "execute_result", "data": { "text/plain": "DecisionTreeRegressor(criterion='mse')" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": "## Evaluation\n", "metadata": {} }, { "cell_type": "markdown", "source": "To evaluate our dataset we will use the `score` method of the `DecisionTreeRegressor` object providing our testing data, this number is the $R^2$ value which indicates the coefficient of determination\n", "metadata": {} }, { "cell_type": "code", "source": "regression_tree.score(X_test, Y_test)", "metadata": { "trusted": true }, "execution_count": 16, "outputs": [ { "execution_count": 16, "output_type": "execute_result", "data": { "text/plain": "0.852006811553053" }, "metadata": {} } ] }, { "cell_type": "markdown", "source": "We can also find the average error in our testing set which is the average error in median home value prediction\n", "metadata": {} }, { "cell_type": "code", "source": "prediction = regression_tree.predict(X_test)\n\nprint(\"$\",(prediction - Y_test).abs().mean()*1000)", "metadata": { "trusted": true }, "execution_count": 17, "outputs": [ { "name": "stdout", "text": "$ 2715.189873417721\n", "output_type": "stream" } ] }, { "cell_type": "markdown", "source": "## Excercise\n", "metadata": {} }, { "cell_type": "markdown", "source": "Train a regression tree using the `criterion` `mae` then report its $R^2$ value and average error\n", "metadata": {} }, { "cell_type": "code", "source": "regression_tree = DecisionTreeRegressor(criterion = \"mae\")\n\nregression_tree.fit(X_train, Y_train)\n\nprint(regression_tree.score(X_test, Y_test))\n\nprediction = regression_tree.predict(X_test)\n\nprint(\"$\",(prediction - Y_test).abs().mean()*1000)", "metadata": { "trusted": true }, "execution_count": 18, "outputs": [ { "name": "stderr", "text": "/lib/python3.10/site-packages/sklearn/tree/_classes.py:366: FutureWarning: Criterion 'mae' was deprecated in v1.0 and will be removed in version 1.2. Use `criterion='absolute_error'` which is equivalent.\n warnings.warn(\n", "output_type": "stream" }, { "name": "stdout", "text": "0.8720206502582719\n$ 2537.9746835443034\n", "output_type": "stream" } ] }, { "cell_type": "markdown", "source": "
Click here for the solution\n\n```python\nregression_tree = DecisionTreeRegressor(criterion = \"mae\")\n\nregression_tree.fit(X_train, Y_train)\n\nprint(regression_tree.score(X_test, Y_test))\n\nprediction = regression_tree.predict(X_test)\n\nprint(\"$\",(prediction - Y_test).abs().mean()*1000)\n\n```\n\n
\n", "metadata": {} }, { "cell_type": "markdown", "source": "## Authors\n", "metadata": {} }, { "cell_type": "markdown", "source": "Azim Hirjani\n", "metadata": {} }, { "cell_type": "markdown", "source": "## Change Log\n", "metadata": {} }, { "cell_type": "markdown", "source": "| Date (YYYY-MM-DD) | Version | Changed By | Change Description |\n| ----------------- | ------- | ---------- | ----------------------- |\n| 2020-07-20 | 0.2 | Azim | Modified Multiple Areas |\n| 2020-07-17 | 0.1 | Azim | Created Lab Template |\n", "metadata": {} }, { "cell_type": "markdown", "source": "Copyright © 2020 IBM Corporation. All rights reserved.\n", "metadata": {} } ] }