{
  "metadata": {
    "kernelspec": {
      "name": "python",
      "display_name": "Pyolite",
      "language": "python"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "python",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8"
    }
  },
  "nbformat_minor": 4,
  "nbformat": 4,
  "cells": [
    {
      "cell_type": "markdown",
      "source": "<p style=\"text-align:center\">\n    <a href=\"https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01\" target=\"_blank\">\n    <img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png\" width=\"200\" alt=\"Skills Network Logo\"  />\n    </a>\n</p>\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "# **Regression Trees**\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "Estimated time needed: **20** minutes\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "In this lab you will learn how to implement regression trees using ScikitLearn. We will show what parameters are important, how to train a regression tree, and finally how to determine our regression trees accuracy.\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "## Objectives\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "After completing this lab you will be able to:\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "*   Train a Regression Tree\n*   Evaluate a Regression Trees Performance\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "***\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "## Setup\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "For this lab, we are going to be using Python and several Python libraries. Some of these libraries might be installed in your lab environment or in SN Labs. Others may need to be installed by you. The cells below will install these libraries when executed.\n",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": "from js import fetch\nimport io\n\nURL = \"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/real_estate_data.csv\"\nresp = await fetch(URL)\nregression_tree_data = io.BytesIO((await resp.arrayBuffer()).to_py())",
      "metadata": {
        "trusted": true
      },
      "execution_count": 1,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "import piplite\nawait piplite.install(['pandas'])\nawait piplite.install(['numpy'])\nawait piplite.install(['scikit-learn'])",
      "metadata": {
        "trusted": true
      },
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "# Pandas will allow us to create a dataframe of the data so it can be used and manipulated\nimport pandas as pd\n# Regression Tree Algorithm\nfrom sklearn.tree import DecisionTreeRegressor\n# Split our data into a training and testing data\nfrom sklearn.model_selection import train_test_split",
      "metadata": {
        "trusted": true
      },
      "execution_count": 3,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": "## About the Dataset\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "Imagine you are a data scientist working for a real estate company that is planning to invest in Boston real estate. You have collected information about various areas of Boston and are tasked with created a model that can predict the median price of houses for that area so it can be used to make offers.\n\nThe dataset had information on areas/towns not individual houses, the features are\n\nCRIM: Crime per capita\n\nZN: Proportion of residential land zoned for lots over 25,000 sq.ft.\n\nINDUS: Proportion of non-retail business acres per town\n\nCHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n\nNOX: Nitric oxides concentration (parts per 10 million)\n\nRM: Average number of rooms per dwelling\n\nAGE: Proportion of owner-occupied units built prior to 1940\n\nDIS: Weighted distances to ﬁve Boston employment centers\n\nRAD: Index of accessibility to radial highways\n\nTAX: Full-value property-tax rate per $10,000\n\nPTRAIO: Pupil-teacher ratio by town\n\nLSTAT: Percent lower status of the population\n\nMEDV: Median value of owner-occupied homes in $1000s\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "## Read the Data\n",
      "metadata": {
        "tags": []
      }
    },
    {
      "cell_type": "markdown",
      "source": "Lets read in the data we have downloaded\n",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": "data = pd.read_csv(regression_tree_data)",
      "metadata": {
        "trusted": true
      },
      "execution_count": 4,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "data.head()",
      "metadata": {
        "trusted": true
      },
      "execution_count": 5,
      "outputs": [
        {
          "execution_count": 5,
          "output_type": "execute_result",
          "data": {
            "text/plain": "      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO  \\\n0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900    1  296     15.3   \n1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671    2  242     17.8   \n2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671    2  242     17.8   \n3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622    3  222     18.7   \n4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622    3  222     18.7   \n\n   LSTAT  MEDV  \n0   4.98  24.0  \n1   9.14  21.6  \n2   4.03  34.7  \n3   2.94  33.4  \n4    NaN  36.2  ",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>CRIM</th>\n      <th>ZN</th>\n      <th>INDUS</th>\n      <th>CHAS</th>\n      <th>NOX</th>\n      <th>RM</th>\n      <th>AGE</th>\n      <th>DIS</th>\n      <th>RAD</th>\n      <th>TAX</th>\n      <th>PTRATIO</th>\n      <th>LSTAT</th>\n      <th>MEDV</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0.00632</td>\n      <td>18.0</td>\n      <td>2.31</td>\n      <td>0.0</td>\n      <td>0.538</td>\n      <td>6.575</td>\n      <td>65.2</td>\n      <td>4.0900</td>\n      <td>1</td>\n      <td>296</td>\n      <td>15.3</td>\n      <td>4.98</td>\n      <td>24.0</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>0.02731</td>\n      <td>0.0</td>\n      <td>7.07</td>\n      <td>0.0</td>\n      <td>0.469</td>\n      <td>6.421</td>\n      <td>78.9</td>\n      <td>4.9671</td>\n      <td>2</td>\n      <td>242</td>\n      <td>17.8</td>\n      <td>9.14</td>\n      <td>21.6</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>0.02729</td>\n      <td>0.0</td>\n      <td>7.07</td>\n      <td>0.0</td>\n      <td>0.469</td>\n      <td>7.185</td>\n      <td>61.1</td>\n      <td>4.9671</td>\n      <td>2</td>\n      <td>242</td>\n      <td>17.8</td>\n      <td>4.03</td>\n      <td>34.7</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>0.03237</td>\n      <td>0.0</td>\n      <td>2.18</td>\n      <td>0.0</td>\n      <td>0.458</td>\n      <td>6.998</td>\n      <td>45.8</td>\n      <td>6.0622</td>\n      <td>3</td>\n      <td>222</td>\n      <td>18.7</td>\n      <td>2.94</td>\n      <td>33.4</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>0.06905</td>\n      <td>0.0</td>\n      <td>2.18</td>\n      <td>0.0</td>\n      <td>0.458</td>\n      <td>7.147</td>\n      <td>54.2</td>\n      <td>6.0622</td>\n      <td>3</td>\n      <td>222</td>\n      <td>18.7</td>\n      <td>NaN</td>\n      <td>36.2</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": "Now lets learn about the size of our data, there are 506 rows and 13 columns\n",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": "data.shape",
      "metadata": {
        "trusted": true
      },
      "execution_count": 6,
      "outputs": [
        {
          "execution_count": 6,
          "output_type": "execute_result",
          "data": {
            "text/plain": "(506, 13)"
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": "Most of the data is valid, there are rows with missing values which we will deal with in pre-processing\n",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": "data.isna().sum()",
      "metadata": {
        "trusted": true
      },
      "execution_count": 7,
      "outputs": [
        {
          "execution_count": 7,
          "output_type": "execute_result",
          "data": {
            "text/plain": "CRIM       20\nZN         20\nINDUS      20\nCHAS       20\nNOX         0\nRM          0\nAGE        20\nDIS         0\nRAD         0\nTAX         0\nPTRATIO     0\nLSTAT      20\nMEDV        0\ndtype: int64"
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": "## Data Pre-Processing\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "First lets drop the rows with missing values because we have enough data in our dataset\n",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": "data.dropna(inplace=True)",
      "metadata": {
        "trusted": true
      },
      "execution_count": 8,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": "Now we can see our dataset has no missing values\n",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": "data.isna().sum()",
      "metadata": {
        "trusted": true
      },
      "execution_count": 9,
      "outputs": [
        {
          "execution_count": 9,
          "output_type": "execute_result",
          "data": {
            "text/plain": "CRIM       0\nZN         0\nINDUS      0\nCHAS       0\nNOX        0\nRM         0\nAGE        0\nDIS        0\nRAD        0\nTAX        0\nPTRATIO    0\nLSTAT      0\nMEDV       0\ndtype: int64"
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": "Lets split the dataset into our features and what we are predicting (target)\n",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": "X = data.drop(columns=[\"MEDV\"])\nY = data[\"MEDV\"]",
      "metadata": {
        "trusted": true
      },
      "execution_count": 10,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "X.head()",
      "metadata": {
        "trusted": true
      },
      "execution_count": 11,
      "outputs": [
        {
          "execution_count": 11,
          "output_type": "execute_result",
          "data": {
            "text/plain": "      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO  \\\n0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900    1  296     15.3   \n1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671    2  242     17.8   \n2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671    2  242     17.8   \n3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622    3  222     18.7   \n5  0.02985   0.0   2.18   0.0  0.458  6.430  58.7  6.0622    3  222     18.7   \n\n   LSTAT  \n0   4.98  \n1   9.14  \n2   4.03  \n3   2.94  \n5   5.21  ",
            "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>CRIM</th>\n      <th>ZN</th>\n      <th>INDUS</th>\n      <th>CHAS</th>\n      <th>NOX</th>\n      <th>RM</th>\n      <th>AGE</th>\n      <th>DIS</th>\n      <th>RAD</th>\n      <th>TAX</th>\n      <th>PTRATIO</th>\n      <th>LSTAT</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>0.00632</td>\n      <td>18.0</td>\n      <td>2.31</td>\n      <td>0.0</td>\n      <td>0.538</td>\n      <td>6.575</td>\n      <td>65.2</td>\n      <td>4.0900</td>\n      <td>1</td>\n      <td>296</td>\n      <td>15.3</td>\n      <td>4.98</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>0.02731</td>\n      <td>0.0</td>\n      <td>7.07</td>\n      <td>0.0</td>\n      <td>0.469</td>\n      <td>6.421</td>\n      <td>78.9</td>\n      <td>4.9671</td>\n      <td>2</td>\n      <td>242</td>\n      <td>17.8</td>\n      <td>9.14</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>0.02729</td>\n      <td>0.0</td>\n      <td>7.07</td>\n      <td>0.0</td>\n      <td>0.469</td>\n      <td>7.185</td>\n      <td>61.1</td>\n      <td>4.9671</td>\n      <td>2</td>\n      <td>242</td>\n      <td>17.8</td>\n      <td>4.03</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>0.03237</td>\n      <td>0.0</td>\n      <td>2.18</td>\n      <td>0.0</td>\n      <td>0.458</td>\n      <td>6.998</td>\n      <td>45.8</td>\n      <td>6.0622</td>\n      <td>3</td>\n      <td>222</td>\n      <td>18.7</td>\n      <td>2.94</td>\n    </tr>\n    <tr>\n      <th>5</th>\n      <td>0.02985</td>\n      <td>0.0</td>\n      <td>2.18</td>\n      <td>0.0</td>\n      <td>0.458</td>\n      <td>6.430</td>\n      <td>58.7</td>\n      <td>6.0622</td>\n      <td>3</td>\n      <td>222</td>\n      <td>18.7</td>\n      <td>5.21</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "code",
      "source": "Y.head()",
      "metadata": {
        "trusted": true
      },
      "execution_count": 12,
      "outputs": [
        {
          "execution_count": 12,
          "output_type": "execute_result",
          "data": {
            "text/plain": "0    24.0\n1    21.6\n2    34.7\n3    33.4\n5    28.7\nName: MEDV, dtype: float64"
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": "Finally lets split our data into a training and testing dataset using `train_test_split` from `sklearn.model_selection`\n",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=1)",
      "metadata": {
        "trusted": true
      },
      "execution_count": 13,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": "## Create Regression Tree\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "Regression Trees are implemented using `DecisionTreeRegressor` from `sklearn.tree`\n\nThe important parameters of `DecisionTreeRegressor` are\n\n`criterion`: {\"mse\", \"friedman_mse\", \"mae\", \"poisson\"} - The function used to measure error\n\n`max_depth` - The max depth the tree can be\n\n`min_samples_split` - The minimum number of samples required to split a node\n\n`min_samples_leaf` - The minimum number of samples that a leaf can contain\n\n`max_features`: {\"auto\", \"sqrt\", \"log2\"} - The number of feature we examine looking for the best one, used to speed up training\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "First lets start by creating a `DecisionTreeRegressor` object, setting the `criterion` parameter to `mse` for Mean Squared Error\n",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": "regression_tree = DecisionTreeRegressor(criterion = \"mse\")",
      "metadata": {
        "trusted": true
      },
      "execution_count": 14,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": "## Training\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "Now lets train our model using the `fit` method on the `DecisionTreeRegressor` object providing our training data\n",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": "regression_tree.fit(X_train, Y_train)",
      "metadata": {
        "trusted": true
      },
      "execution_count": 15,
      "outputs": [
        {
          "name": "stderr",
          "text": "/lib/python3.10/site-packages/sklearn/tree/_classes.py:359: FutureWarning: Criterion 'mse' was deprecated in v1.0 and will be removed in version 1.2. Use `criterion='squared_error'` which is equivalent.\n  warnings.warn(\n",
          "output_type": "stream"
        },
        {
          "execution_count": 15,
          "output_type": "execute_result",
          "data": {
            "text/plain": "DecisionTreeRegressor(criterion='mse')"
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": "## Evaluation\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "To evaluate our dataset we will use the `score` method of the `DecisionTreeRegressor` object providing our testing data, this number is the $R^2$ value which indicates the coefficient of determination\n",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": "regression_tree.score(X_test, Y_test)",
      "metadata": {
        "trusted": true
      },
      "execution_count": 16,
      "outputs": [
        {
          "execution_count": 16,
          "output_type": "execute_result",
          "data": {
            "text/plain": "0.852006811553053"
          },
          "metadata": {}
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": "We can also find the average error in our testing set which is the average error in median home value prediction\n",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": "prediction = regression_tree.predict(X_test)\n\nprint(\"$\",(prediction - Y_test).abs().mean()*1000)",
      "metadata": {
        "trusted": true
      },
      "execution_count": 17,
      "outputs": [
        {
          "name": "stdout",
          "text": "$ 2715.189873417721\n",
          "output_type": "stream"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": "## Excercise\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "Train a regression tree using the `criterion` `mae` then report its $R^2$ value and average error\n",
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": "regression_tree = DecisionTreeRegressor(criterion = \"mae\")\n\nregression_tree.fit(X_train, Y_train)\n\nprint(regression_tree.score(X_test, Y_test))\n\nprediction = regression_tree.predict(X_test)\n\nprint(\"$\",(prediction - Y_test).abs().mean()*1000)",
      "metadata": {
        "trusted": true
      },
      "execution_count": 18,
      "outputs": [
        {
          "name": "stderr",
          "text": "/lib/python3.10/site-packages/sklearn/tree/_classes.py:366: FutureWarning: Criterion 'mae' was deprecated in v1.0 and will be removed in version 1.2. Use `criterion='absolute_error'` which is equivalent.\n  warnings.warn(\n",
          "output_type": "stream"
        },
        {
          "name": "stdout",
          "text": "0.8720206502582719\n$ 2537.9746835443034\n",
          "output_type": "stream"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": "<details><summary>Click here for the solution</summary>\n\n```python\nregression_tree = DecisionTreeRegressor(criterion = \"mae\")\n\nregression_tree.fit(X_train, Y_train)\n\nprint(regression_tree.score(X_test, Y_test))\n\nprediction = regression_tree.predict(X_test)\n\nprint(\"$\",(prediction - Y_test).abs().mean()*1000)\n\n```\n\n</details>\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "## Authors\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "Azim Hirjani\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "## Change Log\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "| Date (YYYY-MM-DD) | Version | Changed By | Change Description      |\n| ----------------- | ------- | ---------- | ----------------------- |\n| 2020-07-20        | 0.2     | Azim       | Modified Multiple Areas |\n| 2020-07-17        | 0.1     | Azim       | Created Lab Template    |\n",
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": "Copyright © 2020 IBM Corporation. All rights reserved.\n",
      "metadata": {}
    }
  ]
}