{ "cells": [ { "cell_type": "markdown", "id": "7c5ba4fe", "metadata": {}, "source": [ "# Decision Trees - Intro and Regression" ] }, { "cell_type": "markdown", "id": "07494ffb", "metadata": {}, "source": [ "Decision Trees are supervised machine learning algorithms that are used for both regression and classification tasks. Trees are powerful algorithms that can handle complex datasets. \n", "\n", "Here are 7 interesting facts about decision trees:\n", "\n", "* They do not need the numerical input data to be scaled. Whatever the numerical values are, decision trees don't care. \n", "\n", "* Decision trees handle categorical features in the raw text format (Scikit-Learn doesn't support this, TensorFlow's trees implementation does).\n", "\n", "* Different to other complex learning algorithms, the results of decision trees can be interpreted. It's fair to say that decision trees are not blackbox type models. \n", "* While most models will suffer from missing values, decision trees are okay with them.\n", "* Trees can handle imbalanced datasets. You will only have to adjust the weights of the classes.\n", "* Trees can provide the feature importances or how much each feature contributed to the model training results.\n", "* Trees are the basic building blocks of ensemble methods such as random forests and gradient boosting machines.\n", "\n", "The way decision trees works is like the series of if/else questions. Let's say that you want to make a decision of the car to buy. In order to get the right car to buy, you could go on and evaluate the level of the safety, the number of sits and doors by asking series of if like questions. \n", "\n", "Here is the structure of the decision trees.\n", "\n", "\n", "\n", "\n", "A well-known downside of decision trees is that they tend to overfit the data easily(pretty much assumed they will always overfit at first). One way to overcome overfitting is to reduce the maximum depth of the decision tree (refered to as `max_depth`hyperparameter) in decision trees. We will see other techniques to avoid overfitting. \n", "\n", "To motivate the superpower of decision trees, let's use it for a regression task where instead of predicting class, we are predicting a continous value. In the next lab, we will use them for classification. " ] }, { "cell_type": "markdown", "id": "d4e2c5a6", "metadata": {}, "source": [ "## Decision Trees for Regression" ] }, { "cell_type": "markdown", "id": "ae5e8372", "metadata": {}, "source": [ "### Contents\n", "\n", "* [1 - Imports]\n", "* [2 - Loading the data]\n", "* [3 - Exploratory Analysis]\n", "* [4 - Preprocessing the data]\n", "* [5 - Training Decision Trees]\n", "* [6 - Evaluating Decision Trees]\n", "* [7 - Improving Decision Trees]" ] }, { "cell_type": "markdown", "id": "99058d12", "metadata": {}, "source": [ "## 1 - Imports" ] }, { "cell_type": "code", "execution_count": 1, "id": "df278c5f", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import sklearn\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "id": "35699c38", "metadata": {}, "source": [ "## 2 - Loading the data\n", "\n", "In this regression task with decision trees, we will use the Machine CPU (Central Processing Unit) data which is avilable at [OpenML](https://www.openml.org/t/5492). We will load it with Sklearn `fetch_openml` function. \n", "\n", "If you are reading this, it's very likely that you know CPU or you have once(or many times) thought about it when you were buying your computer. In this notebook, we will predict the relative performance of the CPU given the following data: \n", "\n", "* MYCT: machine cycle time in nanoseconds (integer)\n", "* MMIN: minimum main memory in kilobytes (integer)\n", "* MMAX: maximum main memory in kilobytes (integer)\n", "* CACH: cache memory in kilobytes (integer)\n", "* CHMIN: minimum channels in units (integer)\n", "* CHMAX: maximum channels in units (integer)\n", "* PRP: published relative performance (integer) (target variable)" ] }, { "cell_type": "code", "execution_count": 2, "id": "67905ee6", "metadata": {}, "outputs": [], "source": [ "# Let's hide warnings\n", "\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "execution_count": 3, "id": "d3083946", "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_openml\n", "\n", "machine_cpu = fetch_openml(name='machine_cpu')" ] }, { "cell_type": "code", "execution_count": 4, "id": "46c09852", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sklearn.utils._bunch.Bunch" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(machine_cpu)" ] }, { "cell_type": "code", "execution_count": 5, "id": "154ed67e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(209, 6)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "machine_cpu.data.shape" ] }, { "cell_type": "code", "execution_count": 6, "id": "69e438fb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "**Author**: \n", "**Source**: Unknown - \n", "**Please cite**: \n", "\n", "The problem concerns Relative CPU Performance Data. More information can be obtained in the UCI Machine\n", " Learning repository (http://www.ics.uci.edu/~mlearn/MLSummary.html).\n", " The used attributes are :\n", " MYCT: machine cycle time in nanoseconds (integer)\n", " MMIN: minimum main memory in kilobytes (integer)\n", " MMAX: maximum main memory in kilobytes (integer)\n", " CACH: cache memory in kilobytes (integer)\n", " CHMIN: minimum channels in units (integer)\n", " CHMAX: maximum channels in units (integer)\n", " PRP: published relative performance (integer) (target variable)\n", " \n", " Original source: UCI machine learning repository. \n", " Source: collection of regression datasets by Luis Torgo (ltorgo@ncc.up.pt) at\n", " http://www.ncc.up.pt/~ltorgo/Regression/DataSets.html\n", " Characteristics: 209 cases; 6 continuous variables\n", "\n", "Downloaded from openml.org.\n" ] } ], "source": [ "print(machine_cpu.DESCR)" ] }, { "cell_type": "code", "execution_count": 7, "id": "d7180991", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMIN', 'CHMAX']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Displaying feature names\n", "\n", "machine_cpu.feature_names" ] }, { "cell_type": "code", "execution_count": 8, "id": "e3ba092e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | MYCT | \n", "MMIN | \n", "MMAX | \n", "CACH | \n", "CHMIN | \n", "CHMAX | \n", "class | \n", "
|---|---|---|---|---|---|---|---|
| 0 | \n", "125.0 | \n", "256.0 | \n", "6000.0 | \n", "256.0 | \n", "16.0 | \n", "128.0 | \n", "198.0 | \n", "
| 1 | \n", "29.0 | \n", "8000.0 | \n", "32000.0 | \n", "32.0 | \n", "8.0 | \n", "32.0 | \n", "269.0 | \n", "
| 2 | \n", "29.0 | \n", "8000.0 | \n", "32000.0 | \n", "32.0 | \n", "8.0 | \n", "32.0 | \n", "220.0 | \n", "
| 3 | \n", "29.0 | \n", "8000.0 | \n", "32000.0 | \n", "32.0 | \n", "8.0 | \n", "32.0 | \n", "172.0 | \n", "
| 4 | \n", "29.0 | \n", "8000.0 | \n", "16000.0 | \n", "32.0 | \n", "8.0 | \n", "16.0 | \n", "132.0 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 204 | \n", "124.0 | \n", "1000.0 | \n", "8000.0 | \n", "0.0 | \n", "1.0 | \n", "8.0 | \n", "42.0 | \n", "
| 205 | \n", "98.0 | \n", "1000.0 | \n", "8000.0 | \n", "32.0 | \n", "2.0 | \n", "8.0 | \n", "46.0 | \n", "
| 206 | \n", "125.0 | \n", "2000.0 | \n", "8000.0 | \n", "0.0 | \n", "2.0 | \n", "14.0 | \n", "52.0 | \n", "
| 207 | \n", "480.0 | \n", "512.0 | \n", "8000.0 | \n", "32.0 | \n", "0.0 | \n", "0.0 | \n", "67.0 | \n", "
| 208 | \n", "480.0 | \n", "1000.0 | \n", "4000.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "45.0 | \n", "
209 rows × 7 columns
\n", "| \n", " | MYCT | \n", "MMIN | \n", "MMAX | \n", "CACH | \n", "CHMIN | \n", "CHMAX | \n", "
|---|---|---|---|---|---|---|
| count | \n", "167.000000 | \n", "167.000000 | \n", "167.000000 | \n", "167.000000 | \n", "167.000000 | \n", "167.000000 | \n", "
| mean | \n", "207.958084 | \n", "2900.826347 | \n", "11761.161677 | \n", "26.071856 | \n", "4.760479 | \n", "18.616766 | \n", "
| std | \n", "266.772823 | \n", "4165.950964 | \n", "12108.332354 | \n", "42.410014 | \n", "6.487439 | \n", "27.489919 | \n", "
| min | \n", "17.000000 | \n", "64.000000 | \n", "64.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "
| 25% | \n", "50.000000 | \n", "768.000000 | \n", "4000.000000 | \n", "0.000000 | \n", "1.000000 | \n", "5.000000 | \n", "
| 50% | \n", "110.000000 | \n", "2000.000000 | \n", "8000.000000 | \n", "8.000000 | \n", "2.000000 | \n", "8.000000 | \n", "
| 75% | \n", "232.500000 | \n", "3100.000000 | \n", "16000.000000 | \n", "32.000000 | \n", "6.000000 | \n", "24.000000 | \n", "
| max | \n", "1500.000000 | \n", "32000.000000 | \n", "64000.000000 | \n", "256.000000 | \n", "52.000000 | \n", "176.000000 | \n", "
DecisionTreeRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeRegressor()
DecisionTreeRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeRegressor()
GridSearchCV(cv=3, estimator=DecisionTreeRegressor(random_state=42),\n",
" param_grid={'max_depth': [None, 0, 1, 2, 3],\n",
" 'max_leaf_nodes': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],\n",
" 'min_samples_split': [0, 1, 2, 3, 4]},\n",
" verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(cv=3, estimator=DecisionTreeRegressor(random_state=42),\n",
" param_grid={'max_depth': [None, 0, 1, 2, 3],\n",
" 'max_leaf_nodes': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],\n",
" 'min_samples_split': [0, 1, 2, 3, 4]},\n",
" verbose=1)DecisionTreeRegressor(random_state=42)
DecisionTreeRegressor(random_state=42)
DecisionTreeRegressor(max_leaf_nodes=9, min_samples_split=4, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeRegressor(max_leaf_nodes=9, min_samples_split=4, random_state=42)