{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Data Science\n", "## From correlation to supervised segmentation and tree-structured models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to need a lot of Python packages, so let's start by importing all of them." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import the libraries we will be using\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "import math\n", "import matplotlib.patches as patches\n", "import matplotlib.pylab as plt\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn import tree\n", "from sklearn import metrics\n", "from sklearn import datasets\n", "from IPython.display import Image\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are also going to do a lot of repetitive stuff, so let's predefine some useful functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# A function that gives a visual representation of the decision tree\n", "def Decision_Tree_Image(decision_tree, feature_names, name=\"temp\"):\n", " # Export our decision tree to graphviz format\n", " dot_file = tree.export_graphviz(decision_tree.tree_, out_file='images/' + name + '.dot', feature_names=feature_names)\n", " \n", " # Call graphviz to make an image file from our decision tree\n", " os.system(\"dot -T png images/\" + name + \".dot -o images/\" + name + \".png\")\n", " \n", " # Return the .png image so we can see it\n", " return Image(filename='images/' + name + '.png')\n", "\n", "# A function to plot the data\n", "def Plot_Data(data, v1, v2, tv):\n", " # Make the plot square\n", " plt.rcParams['figure.figsize'] = [12.0, 8.0]\n", " \n", " # Color\n", " color = [\"red\" if x == 0 else \"blue\" for x in data[tv]]\n", " \n", " # Plot and label\n", " plt.scatter(data[v1], data[v2], c=color, s=50)\n", " plt.xlabel(v1)\n", " plt.ylabel(v2)\n", " plt.xlim([min(data[v1]) - 1, max(data[v1]) + 1])\n", " plt.ylim([min(data[v2]) - .05, max(data[v2]) + .05])\n", " \n", "def Decision_Surface(x, y, model, cell_size=.01):\n", " # Get blob sizes for shading\n", " x = (min(x), max(x))\n", " y = (min(y), max(y))\n", " x_step = (x[1] - x[0]) * cell_size\n", " y_step = (y[1] - y[0]) * cell_size\n", "\n", " # Create blobs\n", " x_values = []\n", " y_values = []\n", " \n", " for i in np.arange(x[0], x[1], x_step):\n", " for j in np.arange(y[0], y[1], y_step):\n", " y_values.append(float(i))\n", " x_values.append(float(j))\n", " \n", " data_blob = pd.DataFrame({\"x\": x_values, \"y\": y_values})\n", "\n", " # Predict the blob labels\n", " label= decision_tree.predict(data_blob)\n", " \n", " # Color and plot them\n", " color = [\"red\" if l == 0 else \"blue\" for l in label]\n", " plt.scatter(data_blob['y'], data_blob['x'], marker='o', edgecolor='black', linewidth='0', c=color, alpha=0.3)\n", " \n", " # Get the raw decision tree rules\n", " decision_tree_raw = []\n", " for feature, left_c, right_c, threshold, value in zip(decision_tree.tree_.feature, \n", " decision_tree.tree_.children_left, \n", " decision_tree.tree_.children_right, \n", " decision_tree.tree_.threshold, \n", " decision_tree.tree_.value):\n", " decision_tree_raw.append([feature, left_c, right_c, threshold, value])\n", "\n", " # Plot the data\n", " Plot_Data(data, \"humor\", \"number_pets\", \"success\")\n", "\n", " # Used for formatting the boundry lines\n", " currentAxis = plt.gca()\n", " line_color = \"black\"\n", " line_width = 3\n", "\n", " # For each rule\n", " for row in decision_tree_raw:\n", " feature, left_c, right_c, threshold, value = row\n", "\n", " if threshold != -2:\n", " if feature == 0:\n", " plt.plot([20, 100], [threshold, threshold], c=line_color, linewidth=line_width)\n", " else:\n", " plt.plot([threshold, threshold], [0, 5], c=line_color, linewidth=line_width)\n", "\n", " plt.xlim([min(x) - 1, max(x) + 1])\n", " plt.ylim([min(y) - .05, max(y) + .05])\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also need some data, so let's create a dataset consisting of 500 people." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Set the randomness\n", "np.random.seed(36)\n", "\n", "# Number of users\n", "n_users = 500\n", "\n", "# Relationships\n", "variable_names = [\"age\", \"humor\", \"number_pets\"]\n", "variables_keep = [\"number_pets\", \"humor\"]\n", "target_name = \"success\"\n", "\n", "# Generate data\n", "predictors, target = datasets.make_classification(n_features=3, n_redundant=0, \n", " n_informative=2, n_clusters_per_class=2,\n", " n_samples=n_users)\n", "data = pd.DataFrame(predictors, columns=variable_names)\n", "data['age'] = data['age'] * 10 + 50\n", "data['humor'] = data['humor'] * 10 + 50\n", "data['number_pets'] = (data['number_pets'] + 6)/2\n", "data[target_name] = target\n", "\n", "X = data[[variables_keep[0], variables_keep[1]]]\n", "Y = data[target_name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Useful features\n", "Let's take a look at one of our features -- `\"number_pets\"`. Is this feature useful? Let's plot the possible values of `\"number_pets\"` and color code our target variable, which is, in this case, `\"success\"`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "# Make the plot long\n", "plt.rcParams['figure.figsize'] = [20.0, 4.0]\n", "color = [\"red\" if x == 0 else \"blue\" for x in data[\"success\"]]\n", "plt.scatter(X['number_pets'], [1] * n_users, c=color, s=50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is `\"number_pets\"` actually useful? Let's quantify it.\n", "\n", "**Entropy** ($H$) and **information gain** ($IG$) are crucial in determining which features are the most informative. Given the data, it is fairly straight forward to calculate both of these.\n", "\n", "
\n", "Figure 3-4. Splitting the \"write-off\" sample into two segments, based on splitting the Balance attribute (account balance) at 50K. | \n", "\n", " | \n", "Figure 3-5. A classification tree split on the three-values Residence attribute. | \n", "