{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Data Science\n", "## From correlation to supervised segmentation and tree-structured models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to need a lot of Python packages, so let's start by importing all of them." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import the libraries we will be using\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "import math\n", "import matplotlib.patches as patches\n", "import matplotlib.pylab as plt\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn import tree\n", "from sklearn import metrics\n", "from sklearn import datasets\n", "from IPython.display import Image\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are also going to do a lot of repetitive stuff, so let's predefine some useful functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# A function that gives a visual representation of the decision tree\n", "def Decision_Tree_Image(decision_tree, feature_names, name=\"temp\"):\n", " # Export our decision tree to graphviz format\n", " dot_file = tree.export_graphviz(decision_tree.tree_, out_file='images/' + name + '.dot', feature_names=feature_names)\n", " \n", " # Call graphviz to make an image file from our decision tree\n", " os.system(\"dot -T png images/\" + name + \".dot -o images/\" + name + \".png\")\n", " \n", " # Return the .png image so we can see it\n", " return Image(filename='images/' + name + '.png')\n", "\n", "# A function to plot the data\n", "def Plot_Data(data, v1, v2, tv):\n", " # Make the plot square\n", " plt.rcParams['figure.figsize'] = [12.0, 8.0]\n", " \n", " # Color\n", " color = [\"red\" if x == 0 else \"blue\" for x in data[tv]]\n", " \n", " # Plot and label\n", " plt.scatter(data[v1], data[v2], c=color, s=50)\n", " plt.xlabel(v1)\n", " plt.ylabel(v2)\n", " plt.xlim([min(data[v1]) - 1, max(data[v1]) + 1])\n", " plt.ylim([min(data[v2]) - .05, max(data[v2]) + .05])\n", " \n", "def Decision_Surface(x, y, model, cell_size=.01):\n", " # Get blob sizes for shading\n", " x = (min(x), max(x))\n", " y = (min(y), max(y))\n", " x_step = (x[1] - x[0]) * cell_size\n", " y_step = (y[1] - y[0]) * cell_size\n", "\n", " # Create blobs\n", " x_values = []\n", " y_values = []\n", " \n", " for i in np.arange(x[0], x[1], x_step):\n", " for j in np.arange(y[0], y[1], y_step):\n", " y_values.append(float(i))\n", " x_values.append(float(j))\n", " \n", " data_blob = pd.DataFrame({\"x\": x_values, \"y\": y_values})\n", "\n", " # Predict the blob labels\n", " label= decision_tree.predict(data_blob)\n", " \n", " # Color and plot them\n", " color = [\"red\" if l == 0 else \"blue\" for l in label]\n", " plt.scatter(data_blob['y'], data_blob['x'], marker='o', edgecolor='black', linewidth='0', c=color, alpha=0.3)\n", " \n", " # Get the raw decision tree rules\n", " decision_tree_raw = []\n", " for feature, left_c, right_c, threshold, value in zip(decision_tree.tree_.feature, \n", " decision_tree.tree_.children_left, \n", " decision_tree.tree_.children_right, \n", " decision_tree.tree_.threshold, \n", " decision_tree.tree_.value):\n", " decision_tree_raw.append([feature, left_c, right_c, threshold, value])\n", "\n", " # Plot the data\n", " Plot_Data(data, \"humor\", \"number_pets\", \"success\")\n", "\n", " # Used for formatting the boundry lines\n", " currentAxis = plt.gca()\n", " line_color = \"black\"\n", " line_width = 3\n", "\n", " # For each rule\n", " for row in decision_tree_raw:\n", " feature, left_c, right_c, threshold, value = row\n", "\n", " if threshold != -2:\n", " if feature == 0:\n", " plt.plot([20, 100], [threshold, threshold], c=line_color, linewidth=line_width)\n", " else:\n", " plt.plot([threshold, threshold], [0, 5], c=line_color, linewidth=line_width)\n", "\n", " plt.xlim([min(x) - 1, max(x) + 1])\n", " plt.ylim([min(y) - .05, max(y) + .05])\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also need some data, so let's create a dataset consisting of 500 people." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Set the randomness\n", "np.random.seed(36)\n", "\n", "# Number of users\n", "n_users = 500\n", "\n", "# Relationships\n", "variable_names = [\"age\", \"humor\", \"number_pets\"]\n", "variables_keep = [\"number_pets\", \"humor\"]\n", "target_name = \"success\"\n", "\n", "# Generate data\n", "predictors, target = datasets.make_classification(n_features=3, n_redundant=0, \n", " n_informative=2, n_clusters_per_class=2,\n", " n_samples=n_users)\n", "data = pd.DataFrame(predictors, columns=variable_names)\n", "data['age'] = data['age'] * 10 + 50\n", "data['humor'] = data['humor'] * 10 + 50\n", "data['number_pets'] = (data['number_pets'] + 6)/2\n", "data[target_name] = target\n", "\n", "X = data[[variables_keep[0], variables_keep[1]]]\n", "Y = data[target_name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Useful features\n", "Let's take a look at one of our features -- `\"number_pets\"`. Is this feature useful? Let's plot the possible values of `\"number_pets\"` and color code our target variable, which is, in this case, `\"success\"`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "# Make the plot long\n", "plt.rcParams['figure.figsize'] = [20.0, 4.0]\n", "color = [\"red\" if x == 0 else \"blue\" for x in data[\"success\"]]\n", "plt.scatter(X['number_pets'], [1] * n_users, c=color, s=50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Is `\"number_pets\"` actually useful? Let's quantify it.\n", "\n", "**Entropy** ($H$) and **information gain** ($IG$) are crucial in determining which features are the most informative. Given the data, it is fairly straight forward to calculate both of these.\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "Figure 3-4. Splitting the \"write-off\" sample into two segments, based on splitting the Balance attribute (account balance) at 50K.\n", "Figure 3-5. A classification tree split on the three-values Residence attribute.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def entropy(target):\n", " # Get the number of users\n", " n = len(target)\n", " # Count how frequently each unique value occurs\n", " counts = np.bincount(target).astype(float)\n", " # Initialize entropy\n", " entropy = 0\n", " # If the split is perfect, return 0\n", " if len(counts) <= 1 or 0 in counts:\n", " return entropy\n", " # Otherwise, for each possible value, update entropy\n", " for count in counts:\n", " entropy += math.log(count/n, len(counts)) * count/n\n", " # Return entropy\n", " return -1 * entropy\n", "\n", "def information_gain(feature, threshold, target):\n", " # Dealing with numpy arrays makes this slightly easier\n", " target = np.array(target)\n", " feature = np.array(feature)\n", " # Cut the feature vector on the threshold\n", " feature = (feature < threshold)\n", " # Initialize information gain with the parent entropy\n", " ig = entropy(target)\n", " # For both sides of the threshold, update information gain\n", " for level, count in zip([0, 1], np.bincount(feature).astype(float)):\n", " ig -= count/len(feature) * entropy(target[feature == level])\n", " # Return information gain\n", " return ig" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a way of calculating $H$ and $IG$, let's pick a threshold, split `\"number_pets\"`, and calculate $IG$." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "threshold = 3.2\n", "print \"IG = %.4f with a thresholding of %.2f.\" % (information_gain(X['number_pets'], threshold, np.array(Y)), threshold)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To be more precise, we can iterate through all values and find the best split." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "def best_threshold():\n", " maximum_ig = 0\n", " maximum_threshold = 0\n", "\n", " for threshold in X['number_pets']:\n", " ig = information_gain(X['number_pets'], threshold, np.array(Y))\n", " if ig > maximum_ig:\n", " maximum_ig = ig\n", " maximum_threshold = threshold\n", "\n", " return \"The maximum IG = %.3f and it occured by splitting on %.4f.\" % (maximum_ig, maximum_threshold)\n", "\n", "print best_threshold()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see how we can do this with just sklearn!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "decision_tree = DecisionTreeClassifier(max_depth=1, criterion=\"entropy\")\n", "decision_tree.fit(X, Y)\n", "Decision_Tree_Image(decision_tree, X.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at another one of our features, `\"humor\"`, and see how it relates to `\"number_pets\"`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "Plot_Data(data, \"humor\", \"number_pets\", \"success\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "Decision_Surface(data['humor'], data['number_pets'], decision_tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What does our tree look like when we look at it along with the plot we just made?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we use this as our decision tree, how accurate is it?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print \"Accuracy = %.3f\" % (metrics.accuracy_score(decision_tree.predict(X), Y))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's add one more level to our decision tree." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "decision_tree = DecisionTreeClassifier(max_depth=2, criterion=\"entropy\")\n", "decision_tree.fit(X, Y)\n", "Decision_Tree_Image(decision_tree, X.columns)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "Decision_Surface(data['humor'], data['number_pets'], decision_tree)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print \"Accuracy = %.3f\" % (metrics.accuracy_score(decision_tree.predict(X), Y))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not much of a change? Can you see why? Try experimenting with different values of `max_depth` and find which one is the most accurate. Is there a point where the tree stops growing?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Your code here!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }