{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import sklearn\n", "import sklearn.datasets\n", "import sklearn.ensemble\n", "import numpy as np\n", "import lime\n", "import lime.lime_tabular\n", "from __future__ import print_function\n", "np.random.seed(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Continuous features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading data, training a model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this part, we'll use the Iris dataset, and we'll train a random forest. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "iris = sklearn.datasets.load_iris()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train, test, labels_train, labels_test = sklearn.model_selection.train_test_split(iris.data, iris.target, train_size=0.80)\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_split=1e-07, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=500, n_jobs=1, oob_score=False, random_state=None,\n", " verbose=0, warm_start=False)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf = sklearn.ensemble.RandomForestClassifier(n_estimators=500)\n", "rf.fit(train, labels_train)\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.96666666666666667" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sklearn.metrics.accuracy_score(labels_test, rf.predict(test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the explainer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As opposed to lime_text.TextExplainer, tabular explainers need a training set. The reason for this is because we compute statistics on each feature (column). If the feature is numerical, we compute the mean and std, and discretize it into quartiles. If the feature is categorical, we compute the frequency of each value. For this tutorial, we'll only look at numerical features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use these computed statistics for two things:\n", "1. To scale the data, so that we can meaningfully compute distances when the attributes are not on the same scale\n", "2. To sample perturbed instances - which we do by sampling from a Normal(0,1), multiplying by the std and adding back the mean.\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "explainer = lime.lime_tabular.LimeTabularExplainer(train, feature_names=iris.feature_names, class_names=iris.target_names, discretize_continuous=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Explaining an instance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since this is a multi-class classification problem, we set the top_labels parameter, so that we only explain the top class." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "i = np.random.randint(0, test.shape[0])\n", "exp = explainer.explain_instance(test[i], rf.predict_proba, num_features=2, top_labels=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now explain a single instance:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", "
\n", " \n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "exp.show_in_notebook(show_table=True, show_all=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, there is a lot going on here. First, note that the row we are explained is displayed on the right side, in table format. Since we had the show_all parameter set to false, only the features used in the explanation are displayed.\n", "\n", "The *value* column displays the original value for each feature.\n", "\n", "Note that LIME has discretized the features in the explanation. This is because we let discretize_continuous=True in the constructor (this is the default). Discretized features make for more intuitive explanations.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Checking the local linear approximation" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "feature_index = lambda x: iris.feature_names.index(x)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Increasing petal width\n", "P(setosa) before: 1.0\n", "P(setosa) after: 0.48\n", "\n", "Increasing petal length\n", "P(setosa) before: 1.0\n", "P(setosa) after: 0.55\n", "\n", "Increasing both\n", "P(setosa) before: 1.0\n", "P(setosa) after: 0.03\n" ] } ], "source": [ "print('Increasing petal width')\n", "temp = test[i].copy()\n", "print('P(setosa) before:', rf.predict_proba(temp.reshape(1,-1))[0,0])\n", "temp[feature_index('petal width (cm)')] = 1.5\n", "print('P(setosa) after:', rf.predict_proba(temp.reshape(1,-1))[0,0])\n", "print ()\n", "print('Increasing petal length')\n", "temp = test[i].copy()\n", "print('P(setosa) before:', rf.predict_proba(temp.reshape(1,-1))[0,0])\n", "temp[feature_index('petal length (cm)')] = 3.5\n", "print('P(setosa) after:', rf.predict_proba(temp.reshape(1,-1))[0,0])\n", "print()\n", "print('Increasing both')\n", "temp = test[i].copy()\n", "print('P(setosa) before:', rf.predict_proba(temp.reshape(1,-1))[0,0])\n", "temp[feature_index('petal width (cm)')] = 1.5\n", "temp[feature_index('petal length (cm)')] = 3.5\n", "print('P(setosa) after:', rf.predict_proba(temp.reshape(1,-1))[0,0])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that both features had the impact we thought they would. The scale at which they need to be perturbed of course depends on the scale of the feature in the training set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now show all features, just for completeness:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", "
\n", " \n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "exp.show_in_notebook(show_table=True, show_all=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this part, we will use the Mushroom dataset, which will be downloaded [here](http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data). The task is to predict if a mushroom is edible or poisonous, based on categorical features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading data" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data = np.genfromtxt('http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data', delimiter=',', dtype='\n", " \n", " \n", "
\n", " \n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "i = 137\n", "exp = explainer.explain_instance(test[i], predict_fn, num_features=5)\n", "exp.show_in_notebook()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now note that the explanations are based not only on features, but on feature-value pairs. For example, we are saying that odor=foul is indicative of a poisonous mushroom. In the context of a categorical feature, odor could take many other values (see below). Since we perturb each categorical feature drawing samples according to the original training distribution, the way to interpret this is: if odor was not foul, on average, this prediction would be 0.24 less 'poisonous'. Let's check if this is the case" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([u'almond', u'anise', u'creosote', u'fishy', u'foul', u'musty',\n", " u'none', u'pungent', u'spicy'], \n", " dtype='\n", " \n", " \n", "
\n", " \n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "np.random.seed(1)\n", "i = 1653\n", "exp = explainer.explain_instance(test[i], predict_fn, num_features=5)\n", "exp.show_in_notebook(show_all=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that capital gain has very high weight. This makes sense. Now let's see an example where the person has a capital gain below the mean:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", "
\n", " \n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "i = 10\n", "exp = explainer.explain_instance(test[i], predict_fn, num_features=5)\n", "exp.show_in_notebook(show_all=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }