{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TextExplainer: debugging black-box text classifiers\n", "\n", "While eli5 supports many classifiers and preprocessing methods, it can't support them all. \n", "\n", "If a library is not supported by eli5 directly, or the text processing pipeline is too complex for eli5, eli5 can still help - it provides an implementation of [LIME](http://arxiv.org/abs/1602.04938) (Ribeiro et al., 2016) algorithm which allows to explain predictions of arbitrary classifiers, including text classifiers. `eli5.lime` can also help when it is hard to get exact mapping between model coefficients and text features, e.g. if there is dimension reduction involved.\n", "\n", "## Example problem: LSA+SVM for 20 Newsgroups dataset\n", "\n", "Let's load \"20 Newsgroups\" dataset and create a text processing pipeline which is hard to debug using conventional methods: SVM with RBF kernel trained on [LSA](https://en.wikipedia.org/wiki/Latent_semantic_analysis) features." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "\n", "categories = ['alt.atheism', 'soc.religion.christian', \n", " 'comp.graphics', 'sci.med']\n", "twenty_train = fetch_20newsgroups(\n", " subset='train',\n", " categories=categories,\n", " shuffle=True,\n", " random_state=42,\n", " remove=('headers', 'footers'),\n", ")\n", "twenty_test = fetch_20newsgroups(\n", " subset='test',\n", " categories=categories,\n", " shuffle=True,\n", " random_state=42,\n", " remove=('headers', 'footers'),\n", ")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "0.89014647137150471" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.svm import SVC\n", "from sklearn.decomposition import TruncatedSVD\n", "from sklearn.pipeline import Pipeline, make_pipeline\n", "\n", "vec = TfidfVectorizer(min_df=3, stop_words='english',\n", " ngram_range=(1, 2))\n", "svd = TruncatedSVD(n_components=100, n_iter=7, random_state=42)\n", "lsa = make_pipeline(vec, svd)\n", "\n", "clf = SVC(C=150, gamma=2e-2, probability=True)\n", "pipe = make_pipeline(lsa, clf)\n", "pipe.fit(twenty_train.data, twenty_train.target)\n", "pipe.score(twenty_test.data, twenty_test.target)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dimension of the input documents is reduced to 100, and then a kernel SVM is used to classify the documents. \n", "\n", "This is what the pipeline returns for a document - it is pretty sure the first message in test data belongs to sci.med:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.001 alt.atheism\n", "0.001 comp.graphics\n", "0.995 sci.med\n", "0.004 soc.religion.christian\n" ] } ], "source": [ "def print_prediction(doc):\n", " y_pred = pipe.predict_proba([doc])[0]\n", " for target, prob in zip(twenty_train.target_names, y_pred):\n", " print(\"{:.3f} {}\".format(prob, target)) \n", "\n", "doc = twenty_test.data[0]\n", "print_prediction(doc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TextExplainer\n", "Such pipelines are not supported by eli5 directly, but one can use `eli5.lime.TextExplainer` to debug the prediction - to check what was important in the document to make this decision.\n", "\n", "Create a `TextExplainer` instance, then pass the document to explain and a black-box classifier (a function which returns probabilities) to the `TextExplainer.fit` method, then check the explanation:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", "\n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", " \n", "\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " \n", " \n", " y=alt.atheism\n", " \n", "\n", "\n", " \n", " (probability 0.000, score -9.583)\n", "\n", "top features\n", "
\n", " \n", "\n", " Contribution?\n", " | \n", " \n", "Feature | \n", " \n", "
---|---|
\n", " -0.360\n", " | \n", "\n", " <BIAS>\n", " | \n", " \n", "
\n", " -9.223\n", " | \n", "\n", " Highlighted in text (sum)\n", " | \n", " \n", "
\n", " as i recall from my bout with kidney stones, there isn't any\n", "medication that can do anything about them except relieve the pain.\n", "\n", "either they pass, or they have to be broken up with sound, or they have\n", "to be extracted surgically.\n", "\n", "when i was in, the x-ray tech happened to mention that she'd had kidney\n", "stones and children, and the childbirth hurt less.\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " y=comp.graphics\n", " \n", "\n", "\n", " \n", " (probability 0.000, score -8.285)\n", "\n", "top features\n", "
\n", " \n", "\n", " Contribution?\n", " | \n", " \n", "Feature | \n", " \n", "
---|---|
\n", " -0.213\n", " | \n", "\n", " <BIAS>\n", " | \n", " \n", "
\n", " -8.073\n", " | \n", "\n", " Highlighted in text (sum)\n", " | \n", " \n", "
\n", " as i recall from my bout with kidney stones, there isn't any\n", "medication that can do anything about them except relieve the pain.\n", "\n", "either they pass, or they have to be broken up with sound, or they have\n", "to be extracted surgically.\n", "\n", "when i was in, the x-ray tech happened to mention that she'd had kidney\n", "stones and children, and the childbirth hurt less.\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " y=sci.med\n", " \n", "\n", "\n", " \n", " (probability 0.996, score 5.846)\n", "\n", "top features\n", "
\n", " \n", "\n", " Contribution?\n", " | \n", " \n", "Feature | \n", " \n", "
---|---|
\n", " +5.959\n", " | \n", "\n", " Highlighted in text (sum)\n", " | \n", " \n", "
\n", " -0.113\n", " | \n", "\n", " <BIAS>\n", " | \n", " \n", "
\n", " as i recall from my bout with kidney stones, there isn't any\n", "medication that can do anything about them except relieve the pain.\n", "\n", "either they pass, or they have to be broken up with sound, or they have\n", "to be extracted surgically.\n", "\n", "when i was in, the x-ray tech happened to mention that she'd had kidney\n", "stones and children, and the childbirth hurt less.\n", "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " y=soc.religion.christian\n", " \n", "\n", "\n", " \n", " (probability 0.004, score -5.484)\n", "\n", "top features\n", "
\n", " \n", "\n", " Contribution?\n", " | \n", " \n", "Feature | \n", " \n", "
---|---|
\n", " -0.346\n", " | \n", "\n", " <BIAS>\n", " | \n", " \n", "
\n", " -5.137\n", " | \n", "\n", " Highlighted in text (sum)\n", " | \n", " \n", "
\n", " as i recall from my bout with kidney stones, there isn't any\n", "medication that can do anything about them except relieve the pain.\n", "\n", "either they pass, or they have to be broken up with sound, or they have\n", "to be extracted surgically.\n", "\n", "when i was in, the x-ray tech happened to mention that she'd had kidney\n", "stones and children, and the childbirth hurt less.\n", "
\n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", "\n" ], "text/plain": [ "\n", " \n", " \n", " y=sci.med\n", " \n", "\n", "\n", " \n", " (probability 0.998, score 6.380)\n", "\n", "top features\n", "
\n", " \n", "\n", " Contribution?\n", " | \n", " \n", "Feature | \n", " \n", "
---|---|
\n", " +6.445\n", " | \n", "\n", " Highlighted in text (sum)\n", " | \n", " \n", "
\n", " -0.065\n", " | \n", "\n", " <BIAS>\n", " | \n", " \n", "
\n", " as i recall from my bout with kidney stones, there isn't any\n", "medication that can do anything about them except relieve the pain.\n", "\n", "either they pass, or they have to be broken up with sound, or they have\n", "to be extracted surgically.\n", "\n", "when i was in, the x-ray tech happened to mention that she'd had kidney\n", "stones and children, and the childbirth hurt less.\n", "
\n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", "\n" ], "text/plain": [ "\n", " \n", " \n", " y=sci.med\n", " \n", "\n", "\n", " \n", " (probability 0.997, score 5.654)\n", "\n", "top features\n", "
\n", " \n", "\n", " Contribution?\n", " | \n", " \n", "Feature | \n", " \n", "
---|---|
\n", " +8.864\n", " | \n", "\n", " countvectorizer: Highlighted in text (sum)\n", " | \n", " \n", "
\n", " -0.083\n", " | \n", "\n", " <BIAS>\n", " | \n", " \n", "
\n", " -3.128\n", " | \n", "\n", " doclength__is_even\n", " | \n", " \n", "
\n", " countvectorizer: as i recall from my bout with kidney stones, there isn't any\n", "medication that can do anything about them except relieve the pain.\n", "\n", "either they pass, or they have to be broken up with sound, or they have\n", "to be extracted surgically.\n", "\n", "when i was in, the x-ray tech happened to mention that she'd had kidney\n", "stones and children, and the childbirth hurt less\n", "
\n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", "\n" ], "text/plain": [ "Explanation(estimator=\"SGDClassifier(alpha=0.001, average=False, class_weight=None, epsilon=0.1,\\n eta0=0.0, fit_intercept=True, l1_ratio=0.15,\\n learning_rate='optimal', loss='log', n_iter=5, n_jobs=1,\\n penalty='elasticnet', power_t=0.5,\\n random_state=\n", " \n", " \n", " y=sci.med\n", " \n", "\n", "\n", " \n", " (probability 0.572, score -0.116)\n", "\n", "top features\n", "
\n", " \n", "\n", " Contribution?\n", " | \n", " \n", "Feature | \n", " \n", "
---|---|
\n", " +0.880\n", " | \n", "\n", " Highlighted in text (sum)\n", " | \n", " \n", "
\n", " -0.995\n", " | \n", "\n", " <BIAS>\n", " | \n", " \n", "
\n", " as i recall from my bout with kidney stones, there isn't any\n", "medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have\n", "to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney\n", "stones and children, and the childbirth hurt less.\n", "
\n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", "\n" ], "text/plain": [ "\n", " \n", " \n", " y=sci.med\n", " \n", "\n", "\n", " \n", " (probability 0.564, score 0.602)\n", "\n", "top features\n", "
\n", " \n", "\n", " Contribution?\n", " | \n", " \n", "Feature | \n", " \n", "
---|---|
\n", " +0.982\n", " | \n", "\n", " Highlighted in text (sum)\n", " | \n", " \n", "
\n", " -0.380\n", " | \n", "\n", " <BIAS>\n", " | \n", " \n", "
\n", " as i recall from my bout with kidney stones, there isn't any\n", "medication that can do anything about them except relieve the pain.\n", "\n", "either they pass, or they have to be broken up with sound, or they have\n", "to be extracted surgically.\n", "\n", "when i was in, the x-ray tech happened to mention that she'd had kidney\n", "stones and children, and the childbirth hurt less.\n", "
\n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", "\n" ], "text/plain": [ "\n", " \n", " \n", " y=sci.med\n", " \n", "\n", "\n", " \n", " (probability 0.360, score 0.043)\n", "\n", "top features\n", "
\n", " \n", "\n", " Contribution?\n", " | \n", " \n", "Feature | \n", " \n", "
---|---|
\n", " +0.241\n", " | \n", "\n", " Highlighted in text (sum)\n", " | \n", " \n", "
\n", " -0.198\n", " | \n", "\n", " <BIAS>\n", " | \n", " \n", "
\n", " as i recall from my bout with kidney stones, there isn't any\n", "medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have\n", "to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney\n", "stones and children, and the childbirth hurt less.\n", "
\n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", "\n" ], "text/plain": [ "\n", " \n", " \n", " y=sci.med\n", " \n", "\n", "\n", " \n", " (probability 0.648, score 0.749)\n", "\n", "top features\n", "
\n", " \n", "\n", " Contribution?\n", " | \n", " \n", "Feature | \n", " \n", "
---|---|
\n", " +0.962\n", " | \n", "\n", " Highlighted in text (sum)\n", " | \n", " \n", "
\n", " -0.213\n", " | \n", "\n", " <BIAS>\n", " | \n", " \n", "
\n", " as i recall from my bout with kidney stones, there isn't any\n", "medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have\n", "to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney\n", "stones and children, and the childbirth hurt less.\n", "
\n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", "\n" ], "text/plain": [ "\n", " \n", " \n", " y=sci.med\n", " \n", "\n", "\n", " \n", " (probability 0.970, score 4.522)\n", "\n", "top features\n", "
\n", " \n", "\n", " Contribution?\n", " | \n", " \n", "Feature | \n", " \n", "
---|---|
\n", " +4.512\n", " | \n", "\n", " Highlighted in text (sum)\n", " | \n", " \n", "
\n", " +0.010\n", " | \n", "\n", " <BIAS>\n", " | \n", " \n", "
\n", " as i recall from my bout with kidney stones, there isn't any\n", "medication that can do anything about them except relieve the pain. either they pass, or they have to be broken up with sound, or they have\n", "to be extracted surgically. when i was in, the x-ray tech happened to mention that she'd had kidney\n", "stones and children, and the childbirth hurt less.\n", "
\n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", " \n", "\n", "\n", "\n" ], "text/plain": [ "Weight | \n", "Feature | \n", "
---|---|
\n", " 0.5449\n", " \n", " | \n", "\n", " kidney\n", " | \n", "
\n", " 0.4551\n", " \n", " | \n", "\n", " pain\n", " | \n", "
\n", "\n", " \n", "\n", "\n", "\n" ], "text/plain": [ "