{
 "metadata": {
  "name": "05.2_application_to_text_mining"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%pylab inline\n",
      "import pylab as pl\n",
      "import numpy as np\n",
      "\n",
      "# Some nice default configuration for plots\n",
      "pl.rcParams['figure.figsize'] = 10, 7.5\n",
      "pl.rcParams['axes.grid'] = True\n",
      "pl.gray()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Text Feature Extraction for Classification and Clustering"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Outline of this section:\n",
      "\n",
      "- Turn a corpus of text documents into **feature vectors** using a **Bag of Words** representation,\n",
      "- Train a simple text classifier on the feature vectors,\n",
      "- Wrap the vectorizer and the classifier with a **pipeline**,\n",
      "- Cross-validation and (manual) **model selection** on the pipeline.\n",
      "- Qualitative **model inspection**"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Text Classification in 20 lines of Python"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's start by implementing a canonical text classification example:\n",
      "\n",
      "- The 20 newsgroups dataset: around 18000 text posts from 20 newsgroups forums\n",
      "- Bag of Words features extraction with TF-IDF weighting\n",
      "- Naive Bayes classifier or Linear Support Vector Machine for the classifier itself"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.datasets import load_files\n",
      "from sklearn.feature_extraction.text import TfidfVectorizer\n",
      "from sklearn.naive_bayes import MultinomialNB\n",
      "\n",
      "# Load the text data\n",
      "categories = [\n",
      "    'alt.atheism',\n",
      "    'talk.religion.misc',\n",
      "    'comp.graphics',\n",
      "    'sci.space',\n",
      "]\n",
      "twenty_train_subset = load_files('datasets/20news-bydate-train/',\n",
      "    categories=categories, charset='latin-1')\n",
      "twenty_test_subset = load_files('datasets/20news-bydate-test/',\n",
      "    categories=categories, charset='latin-1')\n",
      "\n",
      "# Turn the text documents into vectors of word frequencies\n",
      "vectorizer = TfidfVectorizer(min_df=2)\n",
      "X_train = vectorizer.fit_transform(twenty_train_subset.data)\n",
      "y_train = twenty_train_subset.target\n",
      "\n",
      "# Fit a classifier on the training set\n",
      "classifier = MultinomialNB().fit(X_train, y_train)\n",
      "print(\"Training score: {0:.1f}%\".format(\n",
      "    classifier.score(X_train, y_train) * 100))\n",
      "\n",
      "# Evaluate the classifier on the testing set\n",
      "X_test = vectorizer.transform(twenty_test_subset.data)\n",
      "y_test = twenty_test_subset.target\n",
      "print(\"Testing score: {0:.1f}%\".format(\n",
      "    classifier.score(X_test, y_test) * 100))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here is a workflow diagram summary of what happened previously:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.core.display import Image, display\n",
      "display(Image(filename='figures/supervised_scikit_learn.png'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's now decompose what we just did to understand and customize each step:"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Loading the Dataset"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's explore the dataset loading utility without passing a list of categories: in this case we load the full 20 newsgroups dataset in memory. The source website for the 20 newsgroups already provides a date-based train / test split that is made available using the `subset` keyword argument: "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ls -l datasets/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ls -lh datasets/20news-bydate-train"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ls -lh datasets/20news-bydate-train/alt.atheism/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The `load_files` function can load text files from a 2 levels folder structure assuming folder names represent categories:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#print(load_files.__doc__)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_twenty_train = load_files('datasets/20news-bydate-train/',\n",
      "  charset='latin-1', random_state=42)\n",
      "all_twenty_test = load_files('datasets/20news-bydate-test/',\n",
      "    charset='latin-1', random_state=42)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_target_names = all_twenty_train.target_names\n",
      "all_target_names"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_twenty_train.target"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_twenty_train.target.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_twenty_test.target.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(all_twenty_train.data)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "type(all_twenty_train.data[0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def display_sample(i, dataset):\n",
      "    print(\"Class name: \" + dataset.target_names[dataset.target[i]])\n",
      "    print(\"Text content:\\n\")\n",
      "    print(dataset.data[i])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "display_sample(0, all_twenty_train)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "display_sample(1, all_twenty_train)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8 bit encoding (in this case, all chars can be encoded using the latin-1 charset)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def text_size(text, charset='iso-8859-1'):\n",
      "    return len(text.encode(charset)) * 8 * 1e-6\n",
      "\n",
      "train_size_mb = sum(text_size(text) for text in all_twenty_train.data) \n",
      "test_size_mb = sum(text_size(text) for text in all_twenty_test.data)\n",
      "\n",
      "print(\"Training set size: {0} MB\".format(int(train_size_mb)))\n",
      "print(\"Testing set size: {0} MB\".format(int(test_size_mb)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If we only consider a small subset of the 4 categories selected from the initial example:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "train_subset_size_mb = sum(text_size(text) for text in twenty_train_subset.data) \n",
      "test_subset_size_mb = sum(text_size(text) for text in twenty_test_subset.data)\n",
      "\n",
      "print(\"Training set size: {0} MB\".format(int(train_subset_size_mb)))\n",
      "print(\"Testing set size: {0} MB\".format(int(test_subset_size_mb)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Extracting Text Features"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.feature_extraction.text import TfidfVectorizer\n",
      "\n",
      "TfidfVectorizer()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "vectorizer = TfidfVectorizer(min_df=1)\n",
      "\n",
      "%time X_train = vectorizer.fit_transform(twenty_train_subset.data)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The results is not a `numpy.array` but instead a `scipy.sparse` matrix. This datastructure is quite similar to a 2D numpy array but it does not store the zeros."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X_train"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "scipy.sparse matrices also have a shape attribute to access the dimensions:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "n_samples, n_features = X_train.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This dataset has around 2000 samples (the rows of the data matrix):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "n_samples"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This is the same value as the number of strings in the original list of text documents:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(twenty_train_subset.data)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The columns represent the individual token occurrences:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "n_features"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This number is the size of the vocabulary of the model extracted during fit in a Python dictionary:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "type(vectorizer.vocabulary_)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(vectorizer.vocabulary_)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The keys of the `vocabulary_` attribute are also called feature names and can be accessed as a list of strings."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(vectorizer.get_feature_names())"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here are the first 10 elements (sorted in lexicographical order):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "vectorizer.get_feature_names()[:10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's have a look at the features from the middle:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "vectorizer.get_feature_names()[n_features / 2:n_features / 2 + 10]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In adition to the text of the documents that has been vectorized one also has access to the label information:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y_train = twenty_train_subset.target\n",
      "target_names = twenty_train_subset.target_names\n",
      "target_names"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y_train.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y_train"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# We can shape that we have the same number of samples for the input data and the labels:\n",
      "X_train.shape[0] == y_train.shape[0]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now that we have extracted a vector representation of the data, it's a good idea to project the data on the first 2D of a Principal Component Analysis to get a feel of the data. Note that the `RandomizedPCA` class can accept `scipy.sparse` matrices as input (as an alternative to numpy arrays):"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.decomposition import RandomizedPCA\n",
      "\n",
      "%time X_train_pca = RandomizedPCA(n_components=2).fit_transform(X_train)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from itertools import cycle\n",
      "\n",
      "colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']\n",
      "for i, c in zip(np.unique(y_train), cycle(colors)):\n",
      "    pl.scatter(X_train_pca[y_train == i, 0],\n",
      "               X_train_pca[y_train == i, 1],\n",
      "               c=c, label=target_names[i], alpha=0.5)\n",
      "    \n",
      "_ = pl.legend(loc='best')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can observe that there is a large overlap of the samples from different categories. This is to be expected as the PCA linear projection projects data from a 34118 dimensional space down to 2 dimensions: data that is linearly separable in 34118D is often no longer linearly separable in 2D.\n",
      "    \n",
      "Still we can notice an interesting pattern: the newsgroups on religion and atheism occupy the much the same region and computer graphics and space science / space overlap more together than they do with the religion or atheism newsgroups."
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Training a Classifier on Text Features"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can now train a classifier, for instance a Multinomial Naive Bayesian classifier which is a fast baseline for text classification tasks:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.naive_bayes import MultinomialNB\n",
      "\n",
      "clf = MultinomialNB(alpha=0.1)\n",
      "clf"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf.fit(X_train, y_train)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can now evaluate the classifier on the testing set. Let's first use the builtin score function, which is the rate of correct classification in the test set:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X_test = vectorizer.transform(twenty_test_subset.data)\n",
      "y_test = twenty_test_subset.target"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "X_test.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "y_test.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf.score(X_test, y_test)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can also compute the score on the test set and observe that the model is both overfitting and underfitting a bit at the same time:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "clf.score(X_train, y_train)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Introspecting the Behavior of the Text Vectorizer"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The text vectorizer has many parameters to customize it's behavior, in particular how it extracts tokens:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "TfidfVectorizer()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(TfidfVectorizer.__doc__)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The easiest way to introspect what the vectorizer is actually doing for a given test of parameters is call the `vectorizer.build_analyzer()` to get an instance of the text analyzer it uses to process the text:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "analyzer = TfidfVectorizer().build_analyzer()\n",
      "analyzer(\"I love scikit-learn: this is a cool Python lib!\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "You can notice that all the tokens are lowercase, that the single letter word \"I\" was dropped, and that hyphenation is used. Let's change some of that default behavior:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "analyzer = TfidfVectorizer(\n",
      "    preprocessor=lambda text: text,  # disable lowercasing\n",
      "    token_pattern=ur'(?u)\\b[\\w-]+\\b', # treat hyphen as a letter\n",
      "                                      # do not exclude single letter tokens\n",
      ").build_analyzer()\n",
      "\n",
      "analyzer(\"I love scikit-learn: this is a cool Python lib!\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The analyzer name comes from the Lucene parlance: it wraps the sequential application of:\n",
      "\n",
      "- text preprocessing (processing the text documents as a whole, e.g. lowercasing)\n",
      "- text tokenization (splitting the document into a sequence of tokens)\n",
      "- token filtering and recombination (e.g. n-grams extraction, see later)\n",
      "\n",
      "The analyzer system of scikit-learn is much more basic than lucene's though."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Exercise**:\n",
      "\n",
      "- Write a pre-processor callable (e.g. a python function) to remove the headers of the text a newsgroup post.\n",
      "- Vectorize the data again and measure the impact on performance of removing the header info from the dataset.\n",
      "- Do you expect the performance of the model to improve or decrease? What is the score of a uniform random classifier on the same dataset?\n",
      "\n",
      "Hint: the `TfidfVectorizer` class can accept python functions to customize the `preprocessor`, `tokenizer` or `analyzer` stages of the vectorizer.\n",
      "    \n",
      "- type `TfidfVectorizer()` alone in a cell to see the default value of the parameters\n",
      "\n",
      "- type `TfidfVectorizer.__doc__` to print the constructor parameters doc or `?` suffix operator on a any Python class or method to read the docstring or even the `??` operator to read the source code."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Solution**:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# %load solutions/05B_strip_headers.py"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Assembling a Pipeline for Model Validation"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The feature extraction class has many options to customize its behavior:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(TfidfVectorizer.__doc__)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In order to evaluate the impact of the parameters of the feature extraction one can chain a configured feature extraction and linear classifier (as an alternative to the naive Bayes model:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.linear_model import PassiveAggressiveClassifier\n",
      "from sklearn.pipeline import Pipeline\n",
      "\n",
      "pipeline = Pipeline((\n",
      "    ('vec', TfidfVectorizer(min_df=1, max_df=0.8, use_idf=True)),\n",
      "    ('clf', PassiveAggressiveClassifier(C=1)),\n",
      "))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Such a pipeline can then be used to evaluate the performance on the test set:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pipeline.fit(twenty_train_subset.data, twenty_train_subset.target)\n",
      "\n",
      "print(\"Train score:\")\n",
      "print(pipeline.score(twenty_train_subset.data, twenty_train_subset.target))\n",
      "print(\"Test score:\")\n",
      "print(pipeline.score(twenty_test_subset.data, twenty_test_subset.target))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Introspecting Model Performance"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Displaying the Most Discriminative Features"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's collect info on the fitted components of the previously trained model:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "vec_name, vec = pipeline.steps[0]\n",
      "clf_name, clf = pipeline.steps[1]\n",
      "\n",
      "feature_names = vec.get_feature_names()\n",
      "\n",
      "feature_weights = clf.coef_\n",
      "\n",
      "feature_weights.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "By sorting the feature weights on the linear model and asking the vectorizer what their names is, one can get a clue on what the model did actually learn on the data:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def display_important_features(feature_names, target_names, weights, n_top=30):\n",
      "    for i, target_name in enumerate(target_names):\n",
      "        print(\"Class: \" + target_name)\n",
      "        print(\"\")\n",
      "        \n",
      "        sorted_features_indices = weights[i].argsort()[::-1]\n",
      "        \n",
      "        most_important = sorted_features_indices[:n_top]\n",
      "        print(\", \".join(\"{0}: {1:.4f}\".format(feature_names[j], weights[i, j])\n",
      "                        for j in most_important))\n",
      "        print(\"...\")\n",
      "        \n",
      "        least_important = sorted_features_indices[-n_top:]\n",
      "        print(\", \".join(\"{0}: {1:.4f}\".format(feature_names[j], weights[i, j])\n",
      "                        for j in least_important))\n",
      "        print(\"\")\n",
      "        \n",
      "display_important_features(feature_names, target_names, feature_weights)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Displaying the per-class Classification Reports"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.metrics import classification_report\n",
      "\n",
      "predicted = pipeline.predict(twenty_test_subset.data)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print(classification_report(twenty_test_subset.target, predicted,\n",
      "                            target_names=target_names))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Printing the Confusion Matrix"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The confusion matrix summarize which class where by having a look at off-diagonal entries: here we can see that articles about atheism have been wrongly classified as being about religion 57 times for instance: "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.metrics import confusion_matrix\n",
      "\n",
      "confusion_matrix(twenty_test_subset.target, predicted)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}