{ "metadata": { "name": "05.2_application_to_text_mining" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%pylab inline\n", "import pylab as pl\n", "import numpy as np\n", "\n", "# Some nice default configuration for plots\n", "pl.rcParams['figure.figsize'] = 10, 7.5\n", "pl.rcParams['axes.grid'] = True\n", "pl.gray()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Text Feature Extraction for Classification and Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Outline of this section:\n", "\n", "- Turn a corpus of text documents into **feature vectors** using a **Bag of Words** representation,\n", "- Train a simple text classifier on the feature vectors,\n", "- Wrap the vectorizer and the classifier with a **pipeline**,\n", "- Cross-validation and (manual) **model selection** on the pipeline.\n", "- Qualitative **model inspection**" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Text Classification in 20 lines of Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by implementing a canonical text classification example:\n", "\n", "- The 20 newsgroups dataset: around 18000 text posts from 20 newsgroups forums\n", "- Bag of Words features extraction with TF-IDF weighting\n", "- Naive Bayes classifier or Linear Support Vector Machine for the classifier itself" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_files\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.naive_bayes import MultinomialNB\n", "\n", "# Load the text data\n", "categories = [\n", " 'alt.atheism',\n", " 'talk.religion.misc',\n", " 'comp.graphics',\n", " 'sci.space',\n", "]\n", "twenty_train_subset = load_files('datasets/20news-bydate-train/',\n", " categories=categories, charset='latin-1')\n", "twenty_test_subset = load_files('datasets/20news-bydate-test/',\n", " categories=categories, charset='latin-1')\n", "\n", "# Turn the text documents into vectors of word frequencies\n", "vectorizer = TfidfVectorizer(min_df=2)\n", "X_train = vectorizer.fit_transform(twenty_train_subset.data)\n", "y_train = twenty_train_subset.target\n", "\n", "# Fit a classifier on the training set\n", "classifier = MultinomialNB().fit(X_train, y_train)\n", "print(\"Training score: {0:.1f}%\".format(\n", " classifier.score(X_train, y_train) * 100))\n", "\n", "# Evaluate the classifier on the testing set\n", "X_test = vectorizer.transform(twenty_test_subset.data)\n", "y_test = twenty_test_subset.target\n", "print(\"Testing score: {0:.1f}%\".format(\n", " classifier.score(X_test, y_test) * 100))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a workflow diagram summary of what happened previously:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.core.display import Image, display\n", "display(Image(filename='figures/supervised_scikit_learn.png'))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now decompose what we just did to understand and customize each step:" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Loading the Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's explore the dataset loading utility without passing a list of categories: in this case we load the full 20 newsgroups dataset in memory. The source website for the 20 newsgroups already provides a date-based train / test split that is made available using the `subset` keyword argument: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "ls -l datasets/" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ls -lh datasets/20news-bydate-train" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ls -lh datasets/20news-bydate-train/alt.atheism/" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `load_files` function can load text files from a 2 levels folder structure assuming folder names represent categories:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#print(load_files.__doc__)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "all_twenty_train = load_files('datasets/20news-bydate-train/',\n", " charset='latin-1', random_state=42)\n", "all_twenty_test = load_files('datasets/20news-bydate-test/',\n", " charset='latin-1', random_state=42)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "all_target_names = all_twenty_train.target_names\n", "all_target_names" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "all_twenty_train.target" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "all_twenty_train.target.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "all_twenty_test.target.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "len(all_twenty_train.data)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "type(all_twenty_train.data[0])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def display_sample(i, dataset):\n", " print(\"Class name: \" + dataset.target_names[dataset.target[i]])\n", " print(\"Text content:\\n\")\n", " print(dataset.data[i])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "display_sample(0, all_twenty_train)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "display_sample(1, all_twenty_train)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8 bit encoding (in this case, all chars can be encoded using the latin-1 charset)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def text_size(text, charset='iso-8859-1'):\n", " return len(text.encode(charset)) * 8 * 1e-6\n", "\n", "train_size_mb = sum(text_size(text) for text in all_twenty_train.data) \n", "test_size_mb = sum(text_size(text) for text in all_twenty_test.data)\n", "\n", "print(\"Training set size: {0} MB\".format(int(train_size_mb)))\n", "print(\"Testing set size: {0} MB\".format(int(test_size_mb)))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we only consider a small subset of the 4 categories selected from the initial example:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "train_subset_size_mb = sum(text_size(text) for text in twenty_train_subset.data) \n", "test_subset_size_mb = sum(text_size(text) for text in twenty_test_subset.data)\n", "\n", "print(\"Training set size: {0} MB\".format(int(train_subset_size_mb)))\n", "print(\"Testing set size: {0} MB\".format(int(test_subset_size_mb)))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Extracting Text Features" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "TfidfVectorizer()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "vectorizer = TfidfVectorizer(min_df=1)\n", "\n", "%time X_train = vectorizer.fit_transform(twenty_train_subset.data)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results is not a `numpy.array` but instead a `scipy.sparse` matrix. This datastructure is quite similar to a 2D numpy array but it does not store the zeros." ] }, { "cell_type": "code", "collapsed": false, "input": [ "X_train" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "scipy.sparse matrices also have a shape attribute to access the dimensions:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "n_samples, n_features = X_train.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset has around 2000 samples (the rows of the data matrix):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "n_samples" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the same value as the number of strings in the original list of text documents:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(twenty_train_subset.data)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The columns represent the individual token occurrences:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "n_features" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This number is the size of the vocabulary of the model extracted during fit in a Python dictionary:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "type(vectorizer.vocabulary_)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "len(vectorizer.vocabulary_)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The keys of the `vocabulary_` attribute are also called feature names and can be accessed as a list of strings." ] }, { "cell_type": "code", "collapsed": false, "input": [ "len(vectorizer.get_feature_names())" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the first 10 elements (sorted in lexicographical order):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "vectorizer.get_feature_names()[:10]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at the features from the middle:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "vectorizer.get_feature_names()[n_features / 2:n_features / 2 + 10]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In adition to the text of the documents that has been vectorized one also has access to the label information:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "y_train = twenty_train_subset.target\n", "target_names = twenty_train_subset.target_names\n", "target_names" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "y_train.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "y_train" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# We can shape that we have the same number of samples for the input data and the labels:\n", "X_train.shape[0] == y_train.shape[0]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have extracted a vector representation of the data, it's a good idea to project the data on the first 2D of a Principal Component Analysis to get a feel of the data. Note that the `RandomizedPCA` class can accept `scipy.sparse` matrices as input (as an alternative to numpy arrays):" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.decomposition import RandomizedPCA\n", "\n", "%time X_train_pca = RandomizedPCA(n_components=2).fit_transform(X_train)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from itertools import cycle\n", "\n", "colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k']\n", "for i, c in zip(np.unique(y_train), cycle(colors)):\n", " pl.scatter(X_train_pca[y_train == i, 0],\n", " X_train_pca[y_train == i, 1],\n", " c=c, label=target_names[i], alpha=0.5)\n", " \n", "_ = pl.legend(loc='best')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can observe that there is a large overlap of the samples from different categories. This is to be expected as the PCA linear projection projects data from a 34118 dimensional space down to 2 dimensions: data that is linearly separable in 34118D is often no longer linearly separable in 2D.\n", " \n", "Still we can notice an interesting pattern: the newsgroups on religion and atheism occupy the much the same region and computer graphics and space science / space overlap more together than they do with the religion or atheism newsgroups." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Training a Classifier on Text Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now train a classifier, for instance a Multinomial Naive Bayesian classifier which is a fast baseline for text classification tasks:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.naive_bayes import MultinomialNB\n", "\n", "clf = MultinomialNB(alpha=0.1)\n", "clf" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "clf.fit(X_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now evaluate the classifier on the testing set. Let's first use the builtin score function, which is the rate of correct classification in the test set:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "X_test = vectorizer.transform(twenty_test_subset.data)\n", "y_test = twenty_test_subset.target" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "X_test.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "y_test.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "clf.score(X_test, y_test)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also compute the score on the test set and observe that the model is both overfitting and underfitting a bit at the same time:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "clf.score(X_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Introspecting the Behavior of the Text Vectorizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The text vectorizer has many parameters to customize it's behavior, in particular how it extracts tokens:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "TfidfVectorizer()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print(TfidfVectorizer.__doc__)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The easiest way to introspect what the vectorizer is actually doing for a given test of parameters is call the `vectorizer.build_analyzer()` to get an instance of the text analyzer it uses to process the text:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "analyzer = TfidfVectorizer().build_analyzer()\n", "analyzer(\"I love scikit-learn: this is a cool Python lib!\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can notice that all the tokens are lowercase, that the single letter word \"I\" was dropped, and that hyphenation is used. Let's change some of that default behavior:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "analyzer = TfidfVectorizer(\n", " preprocessor=lambda text: text, # disable lowercasing\n", " token_pattern=ur'(?u)\\b[\\w-]+\\b', # treat hyphen as a letter\n", " # do not exclude single letter tokens\n", ").build_analyzer()\n", "\n", "analyzer(\"I love scikit-learn: this is a cool Python lib!\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The analyzer name comes from the Lucene parlance: it wraps the sequential application of:\n", "\n", "- text preprocessing (processing the text documents as a whole, e.g. lowercasing)\n", "- text tokenization (splitting the document into a sequence of tokens)\n", "- token filtering and recombination (e.g. n-grams extraction, see later)\n", "\n", "The analyzer system of scikit-learn is much more basic than lucene's though." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**:\n", "\n", "- Write a pre-processor callable (e.g. a python function) to remove the headers of the text a newsgroup post.\n", "- Vectorize the data again and measure the impact on performance of removing the header info from the dataset.\n", "- Do you expect the performance of the model to improve or decrease? What is the score of a uniform random classifier on the same dataset?\n", "\n", "Hint: the `TfidfVectorizer` class can accept python functions to customize the `preprocessor`, `tokenizer` or `analyzer` stages of the vectorizer.\n", " \n", "- type `TfidfVectorizer()` alone in a cell to see the default value of the parameters\n", "\n", "- type `TfidfVectorizer.__doc__` to print the constructor parameters doc or `?` suffix operator on a any Python class or method to read the docstring or even the `??` operator to read the source code." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Solution**:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# %load solutions/05B_strip_headers.py" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Assembling a Pipeline for Model Validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The feature extraction class has many options to customize its behavior:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(TfidfVectorizer.__doc__)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to evaluate the impact of the parameters of the feature extraction one can chain a configured feature extraction and linear classifier (as an alternative to the naive Bayes model:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.linear_model import PassiveAggressiveClassifier\n", "from sklearn.pipeline import Pipeline\n", "\n", "pipeline = Pipeline((\n", " ('vec', TfidfVectorizer(min_df=1, max_df=0.8, use_idf=True)),\n", " ('clf', PassiveAggressiveClassifier(C=1)),\n", "))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Such a pipeline can then be used to evaluate the performance on the test set:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pipeline.fit(twenty_train_subset.data, twenty_train_subset.target)\n", "\n", "print(\"Train score:\")\n", "print(pipeline.score(twenty_train_subset.data, twenty_train_subset.target))\n", "print(\"Test score:\")\n", "print(pipeline.score(twenty_test_subset.data, twenty_test_subset.target))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Introspecting Model Performance" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Displaying the Most Discriminative Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's collect info on the fitted components of the previously trained model:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "vec_name, vec = pipeline.steps[0]\n", "clf_name, clf = pipeline.steps[1]\n", "\n", "feature_names = vec.get_feature_names()\n", "\n", "feature_weights = clf.coef_\n", "\n", "feature_weights.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By sorting the feature weights on the linear model and asking the vectorizer what their names is, one can get a clue on what the model did actually learn on the data:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def display_important_features(feature_names, target_names, weights, n_top=30):\n", " for i, target_name in enumerate(target_names):\n", " print(\"Class: \" + target_name)\n", " print(\"\")\n", " \n", " sorted_features_indices = weights[i].argsort()[::-1]\n", " \n", " most_important = sorted_features_indices[:n_top]\n", " print(\", \".join(\"{0}: {1:.4f}\".format(feature_names[j], weights[i, j])\n", " for j in most_important))\n", " print(\"...\")\n", " \n", " least_important = sorted_features_indices[-n_top:]\n", " print(\", \".join(\"{0}: {1:.4f}\".format(feature_names[j], weights[i, j])\n", " for j in least_important))\n", " print(\"\")\n", " \n", "display_important_features(feature_names, target_names, feature_weights)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Displaying the per-class Classification Reports" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.metrics import classification_report\n", "\n", "predicted = pipeline.predict(twenty_test_subset.data)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print(classification_report(twenty_test_subset.target, predicted,\n", " target_names=target_names))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Printing the Confusion Matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The confusion matrix summarize which class where by having a look at off-diagonal entries: here we can see that articles about atheism have been wrongly classified as being about religion 57 times for instance: " ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.metrics import confusion_matrix\n", "\n", "confusion_matrix(twenty_test_subset.target, predicted)" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }