{ "metadata": { "name": "03 - Text Feature Extraction for Classification and Clustering" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Text Feature Extraction for Classification and Clustering" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Text Classification in 15 lines of Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by implementing a canonical text classification example:\n", "\n", "- The 20 newsgroups dataset: around 18000 text posts from 20 newsgroups forums\n", "- Bag of Words features extraction with TF-IDF weighting\n", "- Naive Bayes classifier or Linear Support Vector Machine for the classifier itself" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import fetch_20newsgroups\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.naive_bayes import MultinomialNB\n", "\n", "twenty_train = fetch_20newsgroups(subset='train')\n", "twenty_test = fetch_20newsgroups(subset='test')\n", "\n", "vectorizer = TfidfVectorizer()\n", "X_train = vectorizer.fit_transform(twenty_train.data)\n", "y_train = twenty_train.target\n", "\n", "classifier = MultinomialNB().fit(X_train, y_train)\n", "print(\"Training score: {0:.1f}%\".format(\n", " classifier.score(X_train, y_train) * 100))\n", "\n", "X_test = vectorizer.transform(twenty_test.data)\n", "y_test = twenty_test.target\n", "print(\"Testing score: {0:.1f}%\".format(\n", " classifier.score(X_test, y_test) * 100))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now decompose what we just did to understand and customize each step." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Loading the Dataset" ] }, { "cell_type": "code", "collapsed": false, "input": [ "twenty_train = fetch_20newsgroups(subset='train')\n", "twenty_test = fetch_20newsgroups(subset='test')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "target_names = twenty_train.target_names\n", "target_names" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "twenty_train.target" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "twenty_train.target.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "twenty_test.target.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "len(twenty_train.data)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "type(twenty_train.data[0])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def display_sample(i):\n", " print(\"Class name: \" + target_names[twenty_train.target[i]])\n", " print(\"Text content:\\n\")\n", " print(twenty_train.data[i])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "display_sample(0)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "display_sample(1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8 bit encoding (in this case, all chars can be encoded using the latin-1 charset)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "def text_size(text, charset='iso-8859-1'):\n", " return len(text.encode(charset)) * 8 * 1e-6\n", "\n", "train_size_mb = sum(text_size(text) for text in twenty_train.data) \n", "test_size_mb = sum(text_size(text) for text in twenty_test.data)\n", "\n", "print(\"Training set size: {0} MB\".format(int(train_size_mb)))\n", "print(\"Testing set size: {0} MB\".format(int(test_size_mb)))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Extracting Text Features" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# TODO" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Training a Classifier on Text Features" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# TODO" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**:\n", "\n", "- Write a pre-processor callable (e.g. a python function) to remove the headers of the text a newsgroup post.\n", "- Vectorize the data again and measure the impact on performance of removing the header info from the datasets.\n", "- Do you expect the performance of the model to improve or decrease? What is the score of a uniform random classifier on the same dataset?\n", "\n", "Hint: the `TfidfVectorizer` class can accept python functions to customize the `preprocessor`, `tokenizer` or `analyzer` stages of the vectorizer. Don't forget to use the IPython `?` suffix operator on a any Python class or method to read the docstring or even the `??` operator to read the source code." ] }, { "cell_type": "code", "collapsed": false, "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Setting Up a Pipeline for Cross Validation" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# TODO\n", "# Use a subset of the data to make this fast to compute" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "More Complex Text Features" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# TODO" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Tips for Improving the Predictive Performance" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# TODO: Print the confusion matrix\n", "# TODO: display the most important features for each linear model\n", "# TODO: analyze mis-classifications by ranking by violations of the decision / probas thresholds\n", "# TODO: semi supervised learning / active learning\n", "# TODO: K-Means features from large unlabeled datasets" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Clustering and Topics Extraction" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# TODO: quick presentation of minibatch kmeans + NMF + display important cluster terms + word clouds" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }