{
 "metadata": {
  "name": "03 - Text Feature Extraction for Classification and Clustering"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Text Feature Extraction for Classification and Clustering"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Text Classification in 15 lines of Python"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's start by implementing a canonical text classification example:\n",
      "\n",
      "- The 20 newsgroups dataset: around 18000 text posts from 20 newsgroups forums\n",
      "- Bag of Words features extraction with TF-IDF weighting\n",
      "- Naive Bayes classifier or Linear Support Vector Machine for the classifier itself"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.datasets import fetch_20newsgroups\n",
      "from sklearn.feature_extraction.text import TfidfVectorizer\n",
      "from sklearn.naive_bayes import MultinomialNB\n",
      "\n",
      "twenty_train = fetch_20newsgroups(subset='train')\n",
      "twenty_test = fetch_20newsgroups(subset='test')\n",
      "\n",
      "vectorizer = TfidfVectorizer()\n",
      "X_train = vectorizer.fit_transform(twenty_train.data)\n",
      "y_train = twenty_train.target\n",
      "\n",
      "classifier = MultinomialNB().fit(X_train, y_train)\n",
      "print(\"Training score: {0:.1f}%\".format(\n",
      "    classifier.score(X_train, y_train) * 100))\n",
      "\n",
      "X_test = vectorizer.transform(twenty_test.data)\n",
      "y_test = twenty_test.target\n",
      "print(\"Testing score: {0:.1f}%\".format(\n",
      "    classifier.score(X_test, y_test) * 100))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's now decompose what we just did to understand and customize each step."
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Loading the Dataset"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "twenty_train = fetch_20newsgroups(subset='train')\n",
      "twenty_test = fetch_20newsgroups(subset='test')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "target_names = twenty_train.target_names\n",
      "target_names"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "twenty_train.target"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "twenty_train.target.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "twenty_test.target.shape"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(twenty_train.data)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "type(twenty_train.data[0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def display_sample(i):\n",
      "    print(\"Class name: \" + target_names[twenty_train.target[i]])\n",
      "    print(\"Text content:\\n\")\n",
      "    print(twenty_train.data[i])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "display_sample(0)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "display_sample(1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's compute the (uncompressed, in-memory) size of the training and test sets in MB assuming an 8 bit encoding (in this case, all chars can be encoded using the latin-1 charset)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def text_size(text, charset='iso-8859-1'):\n",
      "    return len(text.encode(charset)) * 8 * 1e-6\n",
      "\n",
      "train_size_mb = sum(text_size(text) for text in twenty_train.data) \n",
      "test_size_mb = sum(text_size(text) for text in twenty_test.data)\n",
      "\n",
      "print(\"Training set size: {0} MB\".format(int(train_size_mb)))\n",
      "print(\"Testing set size: {0} MB\".format(int(test_size_mb)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Extracting Text Features"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# TODO"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Training a Classifier on Text Features"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# TODO"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Exercise**:\n",
      "\n",
      "- Write a pre-processor callable (e.g. a python function) to remove the headers of the text a newsgroup post.\n",
      "- Vectorize the data again and measure the impact on performance of removing the header info from the datasets.\n",
      "- Do you expect the performance of the model to improve or decrease? What is the score of a uniform random classifier on the same dataset?\n",
      "\n",
      "Hint: the `TfidfVectorizer` class can accept python functions to customize the `preprocessor`, `tokenizer` or `analyzer` stages of the vectorizer. Don't forget to use the IPython `?` suffix operator on a any Python class or method to read the docstring or even the `??` operator to read the source code."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Setting Up a Pipeline for Cross Validation"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# TODO\n",
      "# Use a subset of the data to make this fast to compute"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "More Complex Text Features"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# TODO"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Tips for Improving the Predictive Performance"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# TODO: Print the confusion matrix\n",
      "# TODO: display the most important features for each linear model\n",
      "# TODO: analyze mis-classifications by ranking by violations of the decision / probas thresholds\n",
      "# TODO: semi supervised learning / active learning\n",
      "# TODO: K-Means features from large unlabeled datasets"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Clustering and Topics Extraction"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# TODO: quick presentation of minibatch kmeans + NMF + display important cluster terms + word clouds"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}