{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Out-of-core Learning - Large Scale Text Classification for Sentiment Analysis" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Scalability Issues" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The `sklearn.feature_extraction.text.CountVectorizer` and `sklearn.feature_extraction.text.TfidfVectorizer` classes suffer from a number of scalability issues that all stem from the internal usage of the `vocabulary_` attribute (a Python dictionary) used to map the unicode string feature names to the integer feature indices.\n", "\n", "The main scalability issues are:\n", "\n", "- **Memory usage of the text vectorizer**: all the string representations of the features are loaded in memory\n", "- **Parallelization problems for text feature extraction**: the `vocabulary_` would be a shared state: complex synchronization and overhead\n", "- **Impossibility to do online or out-of-core / streaming learning**: the `vocabulary_` needs to be learned from the data: its size cannot be known before making one pass over the full dataset\n", " \n", " \n", "To better understand the issue let's have a look at how the `vocabulary_` attribute work. At `fit` time the tokens of the corpus are uniquely indentified by a integer index and this mapping stored in the vocabulary:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "vectorizer = CountVectorizer(min_df=1)\n", "\n", "vectorizer.fit([\n", " \"The cat sat on the mat.\",\n", "])\n", "vectorizer.vocabulary_" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The vocabulary is used at `transform` time to build the occurrence matrix:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "X = vectorizer.transform([\n", " \"The cat sat on the mat.\",\n", " \"This cat is a nice cat.\",\n", "]).toarray()\n", "\n", "print(len(vectorizer.vocabulary_))\n", "print(vectorizer.get_feature_names())\n", "print(X)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Let's refit with a slightly larger corpus:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "vectorizer = CountVectorizer(min_df=1)\n", "\n", "vectorizer.fit([\n", " \"The cat sat on the mat.\",\n", " \"The quick brown fox jumps over the lazy dog.\",\n", "])\n", "vectorizer.vocabulary_" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The `vocabulary_` is the (logarithmically) growing with the size of the training corpus. Note that we could not have built the vocabularies in parallel on the 2 text documents as they share some words hence would require some kind of shared datastructure or synchronization barrier which is complicated to setup, especially if we want to distribute the processing on a cluster.\n", "\n", "With this new vocabulary, the dimensionality of the output space is now larger:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "X = vectorizer.transform([\n", " \"The cat sat on the mat.\",\n", " \"This cat is a nice cat.\",\n", "]).toarray()\n", "\n", "print(len(vectorizer.vocabulary_))\n", "print(vectorizer.get_feature_names())\n", "print(X)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## The IMDb movie dataset" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "To illustrate the scalability issues of the vocabulary-based vectorizers, let's load a more realistic dataset for a classical text classification task: sentiment analysis on text documents. The goal is to tell apart negative from positive movie reviews from the [Internet Movie Database](http://www.imdb.com) (IMDb).\n", "\n", "In the following sections, with a [large subset](http://ai.stanford.edu/~amaas/data/sentiment/) of movie reviews from the IMDb that has been collected by Maas et al. \n", "\n", "- A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning Word Vectors for Sentiment Analysis. In the proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. \n", "\n", "This dataset contains 50,000 movie reviews, which were split into 25,000 training samples and 25,000 test samples. The reviews are labeled as either negative (neg) or positive (pos). Moreover, *positive* means that a movie received >6 stars on IMDb; negative means that a movie received <5 stars, respectively.\n", "\n", "\n", "Assuming that the `../fetch_data.py` script was run successfully the following files should be available:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "import os\n", "\n", "train_path = os.path.join('datasets', 'IMDb', 'aclImdb', 'train')\n", "test_path = os.path.join('datasets', 'IMDb', 'aclImdb', 'test')" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Now, let's load them into our active session via scikit-learn's `load_files` function" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.datasets import load_files\n", "\n", "train = load_files(container_path=(train_path),\n", " categories=['pos', 'neg'])\n", "\n", "test = load_files(container_path=(test_path),\n", " categories=['pos', 'neg'])" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "