{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Out-of-core Learning - Large Scale Text Classification for Sentiment Analysis" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Scalability Issues" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The `sklearn.feature_extraction.text.CountVectorizer` and `sklearn.feature_extraction.text.TfidfVectorizer` classes suffer from a number of scalability issues that all stem from the internal usage of the `vocabulary_` attribute (a Python dictionary) used to map the unicode string feature names to the integer feature indices.\n", "\n", "The main scalability issues are:\n", "\n", "- **Memory usage of the text vectorizer**: all the string representations of the features are loaded in memory\n", "- **Parallelization problems for text feature extraction**: the `vocabulary_` would be a shared state: complex synchronization and overhead\n", "- **Impossibility to do online or out-of-core / streaming learning**: the `vocabulary_` needs to be learned from the data: its size cannot be known before making one pass over the full dataset\n", " \n", " \n", "To better understand the issue let's have a look at how the `vocabulary_` attribute work. At `fit` time the tokens of the corpus are uniquely indentified by a integer index and this mapping stored in the vocabulary:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "vectorizer = CountVectorizer(min_df=1)\n", "\n", "vectorizer.fit([\n", " \"The cat sat on the mat.\",\n", "])\n", "vectorizer.vocabulary_" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The vocabulary is used at `transform` time to build the occurrence matrix:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "X = vectorizer.transform([\n", " \"The cat sat on the mat.\",\n", " \"This cat is a nice cat.\",\n", "]).toarray()\n", "\n", "print(len(vectorizer.vocabulary_))\n", "print(vectorizer.get_feature_names())\n", "print(X)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Let's refit with a slightly larger corpus:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "vectorizer = CountVectorizer(min_df=1)\n", "\n", "vectorizer.fit([\n", " \"The cat sat on the mat.\",\n", " \"The quick brown fox jumps over the lazy dog.\",\n", "])\n", "vectorizer.vocabulary_" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The `vocabulary_` is the (logarithmically) growing with the size of the training corpus. Note that we could not have built the vocabularies in parallel on the 2 text documents as they share some words hence would require some kind of shared datastructure or synchronization barrier which is complicated to setup, especially if we want to distribute the processing on a cluster.\n", "\n", "With this new vocabulary, the dimensionality of the output space is now larger:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "X = vectorizer.transform([\n", " \"The cat sat on the mat.\",\n", " \"This cat is a nice cat.\",\n", "]).toarray()\n", "\n", "print(len(vectorizer.vocabulary_))\n", "print(vectorizer.get_feature_names())\n", "print(X)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## The IMDb movie dataset" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "To illustrate the scalability issues of the vocabulary-based vectorizers, let's load a more realistic dataset for a classical text classification task: sentiment analysis on text documents. The goal is to tell apart negative from positive movie reviews from the [Internet Movie Database](http://www.imdb.com) (IMDb).\n", "\n", "In the following sections, with a [large subset](http://ai.stanford.edu/~amaas/data/sentiment/) of movie reviews from the IMDb that has been collected by Maas et al. \n", "\n", "- A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning Word Vectors for Sentiment Analysis. In the proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. \n", "\n", "This dataset contains 50,000 movie reviews, which were split into 25,000 training samples and 25,000 test samples. The reviews are labeled as either negative (neg) or positive (pos). Moreover, *positive* means that a movie received >6 stars on IMDb; negative means that a movie received <5 stars, respectively.\n", "\n", "\n", "Assuming that the `../fetch_data.py` script was run successfully the following files should be available:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "import os\n", "\n", "train_path = os.path.join('datasets', 'IMDb', 'aclImdb', 'train')\n", "test_path = os.path.join('datasets', 'IMDb', 'aclImdb', 'test')" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Now, let's load them into our active session via scikit-learn's `load_files` function" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.datasets import load_files\n", "\n", "train = load_files(container_path=(train_path),\n", " categories=['pos', 'neg'])\n", "\n", "test = load_files(container_path=(test_path),\n", " categories=['pos', 'neg'])" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
\n", " NOTE:\n", " \n", "
" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The `load_files` function loaded the datasets into `sklearn.datasets.base.Bunch` objects, which are Python dictionaries:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "train.keys()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "In particular, we are only interested in the `data` and `target` arrays." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "import numpy as np\n", "\n", "for label, data in zip(('TRAINING', 'TEST'), (train, test)):\n", " print('\\n\\n%s' % label)\n", " print('Number of documents:', len(data['data']))\n", " print('\\n1st document:\\n', data['data'][0])\n", " print('\\n1st label:', data['target'][0])\n", " print('\\nClass names:', data['target_names'])\n", " print('Class count:', \n", " np.unique(data['target']), ' -> ',\n", " np.bincount(data['target']))" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "As we can see above the `'target'` array consists of integers `0` and `1`, where `0` stands for negative and `1` stands for positive." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## The Hashing Trick" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Remember the bag of word representation using a vocabulary based vectorizer:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "To workaround the limitations of the vocabulary-based vectorizers, one can use the hashing trick. Instead of building and storing an explicit mapping from the feature names to the feature indices in a Python dict, we can just use a hash function and a modulus operation:" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "More info and reference for the original papers on the Hashing Trick in the [following site](http://www.hunch.net/~jl/projects/hash_reps/index.html) as well as a description specific to language [here](http://blog.someben.com/2013/01/hashing-lang/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.utils.murmurhash import murmurhash3_bytes_u32\n", "\n", "# encode for python 3 compatibility\n", "for word in \"the cat sat on the mat\".encode(\"utf-8\").split():\n", " print(\"{0} => {1}\".format(\n", " word, murmurhash3_bytes_u32(word, 0) % 2 ** 20))" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "This mapping is completely stateless and the dimensionality of the output space is explicitly fixed in advance (here we use a modulo `2 ** 20` which means roughly 1M dimensions). The makes it possible to workaround the limitations of the vocabulary based vectorizer both for parallelizability and online / out-of-core learning." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The `HashingVectorizer` class is an alternative to the `CountVectorizer` (or `TfidfVectorizer` class with `use_idf=False`) that internally uses the murmurhash hash function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import HashingVectorizer\n", "\n", "h_vectorizer = HashingVectorizer(encoding='latin-1')\n", "h_vectorizer" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "It shares the same \"preprocessor\", \"tokenizer\" and \"analyzer\" infrastructure:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "analyzer = h_vectorizer.build_analyzer()\n", "analyzer('This is a test sentence.')" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We can vectorize our datasets into a scipy sparse matrix exactly as we would have done with the `CountVectorizer` or `TfidfVectorizer`, except that we can directly call the `transform` method: there is no need to `fit` as `HashingVectorizer` is a stateless transformer:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "docs_train, y_train = train['data'], train['target']\n", "docs_valid, y_valid = test['data'][:12500], test['target'][:12500]\n", "docs_test, y_test = test['data'][12500:], test['target'][12500:]" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The dimension of the output is fixed ahead of time to `n_features=2 ** 20` by default (nearly 1M features) to minimize the rate of collision on most classification problem while having reasonably sized linear models (1M weights in the `coef_` attribute):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "h_vectorizer.transform(docs_train)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Now, let's compare the computational efficiency of the `HashingVectorizer` to the `CountVectorizer`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "h_vec = HashingVectorizer(encoding='latin-1')\n", "%timeit -n 1 -r 3 h_vec.fit(docs_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "count_vec = CountVectorizer(encoding='latin-1')\n", "%timeit -n 1 -r 3 count_vec.fit(docs_train, y_train)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "As we can see, the HashingVectorizer is much faster than the Countvectorizer in this case." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Finally, let us train a LogisticRegression classifier on the IMDb training subset:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.pipeline import Pipeline\n", "\n", "h_pipeline = Pipeline([\n", " ('vec', HashingVectorizer(encoding='latin-1')),\n", " ('clf', LogisticRegression(random_state=1)),\n", "])\n", "\n", "h_pipeline.fit(docs_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "print('Train accuracy', h_pipeline.score(docs_train, y_train))\n", "print('Validation accuracy', h_pipeline.score(docs_valid, y_valid))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "import gc\n", "\n", "del count_vec\n", "del h_pipeline\n", "\n", "gc.collect()" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Out-of-Core learning" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Out-of-Core learning is the task of training a machine learning model on a dataset that does not fit into memory or RAM. This requires the following conditions:\n", " \n", "- a **feature extraction** layer with **fixed output dimensionality**\n", "- knowing the list of all classes in advance (in this case we only have positive and negative reviews)\n", "- a machine learning **algorithm that supports incremental learning** (the `partial_fit` method in scikit-learn).\n", "\n", "In the following sections, we will set up a simple batch-training function to train an `SGDClassifier` iteratively. " ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "But first, let us load the file names into a Python list:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "train_path = os.path.join('datasets', 'IMDb', 'aclImdb', 'train')\n", "train_pos = os.path.join(train_path, 'pos')\n", "train_neg = os.path.join(train_path, 'neg')\n", "\n", "fnames = [os.path.join(train_pos, f) for f in os.listdir(train_pos)] +\\\n", " [os.path.join(train_neg, f) for f in os.listdir(train_neg)]\n", "\n", "fnames[:3]" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Next, let us create the target label array:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "y_train = np.zeros((len(fnames), ), dtype=int)\n", "y_train[:12500] = 1\n", "np.bincount(y_train)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Now, we implement the `batch_train function` as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.base import clone\n", "\n", "def batch_train(clf, fnames, labels, iterations=25, batchsize=1000, random_seed=1):\n", " vec = HashingVectorizer(encoding='latin-1')\n", " idx = np.arange(labels.shape[0])\n", " c_clf = clone(clf)\n", " rng = np.random.RandomState(seed=random_seed)\n", " \n", " for i in range(iterations):\n", " rnd_idx = rng.choice(idx, size=batchsize)\n", " documents = []\n", " for i in rnd_idx:\n", " with open(fnames[i], 'r', encoding='latin-1') as f:\n", " documents.append(f.read())\n", " X_batch = vec.transform(documents)\n", " batch_labels = labels[rnd_idx]\n", " c_clf.partial_fit(X=X_batch, \n", " y=batch_labels, \n", " classes=[0, 1])\n", " \n", " return c_clf" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Note that we are not using `LogisticRegression` as in the previous section, but we will use a `SGDClassifier` with a logistic cost function instead. SGD stands for `stochastic gradient descent`, an optimization alrogithm that optimizes the weight coefficients iteratively sample by sample, which allows us to feed the data to the classifier chunk by chuck." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "And we train the `SGDClassifier`; using the default settings of the `batch_train` function, it will train the classifier on 25*1000=25000 documents. (Depending on your machine, this may take >2 min)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.linear_model import SGDClassifier\n", "\n", "sgd = SGDClassifier(loss='log', random_state=1, max_iter=1000)\n", "\n", "sgd = batch_train(clf=sgd,\n", " fnames=fnames,\n", " labels=y_train)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Eventually, let us evaluate its performance:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "vec = HashingVectorizer(encoding='latin-1')\n", "sgd.score(vec.transform(docs_test), y_test)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Limitations of the Hashing Vectorizer" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Using the Hashing Vectorizer makes it possible to implement streaming and parallel text classification but can also introduce some issues:\n", " \n", "- The collisions can introduce too much noise in the data and degrade prediction quality,\n", "- The `HashingVectorizer` does not provide \"Inverse Document Frequency\" reweighting (lack of a `use_idf=True` option).\n", "- There is no easy way to inverse the mapping and find the feature names from the feature index.\n", "\n", "The collision issues can be controlled by increasing the `n_features` parameters.\n", "\n", "The IDF weighting might be reintroduced by appending a `TfidfTransformer` instance on the output of the vectorizer. However computing the `idf_` statistic used for the feature reweighting will require to do at least one additional pass over the training set before being able to start training the classifier: this breaks the online learning scheme.\n", "\n", "The lack of inverse mapping (the `get_feature_names()` method of `TfidfVectorizer`) is even harder to workaround. That would require extending the `HashingVectorizer` class to add a \"trace\" mode to record the mapping of the most important features to provide statistical debugging information.\n", "\n", "In the mean time to debug feature extraction issues, it is recommended to use `TfidfVectorizer(use_idf=False)` on a small-ish subset of the dataset to simulate a `HashingVectorizer()` instance that have the `get_feature_names()` method and no collision issues." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
\n", " EXERCISE:\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "# %load solutions/23_batchtrain.py" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }