{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sebastian Raschka 30/05/2015 \n", "\n", "CPython 3.4.3\n", "IPython 3.1.0\n", "\n", "scikit-learn 0.16.1\n", "joblib 0.8.4\n", "numpy 1.9.2\n", "nltk 3.0.0\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -a 'Sebastian Raschka' -d -v -p scikit-learn,joblib,numpy,nltk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Out-of-core Learning and Model Persistence using scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we are applying machine learning algorithms to real-world applications, our computer hardware often still constitutes the major bottleneck of the learning process. Of course, we all have access to supercomputers, Amazon EC2, Apache Spark, etc. However, out-of-core learning via Stochastic Gradient Descent can still be attractive if we'd want to update our model on-the-fly (\"online-learning\"), and in this notebook, I want to provide some examples of how we can implement an \"out-of-core\" approach using scikit-learn. \n", "I compiled the following code examples for personal reference, and I don't intend it to be a comprehensive reference for the underlying theory, but nonetheless, I decided to share it since it may be useful to one or the other!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sections\n", "\n", "- [The IMDb Movie Review Dataset](#The-IMDb-Movie-Review-Dataset)\n", "- [Preprocessing Text Data](#Preprocessing-Text-Data)\n", "- [Out-of-core learning](#Out-of-core-learning)\n", "- [Model Persistence](#Model-Persistence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The IMDb Movie Review Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, we will train a simple logistic regression model to classify movie reviews from the 50k IMDb review dataset that has been collected by Maas et. al.\n", "\n", "> AL Maas, RE Daly, PT Pham, D Huang, AY Ng, and C Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics\n", "\n", "[Source: http://ai.stanford.edu/~amaas/data/sentiment/]\n", "\n", "The dataset consists of 50,000 movie reviews from the original \"train\" and \"test\" subdirectories. The class labels are binary (1=positive and 0=negative) and contain 25,000 positive and 25,000 negative movie reviews, respectively.\n", "For simplicity, I assembled the reviews in a single CSV file.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentimentset
49995Towards the end of the movie, I felt it was to...0train
49996This is the kind of movie that my enemies cont...0train
49997I saw 'Descent' last night at the Stockholm Fi...0train
49998Some films that you pick up for a pound turn o...0train
49999This is one of the dumbest films, I've ever se...0train
\n", "
" ], "text/plain": [ " review sentiment set\n", "49995 Towards the end of the movie, I felt it was to... 0 train\n", "49996 This is the kind of movie that my enemies cont... 0 train\n", "49997 I saw 'Descent' last night at the Stockholm Fi... 0 train\n", "49998 Some films that you pick up for a pound turn o... 0 train\n", "49999 This is one of the dumbest films, I've ever se... 0 train" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "df = pd.read_csv('https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/50k_imdb_movie_reviews.csv')\n", "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following sections, we will define some simple function to process the text data and read it from the CSV file in minibatches to train a logistic regression classifier via stochastic gradient descent. However, before we proceed to the next section, let us shuffle the class labels." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "np.random.seed(0)\n", "df = df.reindex(np.random.permutation(df.index))\n", "df[['review', 'sentiment']].to_csv('/Users/sebastian/Desktop/shuffled_movie_data.csv', index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing Text Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let us define a simple `tokenizer` that splits the text into individual word tokens. Furthermore, we will use some simple regular expression to remove HTML markup and all non-letter characters but \"emoticons,\" convert the text to lower case, remove stopwords, and apply the Porter stemming algorithm to convert the words into their root form." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "from nltk.stem.porter import PorterStemmer\n", "import re\n", "from nltk.corpus import stopwords\n", "\n", "stop = stopwords.words('english')\n", "porter = PorterStemmer()\n", "\n", "def tokenizer(text):\n", " text = re.sub('<[^>]*>', '', text)\n", " emoticons = re.findall('(?::|;|=)(?:-)?(?:\\)|\\(|D|P)', text.lower())\n", " text = re.sub('[\\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')\n", " text = [w for w in text.split() if w not in stop]\n", " tokenized = [porter.stem(w) for w in text]\n", " return text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's give it at try:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['test', ':)', ':)']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokenizer('This :) is a test! :-)
')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Out-of-core learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we define a generator that returns the document body and the corresponding class label:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def stream_docs(path):\n", " with open(path, 'r') as csv:\n", " next(csv) # skip header\n", " for line in csv:\n", " text, label = line[:-3], int(line[-2])\n", " yield text, label" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To conform that the `stream_docs` function fetches the documents as intended, let us execute the following code snippet before we implement the `get_minibatch` function:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "('\"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\\'s, they discover the criminal and a net of power and money to cover the murder.

\"\"Murder in Greenwich\"\" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich family used their influence to cover the murder for more than twenty years. However, a snoopy detective and convicted perjurer in disgrace was able to disclose how the hideous crime was committed. The screenplay shows the investigation of Mark and the last days of Martha in parallel, but there is a lack of the emotion in the dramatization. My vote is seven.

Title (Brazil): Not Available\"',\n", " 1)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "next(stream_docs(path='/Users/sebastian/Desktop/shuffled_movie_data.csv'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After we confirmed that our `stream_docs` functions works, we will now implement a `get_minibatch` function to fetch a specified number (`size`) of documents:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def get_minibatch(doc_stream, size):\n", " docs, y = [], []\n", " for _ in range(size):\n", " text, label = next(doc_stream)\n", " docs.append(text)\n", " y.append(label)\n", " return docs, y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we will make use of the \"hashing trick\" through scikit-learns [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) to create a bag-of-words model of our documents. I don't want to go into the details of the bag-of-words model for document classification, but if you are interested, you can take a look at one of my articles, [Naive Bayes and Text Classification I - Introduction and Theory](http://arxiv.org/abs/1410.5329), where I explained the concepts behind bag-of-words, tokenization, stemming, etc." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import HashingVectorizer\n", "vect = HashingVectorizer(decode_error='ignore', \n", " n_features=2**21,\n", " preprocessor=None, \n", " tokenizer=tokenizer)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the [SGDClassifier]() from scikit-learn, we will can instanciate a logistic regression classifier that learns from the documents incrementally using stochastic gradient descent. If you are curious about how this optimization algorithm works, please see my article on [artificial neurons](http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html#Online-Learning-via-Stochastic-Gradient-Descent)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.linear_model import SGDClassifier\n", "clf = SGDClassifier(loss='log', random_state=1, n_iter=1)\n", "doc_stream = stream_docs(path='/Users/sebastian/Desktop/shuffled_movie_data.csv')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "0% 100%\n", "[##############################] | ETA[sec]: 0.000 \n", "Total time elapsed: 182.407 sec\n" ] } ], "source": [ "import pyprind\n", "pbar = pyprind.ProgBar(45)\n", "\n", "classes = np.array([0, 1])\n", "for _ in range(45):\n", " X_train, y_train = get_minibatch(doc_stream, size=1000)\n", " X_train = vect.transform(X_train)\n", " clf.partial_fit(X_train, y_train, classes=classes)\n", " pbar.update()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Depending on your machine, it will take about 2-3 minutes to stream the documents and learn the weights for the logistic regression model to classify \"new\" movie reviews. Executing the preceding code, we used the first 45,000 movie reviews to train the classifier, which means that we have 5,000 reviews left for testing:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy: 0.868\n" ] } ], "source": [ "X_test, y_test = get_minibatch(doc_stream, size=5000)\n", "X_test = vect.transform(X_test)\n", "print('Accuracy: %.3f' % clf.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I think that the predictive performance, an accuarcy of ~87%, is quite \"reasonable\" given that we \"only\" used the default parameters and didn't do any hyperparameter optimization. \n", "\n", "After we estimated the model perfomance, let us use those last 5,000 test samples to update our model." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "clf = clf.partial_fit(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Model Persistence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[[back to top](#Sections)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous section, we successfully trained a model to predict the sentiment of a movie review. Unfortunately, if we'd close this IPython notebook at this point, we'd have to go through the whole learning process again and again if we'd want to make a prediction on \"new data.\"\n", "\n", "So, to reuse this model, we could use the [`pickle`](https://docs.python.org/3.5/library/pickle.html) module to \"serialize a Python object structure\". Or even better, we could use the [`joblib`](https://pypi.python.org/pypi/joblib) library, which handles large NumPy arrays more efficiently." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['./outofcore_modelpersistence/clf.pkl',\n", " './outofcore_modelpersistence/clf.pkl_01.npy',\n", " './outofcore_modelpersistence/clf.pkl_02.npy',\n", " './outofcore_modelpersistence/clf.pkl_03.npy',\n", " './outofcore_modelpersistence/clf.pkl_04.npy']" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import joblib\n", "import os\n", "if not os.path.exists('./pkl_objects'):\n", " os.mkdir('./pkl_objects')\n", " \n", "joblib.dump(vect, './outofcore_modelpersistence/vectorizer.pkl')\n", "joblib.dump(clf, './outofcore_modelpersistence/clf.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the code above, we \"pickled\" the `HashingVectorizer` and the `SGDClassifier` so that we can re-use those objects later. However, `pickle` and `joblib` have a known issue with `pickling` objects or functions from a `__main__` block and we'd get an `AttributeError: Can't get attribute [x] on ` if we'd unpickle it later. Thus, to pickle the `tokenizer` function, we can write it to a file and import it to get the `namespace` \"right\"." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Writing tokenizer.py\n" ] } ], "source": [ "%%writefile tokenizer.py\n", "from nltk.stem.porter import PorterStemmer\n", "import re\n", "from nltk.corpus import stopwords\n", "\n", "stop = stopwords.words('english')\n", "porter = PorterStemmer()\n", "\n", "def tokenizer(text):\n", " text = re.sub('<[^>]*>', '', text)\n", " emoticons = re.findall('(?::|;|=)(?:-)?(?:\\)|\\(|D|P)', text.lower())\n", " text = re.sub('[\\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')\n", " text = [w for w in text.split() if w not in stop]\n", " tokenized = [porter.stem(w) for w in text]\n", " return text" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['./outofcore_modelpersistence/tokenizer.pkl']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from tokenizer import tokenizer\n", "joblib.dump(tokenizer, './outofcore_modelpersistence/tokenizer.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let us restart this IPython notebook and check if the we can load our serialized objects:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import joblib\n", "tokenizer = joblib.load('./outofcore_modelpersistence/tokenizer.pkl')\n", "vect = joblib.load('./outofcore_modelpersistence/vectorizer.pkl')\n", "clf = joblib.load('./outofcore_modelpersistence/clf.pkl')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After loading the `tokenizer`, `HashingVectorizer`, and the tranined logistic regression model, we can use it to make predictions on new data, which can be useful, for example, if we'd want to embed our classifier into a web application -- a topic for another IPython notebook." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example = ['I did not like this movie']\n", "X = vect.transform(example)\n", "clf.predict(X)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([1])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "example = ['I loved this movie']\n", "X = vect.transform(example)\n", "clf.predict(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }