{ "metadata": { "name": "", "signature": "sha256:b4431d413654ca43dc1e3f8061db08aaa3a5f1a14ae73c48ebfd719a9a1b907d" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Persistence of Vectorizer and Classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Right now, we are doing first training and then using the vectorizer, feature selector and classifier immediately after training. However, in practice this scenario is one of the least likely occurring one. You generally train your classifier once, and then you want to use that as much as you'd like. In order to do so, we need to persist the vectorizer, feature selector and classifier. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Pipeline Serialization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In last notebook, I showed that `Pipeline` structure gives a nice way to combine and put together a single component for all vectorizer, feature selector and also classifier(for the sequential pipeline). Instead of serializing two and possibly three structures, we would use `pipeline` to handle the serialization and deserialization. Without loss of generality, the things that I will show in this notebook is applicable to independent and separate components of the system(namely vectorizer, fature selector and classifier) as well. However, `pipeline` is preffered way to persist your whole __machine learning pipeline__." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "import csv\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import os\n", "from sklearn import cross_validation\n", "from sklearn import ensemble\n", "from sklearn.feature_extraction import text\n", "from sklearn import feature_extraction\n", "from sklearn import feature_selection\n", "from sklearn import linear_model\n", "from sklearn import metrics\n", "from sklearn import naive_bayes\n", "from sklearn import pipeline\n", "from sklearn import svm\n", "from sklearn import tree\n", "from sklearn import externals\n", "\n", "_DATA_DIR = 'data'\n", "_NYT_DATA_PATH = os.path.join(_DATA_DIR, 'nyt_title_data.csv')\n", "_SERIALIZATION_DIR = 'serializations'\n", "_SERIALIZED_PIPELINE_NAME = 'pipe.pickle'\n", "_SERIALIZATION_PATH = os.path.join(_SERIALIZATION_DIR, _SERIALIZED_PIPELINE_NAME)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "with open(_NYT_DATA_PATH) as nyt:\n", " nyt_data = []\n", " nyt_labels = []\n", " csv_reader = csv.reader(nyt)\n", " for line in csv_reader:\n", " nyt_labels.append(int(line[0]))\n", " nyt_data.append(line[1])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "X = np.array([''.join(el) for el in nyt_data])\n", "y = np.array([el for el in nyt_labels])\n", "\n", "X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y)\n", "\n", "vectorizer = text.TfidfVectorizer(min_df=2, \n", " ngram_range=(1, 2), \n", " stop_words='english', \n", " strip_accents='unicode', \n", " norm='l2')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "pipe = pipeline.Pipeline([(\"vectorizer\", vectorizer), (\"svm\", linear_model.RidgeClassifier())])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "pipe.fit(X_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer=u'word', binary=False, charset=None,\n", " charset_error=None, decode_error=u'strict',\n", " dtype=, encoding=u'utf-8', input=u'content',\n", " lowercase=True, max_df=1.0, max_features=None, min_df=2,\n", " ngram_range=(1, 2)...copy_X=True, fit_intercept=True,\n", " max_iter=None, normalize=False, solver='auto', tol=0.001))])" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Up to here, everything is same with previous notebook, there should be no surprises." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Create a Serialization Directory if it does not exist already" ] }, { "cell_type": "code", "collapsed": false, "input": [ "if not os.path.exists(_SERIALIZATION_DIR):\n", " os.makedirs(_SERIALIZATION_DIR)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "externals.joblib.dump(pipe, _SERIALIZATION_PATH)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "['serializations/pipe.pickle',\n", " 'serializations/pipe.pickle_01.npy',\n", " 'serializations/pipe.pickle_02.npy',\n", " 'serializations/pipe.pickle_03.npy',\n", " 'serializations/pipe.pickle_04.npy',\n", " 'serializations/pipe.pickle_05.npy']" ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "> joblib.dump returns a list of filenames. Each individual numpy array contained in the clf object is serialized as a separate file on the filesystem. All files are required in the same folder when reloading the model with joblib.load." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By this point, the serialization is complete and ready for deployment. If we were not using `Pipeline`, we would need at least two serialization for vectorizer and classifier(feature seelector would be third if one uses that). In order to deploy the model, let's deserialize the `pipeline` in a very similar manner." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pipe = externals.joblib.load(_SERIALIZATION_PATH)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "pipe" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "Pipeline(steps=[('vectorizer', TfidfVectorizer(analyzer=u'word', binary=False, charset=None,\n", " charset_error=None, decode_error=u'strict',\n", " dtype=, encoding=u'utf-8', input=u'content',\n", " lowercase=True, max_df=1.0, max_features=None, min_df=2,\n", " ngram_range=(1, 2)...copy_X=True, fit_intercept=True,\n", " max_iter=None, normalize=False, solver='auto', tol=0.001))])" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We successfully persisted our pipeline and loaded into the namespace again, ready to apply our test set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Never, ever unpickle untrusted data! `Pickle` (the serialization method) in Python which `joblib` uses under the hood has a lot of issues in terms security and vulnerability. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try the pipeline on the test dataset if it works. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "y_test = pipe.predict(X_test)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "y_test" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "array([19, 19, 19, 19, 19, 16, 20, 16, 16, 16, 19, 19, 20, 19, 20, 15, 20,\n", " 16, 16, 12, 16, 16, 19, 16, 20, 20, 19, 20, 16, 16, 20, 16, 19, 19,\n", " 20, 19, 19, 19, 20, 20, 19, 19, 16, 29, 19, 12, 20, 29, 19, 19, 15,\n", " 19, 20, 20, 19, 12, 16, 19, 19, 12, 16, 19, 19, 16, 29, 20, 3, 15,\n", " 19, 12, 19, 3, 15, 3, 16, 19, 16, 29, 15, 19, 20, 19, 29, 19, 15,\n", " 19, 19, 16, 20, 16, 29, 16, 19, 19, 20, 19, 16, 16, 16, 19, 3, 20,\n", " 16, 16, 19, 16, 19, 16, 12, 16, 20, 19, 20, 20, 20, 12, 19, 19, 19,\n", " 29, 19, 16, 19, 19, 20, 15, 29, 19, 16, 20, 20, 16, 19, 19, 19, 19,\n", " 19, 3, 19, 19, 16, 15, 15, 19, 19, 19, 3, 3, 19, 20, 3, 3, 20,\n", " 19, 20, 3, 16, 20, 16, 16, 19, 20, 20, 20, 16, 19, 15, 16, 19, 20,\n", " 16, 19, 20, 12, 20, 19, 19, 19, 16, 19, 15, 29, 15, 3, 16, 19, 19,\n", " 16, 20, 19, 19, 19, 15, 16, 20, 19, 29, 19, 19, 29, 19, 29, 20, 12,\n", " 19, 29, 19, 19, 19, 19, 29, 19, 16, 16, 19, 20, 20, 19, 3, 20, 16,\n", " 3, 19, 16, 20, 19, 20, 20, 19, 16, 20, 19, 16, 20, 16, 20, 3, 19,\n", " 15, 16, 15, 19, 16, 20, 19, 20, 12, 19, 19, 20, 16, 19, 12, 16, 16,\n", " 15, 12, 19, 20, 19, 16, 20, 19, 19, 19, 19, 16, 12, 16, 19, 16, 16,\n", " 19, 20, 19, 19, 19, 20, 15, 20, 16, 19, 3, 16, 29, 19, 19, 20, 19,\n", " 12, 16, 29, 19, 20, 19, 20, 19, 19, 19, 16, 20, 20, 19, 16, 19, 20,\n", " 29, 16, 19, 16, 16, 16, 19, 19, 19, 3, 20, 20, 19, 3, 3, 20, 29,\n", " 16, 19, 19, 16, 19, 16, 19, 20, 19, 16, 19, 20, 19, 16, 15, 3, 20,\n", " 15, 19, 16, 15, 3, 12, 19, 19, 15, 20, 19, 3, 20, 16, 19, 16, 20,\n", " 20, 15, 16, 19, 16, 19, 20, 20, 12, 16, 19, 3, 29, 12, 19, 16, 19,\n", " 15, 20, 3, 16, 19, 19, 16, 19, 19, 15, 16, 19, 3, 20, 19, 19, 20,\n", " 20, 3, 19, 16, 19, 19, 12, 19, 16, 3, 16, 19, 20, 19, 19, 19, 16,\n", " 20, 19, 16, 19, 16, 19, 29, 16, 16, 19, 3, 16, 16, 16, 3, 29, 16,\n", " 20, 19, 16, 19, 12, 29, 16, 20, 15, 16, 20, 16, 20, 19, 19, 16, 19,\n", " 16, 20, 19, 19, 12, 19, 20, 16, 3, 20, 20, 3, 16, 15, 19, 3, 16,\n", " 3, 20, 20, 20, 20, 16, 16, 3, 19, 15, 12, 16, 16, 16, 19, 16, 19,\n", " 19, 16, 20, 20, 16, 19, 19, 19, 19, 19, 19, 16, 16, 3, 19, 20, 16,\n", " 3, 20, 16, 12, 19, 12, 16, 19, 19, 19, 19, 16, 20, 19, 19, 20, 16,\n", " 16, 16, 3, 29, 12, 19, 16, 16, 16, 16, 19, 19, 15, 20, 12, 20, 19,\n", " 19, 20, 19, 16, 16, 15, 19, 19, 20, 20, 19, 20, 20, 16])" ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Takeaways\n", "- `Pipeline` is not just useful for abstraction but also easier to maintain, persist and deploy.\n", "- Do not use a loss-compression technique to compress serialization.(Tarballs would work fine)\n", "- If you want to create exact environment of your training set, use always `virtualenv` and note the version numbers of the libraries that you are using.(See the Notebook 0)\n", "- Some of the algorithms support `partial_fit` function for online learning. ([SGD Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html), [Perceptron](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron), [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)). If you have incremental data that you want to improve your classifier over time, you may want to persist your models and then use `partial_fit` to improve them when you have new data. Works like a charm!" ] } ], "metadata": {} } ] }