{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Pipelining estimators" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "In this section we study how different estimators maybe be chained." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## A simple example: feature extraction and selection before an estimator" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### Feature extraction: vectorizer" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "For some types of data, for instance text data, a feature extraction step must be applied to convert it to numerical features.\n", "To illustrate we load the SMS spam dataset we used earlier." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "import os\n", "\n", "with open(os.path.join(\"datasets\", \"smsspam\", \"SMSSpamCollection\")) as f:\n", " lines = [line.strip().split(\"\\t\") for line in f.readlines()]\n", "text = [x[1] for x in lines]\n", "y = [x[0] == \"ham\" for x in lines]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "text_train, text_test, y_train, y_test = train_test_split(text, y)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Previously, we applied the feature extraction manually, like so:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "vectorizer = TfidfVectorizer()\n", "vectorizer.fit(text_train)\n", "\n", "X_train = vectorizer.transform(text_train)\n", "X_test = vectorizer.transform(text_test)\n", "\n", "clf = LogisticRegression()\n", "clf.fit(X_train, y_train)\n", "\n", "clf.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The situation where we learn a transformation and then apply it to the test data is very common in machine learning.\n", "Therefore scikit-learn has a shortcut for this, called pipelines:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.pipeline import make_pipeline\n", "\n", "pipeline = make_pipeline(TfidfVectorizer(), LogisticRegression())\n", "pipeline.fit(text_train, y_train)\n", "pipeline.score(text_test, y_test)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "As you can see, this makes the code much shorter and easier to handle. Behind the scenes, exactly the same as above is happening. When calling fit on the pipeline, it will call fit on each step in turn.\n", "\n", "After the first step is fit, it will use the ``transform`` method of the first step to create a new representation.\n", "This will then be fed to the ``fit`` of the next step, and so on.\n", "Finally, on the last step, only ``fit`` is called.\n", "\n", "![pipeline](figures/pipeline.svg)\n", "\n", "If we call ``score``, only ``transform`` will be called on each step - this could be the test set after all! Then, on the last step, ``score`` is called with the new representation. The same goes for ``predict``." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Building pipelines not only simplifies the code, it is also important for model selection.\n", "Say we want to grid-search C to tune our Logistic Regression above.\n", "\n", "Let's say we do it like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "# This illustrates a common mistake. Don't use this code!\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "vectorizer = TfidfVectorizer()\n", "vectorizer.fit(text_train)\n", "\n", "X_train = vectorizer.transform(text_train)\n", "X_test = vectorizer.transform(text_test)\n", "\n", "clf = LogisticRegression()\n", "grid = GridSearchCV(clf, param_grid={'C': [.1, 1, 10, 100]}, cv=5)\n", "grid.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "### What did we do wrong?" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Here, we did grid-search with cross-validation on ``X_train``. However, when applying ``TfidfVectorizer``, it saw all of the ``X_train``,\n", "not only the training folds! So it could use knowledge of the frequency of the words in the test-folds. This is called \"contamination\" of the test set, and leads to too optimistic estimates of generalization performance, or badly selected parameters.\n", "We can fix this with the pipeline, though:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "pipeline = make_pipeline(TfidfVectorizer(), \n", " LogisticRegression())\n", "\n", "grid = GridSearchCV(pipeline,\n", " param_grid={'logisticregression__C': [.1, 1, 10, 100]}, cv=5)\n", "\n", "grid.fit(text_train, y_train)\n", "grid.score(text_test, y_test)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Note that we need to tell the pipeline where at which step we wanted to set the parameter ``C``.\n", "We can do this using the special ``__`` syntax. The name before the ``__`` is simply the name of the class, the part after ``__`` is the parameter we want to set with grid-search." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Another benefit of using pipelines is that we can now also search over parameters of the feature extraction with ``GridSearchCV``:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "pipeline = make_pipeline(TfidfVectorizer(), LogisticRegression())\n", "\n", "params = {'logisticregression__C': [.1, 1, 10, 100],\n", " \"tfidfvectorizer__ngram_range\": [(1, 1), (1, 2), (2, 2)]}\n", "grid = GridSearchCV(pipeline, param_grid=params, cv=5)\n", "grid.fit(text_train, y_train)\n", "print(grid.best_params_)\n", "grid.score(text_test, y_test)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "
\n", " EXERCISE:\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "# %load solutions/15A_ridge_grid.py" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }