{
 "metadata": {
  "celltoolbar": "Slideshow",
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "# Introduction to Scikit-Learn \n",
      "\n",
      "View this IPython Notebook: \n",
      "\n",
      "    j.mp/sklearn\n",
      "    \n",
      "Everything is in a Github repo: \n",
      "    \n",
      "    github.com/tdhopper/\n",
      "    \n",
      "View slides with:\n",
      "\n",
      "    ipython nbconvert Intro\\ to\\ Scikit-Learn.ipynb --to slides --post serve"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "<center>\n",
      "# Introduction to Scikit-Learn <br \\><br \\>\n",
      "\n",
      "__Research Triangle Analysts (1/16/13)__\n",
      "<img src = \"http://stiglerdiet.com/theme/images/logo.png\" /><br \\><br \\>\n",
      "\n",
      "Software Engineer at [parse.ly](http://www.parse.ly) <br \\>\n",
      "@tdhopper <br \\>\n",
      "tdhopper@gmail.com <br \\>\n",
      "</center>"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "What is Scikit-Learn?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "\"Machine Learning in Python\""
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "source": [
      "* Classification \n",
      "* Regression\n",
      "* Clustering\n",
      "* Dimensionality Reduction\n",
      "* Model Selection\n",
      "* Preprocessing\n",
      "\n",
      "See more: [http://scikit-learn.org/stable/user_guide.html](http://scikit-learn.org/stable/user_guide.html)"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Why scikit-learn?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n",
      "\n",
      "One: __Commitment to documentation and usability__\n",
      "\n",
      "> One of the reasons I started using scikit-learn was because of its nice documentation (which I hold up as an example for other communities and projects to emulate). \n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "subslide"
      }
     },
     "source": [
      "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n",
      "\n",
      "Two: __Models are chosen and implemented by a dedicated team of experts__\n",
      "\n",
      "> Scikit-learn\u2019s stable of contributors includes experts in machine-learning and software development."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "subslide"
      }
     },
     "source": [
      "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n",
      "\n",
      "Three: __Covers most machine-learning tasks__\n",
      "\n",
      "> Scan the list of things available in scikit-learn and you quickly realize that it includes tools for many of the standard machine-learning tasks (such as clustering, classification, regression, etc.).\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "subslide"
      }
     },
     "source": [
      "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n",
      "\n",
      "Four: __Python and Pydata__\n",
      "\n",
      "> An impressive set of Python data tools (pydata) have emerged over the last few years.\n",
      "\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "subslide"
      }
     },
     "source": [
      "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n",
      "\n",
      "Five: __Focus__\n",
      "\n",
      "> Scikit-learn is a machine-learning library. Its goal is to provide a set of common algorithms to Python users through a consistent interface.\n",
      "\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "subslide"
      }
     },
     "source": [
      "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n",
      "\n",
      "Six: __scikit-learn scales to most data problems__\n",
      "\n",
      "> Many problems can be tackled using a single (big memory) server, and well-designed software that runs on a single machine can blow away distributed systems.\n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "This talk is _not_...\n",
      "\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "...an introduction to Python"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "source": [
      "...an introduction to machine learning"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Example\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn import datasets\n",
      "from numpy import logical_or\n",
      "from sklearn.lda import LDA\n",
      "from sklearn.metrics import confusion_matrix"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "iris = datasets.load_iris()\n",
      "subset = logical_or(iris.target == 0, iris.target == 1)\n",
      "\n",
      "X = iris.data[subset]\n",
      "y = iris.target[subset]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "-"
      }
     },
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print X[0:5,:]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[[ 5.1  3.5  1.4  0.2]\n",
        " [ 4.9  3.   1.4  0.2]\n",
        " [ 4.7  3.2  1.3  0.2]\n",
        " [ 4.6  3.1  1.5  0.2]\n",
        " [ 5.   3.6  1.4  0.2]]\n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print y[0:5]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[0 0 0 0 0]\n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Linear Discriminant Analysis\n",
      "lda = LDA(2)\n",
      "lda.fit(X, y)\n",
      "\n",
      "confusion_matrix(y, lda.predict(X))"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "subslide"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 8,
       "text": [
        "array([[50,  0],\n",
        "       [ 0, 50]])"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "The Scikit-learn API"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The main \"interfaces\" in scikit-learn are (one class can implement multiple interfaces): \n",
      "\n",
      "__Estimator__: \n",
      "\n",
      "    estimator = obj.fit(data, targets) "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "__Predictor__: \n",
      "\n",
      "    prediction = obj.predict(data) "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "__Transformer__:\n",
      "\n",
      "    new_data = obj.transform(data) \n",
      "    "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "__Model__:\n",
      "\n",
      "    score = obj.score(data)"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Scikit-learn API: the Estimator"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "All estimators implement the __fit__ method:\n",
      "\n",
      "    estimator.fit(X, y)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "-"
      }
     },
     "source": [
      "    \n",
      "> A estimator is an object that __fits a model__ based on some training data and is __capable of inferring__ some properties on new data."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.linear_model import LogisticRegression"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Create Model\n",
      "model = LogisticRegression()\n",
      "# Fit Model\n",
      "model.fit(X, y)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 10,
       "text": [
        "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n",
        "          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "(Almost) everything is an estimator"
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Unsupervised Learning"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.cluster import KMeans"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Create Model\n",
      "kmeans = KMeans(n_clusters = 2)\n",
      "# Fit Model\n",
      "kmeans.fit(X)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,\n",
        "    n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,\n",
        "    verbose=0)"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Dimensionality Reduction"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.decomposition import PCA"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Create Model \n",
      "pca = PCA(n_components=2)\n",
      "# Fit Model\n",
      "pca.fit(X)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 14,
       "text": [
        "PCA(copy=True, n_components=2, whiten=False)"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "source": [
      "The __fit__ method takes a $y$ parameter even if it isn't needed (though $y$ is ignored). This is important later."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.decomposition import PCA"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pca = PCA(n_components=2)\n",
      "pca.fit(X, y)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 16,
       "text": [
        "PCA(copy=True, n_components=2, whiten=False)"
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Feature Selection"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.feature_selection import SelectKBest\n",
      "from sklearn.metrics import matthews_corrcoef"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "outputs": [],
     "prompt_number": 17
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Create Model\n",
      "kbest = SelectKBest(k = 3)\n",
      "# Fit Model\n",
      "kbest.fit(X, y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 18,
       "text": [
        "SelectKBest(k=1, score_func=<function f_classif at 0x1139f3398>)"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "(Almost) everything is an estimator!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "model = LogisticRegression()\n",
      "model.fit(X, y)\n",
      "\n",
      "kbest = SelectKBest(k = 1)\n",
      "kbest.fit(X, y)\n",
      "\n",
      "kmeans = KMeans(n_clusters = 2)\n",
      "kmeans.fit(X, y)\n",
      "\n",
      "pca = PCA(n_components=2)\n",
      "pca.fit(X, y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 83,
       "text": [
        "PCA(copy=True, n_components=2, whiten=False)"
       ]
      }
     ],
     "prompt_number": 83
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "__What can we do with an estimator?__ \n",
      "\n",
      "Inference!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "model = LogisticRegression()\n",
      "model.fit(X, y)\n",
      "print model.coef_"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "-"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[[-0.40731745 -1.46092371  2.24004724  1.00841492]]\n"
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "kmeans = KMeans(n_clusters = 2)\n",
      "kmeans.fit(X)\n",
      "print kmeans.cluster_centers_"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[[ 5.936  2.77   4.26   1.326]\n",
        " [ 5.006  3.418  1.464  0.244]]\n"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pca = PCA(n_components=2)\n",
      "pca.fit(X, y)\n",
      "print pca.explained_variance_"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[ 2.73946394  0.22599044]\n"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "kbest = SelectKBest(k = 1)\n",
      "kbest.fit(X, y)\n",
      "print kbest.get_support()"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[False False  True False]\n"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "__Is that it?__"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Scikit-learn API: the Predictor"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "model = LogisticRegression()\n",
      "model.fit(X, y)\n",
      "\n",
      "X_test = [[ 5.006,  3.418,  1.464,  0.244], [ 5.936,  2.77 ,  4.26 ,  1.326]]\n",
      "\n",
      "model.predict(X_test)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 23,
       "text": [
        "array([0, 1])"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print model.predict_proba(X_test)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[[ 0.97741151  0.02258849]\n",
        " [ 0.01544837  0.98455163]]\n"
       ]
      }
     ],
     "prompt_number": 24
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Scikit-learn API: the Transformer"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pca = PCA(n_components=2)\n",
      "pca.fit(X)\n",
      "\n",
      "print pca.transform(X)[0:5,:]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[[-1.65441341 -0.20660719]\n",
        " [-1.63509488  0.2988347 ]\n",
        " [-1.82037547  0.27141696]\n",
        " [-1.66207305  0.43021683]\n",
        " [-1.70358916 -0.21574051]]\n"
       ]
      }
     ],
     "prompt_number": 25
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "source": [
      "__fit_transform__ is also available (and is sometimes faster)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pca = PCA(n_components=2)\n",
      "print pca.fit_transform(X)[0:5,:]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[[-1.65441341 -0.20660719]\n",
        " [-1.63509488  0.2988347 ]\n",
        " [-1.82037547  0.27141696]\n",
        " [-1.66207305  0.43021683]\n",
        " [-1.70358916 -0.21574051]]\n"
       ]
      }
     ],
     "prompt_number": 54
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "kbest = SelectKBest(k = 1)\n",
      "kbest.fit(X, y)\n",
      "\n",
      "print kbest.transform(X)[0:5,:]"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[[ 1.4]\n",
        " [ 1.4]\n",
        " [ 1.3]\n",
        " [ 1.5]\n",
        " [ 1.4]]\n"
       ]
      }
     ],
     "prompt_number": 26
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Scikit-learn API: the Model"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.cross_validation import KFold\n",
      "from numpy import arange\n",
      "from random import shuffle\n",
      "from sklearn.dummy import DummyClassifier"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "outputs": [],
     "prompt_number": 27
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "model = DummyClassifier()\n",
      "model.fit(X, y)\n",
      "\n",
      "model.score(X, y)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 86,
       "text": [
        "0.48999999999999999"
       ]
      }
     ],
     "prompt_number": 86
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Building Pipelines"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.pipeline import Pipeline"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "outputs": [],
     "prompt_number": 87
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pipe = Pipeline([\n",
      "          (\"select\", SelectKBest(k = 3)),\n",
      "          (\"pca\", PCA(n_components = 1)),\n",
      "          (\"classify\", LogisticRegression())\n",
      "          ])\n",
      "\n",
      "pipe.fit(X, y)\n",
      "\n",
      "pipe.predict(X)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 55,
       "text": [
        "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
        "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
        "       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
        "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
        "       1, 1, 1, 1, 1, 1, 1, 1])"
       ]
      }
     ],
     "prompt_number": 55
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Intermediate steps of the pipeline must be __Estimators__ and __Transformers__.\n",
      "\n",
      "The final estimator needs only to be an __Estimator__."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Text Pipeline"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.datasets import fetch_20newsgroups\n",
      "from sklearn.feature_extraction.text import CountVectorizer\n",
      "from sklearn.feature_extraction.text import TfidfTransformer\n",
      "from sklearn.linear_model import SGDClassifier\n"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "outputs": [],
     "prompt_number": 78
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "news = fetch_20newsgroups()\n",
      "data = news.data\n",
      "category = news.target"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 71
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "len(data)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 72,
       "text": [
        "11314"
       ]
      }
     ],
     "prompt_number": 72
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"  \".join(news.target_names)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "alt.atheism  comp.graphics  comp.os.ms-windows.misc  comp.sys.ibm.pc.hardware  comp.sys.mac.hardware  comp.windows.x  misc.forsale  rec.autos  rec.motorcycles  rec.sport.baseball  rec.sport.hockey  sci.crypt  sci.electronics  sci.med  sci.space  soc.religion.christian  talk.politics.guns  talk.politics.mideast  talk.politics.misc  talk.religion.misc\n"
       ]
      }
     ],
     "prompt_number": 92
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print data[8]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "From: holmes7000@iscsvax.uni.edu\n",
        "Subject: WIn 3.0 ICON HELP PLEASE!\n",
        "Organization: University of Northern Iowa\n",
        "Lines: 10\n",
        "\n",
        "I have win 3.0 and downloaded several icons and BMP's but I can't figure out\n",
        "how to change the \"wallpaper\" or use the icons.  Any help would be appreciated.\n",
        "\n",
        "\n",
        "Thanx,\n",
        "\n",
        "-Brando\n",
        "\n",
        "PS Please E-mail me\n",
        "\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 99
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pipe = Pipeline([\n",
      "    ('vect', CountVectorizer(max_features = 100)),\n",
      "    ('tfidf', TfidfTransformer()),\n",
      "    ('clf', SGDClassifier()),\n",
      "])\n",
      "\n",
      "pipe.fit(data, category)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 100,
       "text": [
        "Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, charset=None,\n",
        "        charset_error=None, decode_error=u'strict',\n",
        "        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',\n",
        "        lowercase=True, max_df=1.0, max_features=100, min_df=1,\n",
        "        ngram_range=(1, 1), prepr..., penalty='l2', power_t=0.5,\n",
        "       random_state=None, shuffle=False, verbose=0, warm_start=False))])"
       ]
      }
     ],
     "prompt_number": 100
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Pandas Pipelines!"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "import numpy as np\n",
      "import sklearn.preprocessing, sklearn.decomposition, sklearn.linear_model, sklearn.pipeline, sklearn.metrics\n",
      "from sklearn_pandas import DataFrameMapper, cross_val_score"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "outputs": [],
     "prompt_number": 107
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "data = pd.DataFrame({\n",
      "    'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],\n",
      "    'children': [4., 6, 3, 3, 2, 3, 5, 4],\n",
      "    'salary':   [90, 24, 44, 27, 32, 59, 36, 27]\n",
      "})"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 117
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mapper = DataFrameMapper([\n",
      "     ('pet', sklearn.preprocessing.LabelBinarizer()),\n",
      "     ('children', sklearn.preprocessing.StandardScaler()),\n",
      "     ('salary', None)\n",
      "])"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [],
     "prompt_number": 111
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mapper.fit_transform(data)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 113,
       "text": [
        "array([[  1.        ,   0.        ,   0.        ,   0.20851441,  90.        ],\n",
        "       [  0.        ,   1.        ,   0.        ,   1.87662973,  24.        ],\n",
        "       [  0.        ,   1.        ,   0.        ,  -0.62554324,  44.        ],\n",
        "       [  0.        ,   0.        ,   1.        ,  -0.62554324,  27.        ],\n",
        "       [  1.        ,   0.        ,   0.        ,  -1.4596009 ,  32.        ],\n",
        "       [  0.        ,   1.        ,   0.        ,  -0.62554324,  59.        ],\n",
        "       [  1.        ,   0.        ,   0.        ,   1.04257207,  36.        ],\n",
        "       [  0.        ,   0.        ,   1.        ,   0.20851441,  27.        ]])"
       ]
      }
     ],
     "prompt_number": 113
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "mapper = DataFrameMapper([\n",
      "     ('pet', sklearn.preprocessing.LabelBinarizer()),\n",
      "     ('children', sklearn.preprocessing.StandardScaler()),\n",
      "     ('salary', None)\n",
      "])\n",
      "\n",
      "pipe = Pipeline([\n",
      "    (\"mapper\", mapper),\n",
      "    (\"pca\", PCA(n_components=2))\n",
      "])\n",
      "pipe.fit_transform(data) # 'data' is a data frame, not a numpy array!"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 157,
       "text": [
        "array([[ -4.76269151e+01,   4.25991055e-01],\n",
        "       [  1.83856756e+01,   1.86178138e+00],\n",
        "       [ -1.62747544e+00,  -5.06199939e-01],\n",
        "       [  1.53796381e+01,  -8.10331853e-01],\n",
        "       [  1.03575109e+01,  -1.52528125e+00],\n",
        "       [ -1.66260441e+01,  -4.27845667e-01],\n",
        "       [  6.37295205e+00,   9.68066902e-01],\n",
        "       [  1.53846579e+01,   1.38193738e-02]])"
       ]
      }
     ],
     "prompt_number": 157
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Pandas pipelines require [sklearn-pandas](https://github.com/paulgb/sklearn-pandas) module by [@paulgb](http://www.twitter.com/paulgb)."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "source": [
      "Also by Paul:\n",
      "    \n",
      "[![](facebook_map.png)](https://www.facebook.com/note.php?note_id=469716398919)"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Model Evaluation and Selection"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.grid_search import GridSearchCV, RandomizedSearchCV\n",
      "from sklearn import datasets\n",
      "from sklearn.ensemble import RandomForestClassifier"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "skip"
      }
     },
     "outputs": [],
     "prompt_number": 212
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Create sample dataset\n",
      "X, y = datasets.make_classification(n_samples = 1000, n_features = 40, n_informative = 6, n_classes = 2)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 137
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Pipeline for Feature Selection to Random Forest\n",
      "pipe = Pipeline([\n",
      "  (\"select\", SelectKBest()),\n",
      "  (\"classify\", RandomForestClassifier())\n",
      "])"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [],
     "prompt_number": 162
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Define parameter grid\n",
      "param_grid = {\n",
      "  \"select__k\" : [1, 6, 20, 40],\n",
      "  \"classify__n_estimators\" : [1, 10, 100],\n",
      "  \n",
      "}\n",
      "gs = GridSearchCV(pipe, param_grid)"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [],
     "prompt_number": 175
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Search over grid\n",
      "gs.fit(X, y)\n",
      "\n",
      "gs.best_params_"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 183,
       "text": [
        "{'classify__n_estimators': 10, 'select__k': 6}"
       ]
      }
     ],
     "prompt_number": 183
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print gs.best_estimator_.predict(X.mean(axis = 0))"
     ],
     "language": "python",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[1]\n"
       ]
      }
     ],
     "prompt_number": 192
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Curse of Dimensionality"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Search space grows exponentially with number of parameters."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "gs.grid_scores_"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 185,
       "text": [
        "[mean: 0.72600, std: 0.02773, params: {'classify__n_estimators': 1, 'select__k': 1},\n",
        " mean: 0.78200, std: 0.00631, params: {'classify__n_estimators': 1, 'select__k': 6},\n",
        " mean: 0.74400, std: 0.02580, params: {'classify__n_estimators': 1, 'select__k': 20},\n",
        " mean: 0.70600, std: 0.05772, params: {'classify__n_estimators': 1, 'select__k': 40},\n",
        " mean: 0.73800, std: 0.02372, params: {'classify__n_estimators': 10, 'select__k': 1},\n",
        " mean: 0.90000, std: 0.01539, params: {'classify__n_estimators': 10, 'select__k': 6},\n",
        " mean: 0.86400, std: 0.01047, params: {'classify__n_estimators': 10, 'select__k': 20},\n",
        " mean: 0.81200, std: 0.02247, params: {'classify__n_estimators': 10, 'select__k': 40},\n",
        " mean: 0.73600, std: 0.02229, params: {'classify__n_estimators': 100, 'select__k': 1},\n",
        " mean: 0.89200, std: 0.01520, params: {'classify__n_estimators': 100, 'select__k': 6},\n",
        " mean: 0.89000, std: 0.01769, params: {'classify__n_estimators': 100, 'select__k': 20},\n",
        " mean: 0.87000, std: 0.02366, params: {'classify__n_estimators': 100, 'select__k': 40}]"
       ]
      }
     ],
     "prompt_number": 185
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Curse of Dimensionality: Parallelization "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "__GridSearch__ on 1 core:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "param_grid = {\n",
      "  \"select__k\" : [1, 5, 10, 15, 20, 25, 30, 35, 40],\n",
      "  \"classify__n_estimators\" : [1, 5, 10, 25, 50, 75, 100],\n",
      "  \n",
      "}\n",
      "gs = GridSearchCV(pipe, param_grid, n_jobs = 1)\n",
      "%timeit gs.fit(X, y)\n",
      "print"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1 loops, best of 3: 6.31 s per loop\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 207
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "__GridSearch__ on 7 cores:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "gs = GridSearchCV(pipe, param_grid, n_jobs = 7)\n",
      "%timeit gs.fit(X, y)\n",
      "print"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1 loops, best of 3: 1.81 s per loop\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 208
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Curse of Dimensionality: Randomization"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "__GridSearchCV__ might be very slow:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "param_grid = {\n",
      "  \"select__k\" : range(1, 40),\n",
      "  \"classify__n_estimators\" : range(1, 100), \n",
      "}"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 220
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "gs = GridSearchCV(pipe, param_grid, n_jobs = 7)\n",
      "gs.fit(X, y)\n",
      "print \"Best CV score\", gs.best_score_\n",
      "print gs.best_params_"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0.924\n",
        "{'classify__n_estimators': 59, 'select__k': 9}\n"
       ]
      }
     ],
     "prompt_number": 221
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We can instead randomly sample from the parameter space with __RandomizedSearchCV__:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "gs = RandomizedSearchCV(pipe, param_grid, n_jobs = 7, n_iter = 10)\n",
      "gs.fit(X, y)\n",
      "print \"Best CV score\", gs.best_score_\n",
      "print gs.best_params_"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "0.894\n",
        "{'classify__n_estimators': 58, 'select__k': 7}\n"
       ]
      }
     ],
     "prompt_number": 229
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {
      "slideshow": {
       "slide_type": "slide"
      }
     },
     "source": [
      "Conclusions"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "-"
      }
     },
     "source": [
      "* Scikit-learn has an elegant API and is built in a beautiful language."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "source": [
      "* Pipelines allow complex chains of operations to be easily computed.\n",
      "    * This helps ensure correct cross validation (see _Elements of Statistical Learning_ 7.10.2). "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {
      "slideshow": {
       "slide_type": "fragment"
      }
     },
     "source": [
      "* Pipelines combined with grid search permit easy model selection."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}