{ "metadata": { "celltoolbar": "Slideshow", "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Introduction to Scikit-Learn \n", "\n", "View this IPython Notebook: \n", "\n", " j.mp/sklearn\n", " \n", "Everything is in a Github repo: \n", " \n", " github.com/tdhopper/\n", " \n", "View slides with:\n", "\n", " ipython nbconvert Intro\\ to\\ Scikit-Learn.ipynb --to slides --post serve" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "# Introduction to Scikit-Learn

\n", "\n", "__Research Triangle Analysts (1/16/13)__\n", "

\n", "\n", "Software Engineer at [parse.ly](http://www.parse.ly)
\n", "@tdhopper
\n", "tdhopper@gmail.com
\n", "
" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "What is Scikit-Learn?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Machine Learning in Python\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Classification \n", "* Regression\n", "* Clustering\n", "* Dimensionality Reduction\n", "* Model Selection\n", "* Preprocessing\n", "\n", "See more: [http://scikit-learn.org/stable/user_guide.html](http://scikit-learn.org/stable/user_guide.html)" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Why scikit-learn?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n", "\n", "One: __Commitment to documentation and usability__\n", "\n", "> One of the reasons I started using scikit-learn was because of its nice documentation (which I hold up as an example for other communities and projects to emulate). \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n", "\n", "Two: __Models are chosen and implemented by a dedicated team of experts__\n", "\n", "> Scikit-learn\u2019s stable of contributors includes experts in machine-learning and software development." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n", "\n", "Three: __Covers most machine-learning tasks__\n", "\n", "> Scan the list of things available in scikit-learn and you quickly realize that it includes tools for many of the standard machine-learning tasks (such as clustering, classification, regression, etc.).\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n", "\n", "Four: __Python and Pydata__\n", "\n", "> An impressive set of Python data tools (pydata) have emerged over the last few years.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n", "\n", "Five: __Focus__\n", "\n", "> Scikit-learn is a machine-learning library. Its goal is to provide a set of common algorithms to Python users through a consistent interface.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Six reasons why [Ben Lorica (@bigdata)](http://strata.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html) recommends scikit-learn\n", "\n", "Six: __scikit-learn scales to most data problems__\n", "\n", "> Many problems can be tackled using a single (big memory) server, and well-designed software that runs on a single machine can blow away distributed systems.\n" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This talk is _not_...\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "...an introduction to Python" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "...an introduction to machine learning" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Example\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import datasets\n", "from numpy import logical_or\n", "from sklearn.lda import LDA\n", "from sklearn.metrics import confusion_matrix" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "iris = datasets.load_iris()\n", "subset = logical_or(iris.target == 0, iris.target == 1)\n", "\n", "X = iris.data[subset]\n", "y = iris.target[subset]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "print X[0:5,:]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[ 5.1 3.5 1.4 0.2]\n", " [ 4.9 3. 1.4 0.2]\n", " [ 4.7 3.2 1.3 0.2]\n", " [ 4.6 3.1 1.5 0.2]\n", " [ 5. 3.6 1.4 0.2]]\n" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "print y[0:5]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[0 0 0 0 0]\n" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "# Linear Discriminant Analysis\n", "lda = LDA(2)\n", "lda.fit(X, y)\n", "\n", "confusion_matrix(y, lda.predict(X))" ], "language": "python", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ "array([[50, 0],\n", " [ 0, 50]])" ] } ], "prompt_number": 8 }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The Scikit-learn API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The main \"interfaces\" in scikit-learn are (one class can implement multiple interfaces): \n", "\n", "__Estimator__: \n", "\n", " estimator = obj.fit(data, targets) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Predictor__: \n", "\n", " prediction = obj.predict(data) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Transformer__:\n", "\n", " new_data = obj.transform(data) \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Model__:\n", "\n", " score = obj.score(data)" ] }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Scikit-learn API: the Estimator" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All estimators implement the __fit__ method:\n", "\n", " estimator.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ " \n", "> A estimator is an object that __fits a model__ based on some training data and is __capable of inferring__ some properties on new data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.linear_model import LogisticRegression" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "# Create Model\n", "model = LogisticRegression()\n", "# Fit Model\n", "model.fit(X, y)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)" ] } ], "prompt_number": 10 }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "(Almost) everything is an estimator" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Unsupervised Learning" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cluster import KMeans" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "# Create Model\n", "kmeans = KMeans(n_clusters = 2)\n", "# Fit Model\n", "kmeans.fit(X)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,\n", " n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,\n", " verbose=0)" ] } ], "prompt_number": 12 }, { "cell_type": "heading", "level": 3, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Dimensionality Reduction" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.decomposition import PCA" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "# Create Model \n", "pca = PCA(n_components=2)\n", "# Fit Model\n", "pca.fit(X)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ "PCA(copy=True, n_components=2, whiten=False)" ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The __fit__ method takes a $y$ parameter even if it isn't needed (though $y$ is ignored). This is important later." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.decomposition import PCA" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "pca = PCA(n_components=2)\n", "pca.fit(X, y)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ "PCA(copy=True, n_components=2, whiten=False)" ] } ], "prompt_number": 16 }, { "cell_type": "heading", "level": 3, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Feature Selection" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.feature_selection import SelectKBest\n", "from sklearn.metrics import matthews_corrcoef" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "# Create Model\n", "kbest = SelectKBest(k = 3)\n", "# Fit Model\n", "kbest.fit(X, y)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "SelectKBest(k=1, score_func=)" ] } ], "prompt_number": 18 }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "(Almost) everything is an estimator!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "model = LogisticRegression()\n", "model.fit(X, y)\n", "\n", "kbest = SelectKBest(k = 1)\n", "kbest.fit(X, y)\n", "\n", "kmeans = KMeans(n_clusters = 2)\n", "kmeans.fit(X, y)\n", "\n", "pca = PCA(n_components=2)\n", "pca.fit(X, y)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 83, "text": [ "PCA(copy=True, n_components=2, whiten=False)" ] } ], "prompt_number": 83 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "__What can we do with an estimator?__ \n", "\n", "Inference!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "model = LogisticRegression()\n", "model.fit(X, y)\n", "print model.coef_" ], "language": "python", "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[-0.40731745 -1.46092371 2.24004724 1.00841492]]\n" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "kmeans = KMeans(n_clusters = 2)\n", "kmeans.fit(X)\n", "print kmeans.cluster_centers_" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[ 5.936 2.77 4.26 1.326]\n", " [ 5.006 3.418 1.464 0.244]]\n" ] } ], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "pca = PCA(n_components=2)\n", "pca.fit(X, y)\n", "print pca.explained_variance_" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[ 2.73946394 0.22599044]\n" ] } ], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "kbest = SelectKBest(k = 1)\n", "kbest.fit(X, y)\n", "print kbest.get_support()" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[False False True False]\n" ] } ], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "__Is that it?__" ] }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Scikit-learn API: the Predictor" ] }, { "cell_type": "code", "collapsed": false, "input": [ "model = LogisticRegression()\n", "model.fit(X, y)\n", "\n", "X_test = [[ 5.006, 3.418, 1.464, 0.244], [ 5.936, 2.77 , 4.26 , 1.326]]\n", "\n", "model.predict(X_test)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "array([0, 1])" ] } ], "prompt_number": 23 }, { "cell_type": "code", "collapsed": false, "input": [ "print model.predict_proba(X_test)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[ 0.97741151 0.02258849]\n", " [ 0.01544837 0.98455163]]\n" ] } ], "prompt_number": 24 }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Scikit-learn API: the Transformer" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pca = PCA(n_components=2)\n", "pca.fit(X)\n", "\n", "print pca.transform(X)[0:5,:]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[-1.65441341 -0.20660719]\n", " [-1.63509488 0.2988347 ]\n", " [-1.82037547 0.27141696]\n", " [-1.66207305 0.43021683]\n", " [-1.70358916 -0.21574051]]\n" ] } ], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "__fit_transform__ is also available (and is sometimes faster)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pca = PCA(n_components=2)\n", "print pca.fit_transform(X)[0:5,:]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[-1.65441341 -0.20660719]\n", " [-1.63509488 0.2988347 ]\n", " [-1.82037547 0.27141696]\n", " [-1.66207305 0.43021683]\n", " [-1.70358916 -0.21574051]]\n" ] } ], "prompt_number": 54 }, { "cell_type": "code", "collapsed": false, "input": [ "kbest = SelectKBest(k = 1)\n", "kbest.fit(X, y)\n", "\n", "print kbest.transform(X)[0:5,:]" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[[ 1.4]\n", " [ 1.4]\n", " [ 1.3]\n", " [ 1.5]\n", " [ 1.4]]\n" ] } ], "prompt_number": 26 }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Scikit-learn API: the Model" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cross_validation import KFold\n", "from numpy import arange\n", "from random import shuffle\n", "from sklearn.dummy import DummyClassifier" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 27 }, { "cell_type": "code", "collapsed": false, "input": [ "model = DummyClassifier()\n", "model.fit(X, y)\n", "\n", "model.score(X, y)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 86, "text": [ "0.48999999999999999" ] } ], "prompt_number": 86 }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Building Pipelines" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.pipeline import Pipeline" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 87 }, { "cell_type": "code", "collapsed": false, "input": [ "pipe = Pipeline([\n", " (\"select\", SelectKBest(k = 3)),\n", " (\"pca\", PCA(n_components = 1)),\n", " (\"classify\", LogisticRegression())\n", " ])\n", "\n", "pipe.fit(X, y)\n", "\n", "pipe.predict(X)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 55, "text": [ "array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n", " 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", " 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n", " 1, 1, 1, 1, 1, 1, 1, 1])" ] } ], "prompt_number": 55 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Intermediate steps of the pipeline must be __Estimators__ and __Transformers__.\n", "\n", "The final estimator needs only to be an __Estimator__." ] }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Text Pipeline" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import fetch_20newsgroups\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.feature_extraction.text import TfidfTransformer\n", "from sklearn.linear_model import SGDClassifier\n" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 78 }, { "cell_type": "code", "collapsed": false, "input": [ "news = fetch_20newsgroups()\n", "data = news.data\n", "category = news.target" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 71 }, { "cell_type": "code", "collapsed": false, "input": [ "len(data)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 72, "text": [ "11314" ] } ], "prompt_number": 72 }, { "cell_type": "code", "collapsed": false, "input": [ "print \" \".join(news.target_names)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc\n" ] } ], "prompt_number": 92 }, { "cell_type": "code", "collapsed": false, "input": [ "print data[8]" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "From: holmes7000@iscsvax.uni.edu\n", "Subject: WIn 3.0 ICON HELP PLEASE!\n", "Organization: University of Northern Iowa\n", "Lines: 10\n", "\n", "I have win 3.0 and downloaded several icons and BMP's but I can't figure out\n", "how to change the \"wallpaper\" or use the icons. Any help would be appreciated.\n", "\n", "\n", "Thanx,\n", "\n", "-Brando\n", "\n", "PS Please E-mail me\n", "\n", "\n" ] } ], "prompt_number": 99 }, { "cell_type": "code", "collapsed": false, "input": [ "pipe = Pipeline([\n", " ('vect', CountVectorizer(max_features = 100)),\n", " ('tfidf', TfidfTransformer()),\n", " ('clf', SGDClassifier()),\n", "])\n", "\n", "pipe.fit(data, category)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 100, "text": [ "Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, charset=None,\n", " charset_error=None, decode_error=u'strict',\n", " dtype=, encoding=u'utf-8', input=u'content',\n", " lowercase=True, max_df=1.0, max_features=100, min_df=1,\n", " ngram_range=(1, 1), prepr..., penalty='l2', power_t=0.5,\n", " random_state=None, shuffle=False, verbose=0, warm_start=False))])" ] } ], "prompt_number": 100 }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Pandas Pipelines!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import numpy as np\n", "import sklearn.preprocessing, sklearn.decomposition, sklearn.linear_model, sklearn.pipeline, sklearn.metrics\n", "from sklearn_pandas import DataFrameMapper, cross_val_score" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 107 }, { "cell_type": "code", "collapsed": false, "input": [ "data = pd.DataFrame({\n", " 'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],\n", " 'children': [4., 6, 3, 3, 2, 3, 5, 4],\n", " 'salary': [90, 24, 44, 27, 32, 59, 36, 27]\n", "})" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 117 }, { "cell_type": "code", "collapsed": false, "input": [ "mapper = DataFrameMapper([\n", " ('pet', sklearn.preprocessing.LabelBinarizer()),\n", " ('children', sklearn.preprocessing.StandardScaler()),\n", " ('salary', None)\n", "])" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 111 }, { "cell_type": "code", "collapsed": false, "input": [ "mapper.fit_transform(data)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 113, "text": [ "array([[ 1. , 0. , 0. , 0.20851441, 90. ],\n", " [ 0. , 1. , 0. , 1.87662973, 24. ],\n", " [ 0. , 1. , 0. , -0.62554324, 44. ],\n", " [ 0. , 0. , 1. , -0.62554324, 27. ],\n", " [ 1. , 0. , 0. , -1.4596009 , 32. ],\n", " [ 0. , 1. , 0. , -0.62554324, 59. ],\n", " [ 1. , 0. , 0. , 1.04257207, 36. ],\n", " [ 0. , 0. , 1. , 0.20851441, 27. ]])" ] } ], "prompt_number": 113 }, { "cell_type": "code", "collapsed": false, "input": [ "mapper = DataFrameMapper([\n", " ('pet', sklearn.preprocessing.LabelBinarizer()),\n", " ('children', sklearn.preprocessing.StandardScaler()),\n", " ('salary', None)\n", "])\n", "\n", "pipe = Pipeline([\n", " (\"mapper\", mapper),\n", " (\"pca\", PCA(n_components=2))\n", "])\n", "pipe.fit_transform(data) # 'data' is a data frame, not a numpy array!" ], "language": "python", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 157, "text": [ "array([[ -4.76269151e+01, 4.25991055e-01],\n", " [ 1.83856756e+01, 1.86178138e+00],\n", " [ -1.62747544e+00, -5.06199939e-01],\n", " [ 1.53796381e+01, -8.10331853e-01],\n", " [ 1.03575109e+01, -1.52528125e+00],\n", " [ -1.66260441e+01, -4.27845667e-01],\n", " [ 6.37295205e+00, 9.68066902e-01],\n", " [ 1.53846579e+01, 1.38193738e-02]])" ] } ], "prompt_number": 157 }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Pandas pipelines require [sklearn-pandas](https://github.com/paulgb/sklearn-pandas) module by [@paulgb](http://www.twitter.com/paulgb)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Also by Paul:\n", " \n", "[![](facebook_map.png)](https://www.facebook.com/note.php?note_id=469716398919)" ] }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Model Evaluation and Selection" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.grid_search import GridSearchCV, RandomizedSearchCV\n", "from sklearn import datasets\n", "from sklearn.ensemble import RandomForestClassifier" ], "language": "python", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "prompt_number": 212 }, { "cell_type": "code", "collapsed": false, "input": [ "# Create sample dataset\n", "X, y = datasets.make_classification(n_samples = 1000, n_features = 40, n_informative = 6, n_classes = 2)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 137 }, { "cell_type": "code", "collapsed": false, "input": [ "# Pipeline for Feature Selection to Random Forest\n", "pipe = Pipeline([\n", " (\"select\", SelectKBest()),\n", " (\"classify\", RandomForestClassifier())\n", "])" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 162 }, { "cell_type": "code", "collapsed": false, "input": [ "# Define parameter grid\n", "param_grid = {\n", " \"select__k\" : [1, 6, 20, 40],\n", " \"classify__n_estimators\" : [1, 10, 100],\n", " \n", "}\n", "gs = GridSearchCV(pipe, param_grid)" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "prompt_number": 175 }, { "cell_type": "code", "collapsed": false, "input": [ "# Search over grid\n", "gs.fit(X, y)\n", "\n", "gs.best_params_" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 183, "text": [ "{'classify__n_estimators': 10, 'select__k': 6}" ] } ], "prompt_number": 183 }, { "cell_type": "code", "collapsed": false, "input": [ "print gs.best_estimator_.predict(X.mean(axis = 0))" ], "language": "python", "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "[1]\n" ] } ], "prompt_number": 192 }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Curse of Dimensionality" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Search space grows exponentially with number of parameters." ] }, { "cell_type": "code", "collapsed": false, "input": [ "gs.grid_scores_" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 185, "text": [ "[mean: 0.72600, std: 0.02773, params: {'classify__n_estimators': 1, 'select__k': 1},\n", " mean: 0.78200, std: 0.00631, params: {'classify__n_estimators': 1, 'select__k': 6},\n", " mean: 0.74400, std: 0.02580, params: {'classify__n_estimators': 1, 'select__k': 20},\n", " mean: 0.70600, std: 0.05772, params: {'classify__n_estimators': 1, 'select__k': 40},\n", " mean: 0.73800, std: 0.02372, params: {'classify__n_estimators': 10, 'select__k': 1},\n", " mean: 0.90000, std: 0.01539, params: {'classify__n_estimators': 10, 'select__k': 6},\n", " mean: 0.86400, std: 0.01047, params: {'classify__n_estimators': 10, 'select__k': 20},\n", " mean: 0.81200, std: 0.02247, params: {'classify__n_estimators': 10, 'select__k': 40},\n", " mean: 0.73600, std: 0.02229, params: {'classify__n_estimators': 100, 'select__k': 1},\n", " mean: 0.89200, std: 0.01520, params: {'classify__n_estimators': 100, 'select__k': 6},\n", " mean: 0.89000, std: 0.01769, params: {'classify__n_estimators': 100, 'select__k': 20},\n", " mean: 0.87000, std: 0.02366, params: {'classify__n_estimators': 100, 'select__k': 40}]" ] } ], "prompt_number": 185 }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Curse of Dimensionality: Parallelization " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__GridSearch__ on 1 core:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "param_grid = {\n", " \"select__k\" : [1, 5, 10, 15, 20, 25, 30, 35, 40],\n", " \"classify__n_estimators\" : [1, 5, 10, 25, 50, 75, 100],\n", " \n", "}\n", "gs = GridSearchCV(pipe, param_grid, n_jobs = 1)\n", "%timeit gs.fit(X, y)\n", "print" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 loops, best of 3: 6.31 s per loop\n", "\n" ] } ], "prompt_number": 207 }, { "cell_type": "markdown", "metadata": {}, "source": [ "__GridSearch__ on 7 cores:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "gs = GridSearchCV(pipe, param_grid, n_jobs = 7)\n", "%timeit gs.fit(X, y)\n", "print" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1 loops, best of 3: 1.81 s per loop\n", "\n" ] } ], "prompt_number": 208 }, { "cell_type": "heading", "level": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Curse of Dimensionality: Randomization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__GridSearchCV__ might be very slow:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "param_grid = {\n", " \"select__k\" : range(1, 40),\n", " \"classify__n_estimators\" : range(1, 100), \n", "}" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 220 }, { "cell_type": "code", "collapsed": false, "input": [ "gs = GridSearchCV(pipe, param_grid, n_jobs = 7)\n", "gs.fit(X, y)\n", "print \"Best CV score\", gs.best_score_\n", "print gs.best_params_" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.924\n", "{'classify__n_estimators': 59, 'select__k': 9}\n" ] } ], "prompt_number": 221 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can instead randomly sample from the parameter space with __RandomizedSearchCV__:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "gs = RandomizedSearchCV(pipe, param_grid, n_jobs = 7, n_iter = 10)\n", "gs.fit(X, y)\n", "print \"Best CV score\", gs.best_score_\n", "print gs.best_params_" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.894\n", "{'classify__n_estimators': 58, 'select__k': 7}\n" ] } ], "prompt_number": 229 }, { "cell_type": "heading", "level": 1, "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Conclusions" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "* Scikit-learn has an elegant API and is built in a beautiful language." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Pipelines allow complex chains of operations to be easily computed.\n", " * This helps ensure correct cross validation (see _Elements of Statistical Learning_ 7.10.2). " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Pipelines combined with grid search permit easy model selection." ] } ], "metadata": {} } ] }