{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn\n", "=============\n", "Machine learning for the masses!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What?\n", "------\n", "**Algorithms**\n", "\n", "- Classification\n", "- Regression\n", "- Dimensionality reduction\n", "- Manifold learning\n", "- Feature selection\n", "- Semisupervised learning\n", "- Clustering\n", "\n", "**Tools**\n", "\n", "- Preprocessing\n", "- Pipelining\n", "- Model evaluation\n", "- Model selection\n", "\n", "**Features**\n", "\n", "- Sparse data\n", "- Dense data\n", "- Multi-core\n", "- Out-of-core\n", "- Cloud tools available" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get some data to play with" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_digits\n", "digits = load_digits()\n", "digits.keys()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "digits.images.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "digits.data.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "digits.target.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Data is always a numpy array (or sparse matrix) of shape (n_samples, n_features)**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(digits.images[0])\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print(digits.target[0])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "plt.matshow(digits.images[0], cmap=plt.cm.Greys)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Split the data to get going" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cross_validation import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Really Simple API\n", "-------------------\n", "1) Import your model class" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.svm import LinearSVC" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2) Instantiate an object and set the parameters" ] }, { "cell_type": "code", "collapsed": false, "input": [ "svm = LinearSVC(C=0.1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3) Fit the model" ] }, { "cell_type": "code", "collapsed": false, "input": [ "svm.fit(X_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4) Apply / evaluate" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(svm.predict(X_train))\n", "print(y_train)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "svm.score(X_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "svm.score(X_test, y_test)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And again\n", "---------" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.ensemble import RandomForestClassifier" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "rf = RandomForestClassifier(n_estimators=50)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "rf.fit(X_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "rf.score(X_train, y_train)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "rf.score(X_test, y_test)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#%load from github" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#!/usr/bin/python\n", "\n", "\"\"\"\n", "=====================\n", "Classifier comparison\n", "=====================\n", "\n", "A comparison of a several classifiers in scikit-learn on synthetic datasets.\n", "The point of this example is to illustrate the nature of decision boundaries\n", "of different classifiers.\n", "This should be taken with a grain of salt, as the intuition conveyed by\n", "these examples does not necessarily carry over to real datasets.\n", "\n", "Particularly in high-dimensional spaces, data can more easily be separated\n", "linearly and the simplicity of classifiers such as naive Bayes and linear SVMs\n", "might lead to better generalization than is achieved by other classifiers.\n", "\n", "The plots show training points in solid colors and testing points\n", "semi-transparent. The lower right shows the classification accuracy on the test\n", "set.\n", "\"\"\"\n", "print(__doc__)\n", "\n", "\n", "# Code source: Ga\u00ebl Varoquaux\n", "# Andreas M\u00fcller\n", "# Modified for documentation by Jaques Grobler\n", "# License: BSD 3 clause\n", "\n", "import numpy as np\n", "import pylab as pl\n", "from matplotlib.colors import ListedColormap\n", "from sklearn.cross_validation import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.datasets import make_moons, make_circles, make_classification\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.svm import SVC\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.lda import LDA\n", "from sklearn.qda import QDA\n", "\n", "h = .02 # step size in the mesh\n", "\n", "names = [\"Nearest Neighbors\", \"Linear SVM\", \"RBF SVM\", \"Decision Tree\",\n", " \"Random Forest\", \"AdaBoost\", \"Naive Bayes\", \"LDA\", \"QDA\"]\n", "classifiers = [\n", " KNeighborsClassifier(3),\n", " SVC(kernel=\"linear\", C=0.025),\n", " SVC(gamma=2, C=1),\n", " DecisionTreeClassifier(max_depth=5),\n", " RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),\n", " AdaBoostClassifier(),\n", " GaussianNB(),\n", " LDA(),\n", " QDA()]\n", "\n", "X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,\n", " random_state=1, n_clusters_per_class=1)\n", "rng = np.random.RandomState(2)\n", "X += 2 * rng.uniform(size=X.shape)\n", "linearly_separable = (X, y)\n", "\n", "datasets = [make_moons(noise=0.3, random_state=0),\n", " make_circles(noise=0.2, factor=0.5, random_state=1),\n", " linearly_separable\n", " ]\n", "\n", "figure = pl.figure(figsize=(27, 9))\n", "i = 1\n", "# iterate over datasets\n", "for ds in datasets:\n", " # preprocess dataset, split into training and test part\n", " X, y = ds\n", " X = StandardScaler().fit_transform(X)\n", " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)\n", "\n", " x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5\n", " y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5\n", " xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n", " np.arange(y_min, y_max, h))\n", "\n", " # just plot the dataset first\n", " cm = pl.cm.RdBu\n", " cm_bright = ListedColormap(['#FF0000', '#0000FF'])\n", " ax = pl.subplot(len(datasets), len(classifiers) + 1, i)\n", " # Plot the training points\n", " ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)\n", " # and testing points\n", " ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6)\n", " ax.set_xlim(xx.min(), xx.max())\n", " ax.set_ylim(yy.min(), yy.max())\n", " ax.set_xticks(())\n", " ax.set_yticks(())\n", " i += 1\n", "\n", " # iterate over classifiers\n", " for name, clf in zip(names, classifiers):\n", " ax = pl.subplot(len(datasets), len(classifiers) + 1, i)\n", " clf.fit(X_train, y_train)\n", " score = clf.score(X_test, y_test)\n", "\n", " # Plot the decision boundary. For that, we will assign a color to each\n", " # point in the mesh [x_min, m_max]x[y_min, y_max].\n", " if hasattr(clf, \"decision_function\"):\n", " Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])\n", " else:\n", " Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]\n", "\n", " # Put the result into a color plot\n", " Z = Z.reshape(xx.shape)\n", " ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)\n", "\n", " # Plot also the training points\n", " ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)\n", " # and testing points\n", " ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,\n", " alpha=0.6)\n", "\n", " ax.set_xlim(xx.min(), xx.max())\n", " ax.set_ylim(yy.min(), yy.max())\n", " ax.set_xticks(())\n", " ax.set_yticks(())\n", " ax.set_title(name)\n", " ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),\n", " size=15, horizontalalignment='right')\n", " i += 1\n", "\n", "figure.subplots_adjust(left=.02, right=.98)\n", "pl.show()\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tasks\n", "======\n", "1. Train a KNeighbors classifier on the digits dataset and compute the test accuracy.\n", "2. Visualize some of the mistakes." ] } ], "metadata": {} } ] }