{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Andreas Mueller, Kyle Kastner, Sebastian Raschka \n", "last updated: 2016-06-23 \n", "\n", "CPython 3.5.1\n", "IPython 4.2.0\n", "\n", "numpy 1.11.0\n", "scipy 0.17.1\n", "matplotlib 1.5.1\n", "pillow 3.2.0\n", "scikit-learn 0.17.1\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,pillow,scikit-learn" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# SciPy 2016 Scikit-learn Tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Supervised Learning Part 1 -- Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To visualize the workings of machine learning algorithms, it is often helpful to study two-dimensional or one-dimensional data, that is data with only one or two features. While in practice, datasets usually have many more features, it is hard to plot high-dimensional data in on two-dimensional screens.\n", "\n", "We will illustrate some very simple examples before we move on to more \"real world\" data sets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "First, we will look at a two class classification problem in two dimensions. We use the synthetic data generated by the ``make_blobs`` function." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import make_blobs\n", "\n", "X, y = make_blobs(centers=2, random_state=0)\n", "\n", "print('X ~ n_samples x n_features:', X.shape)\n", "print('y ~ n_samples:', y.shape)\n", "\n", "print('\\nFirst 5 samples:\\n', X[:5, :])\n", "print('\\nFirst 5 labels:', y[:5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the data is two-dimensional, we can plot each sample as a point in a two-dimensional coordinate system, with the first feature being the x-axis and the second feature being the y-axis." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.scatter(X[y == 0, 0], X[y == 0, 1], \n", " c='blue', s=40, label='0')\n", "plt.scatter(X[y == 1, 0], X[y == 1, 1], \n", " c='red', s=40, label='1', marker='s')\n", "\n", "plt.xlabel('first feature')\n", "plt.ylabel('second feature')\n", "plt.legend(loc='upper right');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Classification is a supervised task, and since we are interested in its performance on unseen data, we split our data into two parts:\n", "\n", "1. a training set that the learning algorithm uses to fit the model\n", "2. a test set to evaluate the generalization performance of the model\n", "\n", "The ``train_test_split`` function from the ``cross_validation`` module does that for us -- we will use it to split a dataset into 75% training data and 25% test data.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y,\n", " test_size=0.25,\n", " random_state=1234,\n", " stratify=y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The scikit-learn estimator API\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every algorithm is exposed in scikit-learn via an ''Estimator'' object. (All models in scikit-learn have a very consistent interface). For instance, we first import the logistic regression class." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we instantiate the estimator object." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "classifier = LogisticRegression()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_train.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y_train.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To built the model from our data, that is to learn how to classify new points, we call the ``fit`` function with the training data, and the corresponding training labels (the desired output for the training data point):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "classifier.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Some estimator methods such as `fit` return `self` by default. Thus, after executing the code snippet above, you will see the default parameters of this particular instance of `LogisticRegression`. Another way of retrieving the estimator's ininitialization parameters is to execute `classifier.get_params()`, which returns a parameter dictionary.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then apply the model to unseen data and use the model to predict the estimated outcome using the ``predict`` method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "prediction = classifier.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can compare these against the true labels:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(prediction)\n", "print(y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can evaluate our classifier quantitatively by measuring what fraction of predictions is correct. This is called **accuracy**:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "np.mean(prediction == y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is also a convenience function , ``score``, that all scikit-learn classifiers have to compute this directly from the test data:\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "classifier.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is often helpful to compare the generalization performance (on the test set) to the performance on the training set:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "classifier.score(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "LogisticRegression is a so-called linear model,\n", "that means it will create a decision that is linear in the input space. In 2d, this simply means it finds a line to separate the blue from the red:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from figures import plot_2d_separator\n", "\n", "plt.scatter(X[y == 0, 0], X[y == 0, 1], \n", " c='blue', s=40, label='0')\n", "plt.scatter(X[y == 1, 0], X[y == 1, 1], \n", " c='red', s=40, label='1', marker='s')\n", "\n", "plt.xlabel(\"first feature\")\n", "plt.ylabel(\"second feature\")\n", "plot_2d_separator(classifier, X)\n", "plt.legend(loc='upper right');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Estimated parameters**: All the estimated model parameters are attributes of the estimator object ending by an underscore. Here, these are the coefficients and the offset of the line:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(classifier.coef_)\n", "print(classifier.intercept_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another classifier: K Nearest Neighbors\n", "------------------------------------------------\n", "Another popular and easy to understand classifier is K nearest neighbors (kNN). It has one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.\n", "\n", "The interface is exactly the same as for ``LogisticRegression above``." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This time we set a parameter of the KNeighborsClassifier to tell it we only want to look at one nearest neighbor:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "knn = KNeighborsClassifier(n_neighbors=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We fit the model with out training data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "knn.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.scatter(X[y == 0, 0], X[y == 0, 1], \n", " c='blue', s=40, label='0')\n", "plt.scatter(X[y == 1, 0], X[y == 1, 1], \n", " c='red', s=40, label='1', marker='s')\n", "\n", "plt.xlabel(\"first feature\")\n", "plt.ylabel(\"second feature\")\n", "plot_2d_separator(knn, X)\n", "plt.legend(loc='upper right');" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "knn.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise\n", "=========\n", "Apply the KNeighborsClassifier to the ``iris`` dataset. Play with different values of the ``n_neighbors`` and observe how training and test score change." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }