{ "metadata": { "name": "03_basic_principles_of_machine_learning" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%pylab inline\n", "import pylab as pl\n", "import numpy as np" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Basic principles of machine learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is where we start diving into the field of machine learning.\n", "\n", "By the end of this section you will\n", "\n", "- Know the basic categories of supervised learning, including classification and regression problems.\n", "- Know the basic categories of unsupervised learning, including dimensionality reduction and clustering.\n", "- Know the basic syntax of the Scikit-learn **estimator** interface.\n", "- Know how features are extracted from real-world data.\n", "\n", "In addition, we will go over several basic tools within scikit-learn which can be used to accomplish the above tasks." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Problem setting" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "A simple definition of machine learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Machine Learning (ML) is about building programs with **tunable parameters** (typically an\n", "array of floating point values) that are adjusted automatically so as to improve\n", "their behavior by **adapting to previously seen data.**\n", "\n", "In most ML applications, the data is in a 2D array of shape ``[n_samples x n_features]``,\n", "where the number of features is the same for each object, and each feature column refers\n", "to a related piece of information about each sample.\n", "\n", "Machine learning can be broken into two broad regimes:\n", "*supervised learning* and *unsupervised learning*.\n", "We\u2019ll introduce these concepts here, and discuss them in more detail below." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Introducing the scikit-learn estimator object" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every algorithm is exposed in scikit-learn via an ''Estimator'' object. For instance a linear regression is:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.linear_model import LinearRegression" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Estimator parameters**: All the parameters of an estimator can be set when it is instantiated:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "model = LinearRegression(normalize=True)\n", "print model.normalize" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print model" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Estimated parameters**: When data is fitted with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "x = np.array([0, 1, 2])\n", "y = np.array([0, 1, 2])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "_ = pl.plot(x, y, marker='o')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "X = x[:, np.newaxis] # The input data for sklearn is 2D: (samples == 3 x features == 1)\n", "X" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "model.fit(X, y) \n", "model.coef_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Supervised Learning: Classification and regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In **Supervised Learning**, we have a dataset consisting of both features and labels.\n", "The task is to construct an estimator which is able to predict the label of an object\n", "given the set of features. A relatively simple example is predicting the species of \n", "iris given a set of measurements of its flower. This is a relatively simple task. \n", "Some more complicated examples are:\n", "\n", "- given a multicolor image of an object through a telescope, determine\n", " whether that object is a star, a quasar, or a galaxy.\n", "- given a photograph of a person, identify the person in the photo.\n", "- given a list of movies a person has watched and their personal rating\n", " of the movie, recommend a list of movies they would like\n", " (So-called *recommender systems*: a famous example is the [Netflix Prize](http://en.wikipedia.org/wiki/Netflix_prize)).\n", "\n", "What these tasks have in common is that there is one or more unknown\n", "quantities associated with the object which needs to be determined from other\n", "observed quantities.\n", "\n", "Supervised learning is further broken down into two categories, **classification** and **regression**.\n", "In classification, the label is discrete, while in regression, the label is continuous. For example,\n", "in astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a\n", "classification problem: the label is from three distinct categories. On the other hand, we might\n", "wish to estimate the age of an object based on such observations: this would be a regression problem,\n", "because the label (age) is a continuous quantity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Classification**: K nearest neighbors (kNN) is one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.\n", "\n", "Let's try it out on our iris classification problem:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import neighbors, datasets\n", "iris = datasets.load_iris()\n", "X, y = iris.data, iris.target\n", "knn = neighbors.KNeighborsClassifier(n_neighbors=1)\n", "knn.fit(X, y)\n", "# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?\n", "print iris.target_names[knn.predict([[3, 5, 4, 2]])]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# A plot of the sepal space and the prediction of the KNN\n", "from helpers import plot_iris_knn\n", "plot_iris_knn()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**: Now use as an estimator on the same problem: sklearn.svm.SVC.\n", "\n", "> Note that you don't have to know what it is do use it.\n", "\n", "> If you finish early, do the same plot as above." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Regression**: The simplest possible regression setting is the linear regression one:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Create some simple data\n", "import numpy as np\n", "np.random.seed(0)\n", "X = np.random.random(size=(20, 1))\n", "y = 3 * X.squeeze() + 2 + np.random.normal(size=20)\n", "\n", "# Fit a linear regression to it\n", "from sklearn.linear_model import LinearRegression\n", "model = LinearRegression(fit_intercept=True)\n", "model.fit(X, y)\n", "print \"Model coefficient: %.5f, and intercept: %.5f\" % (model.coef_, model.intercept_)\n", "\n", "# Plot the data and the model prediction\n", "X_test = np.linspace(0, 1, 100)[:, np.newaxis]\n", "y_test = model.predict(X_test)\n", "import pylab as pl\n", "pl.plot(X.squeeze(), y, 'o')\n", "pl.plot(X_test.squeeze(), y_test)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Unsupervised Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Unsupervised Learning** addresses a different sort of problem. Here the data has no labels,\n", "and we are interested in finding similarities between the objects in question. In a sense,\n", "you can think of unsupervised learning as a means of discovering labels from the data itself.\n", "Unsupervised learning comprises tasks such as *dimensionality reduction*, *clustering*, and\n", "*density estimation*. For example, in the iris data discussed above, we can used unsupervised\n", "methods to determine combinations of the measurements which best display the structure of the\n", "data. As we\u2019ll see below, such a projection of the data can be used to visualize the\n", "four-dimensional dataset in two dimensions. Some more involved unsupervised learning problems are:\n", "\n", "- given detailed observations of distant galaxies, determine which features or combinations of\n", " features summarize best the information.\n", "- given a mixture of two sound sources (for example, a person talking over some music),\n", " separate the two (this is called the [blind source separation](http://en.wikipedia.org/wiki/Blind_signal_separation) problem).\n", "- given a video, isolate a moving object and categorize in relation to other moving objects which have been seen.\n", "\n", "Sometimes the two may even be combined: e.g. Unsupervised learning can be used to find useful\n", "features in heterogeneous data, and then these features can be used within a supervised\n", "framework." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**An example PCA for visualization** Principle Component Analysis (PCA) is a dimension reduction technique that can find the combinations of variables that explain the most variance.\n", "\n", "Consider the iris dataset. It cannot be visualized in a single 2D plot, as it has 4 features. We are going to extract 2 combinations of sepal and petal dimensions to visualize it:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "X, y = iris.data, iris.target\n", "from sklearn.decomposition import PCA\n", "pca = PCA(n_components=2)\n", "pca.fit(X)\n", "X_reduced = pca.transform(X)\n", "print \"Reduced dataset shape:\", X_reduced.shape\n", "\n", "import pylab as pl\n", "pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y)\n", "\n", "print \"Meaning of the 2 components:\"\n", "for component in pca.components_:\n", " print \" + \".join(\"%.3f x %s\" % (value, name)\n", " for value, name in zip(component, iris.feature_names))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Clustering**: Clustering groups together observations that are homogeneous with respect to a given criterion, finding ''clusters'' in the data.\n", "\n", "Note that these clusters will uncover relevent hidden structure of the data only if the criterion used highlights it." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.cluster import KMeans\n", "k_means = KMeans(n_clusters=3, random_state=0) # Fixing the RNG in kmeans\n", "k_means.fit(X_reduced)\n", "y_pred = k_means.predict(X_reduced)\n", "\n", "pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_pred)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "A recap on Scikit-learn's estimator interface" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn strives to have a uniform interface across all methods,\n", "and we\u2019ll see examples of these below. Given a scikit-learn *estimator*\n", "object named `model`, the following methods are available:\n", "\n", "- Available in **all Estimators**\n", " + `model.fit()` : fit training data. For supervised learning applications,\n", " this accepts two arguments: the data `X` and the labels `y` (e.g. `model.fit(X, y)`).\n", " For unsupervised learning applications, this accepts only a single argument,\n", " the data `X` (e.g. `model.fit(X)`).\n", "- Available in **supervised estimators**\n", " + `model.predict()` : given a trained model, predict the label of a new set of data.\n", " This method accepts one argument, the new data `X_new` (e.g. `model.predict(X_new)`),\n", " and returns the learned label for each object in the array.\n", " + `model.predict_proba()` : For classification problems, some estimators also provide\n", " this method, which returns the probability that a new observation has each categorical label.\n", " In this case, the label with the highest probability is returned by `model.predict()`.\n", " + `model.score()` : for classification or regression problems, most (all?) estimators implement\n", " a score method. Scores are between 0 and 1, with a larger score indicating a better fit.\n", "- Available in **unsupervised estimators**\n", " + `model.transform()` : given an unsupervised model, transform new data into the new basis.\n", " This also accepts one argument `X_new`, and returns the new representation of the data based\n", " on the unsupervised model.\n", " + `model.fit_transform()` : some estimators implement this method,\n", " which more efficiently performs a fit and a transform on the same input data." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Regularization: what it is and why it is necessary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Train errors** Suppose you are using a 1-nearest neighbor estimator. How many errors do you expect on your train set?\n", "\n", "This tells us that:\n", "- Train set error is not a good measurement of prediction performance. You need to leave out a test set.\n", "- In general, we should accept errors on the train set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**An example of regularization** The core idea behind regularization is that we are going to prefer models that are simpler, for a certain definition of ''simpler'', even if they lead to more errors on the train set.\n", "\n", "As an example, let's generate with a 9th order polynomial." ] }, { "cell_type": "code", "collapsed": false, "input": [ "rng = np.random.RandomState(0)\n", "x = 2 * rng.rand(100) - 1\n", "\n", "f = lambda t: 1.2 * t ** 2 + .1 * t ** 3 - .4 * t ** 5 - .5 * t ** 9\n", "y = f(x) + .4 * rng.normal(size=100)\n", "\n", "pl.figure()\n", "pl.scatter(x, y, s=4)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now, let's fit a 4th order and a 9th order polynomial to the data. For this we need to engineer features: the n_th powers of x:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "x_test = np.linspace(-1, 1, 100)\n", "\n", "pl.figure()\n", "pl.scatter(x, y, s=4)\n", "\n", "X = np.array([x**i for i in range(5)]).T\n", "X_test = np.array([x_test**i for i in range(5)]).T\n", "order4 = LinearRegression()\n", "order4.fit(X, y)\n", "pl.plot(x_test, order4.predict(X_test), label='4th order')\n", "\n", "X = np.array([x**i for i in range(10)]).T\n", "X_test = np.array([x_test**i for i in range(10)]).T\n", "order9 = LinearRegression()\n", "order9.fit(X, y)\n", "pl.plot(x_test, order9.predict(X_test), label='9th order')\n", "\n", "pl.legend(loc='best')\n", "pl.axis('tight')\n", "pl.title('Fitting a 4th and a 9th order polynomial')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With your naked eyes, which model do you prefer, the 4th order one, or the 9th order one?\n", "\n", "Let's look at the ground truth:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "pl.figure()\n", "pl.scatter(x, y, s=4)\n", "pl.plot(x_test, f(x_test), label=\"truth\")\n", "pl.axis('tight')\n", "pl.title('Ground truth (9th order polynomial)')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regularization is ubiquitous in machine learning. Most scikit-learn estimators have a parameter to tune the amount of regularization. For instance, with k-NN, it is 'k', the number of nearest neighbors used to make the decision. k=1 amounts to no regularization: 0 error on the training set, whereas large k will push toward smoother decision boundaries in the feature space." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Exercise: Interactive Demo on linearly separable data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the **svm_gui.py** file in the repository: https://github.com/jakevdp/sklearn_scipy2013" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Flow chart: how do I choose what to do with my data set?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] } ], "metadata": {} } ] }