{ "metadata": { "name": "03_machine_learning_101" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Machine Learning 101: General Concepts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is where we start diving into the field of machine learning.\n", "\n", "By the end of this section you will\n", "\n", "- Know the basic categories of supervised learning, including classification and regression problems.\n", "- Know the basic categories of unsupervised learning, including dimensionality reduction and clustering.\n", "- Know the basic syntax of the Scikit-learn **estimator** interface.\n", "- Know how features are extracted from real-world data.\n", "\n", "In addition, we will go over several basic tools within scikit-learn which can be used to accomplish the above tasks." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Quick Review:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We saw before the basic definition of Machine Learning:\n", "\n", "Machine Learning (ML) is about building programs with **tunable parameters** (typically an\n", "array of floating point values) that are adjusted automatically so as to improve\n", "their behavior by **adapting to previously seen data.**\n", "\n", "In most ML applications, the data is in a 2D array of shape ``[n_samples x n_features]``,\n", "where the number of features is the same for each object, and each feature column refers\n", "to a related piece of information about each sample." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Supervised Learning, Unsupervised Learning, and Scikit-learn Estimators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Machine learning can be broken into two broad regimes:\n", "*supervised learning* and *unsupervised learning*.\n", "We\u2019ll introduce these concepts here, and discuss them in more detail below." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Supervised Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In **Supervised Learning**, we have a dataset consisting of both features and labels.\n", "The task is to construct an estimator which is able to predict the label of an object\n", "given the set of features. A relatively simple example is predicting the species of \n", "iris given a set of measurements of its flower. This is a relatively simple task. \n", "Some more complicated examples are:\n", "\n", "- given a multicolor image of an object through a telescope, determine\n", " whether that object is a star, a quasar, or a galaxy.\n", "- given a photograph of a person, identify the person in the photo.\n", "- given a list of movies a person has watched and their personal rating\n", " of the movie, recommend a list of movies they would like\n", " (So-called *recommender systems*: a famous example is the [Netflix Prize](http://en.wikipedia.org/wiki/Netflix_prize)).\n", "\n", "What these tasks have in common is that there is one or more unknown\n", "quantities associated with the object which needs to be determined from other\n", "observed quantities.\n", "\n", "Supervised learning is further broken down into two categories, **classification** and **regression**.\n", "In classification, the label is discrete, while in regression, the label is continuous. For example,\n", "in astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a\n", "classification problem: the label is from three distinct categories. On the other hand, we might\n", "wish to estimate the age of an object based on such observations: this would be a regression problem,\n", "because the label (age) is a continuous quantity." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Unsupervised Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Unsupervised Learning** addresses a different sort of problem. Here the data has no labels,\n", "and we are interested in finding similarities between the objects in question. In a sense,\n", "you can think of unsupervised learning as a means of discovering labels from the data itself.\n", "Unsupervised learning comprises tasks such as *dimensionality reduction*, *clustering*, and\n", "*density estimation*. For example, in the iris data discussed above, we can used unsupervised\n", "methods to determine combinations of the measurements which best display the structure of the\n", "data. As we\u2019ll see below, such a projection of the data can be used to visualize the\n", "four-dimensional dataset in two dimensions. Some more involved unsupervised learning problems are:\n", "\n", "- given detailed observations of distant galaxies, determine which features or combinations of\n", " features are most important in distinguishing between galaxies.\n", "- given a mixture of two sound sources (for example, a person talking over some music),\n", " separate the two (this is called the [blind source separation](http://en.wikipedia.org/wiki/Blind_signal_separation) problem).\n", "- given a video, isolate a moving object and categorize in relation to other moving objects which have been seen.\n", "\n", "Sometimes the two may even be combined: e.g. Unsupervised learning can be used to find useful\n", "features in heterogeneous data, and then these features can be used within a supervised\n", "framework." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Scikit-learn's interface" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In scikit-learn, almost all operations are done through an estimator object.\n", "\n", "For example, a linear regression estimator can be instantiated as follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.linear_model import LinearRegression\n", "model = LinearRegression()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print model" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn strives to have a uniform interface across all methods,\n", "and we\u2019ll see examples of these below. Given a scikit-learn *estimator*\n", "object named `model`, the following methods are available:\n", "\n", "- *Available in all Estimators*\n", " + `model.fit()` : fit training data. For supervised learning applications,\n", " this accepts two arguments: the data `X` and the labels `y` (e.g. `model.fit(X, y)`).\n", " For unsupervised learning applications, this accepts only a single argument,\n", " the data `X` (e.g. `model.fit(X)`).\n", "- *Available in supervised estimators*\n", " + `model.predict()` : given a trained model, predict the label of a new set of data.\n", " This method accepts one argument, the new data `X_new` (e.g. `model.predict(X_new)`),\n", " and returns the learned label for each object in the array.\n", " + `model.predict_proba()` : For classification problems, some estimators also provide\n", " this method, which returns the probability that a new observation has each categorical label.\n", " In this case, the label with the highest probability is returned by `model.predict()`.\n", " + `model.score()` : for classification or regression problems, most (all?) estimators implement\n", " a score method. Scores are between 0 and 1, with a larger score indicating a better fit.\n", "- *Available in unsupervised estimators*\n", " + `model.transform()` : given an unsupervised model, transform new data into the new basis.\n", " This also accepts one argument `X_new`, and returns the new representation of the data based\n", " on the unsupervised model.\n", " + `model.fit_transform()` : some estimators implement this method,\n", " which more efficiently performs a fit and a transform on the same input data." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Diagrams of Supervised and Unsupervised Learning" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%pylab inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from figures import plot_supervised_chart, plot_unsupervised_chart" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "plot_supervised_chart(annotate=False)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "plot_supervised_chart(annotate=True)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "plot_unsupervised_chart()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*(Aside: these charts are generated in matplotlib. You can see the code using the %load magic)*" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load figures/ML_flow_chart.py" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 12 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Returning to Feature Extraction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that previously, we have seen two types of features:\n", "\n", "- The iris dataset had measured features: (the lengths of petals, sepals, etc.)\n", "- The digits and faces datasets were pixel values (the images were pre-aligned)\n", "\n", "How might we handle other types of features?" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Categorical Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes we have categorical features: for example, imagine the dataset included\n", "the colors:\n", "\n", " color in [red, blue, purple]\n", "\n", "Often it is best for categorical features to have their own dimenions:\n", "\n", "The enriched iris feature set would hence be in this case:\n", "\n", "- sepal length in cm\n", "- sepal width in cm\n", "- petal length in cm\n", "- petal width in cm\n", "- color#purple (1.0 or 0.0)\n", "- color#blue (1.0 or 0.0)\n", "- color#red (1.0 or 0.0)" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Unstructured data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most often, data does not come in a nice, structured, CSV file where every\n", "column measures the same thing. In this case, we must be more imaginitive\n", "in how we extract features.\n", "\n", "Here is an overview of strategies to turn unstructed data items into arrays of numerical features." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Text documents:\t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Count the frequency of each word or pair of consecutive words in each document. This approach is called **Bag of Words**\n", "\n", "*Note:* we include other file formats such as HTML and PDF in this category:\n", "an ad-hoc preprocessing step is required to extract the plain text in\n", "UTF-8 encoding for instance.\n", "\n", "For a tutorial on text processing in scikit-learn, see\n", "http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html" ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Images:\t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Rescale the picture to a fixed size and take all the raw pixels values (with or without luminosity normalization)\n", "\n", "- Take some transformation of the signal (gradients in each pixel, wavelets transforms...)\n", "\n", "- Compute the Euclidean, Manhattan or cosine similarities of the sample to a set reference prototype images aranged\n", " in a code book. The code book may have been previously extracted from the same dataset using an unsupervised\n", " learning algorithm on the raw pixel signal. Each feature value is the distance to one element of the code book.\n", "\n", "- Perform local feature extraction: split the picture into small regions and perform feature extraction locally in each area,\n", " Then combine all the features of the individual areas into a single array." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Sounds:\t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Same type of strategies as for images; the difference its it's a 1D rather than 2D space.\n", "\n", "For more information on feature extraction in scikit-learn, see\n", "http://scikit-learn.org/stable/modules/feature_extraction.html" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Looking Ahead" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next couple notebooks, we will explore a simple examples of classification,\n", "regression, dimensionality reduction, and clustering using the datasets we've\n", "seen." ] } ], "metadata": {} } ] }