{ "metadata": { "name": "", "signature": "sha256:8b06016cb00e9a03216952e7cfa1e96f17427e3aa4bed3679536c77259d3544a" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "# Start pylab inline mode, so figures will appear in the notebook\n", "%matplotlib inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "What is machine learning?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Machine Learning is about building programs with **tunable parameters** \n", "that are adjusted automatically so as to improve\n", "their behavior by **adapting to previously seen data.**\n", "\n", "Machine Learning can be considered a subfield of **Artificial Intelligence** since those\n", "algorithms can be seen as building blocks to make computers learn to behave more\n", "intelligently by somehow **generalizing** rather that just storing and retrieving data items\n", "like a database system would do.\n", "\n", "We'll take a look at two very simple machine learning tasks here.\n", "The first is a **classification** task: the figure shows a\n", "collection of two-dimensional data, colored according to two different class\n", "labels. A classification algorithm may be used to draw a dividing boundary\n", "between the two clusters of points:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Import the example plot from the figures directory\n", "from figures import plot_sgd_separator\n", "plot_sgd_separator()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By drawing this separating line, we have learned a model which can **generalize** to new\n", "data: if you were to drop another point onto the plane which is unlabeled, this algorithm\n", "could now **predict** whether it's a blue or a red point." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next simple task we'll look at is a **regression** task: a simple best-fit line\n", "to a set of data:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from figures import plot_linear_regression\n", "plot_linear_regression()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, this is an example of fitting a model to data, but our focus here is that the model can make\n", "generalizations about new data. The model has been **learned** from the training\n", "data, and can be used to predict the result of test data:\n", "here, we might be given an x-value, and the model would\n", "allow us to predict the y value." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Data in scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most machine learning algorithms implemented in scikit-learn expect data to be stored in a\n", "**two-dimensional array or matrix**. The arrays can be\n", "either ``numpy`` arrays, or in some cases ``scipy.sparse`` matrices.\n", "The size of the array is expected to be `[n_samples, n_features]`\n", "\n", "- **n_samples:** The number of samples: each sample is an item to process (e.g. classify).\n", " A sample can be a document, a picture, a sound, a video, an astronomical object,\n", " a row in database or CSV file,\n", " or whatever you can describe with a fixed set of quantitative traits.\n", "- **n_features:** The number of features or distinct traits that can be used to describe each\n", " item in a quantitative manner. Features are generally real-valued, but may be boolean or\n", " discrete-valued in some cases.\n", "\n", "The number of features must be fixed in advance. However it can be very high dimensional\n", "(e.g. millions of features) with most of them being zeros for a given sample. This is a case\n", "where `scipy.sparse` matrices can be useful, in that they are\n", "much more memory-efficient than numpy arrays." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "A Simple Example: the Iris Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an example of a simple dataset, we're going to take a look at the iris data stored by scikit-learn.\n", "The data consists of measurements of three different species of irises. There are three species of iris\n", "in the dataset, which we can picture here:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.core.display import Image, display\n", "display(Image(filename='figures/iris_setosa.jpg'))\n", "print(\"Iris Setosa\\n\")\n", "\n", "display(Image(filename='figures/iris_versicolor.jpg'))\n", "print(\"Iris Versicolor\\n\")\n", "\n", "display(Image(filename='figures/iris_virginica.jpg'))\n", "print(\"Iris Virginica\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Quick Question:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**If we want to design an algorithm to recognize iris species, what might the data be?**\n", "\n", "Remember: we need a 2D array of size `[n_samples x n_features]`.\n", "\n", "- What would the `n_samples` refer to?\n", "\n", "- What might the `n_features` refer to?\n", "\n", "Remember that there must be a **fixed** number of features for each sample, and feature\n", "number ``i`` must be a similar kind of quantity for each sample." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Loading the Iris Data with Scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn has a very straightforward set of data on these iris species. The data consist of\n", "the following:\n", "\n", "- Features in the Iris dataset:\n", "\n", " 1. sepal length in cm\n", " 2. sepal width in cm\n", " 3. petal length in cm\n", " 4. petal width in cm\n", "\n", "- Target classes to predict:\n", "\n", " 1. Iris Setosa\n", " 2. Iris Versicolour\n", " 3. Iris Virginica" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``scikit-learn`` embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_iris\n", "iris = load_iris()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The features of each sample flower are stored in the ``data`` attribute of the dataset:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "n_samples, n_features = iris.data.shape\n", "print(n_samples)\n", "print(n_features)\n", "print(iris.data[0])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The information about the class of each sample is stored in the ``target`` attribute of the dataset:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(iris.data.shape)\n", "print(iris.target.shape)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print(iris.target)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The names of the classes are stored in the last attribute, namely ``target_names``:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(iris.target_names)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data is four dimensional, but we can visualize two of the dimensions\n", "at a time using a simple scatter-plot. Again, we'll start by enabling\n", "matplotlib inline mode:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from matplotlib import pyplot as plt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "x_index = 0\n", "y_index = 1\n", "\n", "# this formatter will label the colorbar with the correct target names\n", "formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])\n", "\n", "plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target)\n", "plt.colorbar(ticks=[0, 1, 2], format=formatter)\n", "plt.xlabel(iris.feature_names[x_index])\n", "plt.ylabel(iris.feature_names[y_index])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Excercise**: Can you choose x_index and y_index to find a plot where it is easier to seperate the different classes of irises." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }