{ "metadata": { "name": "02A_representation_of_data" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Representation and Visualization of Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Machine learning is about creating models from data: for that reason, we'll start by\n", "discussing how data can be represented in order to be understood by the computer. Along\n", "with this, we'll build on our matplotlib examples from the previous section and show some\n", "examples of how to visualize data.\n", "\n", "By the end of this section you should:\n", "\n", "- Know the internal data representation of scikit-learn.\n", "- Know how to use scikit-learn's dataset loaders to load example data.\n", "- Know how to turn image & text data into data matrices for learning.\n", "- Know how to use matplotlib to help visualize different types of data." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Data in scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data in scikit-learn, with very few exceptions, is assumed to be stored as a\n", "**two-dimensional array**, of size `[n_samples, n_features]`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most machine learning algorithms implemented in scikit-learn expect data to be stored in a\n", "**two-dimensional array or matrix**. The arrays can be\n", "either ``numpy`` arrays, or in some cases ``scipy.sparse`` matrices.\n", "The size of the array is expected to be `[n_samples, n_features]`\n", "\n", "- **n_samples:** The number of samples: each sample is an item to process (e.g. classify).\n", " A sample can be a document, a picture, a sound, a video, an astronomical object,\n", " a row in database or CSV file,\n", " or whatever you can describe with a fixed set of quantitative traits.\n", "- **n_features:** The number of features or distinct traits that can be used to describe each\n", " item in a quantitative manner. Features are generally real-valued, but may be boolean or\n", " discrete-valued in some cases.\n", "\n", "The number of features must be fixed in advance. However it can be very high dimensional\n", "(e.g. millions of features) with most of them being zeros for a given sample. This is a case\n", "where `scipy.sparse` matrices can be useful, in that they are\n", "much more memory-efficient than numpy arrays." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "A Simple Example: the Iris Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an example of a simple dataset, we're going to take a look at the iris data stored by scikit-learn.\n", "The data consists of measurements of three different species of irises. There are three species of iris\n", "in the dataset, which we can picture here:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.core.display import Image, display\n", "display(Image(filename='figures/iris_setosa.jpg'))\n", "print \"Iris Setosa\\n\"\n", "\n", "display(Image(filename='figures/iris_versicolor.jpg'))\n", "print \"Iris Versicolor\\n\"\n", "\n", "display(Image(filename='figures/iris_virginica.jpg'))\n", "print \"Iris Virginica\"" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Quick Question:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**If we want to design an algorithm to recognize iris species, what might the data be?**\n", "\n", "Remember: we need a 2D array of size `[n_samples x n_features]`.\n", "\n", "- What would the `n_samples` refer to?\n", "\n", "- What might the `n_features` refer to?\n", "\n", "Remember that there must be a **fixed** number of features for each sample, and feature\n", "number ``i`` must be a similar kind of quantity for each sample." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Loading the Iris Data with Scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn has a very straightforward set of data on these iris species. The data consist of\n", "the following:\n", "\n", "- Features in the Iris dataset:\n", "\n", " 1. sepal length in cm\n", " 2. sepal width in cm\n", " 3. petal length in cm\n", " 4. petal width in cm\n", "\n", "- Target classes to predict:\n", "\n", " 1. Iris Setosa\n", " 2. Iris Versicolour\n", " 3. Iris Virginica" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``scikit-learn`` embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_iris\n", "iris = load_iris()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting dataset is a ``Bunch`` object: you can see what's available using\n", "the method ``keys()``:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "iris.keys()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The features of each sample flower are stored in the ``data`` attribute of the dataset:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "n_samples, n_features = iris.data.shape\n", "print n_samples\n", "print n_features\n", "print iris.data[0]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The information about the class of each sample is stored in the ``target`` attribute of the dataset:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print iris.data.shape\n", "print iris.target.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print iris.target" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The names of the classes are stored in the last attribute, namely ``target_names``:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print iris.target_names" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data is four dimensional, but we can visualize two of the dimensions\n", "at a time using a simple scatter-plot. Again, we'll start by enabling\n", "pylab inline mode:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# note: this also imports numpy as np, imports matplotlib.pyplot as plt, and others\n", "%pylab inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "x_index = 0\n", "y_index = 1\n", "\n", "# this formatter will label the colorbar with the correct target names\n", "formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])\n", "\n", "plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target)\n", "plt.colorbar(ticks=[0, 1, 2], format=formatter)\n", "plt.xlabel(iris.feature_names[x_index])\n", "plt.ylabel(iris.feature_names[y_index])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Quick Exercise:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Change** `x_index` **and** `y_index` **in the above script\n", "and find a combination of two parameters\n", "which maximally separate the three classes.**\n", "\n", "This exercise is a preview of **dimensionality reduction**, which we'll see later." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Other Available Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn makes available a host of datasets for testing learning algorithms.\n", "They come in three flavors:\n", "\n", "- **Packaged Data:** these small datasets are packaged with the scikit-learn installation,\n", " and can be downloaded using the tools in ``sklearn.datasets.load_*``\n", "- **Downloadable Data:** these larger datasets are available for download, and scikit-learn\n", " includes tools which streamline this process. These tools can be found in\n", " ``sklearn.datasets.fetch_*``\n", "- **Generated Data:** there are several datasets which are generated from models based on a\n", " random seed. These are available in the ``sklearn.datasets.make_*``\n", "\n", "You can explore the available dataset loaders, fetchers, and generators using IPython's\n", "tab-completion functionality. After importing the ``datasets`` submodule from ``sklearn``,\n", "type\n", "\n", " datasets.load_\n", "\n", "or\n", "\n", " datasets.fetch_\n", "\n", "or\n", "\n", " datasets.make_\n", "\n", "to see a list of available functions." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import datasets" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data downloaded using the ``fetch_`` scripts are stored locally,\n", "within a subdirectory of your home directory.\n", "You can use the following to determine where it is:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import get_data_home\n", "get_data_home()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!ls $HOME/scikit_learn_data/" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Be warned: many of these datasets are quite large, and can take a long time to download!\n", "(especially on Conference wifi).\n", "\n", "If you start a download within the IPython notebook\n", "and you want to kill it, you can use ipython's \"kernel interrupt\" feature, available in the menu or using\n", "the shortcut ``Ctrl-m i``.\n", "\n", "You can press ``Ctrl-m h`` for a list of all ``ipython`` keyboard shortcuts." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Loading Digits Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we'll take a look at another dataset, one where we have to put a bit\n", "more thought into how to represent the data. We can explore the data in\n", "a similar manner as above:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_digits\n", "digits = load_digits()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "digits.keys()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "n_samples, n_features = digits.data.shape\n", "print (n_samples, n_features)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print digits.data[0]\n", "print digits.target" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The target here is just the digit represented by the data. The data is an array of\n", "length 64... but what does this data mean?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's a clue in the fact that we have two versions of the data array:\n", "``data`` and ``images``. Let's take a look at them:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print digits.data.shape\n", "print digits.images.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that they're related by a simple reshaping:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print np.all(digits.images.reshape((1797, 64)) == digits.data)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Aside... numpy and memory efficiency:*\n", "\n", "*You might wonder whether duplicating the data is a problem. In this case, the memory\n", "overhead is very small. Even though the arrays are different shapes, they point to the\n", "same memory block, which we can see by doing a bit of digging into the guts of numpy:*" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print digits.data.__array_interface__['data']\n", "print digits.images.__array_interface__['data']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*The long integer here is a memory address: the fact that the two are the same tells\n", "us that the two arrays are looking at the same data.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the data. It's little bit more involved than the simple scatter-plot\n", "we used above, but we can do it rather quickly." ] }, { "cell_type": "code", "collapsed": false, "input": [ "# set up the figure\n", "fig = plt.figure(figsize=(6, 6)) # figure size in inches\n", "fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)\n", "\n", "# plot the digits: each image is 8x8 pixels\n", "for i in range(64):\n", " ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])\n", " ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')\n", " \n", " # label the image with the target value\n", " ax.text(0, 7, str(digits.target[i]))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see now what the features mean. Each feature is a real-valued quantity representing the\n", "darkness of a pixel in an 8x8 image of a hand-written digit.\n", "\n", "Even though each sample has data that is inherently two-dimensional, the data matrix flattens\n", "this 2D data into a **single vector**, which can be contained in one **row** of the data matrix." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Generated Data: the S-Curve" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One dataset often used as an example of a simple nonlinear dataset is the S-cure:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import make_s_curve\n", "data, colors = make_s_curve(n_samples=1000)\n", "print(data.shape)\n", "print(colors.shape)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from mpl_toolkits.mplot3d import Axes3D\n", "ax = plt.axes(projection='3d')\n", "ax.scatter(data[:, 0], data[:, 1], data[:, 2], c=colors)\n", "ax.view_init(10, -60)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This example is typically used with an unsupervised learning method called Locally\n", "Linear Embedding. We'll explore unsupervised learning in detail later in the tutorial." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Exercise: working with the faces dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we'll take a moment for you to explore the datasets yourself.\n", "Later on we'll be using the Olivetti faces dataset.\n", "Take a moment to fetch the data (about 1.4MB), and visualize the faces.\n", "You can copy the code used to visualize the digits above, and modify it for this data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import fetch_olivetti_faces" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# fetch the faces data\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Use a script like above to plot the faces image data.\n", "# hint: plt.cm.bone is a good colormap for this data\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Solution:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load solutions/02A_faces_plot.py" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }