{ "metadata": { "name": "01_setup_and_introduction" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Introduction to Machine Learning in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section we'll go through some preliminary topics, as well as some of the\n", "requirements for this tutorial.\n", "\n", "By the end of this section you should:\n", "\n", "- Know what sort of tasks qualify as Machine Learning problems.\n", "- See some simple examples of machine learning\n", "- Know the basics of creating and manipulating numpy arrays.\n", "- Know the basics of scatter plots in matplotlib." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "What is Machine Learning?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section we will begin to explore the basic principles of machine learning.\n", "Machine Learning is about building programs with **tunable parameters** (typically an\n", "array of floating point values) that are adjusted automatically so as to improve\n", "their behavior by **adapting to previously seen data.**\n", "\n", "Machine Learning can be considered a subfield of **Artificial Intelligence** since those\n", "algorithms can be seen as building blocks to make computers learn to behave more\n", "intelligently by somehow **generalizing** rather that just storing and retrieving data items\n", "like a database system would do.\n", "\n", "We'll take a look at two very simple machine learning tasks here.\n", "The first is a **classification** task: the figure shows a\n", "collection of two-dimensional data, colored according to two different class\n", "labels. A classification algorithm may be used to draw a dividing boundary\n", "between the two clusters of points:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Start pylab inline mode, so figures will appear in the notebook\n", "%pylab inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Import the example plot from the figures directory\n", "from figures import plot_sgd_separator\n", "plot_sgd_separator()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This may seem like a trivial task, but it is a simple version of a very important concept.\n", "By drawing this separating line, we have learned a model which can **generalize** to new\n", "data: if you were to drop another point onto the plane which is unlabeled, this algorithm\n", "could now **predict** whether it's a blue or a red point.\n", "\n", "If you'd like to see the source code used to generate this, you can either open the\n", "code in the `figures` directory, or you can load the code using the `%load` magic command:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load figures/sgd_separator.py" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next simple task we'll look at is a **regression** task: a simple best-fit line\n", "to a set of data:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from figures import plot_linear_regression\n", "plot_linear_regression()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, this is an example of fitting a model to data, such that the model can make\n", "generalizations about new data. The model has been **learned** from the training\n", "data, and can be used to predict the result of test data:\n", "here, we might be given an x-value, and the model would\n", "allow us to predict the y value. Again, this might seem like a trivial problem,\n", "but it is a basic example of a type of operation that is fundamental to\n", "machine learning tasks." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Numpy Arrays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Manipulating `numpy` arrays is an important part of doing machine learning\n", "(or, really, any type of scientific computation) in python. This will likely\n", "be review for most: we'll quickly go through some of the most important features." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "\n", "# Generating a random array\n", "X = np.random.random((3, 5)) # a 3 x 5 array\n", "\n", "print X" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Accessing elements\n", "\n", "# get a single element\n", "print X[0, 0]\n", "\n", "# get a row\n", "print X[1]\n", "\n", "# get a column\n", "print X[:, 1]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Transposing an array\n", "print X.T" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Turning a row vector into a column vector\n", "y = np.linspace(0, 12, 5)\n", "print y" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# make into a column vector\n", "print y[:, np.newaxis]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is much, much more to know, but these few operations are fundamental to what we'll\n", "do during this tutorial." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Scipy Sparse Matrices" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We won't make very much use of these in this tutorial, but sparse matrices are very nice\n", "in some situations. For example, in some machine learning tasks, especially those associated\n", "with textual analysis, the data may be mostly zeros. Storing all these zeros is very\n", "inefficient. We can create and manipulate sparse matrices as follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from scipy import sparse\n", "\n", "# Create a random array with a lot of zeros\n", "X = np.random.random((10, 5))\n", "print X" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# set the majority of elements to zero\n", "X[X < 0.7] = 0\n", "print X" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# turn X into a csr (Compressed-Sparse-Row) matrix\n", "X_csr = sparse.csr_matrix(X)\n", "print X_csr" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# convert the sparse matrix to a dense array\n", "print X_csr.toarray()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The CSR representation can be very efficient for computations, but it is not\n", "as good for adding elements. For that, the LIL (List-In-List) representation\n", "is better:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# Create an empty LIL matrix and add some items\n", "X_lil = sparse.lil_matrix((5, 5))\n", "\n", "for i, j in np.random.randint(0, 5, (15, 2)):\n", " X_lil[i, j] = i + j\n", "\n", "print X_lil" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print X_lil.toarray()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often, once an LIL matrix is created, it is useful to convert it to a CSR format\n", "(many scikit-learn algorithms require CSR or CSC format)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print X_lil.tocsr()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are several other sparse formats that can be useful for various problems:\n", "\n", "- `CSC` (compressed sparse column)\n", "- `BSR` (block sparse row)\n", "- `COO` (coordinate)\n", "- `DIA` (diagonal)\n", "- `DOK` (dictionary of keys)\n", "\n", "The ``scipy.sparse`` submodule also has a lot of functions for sparse matrices\n", "including linear algebra, sparse solvers, graph algorithms, and much more." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Matplotlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another important part of machine learning is visualization of data. The most common\n", "tool for this in Python is `matplotlib`. It is an extremely flexible package, but\n", "we will go over some basics here.\n", "\n", "First, something special to IPython notebook. We can turn on the \"IPython inline\" mode,\n", "which will make plots show up inline in the notebook." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%pylab inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# When you run %pylab inline, the following import happens automatically\n", "# but it's often useful to be explicit:\n", "import matplotlib.pyplot as plt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# plotting a line\n", "x = np.linspace(0, 10, 100)\n", "plt.plot(x, np.sin(x))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# scatter-plot points\n", "x = np.random.normal(size=500)\n", "y = np.random.normal(size=500)\n", "plt.scatter(x, y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# showing images\n", "x = np.linspace(1, 12, 100)\n", "y = x[:, np.newaxis]\n", "\n", "im = y * np.sin(x) * np.cos(y)\n", "print im.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# imshow - note that origin is at the top-left by default!\n", "plt.imshow(im)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Contour plot - note that origin here is at the bottom-left by default!\n", "plt.contour(im)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# 3D plotting\n", "from mpl_toolkits.mplot3d import Axes3D\n", "ax = plt.axes(projection='3d')\n", "xgrid, ygrid = np.meshgrid(x, y.ravel())\n", "ax.plot_surface(xgrid, ygrid, im, cmap=plt.cm.jet, cstride=2, rstride=2, linewidth=0)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are many, many more plot types available. One useful way to explore these is by\n", "looking at the matplotlib gallery: http://matplotlib.org/gallery.html\n", "\n", "You can test these examples out easily in the notebook: simply copy the ``Source Code``\n", "link on each page, and put it in a notebook using the ``%load`` magic.\n", "For example:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%load http://matplotlib.org/mpl_examples/pylab_examples/ellipse_collection.py" ], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }