{ "metadata": { "name": "", "signature": "sha256:56baeb5ce435707b27330ea7f9e40794333e8d7dbb5a6a93b3bdf781a3f281ff" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![](https://bytebucket.org/davis68/resources/raw/f7c98d2b95e961fae257707e22a58fa1a2c36bec/logos/baseline_cse_wdmk.png?token=be4cc41d4b2afe594f5b1570a3c5aad96a65f0d6)](http://cse.illinois.edu/)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from __future__ import print_function\n", "\n", "import numpy as np\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import matplotlib.cm as cm\n", "%matplotlib inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "##Machine Learning \n", "_Machine learning_ is the science of getting computers to act without telling them to do so. This is where we have got our self-driving cars, effective web search, speech recognition etc. \n", "\n", "\n", "\n", "Machine learning is expanding its reach to various domains. \n", "\n", "**Spam Detection** : When gmail detects your spam mail automatically, it is only the result of machine learning techniques.\n", "\n", "** Credit Card Frauds** : Identifying 'unusual' activities on a credit card is often a machine learning problem involving anomaly detection.\n", "\n", "** Face Detection ** : When Facebook automatically recognizes faces of your friends in a photo, a machine learning process is what's running in the background.\n", "\n", "** Product Recommendation ** : Whether Netflix was suggesting you watch House of Cards or Amazon was insisting you finally buy those Bose headphones, it's not magic, but machine learning.\n", "\n", "** Customer segmentation ** : Using the data for usage during a trial period of a product, identifying the users who will go onto subscribe for the paid version is a learning problem.\n", "\n", "** Medical Diagnosis ** : Using a database of symptoms and treatments of patients, a popular machine learning problem is to predict if a patient has a particular illness." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "Today's lesson will walk you through some of machine learning algorithms and demonstrate some working problems in the real world.\n", "\n", "## Contents\n", "- [What is Machine Learning?](#intro)\n", "- [Types of Machine Learning Algorithms](#algo)\n", "- [scikit-learn](#sklearn)\n", " - [Representing Data](#data)\n", " - [Loading a dataset](#load)\n", "- [Supervised Learning](#super)\n", " - [Classification](#class)\n", " - [KNN](#knn)\n", " - [OLS](#ols)\n", " - [SVM](#svm)\n", "- [Unsupervised Learning](#unsuper)\n", " - [K-Means](#kmeans)\n", " - [PCA](#pca)\n", " - [Neural Networks](#ann)\n", "- [More examples](#realworld)\n", " - [Color space segmentation](#colorspace)\n", " - [Image Compression](#image)\n", "- [References](#refs)\n", "- [Credits](#credits)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### Types of Machine Learning algorithms\n", "\n", "All of the problems listed above, are not exactly similar to each other. For instance the customer segmentation problem aims to *cluster* the customers into two sets: the ones who will pay for the version and the ones who won't. The face recognition problem aims to *classify* a face from amongst your friends list. There are broadly two types of machine learning algorithms.\n", "\n", "\n", "\n", "\n", "- **Supervised learning** : For example, when you rate a movie after you view it on Netflix, the suggestion which follows is predicted using a database of your ratings (known as the training data). When the problem is based on continuous variables, (maybe predicting the stock price) then it falls under *regression*. With class labels (such as in the Netflix problem), the learning problem is called a *classification* problem.\n", "\n", "- **Unsupervised learning**: In the case of unsupervised learning, there is no labeled data set which can be used for making further predictions. The learning algorithm tries to find patterns or associations in the given data set. Identifying clusters in the data (as in the customer segmentation problem) and reducing dimensions of the data falls in the unsupervised category.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### Machine Learning in Python\n", "\n", "Until before 2007, there existed no built-in machine learning package in Python. David Cournapeau (core developer of `numpy`/`scipy`) developed **Scikit-learn** as a part of a Google Summer of Code project. Later, Matthieu Brucher joined in and the first release was made in January 2010. The project now has 30 contributors and many sponsors including Google and Python Software Foundation.\n", "\n", "`scikit-learn` provides a range of supervised and unsupervised learning algorithms via a Python interface (as shown in the Figure below). Some of its dependencies: `scipy`,`numpy`,`pandas`, `matplotlib`\n", "\n", "\n", "\n", "Today's lesson will go over some of these algorithms. We start at the grassroot level, by first learning how to represent data in `scikit-learn`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### scikit-learn: Representing Data\n", "Most machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional array or matrix. The arrays can be either numpy arrays, or in some cases scipy.sparse matrices. The size of the array is expected to be [n_samples, n_features]\n", "\n", "- `n_samples`: Each sample is an item to process (e.g. classify). A sample can be a document, a picture, a sound, a video, an astronomical object, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.\n", "\n", "- `n_features`: The number of features or distinct traits that can be used to describe each item in a quantitative manner. Features are generally real-valued, but may be boolean or discrete-valued in some cases. The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample. This is a case where scipy.sparse matrices can be useful, in that they are much more memory-efficient than numpy arrays." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "###Loading a dataset\n", "\n", "Before we begin with machine learning, let's start at its very root by understanding how to load a dataset. We are going to start with the `iris` dataset stored in `scikit-learn`.\n", "\n", "**Iris Setosa** \n", "**Iris Versicolor** \n", "**Iris Virginica** \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question**\n", "If we want to design an algorithm to recognize iris species, what might the data be?\n", "(Remember: we need a 2D array of size [`n_samples` x `n_features`].)\n", "\n", "- What would the `n_samples` refer to?\n", "- What might the `n_features` refer to?\n", "\n", "Remember that there must be a fixed number of features for each sample, and feature number `i` must be a similar kind of quantity for each sample." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`scikit-learn` has a very straightforward set of data on these iris species. The data consists of the following:\n", "\n", "Features in the Iris dataset:\n", "\n", "- sepal length in cm\n", "- sepal width in cm\n", "- petal length in cm\n", "- petal width in cm\n", "\n", "Target classes to predict:\n", "\n", "- Iris Setosa\n", "- Iris Versicolour\n", "- Iris Virginica\n", "\n", "`scikit-learn` embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import datasets\n", "iris = datasets.load_iris() #150 observations of the iris flower with sepal length, sepal width, petal length and petal width" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data is stored in the `.data` attribute" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#iris.data.shape\n", "#iris.target_names\n", "iris.feature_names" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data is four dimensional, but we can visualize two of the dimensions at a time using a simple scatter-plot:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "x_index = 0\n", "y_index = 1\n", "\n", "# this formatter will label the colorbar with the correct target names\n", "formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])\n", "\n", "plt.scatter(iris.data[:, x_index], iris.data[:, y_index],\n", " c=iris.target)\n", "plt.colorbar(ticks=[0, 1, 2], format=formatter)\n", "plt.xlabel(iris.feature_names[x_index])\n", "plt.ylabel(iris.feature_names[y_index])\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**\n", "Change x_index and y_index in the above script and find a combination of two parameters which maximally separate the three classes.\n", "This exercise is a preview of dimensionality reduction, which we'll see later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Other datasets**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "- Packaged Data: these small datasets are packaged with the scikit-learn installation, and can be downloaded using the tools in sklearn.datasets.load_*\n", "- Downloadable Data: these larger datasets are available for download, and scikit-learn includes tools which streamline this process. These tools can be found in sklearn.datasets.fetch_*\n", "- Generated Data: there are several datasets which are generated from models based on a random seed. These are available in the sklearn.datasets.make_*" ] }, { "cell_type": "code", "collapsed": false, "input": [ "datasets.make_circles" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import pylab as pl\n", "mpl.rcParams['figure.figsize']=[16,5.4]\n", "\n", "digits = datasets.load_digits()\n", "fig, (ax0, ax1, ax2) = plt.subplots(ncols=3, sharex=True)\n", "ax0.matshow(digits.images[0])\n", "ax1.matshow(digits.images[0], cmap=pl.cm.gray_r) \n", "ax2.imshow(digits.images[2], cmap=pl.cm.gray_r) \n", "fig.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Estimator**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every algorithm is exposed in scikit-learn via an ''Estimator'' object. For instance a linear regression is implemented as follows:\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.linear_model import LinearRegression" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Estimator parameters: All the parameters of an estimator can be set when it is instantiated, and have suitable default values:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "model = LinearRegression(normalize=True)\n", "print (model.normalize)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print (model)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Estimated Model parameters: When data is fit with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "mpl.rcParams['figure.figsize']=[6,3]\n", "\n", "x = np.array([0, 1, 2])\n", "y = np.array([0, 1, 2])\n", "\n", "plt.plot(x, y, 'o')\n", "plt.xlim(-0.5, 2.5)\n", "plt.ylim(-0.5, 2.5)\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# The input data for sklearn is 2D: (samples == 3 x features == 1)\n", "X = x[:, np.newaxis]\n", "print (X)\n", "print (y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "model.fit(X, y) \n", "print (model.coef_)\n", "print (model.intercept_)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "model.residues_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model found a line with a slope 1 and intercept 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "### Supervised Learning\n", "\n", "In supervised learning, the computer gets presented with a set of sample data (or \"training\" data as it is called) along with its output. The idea is to develop a rule or a model that maps the sample output to the input. So, when the computer is presented with a new input, it can classify it into a particular category. For example, spam filtering is an instance where the learning algorithm is presented with emails messages labeled beforehand as \"spam\" or \"not spam\", to produce a computer program that labels unseen messages as either spam or not.\n", "\n", "Essentially we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features. \n", "\n", "Supervised learning is further broken down into two categories, **classification** and **regression**. In classification, the label is discrete, while in regression, the label is continuous." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " `scikit-learn` supports a number of supervised learning algorithms, and we will try to elucidate using examples using the following:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- KNN (K Nearest Neighbours)\n", "- Linear Regression\n", "- Support Vector Machines\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "###Classification\n", "\n", "A classification algorithm may be used to draw a dividing boundary between the two clusters of points:\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "**KNN Classification**\n", "\n", "K nearest neighbors (kNN) is one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import neighbors, datasets\n", "\n", "iris = datasets.load_iris()\n", "X, y = iris.data, iris.target\n", "\n", "# create the model\n", "knn = neighbors.KNeighborsClassifier(n_neighbors=2)\n", "\n", "# fit the model\n", "knn.fit(X, y)\n", "\n", "# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?\n", "# call the \"predict\" method:\n", "result = knn.predict([[3, 5, 4, 2],])\n", "\n", "print(iris.target_names[result])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "**Ordinary Least Squares(OLS)**\n", "\n", "OLS fits a linear model by evaluating the coefficients $\\alpha= (\\alpha_1,\\alpha_2,...\\alpha_n)$ to minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. Mathematically it solves a problem of the form:\n", "$$\\min_{\\alpha} {\\|{X\\alpha - y}} \\| \\text{.}$$" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import linear_model\n", "clf = linear_model.LinearRegression()\n", "clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "clf.coef_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Create some simple data\n", "import numpy as np\n", "np.random.seed(0)\n", "X = np.random.random(size=(20, 1))\n", "y = 3 * X.squeeze() + 2 + np.random.normal(size=20)\n", "\n", "# Fit a linear regression to it\n", "model = LinearRegression(fit_intercept=True)\n", "model.fit(X, y)\n", "print (\"Model coefficient: %.5f, and intercept: %.5f\"\n", " % (model.coef_, model.intercept_))\n", "\n", "# Plot the data and the model prediction\n", "X_test = np.linspace(0, 1, 100)[:, np.newaxis]\n", "y_test = model.predict(X_test)\n", "\n", "plt.plot(X.squeeze(), y, 'x')\n", "plt.plot(X_test.squeeze(), y_test);\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model has been learned from the training data, and can be used to predict the result of test data: here, we might be given an x-value, and the model would allow us to predict the y value." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "**Support Vector Machines (SVM)**\n", "\n", "Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.\n", "\n", "They are particularly advantageous in the following situations :\n", "\n", "- High dimensional spaces\n", "- Number of dimensions greater than the number of samples\n", "- Memory efficient\n", "- Versatile\n", "\n", "The support vector machines in `scikit-learn` support both dense (`numpy.ndarray`) and sparse (any `scipy.sparse`) sample vectors as input. However, to use an SVM to make predictions for sparse data, it must have been fit on such data. \n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import svm\n", "clf = svm.LinearSVC()\n", "clf.fit(iris.data, iris.target) #learn from the data" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "clf.predict([[ 5.0, 3.6, 1.3, 0.25]])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "clf.coef_ #Access the coefficients" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import svm\n", "svc = svm.SVC(kernel='linear')\n", "svc=svc.fit(iris.data, iris.target)\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "svc.coef_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print(__doc__)\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn import svm, datasets\n", "\n", "# import some data to play with\n", "iris = datasets.load_iris()\n", "X = iris.data[:, :2] # we only take the first two features. We could\n", " # avoid this ugly slicing by using a two-dim dataset\n", "y = iris.target\n", "\n", "h = .02 # step size in the mesh\n", "\n", "# we create an instance of SVM and fit out data. We do not scale our\n", "# data since we want to plot the support vectors\n", "C = 1.0 # SVM regularization parameter\n", "svc = svm.SVC(kernel='linear', C=C).fit(X, y)\n", "rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X, y)\n", "poly_svc = svm.SVC(kernel='poly', degree=3, C=C).fit(X, y)\n", "lin_svc = svm.LinearSVC(C=C).fit(X, y)\n", "\n", "# create a mesh to plot in\n", "x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n", "y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n", "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n", " np.arange(y_min, y_max, h))\n", "\n", "# title for the plots\n", "titles = ['SVC with linear kernel',\n", " 'LinearSVC (linear kernel)',\n", " 'SVC with RBF kernel',\n", " 'SVC with polynomial (degree 3) kernel']\n", "\n", "mpl.rcParams['figure.figsize']=[12,12]\n", "for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):\n", " # Plot the decision boundary. For that, we will assign a color to each\n", " # point in the mesh [x_min, m_max]x[y_min, y_max].\n", " plt.subplot(2, 2, i + 1)\n", " plt.subplots_adjust(wspace=0.4, hspace=0.4)\n", "\n", " Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n", "\n", " # Put the result into a color plot\n", " Z = Z.reshape(xx.shape)\n", " plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)\n", "\n", " # Plot also the training points\n", " plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)\n", " plt.xlabel('Sepal length')\n", " plt.ylabel('Sepal width')\n", " plt.xlim(xx.min(), xx.max())\n", " plt.ylim(yy.min(), yy.max())\n", " plt.xticks(())\n", " plt.yticks(())\n", " plt.title(titles[i])\n", "\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "###Unsupervised learning\n", "\n", "In the case of unsupervised learning, the idea is to find a hidden pattern in the data. In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself. Unsupervised learning comprises tasks such as dimensionality reduction, clustering, and density estimation. \n", "Approaches to unsupervised learning are supported in `scikit-learn` such as:\n", "\n", "- K-means clustering\n", "- Principal Component Analysis\n", "- Neural networks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### K-means Clustering\n", "Clustering is the task of grouping a set of objects together such as they are similar to one another in some way. The simplest clustering algorithm is k-means. This divides a set into k clusters, assigning each observation to a cluster so as to minimize the distance of that observation (in n-dimensional space) to the cluster\u2019s mean; the means are then recomputed. This operation is run iteratively until the clusters converge, for a maximum for max_iter rounds." ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(__doc__)\n", "\n", "\n", "# Code source: Ga\u00ebl Varoquaux\n", "# Modified for documentation by Jaques Grobler\n", "# License: BSD 3 clause\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from mpl_toolkits.mplot3d import Axes3D\n", "\n", "\n", "from sklearn.cluster import KMeans\n", "from sklearn import datasets\n", "\n", "np.random.seed(5)\n", "\n", "centers = [[1, 1], [-1, -1], [1, -1]]\n", "iris = datasets.load_iris()\n", "X = iris.data\n", "y = iris.target\n", "\n", "estimators = {'k_means_iris_3': KMeans(n_clusters=3),\n", " 'k_means_iris_8': KMeans(n_clusters=8),\n", " 'k_means_iris_bad_init': KMeans(n_clusters=3, n_init=1,\n", " init='random')}\n", "\n", "\n", "fignum = 1\n", "for name, est in estimators.items():\n", " fig = plt.figure(fignum, figsize=(4, 3))\n", " plt.clf()\n", " ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)\n", "\n", " plt.cla()\n", " est.fit(X)\n", " labels = est.labels_\n", "\n", " ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))\n", "\n", " ax.w_xaxis.set_ticklabels([])\n", " ax.w_yaxis.set_ticklabels([])\n", " ax.w_zaxis.set_ticklabels([])\n", " ax.set_xlabel('Petal width')\n", " ax.set_ylabel('Sepal length')\n", " ax.set_zlabel('Petal length')\n", " fignum = fignum + 1\n", "\n", "# Plot the ground truth\n", "fig = plt.figure(fignum, figsize=(4, 3))\n", "plt.clf()\n", "ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)\n", "\n", "plt.cla()\n", "\n", "for name, label in [('Setosa', 0),\n", " ('Versicolour', 1),\n", " ('Virginica', 2)]:\n", " ax.text3D(X[y == label, 3].mean(),\n", " X[y == label, 0].mean() + 1.5,\n", " X[y == label, 2].mean(), name,\n", " horizontalalignment='center',\n", " bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))\n", "# Reorder the labels to have colors matching the cluster results\n", "y = np.choose(y, [1, 2, 0]).astype(np.float)\n", "ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y)\n", "\n", "ax.w_xaxis.set_ticklabels([])\n", "ax.w_yaxis.set_ticklabels([])\n", "ax.w_zaxis.set_ticklabels([])\n", "ax.set_xlabel('Petal width')\n", "ax.set_ylabel('Sepal length')\n", "ax.set_zlabel('Petal length')\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "###Principal Component Analysis(PCA)\n", "PCA is extensively used in reducing dimensionality of data. PCA finds the directions in which the data is not flat and it can reduce the dimensionality of the data by projecting on a subspace. It finds the combination of variables that exhibit the maximum variance.\n", "\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "mpl.rcParams['figure.figsize']=[12,5]\n", "np.random.seed(1)\n", "X = np.dot(np.random.random(size=(2, 2)), np.random.normal(size=(2, 200))).T\n", "plt.plot(X[:, 0], X[:, 1], 'og')\n", "plt.axis('equal')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.decomposition import PCA\n", "pca = PCA(n_components=2)\n", "pca.fit(X)\n", "print(pca.explained_variance_)\n", "print(pca.components_)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "mpl.rcParams['figure.figsize']=[12,5]\n", "plt.plot(X[:, 0], X[:, 1], 'og', alpha=0.3)\n", "plt.axis('equal')\n", "for length, vector in zip(pca.explained_variance_, pca.components_):\n", " v = vector * 3 * np.sqrt(length)\n", " plt.plot([0, v[0]], [0, v[1]], '-k', lw=3)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "clf = PCA(0.95) # if we only keep 95% of the variance\n", "X_trans = clf.fit_transform(X)\n", "print(X.shape)\n", "print(X_trans.shape)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "X_new = clf.inverse_transform(X_trans)\n", "plt.plot(X[:, 0], X[:, 1], 'og', alpha=0.2)\n", "plt.plot(X_new[:, 0], X_new[:, 1], 'og', alpha=0.8)\n", "plt.axis('equal');" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "X, y = iris.data, iris.target\n", "from sklearn.decomposition import PCA\n", "pca = PCA(n_components=2)\n", "pca.fit(X)\n", "X_reduced = pca.transform(X)\n", "print (\"Reduced dataset shape:\", X_reduced.shape)\n", "\n", "import pylab as pl\n", "mpl.rcParams['figure.figsize']=[12,3]\n", "pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y)\n", "\n", "print (\"Meaning of the 2 components:\")\n", "for component in pca.components_:\n", " print(\" + \".join(\"%.3f x %s\" % (value, name)\n", " for value, name in zip(component,\n", " iris.feature_names)))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "X, y = iris.data, iris.target\n", "from sklearn.cluster import KMeans\n", "k_means = KMeans(n_clusters=5, random_state=0) # Fixing the RNG in kmeans\n", "k_means.fit(X)\n", "y_pred = k_means.predict(X)\n", "\n", "pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_pred);\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "###Neural Networks\n", "\n", "In general Neural Networks or Artificial Neural Networks(ANNs) as they are commonly known, are used both in machine learning and pattern recognition.\n", "\n", "For example, a neural network for handwriting recognition is defined by a set of input neurons which may be activated by the pixels of an input image. After being weighted and transformed by a function (determined by the network's designer), the activations of these neurons are then passed on to other neurons. This process is repeated until finally, an output neuron is activated. This determines which character was read.\n", "\n", "Typically, the ANN is represented by three layers. The hidden layer's job is to transform the inputs into something that the output layer can use. The output of a neuron is a function of the weighted sum of the inputs plus a bias.\n", "\n", "\n", "\n", "\n", "A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. This is the example of unsupervised learning which is supported by `scikit-learn`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "##Some Real World Examples\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Color-Space Segmentation Using K-Means Clustering\n", "\n", "Color is an extremely sophisticated notion. It unifies the physical wavelength of light, the biological expression of cell distribution and pigment receptors in the eye, the neurological interpretation of the resulting optic signal, and the psychological factors of culture and perception[[ref]](https://en.wikipedia.org/wiki/Opponent_process). Unsurprisingly, there are a lot of different ways to represent colors. Color spaces represent different colors according to essentially different orthogonal bases. For instance, you\u2019re probably familiar with $RGB$ v. $CMYK$ color spaces.\n", "\n", "\n", "\n", "\n", "\n", "\n", "
\n", "\n", "But there are many others, such as [$Lab$](https://en.wikipedia.org/wiki/Lab_color_space) and [$XYZ$](https://en.wikipedia.org/wiki/CIE_1931_color_space). These color spaces do not necessarily cover the same range of [perceptible colors]() (or [imperceptible ones!](https://en.wikipedia.org/wiki/Impossible_color)), but transformations between spaces [can still be defined](http://www.brucelindbloom.com/Math.html). We will use this below to convert between $RGB$ and $Lab$.\n", "\n", "$Lab$ was designed to replicate human vision, and exploits the fact that in a sense there are [four fundamental colors](https://en.wikipedia.org/wiki/Opponent_process) that the human eye can perceive: a red-green axis $a$ and a blue-yellow axis $b$. Adding luminosity $L$ to these chromaticity axes yields a three-parameter color space that is actually more expressive than can be represented by $RGB$ triplets." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy as np\n", "import scipy as sp\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import matplotlib.cm as cm\n", "\n", "from sklearn.feature_extraction import image\n", "from sklearn.cluster import spectral_clustering\n", "from sklearn.cluster import KMeans\n", "%matplotlib inline\n", "\n", "from colormath.color_objects import LabColor, sRGBColor\n", "from colormath.color_conversions import convert_color\n", "rgbize = np.vectorize(sRGBColor)\n", "convertize = np.vectorize(convert_color)\n", "\n", "# Inspired by an example at http://www.mathworks.com/help/images/examples/color-based-segmentation-using-k-means-clustering.html\n", "# Read image and convert it from RGB to Lab representation.\n", "from pylab import imread, imshow, gray, mean\n", "sat_rgb = imread('satellite.png')\n", "plt.figure()\n", "plt.xticks([])\n", "plt.yticks([])\n", "imshow(sat_rgb)\n", "sat_rgb_cs = rgbize(sat_rgb[:,:,0],sat_rgb[:,:,1],sat_rgb[:,:,2])\n", "sat_lab_cs = convertize(sat_rgb_cs, LabColor)\n", "\n", "sat_lab = np.ones((sat_lab_cs.shape[0], sat_lab_cs.shape[1], 4))\n", "for i in range(sat_lab_cs.shape[0]):\n", " for j in range(sat_lab_cs.shape[1]):\n", " sat_lab[i,j,0] = sat_lab_cs[i,j].lab_l/200+100 #rgb_r\n", " sat_lab[i,j,1] = sat_lab_cs[i,j].lab_a/200+100 #rgb_g\n", " sat_lab[i,j,2] = sat_lab_cs[i,j].lab_b/200+100 #rgb_b" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "mpl.rcParams['figure.figsize']=(15,3)\n", "\n", "# Full-color image\n", "plt.figure()\n", "plt.subplot(1,4,1)\n", "plt.xticks([])\n", "plt.yticks([])\n", "plt.xlabel(r'$RGB$ Full-Color Image')\n", "imshow(sat_rgb)\n", "\n", "# Red channel\n", "cdict = {'red': ((0.0, 0.0, 0.0),\n", " (1.0, 1.0, 1.0)),\n", " 'green': ((0.0, 0.0, 0.0),\n", " (1.0, 0.0, 0.0)),\n", " 'blue': ((0.0, 0.0, 0.0),\n", " (1.0, 0.0, 0.0))}\n", "BkRd = mpl.colors.LinearSegmentedColormap('BkRd',cdict,256)\n", "plt.subplot(1,4,2)\n", "plt.xticks([])\n", "plt.yticks([])\n", "plt.xlabel(r'$R$')\n", "imshow(sat_rgb[:,:,0], cmap = BkRd)\n", "\n", "# Green channel\n", "cdict = {'red': ((0.0, 0.0, 0.0),\n", " (1.0, 0.0, 0.0)),\n", " 'green': ((0.0, 0.0, 0.0),\n", " (1.0, 1.0, 1.0)),\n", " 'blue': ((0.0, 0.0, 0.0),\n", " (1.0, 0.0, 0.0))}\n", "BkGn = mpl.colors.LinearSegmentedColormap('BkGn',cdict,256)\n", "plt.subplot(1,4,3)\n", "plt.xticks([])\n", "plt.yticks([])\n", "plt.xlabel(r'$G$')\n", "imshow(sat_rgb[:,:,1], cmap = BkGn)\n", "\n", "# Blue Channel\n", "cdict = {'red': ((0.0, 0.0, 0.0),\n", " (1.0, 0.0, 0.0)),\n", " 'green': ((0.0, 0.0, 0.0),\n", " (1.0, 0.0, 0.0)),\n", " 'blue': ((0.0, 0.0, 0.0),\n", " (1.0, 1.0, 1.0))}\n", "BkBl = mpl.colors.LinearSegmentedColormap('BkBl',cdict,256)\n", "plt.subplot(1,4,4)\n", "plt.xticks([])\n", "plt.yticks([])\n", "plt.xlabel(r'$B$')\n", "imshow(sat_rgb[:,:,2], cmap = BkBl)\n", "\n", "# Composite Lab image as RGB\n", "plt.figure()\n", "plt.subplot(1,4,1)\n", "plt.xticks([])\n", "plt.yticks([])\n", "plt.xlabel(r'Composite $Lab$ as $RGB$')\n", "imshow(sat_lab, origin='lower')\n", "\n", "# L - luminosity\n", "plt.subplot(1,4,2)\n", "plt.xticks([])\n", "plt.yticks([])\n", "plt.xlabel(r'$L$')\n", "imshow(sat_lab[:,:,0], cmap = cm.Greys_r)\n", "\n", "# a - red-green axis\n", "cdict = {'red': ((0.0, 0.0, 0.0),\n", " (1.0, 1.0, 1.0)),\n", " 'green': ((0.0, 1.0, 1.0),\n", " (1.0, 0.0, 0.0)),\n", " 'blue': ((0.0, 0.0, 0.0),\n", " (1.0, 0.0, 0.0))}\n", "RdGr = mpl.colors.LinearSegmentedColormap('RdGr',cdict,256)\n", "plt.subplot(1,4,3)\n", "plt.xticks([])\n", "plt.yticks([])\n", "plt.xlabel(r'$a$')\n", "imshow(sat_lab[:,:,1], cmap = RdGr)\n", "\n", "# b - blue-yellow axis\n", "cdict = {'red': ((0.0, 0.0, 0.0),\n", " (1.0, 1.0, 1.0)),\n", " 'green': ((0.0, 0.0, 0.0),\n", " (1.0, 1.0, 1.0)),\n", " 'blue': ((0.0, 1.0, 1.0),\n", " (1.0, 0.0, 0.0))}\n", "BlYw = mpl.colors.LinearSegmentedColormap('BlYw',cdict,256)\n", "plt.subplot(1,4,4)\n", "plt.xticks([])\n", "plt.yticks([])\n", "plt.xlabel(r'$b$')\n", "imshow(sat_lab[:,:,2], cmap = BlYw)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Classify each color into clusters using the K-means algorithm.\n", "from sklearn.cluster import KMeans\n", "ab = sat_lab[:,:,1:3];\n", "ab = np.reshape(ab,(sat_lab.shape[0]*sat_lab.shape[1],2));\n", "\n", "# How many major colors do you perceive?\n", "n_colors = 3\n", "\n", "# Cluster, repeating 10x to avoid local minima.\n", "kmeans = KMeans(n_clusters=n_colors, n_init=10)\n", "cluster_index = kmeans.fit_predict(ab)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Classify pixels by K-means cluster.\n", "pixel_labels = np.reshape(cluster_index,(sat_lab.shape[0],sat_lab.shape[1]))\n", "\n", "mpl.rcParams['figure.figsize']=(5,5)\n", "plt.figure\n", "plt.xticks([])\n", "plt.yticks([])\n", "imshow(pixel_labels, cmap = cm.Blues_r)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": true, "input": [ "# Segment the original image by color cluster.\n", "mpl.rcParams['figure.figsize']=(16,8)\n", "sat_seg = np.zeros((n_colors,sat_rgb.shape[0],sat_rgb.shape[1],sat_rgb.shape[2]))\n", "\n", "for k in range(n_colors):\n", " color_index = np.where(pixel_labels == k)\n", " sat_seg[k,color_index[0],color_index[1]] = sat_rgb[np.where(pixel_labels == k)]\n", " \n", " plt.subplot(2,np.ceil(n_colors/2),k+1)\n", " plt.xticks([])\n", " plt.yticks([])\n", " imshow(sat_seg[k])\n", "plt.tight_layout()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "###Image compression using clustering\n", "K-means in the previous example was used to identify polar caps. Another related use of K-means is in image compression. The size of an image usually relates to the number of colors it consists of (as we saw earlier). This example compresses an image by reducing the number of colors used to represent it (5 clusters in this case). The primary purpose of using K-means here is to identify the right 5 clusters so that the image can still be represented efficiently." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_sample_image\n", "c1= load_sample_image(\"china.jpg\")\n", "plt.imshow(c1);" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import scipy.misc as misc\n", "from sklearn.datasets import load_sample_image\n", "c = load_sample_image(\"china.jpg\")\n", "#plt.imshow(c)\n", "c = c.astype(np.float32)\n", "#plt.imshow(c)\n", "X = c.reshape((-1, 1)) # Reshaping it into an array\n", "k_means = KMeans(n_clusters=5)\n", "k_means.fit(X)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "values = k_means.cluster_centers_.squeeze()\n", "labels = k_means.labels_\n", "c_compressed = np.choose(labels, values)\n", "c_compressed.shape = c.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "mpl.rcParams['figure.figsize']=[12,5.5]\n", "fig, (ax0, ax1) = plt.subplots(ncols=2, sharex=True)\n", "ax0.imshow(c)\n", "ax1.imshow(c_compressed) \n", "fig.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## References\n", "- [scikit-learn Documentation](http://scikit-learn.org/stable/)\n", "- [Machine Learning using Astronomical Data](http://www.astroml.org/)\n", "- [scikit-learn tutorial](http://www.astroml.org/sklearn_tutorial/)\n", "- [Stock Market Example](http://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html#example-applications-plot-stock-market-py)\n", "- [PyCon14](http://nbviewer.ipython.org/github/jakevdp/sklearn_pycon2014/blob/master/notebooks/04_supervised_in_depth.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Credits\n", "\n", "Lakshmi Rao, Abhishek Sharma, and Neal Davis developed these materials for [Computational Science and Engineering](http://cse.illinois.edu/) at the University of Illinois at Urbana\u2013Champaign.\n", "\n", "\n", "This content is available under a [Creative Commons Attribution 3.0 Unported License](https://creativecommons.org/licenses/by/3.0/).\n", "\n", "[![](https://bytebucket.org/davis68/resources/raw/f7c98d2b95e961fae257707e22a58fa1a2c36bec/logos/baseline_cse_wdmk.png?token=be4cc41d4b2afe594f5b1570a3c5aad96a65f0d6)](http://cse.illinois.edu/)" ] } ], "metadata": {} } ] }