{ "metadata": { "name": "", "signature": "sha256:2d39e721aea6e9d12dd0c22db50ac39a6dcf52de1c53e6c29d980f1c9e338d9c" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "[![](https://bytebucket.org/davis68/resources/raw/f7c98d2b95e961fae257707e22a58fa1a2c36bec/logos/baseline_cse_wdmk.png?token=be4cc41d4b2afe594f5b1570a3c5aad96a65f0d6)](http://cse.illinois.edu/)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from __future__ import print_function\n", "\n", "import numpy as np\n", "import matplotlib as mpl\n", "import matplotlib.pyplot as plt\n", "import matplotlib.cm as cm\n", "%matplotlib inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Machine Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_Machine learning_ is the science of getting computers to act without telling them to do so. This is where we have got our self-driving cars, effective web search, speech recognition etc. You use it everyday without even realizing it! " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So today we will try to go into some of the primary wings of machine learning using Python through examples:\n", "\n", "- Supervised learning : E.g. Classification\n", "\n", "- Unsupervised learning: E.g. Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** A representation of the classifiers in scikit-learn **" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(__doc__)\n", "\n", "\n", "# Code source: Ga\u00ebl Varoquaux\n", "# Andreas M\u00fcller\n", "# Modified for documentation by Jaques Grobler\n", "# License: BSD 3 clause\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from matplotlib.colors import ListedColormap\n", "from sklearn.cross_validation import train_test_split\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.datasets import make_moons, make_circles, make_classification\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.svm import SVC\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.lda import LDA\n", "from sklearn.qda import QDA\n", "%matplotlib inline\n", "h = .02 # step size in the mesh\n", "\n", "names = [\"Nearest Neighbors\", \"Linear SVM\", \"RBF SVM\", \"Decision Tree\",\n", " \"Random Forest\", \"AdaBoost\", \"Naive Bayes\", \"LDA\", \"QDA\"]\n", "classifiers = [\n", " KNeighborsClassifier(3),\n", " SVC(kernel=\"linear\", C=0.025),\n", " SVC(gamma=2, C=1),\n", " DecisionTreeClassifier(max_depth=5),\n", " RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),\n", " AdaBoostClassifier(),\n", " GaussianNB(),\n", " LDA(),\n", " QDA()]\n", "\n", "X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,\n", " random_state=1, n_clusters_per_class=1)\n", "rng = np.random.RandomState(2)\n", "X += 2 * rng.uniform(size=X.shape)\n", "linearly_separable = (X, y)\n", "\n", "datasets = [make_moons(noise=0.3, random_state=0),\n", " make_circles(noise=0.2, factor=0.5, random_state=1),\n", " linearly_separable\n", " ]\n", "\n", "figure = plt.figure(figsize=(20, 7))\n", "i = 1\n", "# iterate over datasets\n", "for ds in datasets:\n", " # preprocess dataset, split into training and test part\n", " X, y = ds\n", " X = StandardScaler().fit_transform(X)\n", " X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)\n", "\n", " x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5\n", " y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5\n", " xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n", " np.arange(y_min, y_max, h))\n", "\n", " # just plot the dataset first\n", " cm = plt.cm.RdBu\n", " cm_bright = ListedColormap(['#FF0000', '#0000FF'])\n", " ax = plt.subplot(len(datasets), len(classifiers) + 1, i)\n", " # Plot the training points\n", " ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)\n", " # and testing points\n", " ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6)\n", " ax.set_xlim(xx.min(), xx.max())\n", " ax.set_ylim(yy.min(), yy.max())\n", " ax.set_xticks(())\n", " ax.set_yticks(())\n", " i += 1\n", "\n", " # iterate over classifiers\n", " for name, clf in zip(names, classifiers):\n", " ax = plt.subplot(len(datasets), len(classifiers) + 1, i)\n", " clf.fit(X_train, y_train)\n", " score = clf.score(X_test, y_test)\n", "\n", " # Plot the decision boundary. For that, we will assign a color to each\n", " # point in the mesh [x_min, m_max]x[y_min, y_max].\n", " if hasattr(clf, \"decision_function\"):\n", " Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])\n", " else:\n", " Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]\n", "\n", " # Put the result into a color plot\n", " Z = Z.reshape(xx.shape)\n", " ax.contourf(xx, yy, Z, cmap=cm, alpha=.8)\n", "\n", " # Plot also the training points\n", " ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright)\n", " # and testing points\n", " ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright,\n", " alpha=0.6)\n", "\n", " ax.set_xlim(xx.min(), xx.max())\n", " ax.set_ylim(yy.min(), yy.max())\n", " ax.set_xticks(())\n", " ax.set_yticks(())\n", " ax.set_title(name)\n", " ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),\n", " size=15, horizontalalignment='right')\n", " i += 1\n", "\n", "figure.subplots_adjust(left=.02, right=.98)\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Representing data in scikit-learn**\n", "\n", "Most machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional array or matrix. The arrays can be either numpy arrays, or in some cases scipy.sparse matrices. The size of the array is expected to be [n_samples, n_features]\n", "\n", "- `n_samples`: Each sample is an item to process (e.g. classify). A sample can be a document, a picture, a sound, a video, an astronomical object, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.\n", "\n", "- `n_features`: The number of features or distinct traits that can be used to describe each item in a quantitative manner. Features are generally real-valued, but may be boolean or discrete-valued in some cases. The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample. This is a case where scipy.sparse matrices can be useful, in that they are much more memory-efficient than numpy arrays." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Loading a dataset**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we begin with machine learning, let's start at its very root by understanding how to load a dataset. We are going to start with the `iris` dataset stored in `scikit-learn`. It can be visualized as follows:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython.core.display import Image, display\n", "display(Image('http://www.twofrog.com/images/iris38a.jpg'))\n", "print (\"Iris Setosa\\n\")\n", "\n", "display(Image('http://www.gatewaygardens.com/_ccLib/image/plants/DETA-185.jpg'))\n", "print (\"Iris Versicolor\\n\")\n", "\n", "display(Image('http://mirlab.org/jang/books/dcpr/image/Iris-virginica-3_1.jpg'))\n", "print (\"Iris Virginica\")\n", "\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question**\n", "If we want to design an algorithm to recognize iris species, what might the data be?\n", "(Remember: we need a 2D array of size [`n_samples` x `n_features`].)\n", "\n", "- What would the `n_samples` refer to?\n", "- What might the `n_features` refer to?\n", "\n", "Remember that there must be a fixed number of features for each sample, and feature number `i` must be a similar kind of quantity for each sample." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn has a very straightforward set of data on these iris species. The data consists of the following:\n", "\n", "Features in the Iris dataset:\n", "\n", "- sepal length in cm\n", "- sepal width in cm\n", "- petal length in cm\n", "- petal width in cm\n", "\n", "Target classes to predict:\n", "\n", "- Iris Setosa\n", "- Iris Versicolour\n", "- Iris Virginica\n", "\n", "`scikit-learn` embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import datasets\n", "iris = datasets.load_iris() #150 observations of the iris flower with sepal length, sepal width, petal length and petal width" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data is stored in the `.data` attribute" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#iris.data.shape\n", "#iris.target_names\n", "iris.feature_names" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This data is four dimensional, but we can visualize two of the dimensions at a time using a simple scatter-plot:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "x_index = 0\n", "y_index = 1\n", "\n", "# this formatter will label the colorbar with the correct target names\n", "formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])\n", "\n", "plt.scatter(iris.data[:, x_index], iris.data[:, y_index],\n", " c=iris.target)\n", "plt.colorbar(ticks=[0, 1, 2], format=formatter)\n", "plt.xlabel(iris.feature_names[x_index])\n", "plt.ylabel(iris.feature_names[y_index])\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**\n", "Change x_index and y_index in the above script and find a combination of two parameters which maximally separate the three classes.\n", "This exercise is a preview of dimensionality reduction, which we'll see later." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Other datasets**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "- Packaged Data: these small datasets are packaged with the scikit-learn installation, and can be downloaded using the tools in sklearn.datasets.load_*\n", "- Downloadable Data: these larger datasets are available for download, and scikit-learn includes tools which streamline this process. These tools can be found in sklearn.datasets.fetch_*\n", "- Generated Data: there are several datasets which are generated from models based on a random seed. These are available in the sklearn.datasets.make_*" ] }, { "cell_type": "code", "collapsed": false, "input": [ "datasets.make_circles" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import pylab as pl\n", "mpl.rcParams['figure.figsize']=[16,5.4]\n", "\n", "digits = datasets.load_digits()\n", "fig, (ax0, ax1, ax2) = plt.subplots(ncols=3, sharex=True)\n", "ax0.matshow(digits.images[0])\n", "ax1.matshow(digits.images[0], cmap=pl.cm.gray_r) \n", "ax2.imshow(digits.images[2], cmap=pl.cm.gray_r) \n", "fig.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Estimator**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Every algorithm is exposed in scikit-learn via an ''Estimator'' object. For instance a linear regression is implemented as follows:\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.linear_model import LinearRegression" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Estimator parameters: All the parameters of an estimator can be set when it is instantiated, and have suitable default values:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "model = LinearRegression(normalize=True)\n", "print (model.normalize)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print (model)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Estimated Model parameters: When data is fit with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "mpl.rcParams['figure.figsize']=[6,3]\n", "\n", "x = np.array([0, 1, 2])\n", "y = np.array([0, 1, 2])\n", "\n", "plt.plot(x, y, 'o')\n", "plt.xlim(-0.5, 2.5)\n", "plt.ylim(-0.5, 2.5)\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# The input data for sklearn is 2D: (samples == 3 x features == 1)\n", "X = x[:, np.newaxis]\n", "print (X)\n", "print (y)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "model.fit(X, y) \n", "print (model.coef_)\n", "print (model.intercept_)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "model.residues_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model found a line with a slope 1 and intercept 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# Learning" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Supervised Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In supervised learning, the computer gets presented with a set of sample data (or \"training\" data as it is called) along with its output. The idea is to develop a rule or a model that maps the sample output to the input. So, when the computer is presented with a new input, it can classify it into a particular category. For example, spam filtering is an instance where the learning algorithm is presented with emails messages labeled beforehand as \"spam\" or \"not spam\", to produce a computer program that labels unseen messages as either spam or not.\n", "\n", "Essentially we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features. \n", "\n", "Supervised learning is further broken down into two categories, **classification** and **regression**. In classification, the label is discrete, while in regression, the label is continuous." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " `scikit-learn` supports a number of supervised learning algorithms, and we will try to elucidate using examples using the following:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- KNN (K Nearest Neighbours)\n", "- Linear Regression\n", "- Support Vector Machines\n" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The figure shows a collection of two-dimensional data, colored according to two different class labels. A classification algorithm may be used to draw a dividing boundary between the two clusters of points:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![]()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**KNN Classification**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "K nearest neighbors (kNN) is one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import neighbors, datasets\n", "\n", "iris = datasets.load_iris()\n", "X, y = iris.data, iris.target\n", "\n", "# create the model\n", "knn = neighbors.KNeighborsClassifier(n_neighbors=2)\n", "\n", "# fit the model\n", "knn.fit(X, y)\n", "\n", "# What kind of iris has 3cm x 5cm sepal and 4cm x 2cm petal?\n", "# call the \"predict\" method:\n", "result = knn.predict([[3, 5, 4, 2],])\n", "\n", "print(iris.target_names[result])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Ordinary Least Squares(OLS)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OLS fits a linear model by evaluating the coefficients $\\alpha= (\\alpha_1,\\alpha_2,...\\alpha_n)$ to minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. Mathematically it solves a problem of the form:\n", "$$\\min_{\\alpha} {\\|{X\\alpha - y}} \\| \\text{.}$$" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import linear_model\n", "clf = linear_model.LinearRegression()\n", "clf.fit ([[0, 0], [1, 1], [2, 2]], [0, 1, 2])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "clf.coef_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Create some simple data\n", "import numpy as np\n", "np.random.seed(0)\n", "X = np.random.random(size=(20, 1))\n", "y = 3 * X.squeeze() + 2 + np.random.normal(size=20)\n", "\n", "# Fit a linear regression to it\n", "model = LinearRegression(fit_intercept=True)\n", "model.fit(X, y)\n", "print (\"Model coefficient: %.5f, and intercept: %.5f\"\n", " % (model.coef_, model.intercept_))\n", "\n", "# Plot the data and the model prediction\n", "X_test = np.linspace(0, 1, 100)[:, np.newaxis]\n", "y_test = model.predict(X_test)\n", "\n", "plt.plot(X.squeeze(), y, 'x')\n", "plt.plot(X_test.squeeze(), y_test);\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The model has been learned from the training data, and can be used to predict the result of test data: here, we might be given an x-value, and the model would allow us to predict the y value." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Support Vector Machines (SVM)**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "They are particularly advantageous in the following situations :\n", "\n", "- High dimensional spaces\n", "- Number of dimensions greater than the number of samples\n", "- Memory efficient\n", "- Versatile\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The support vector machines in `scikit-learn` support both dense (`numpy.ndarray`) and sparse (any `scipy.sparse`) sample vectors as input. However, to use an SVM to make predictions for sparse data, it must have been fit on such data. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import svm\n", "clf = svm.LinearSVC()\n", "clf.fit(iris.data, iris.target) #learn from the data" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "clf.predict([[ 5.0, 3.6, 1.3, 0.25]])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "clf.coef_ #Access the coefficients" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn import svm\n", "svc = svm.SVC(kernel='linear')\n", "svc=svc.fit(iris.data, iris.target)\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "svc.coef_" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print(__doc__)\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn import svm, datasets\n", "\n", "# import some data to play with\n", "iris = datasets.load_iris()\n", "X = iris.data[:, :2] # we only take the first two features. We could\n", " # avoid this ugly slicing by using a two-dim dataset\n", "y = iris.target\n", "\n", "h = .02 # step size in the mesh\n", "\n", "# we create an instance of SVM and fit out data. We do not scale our\n", "# data since we want to plot the support vectors\n", "C = 1.0 # SVM regularization parameter\n", "svc = svm.SVC(kernel='linear', C=C).fit(X, y)\n", "rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X, y)\n", "poly_svc = svm.SVC(kernel='poly', degree=3, C=C).fit(X, y)\n", "lin_svc = svm.LinearSVC(C=C).fit(X, y)\n", "\n", "# create a mesh to plot in\n", "x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n", "y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n", "xx, yy = np.meshgrid(np.arange(x_min, x_max, h),\n", " np.arange(y_min, y_max, h))\n", "\n", "# title for the plots\n", "titles = ['SVC with linear kernel',\n", " 'LinearSVC (linear kernel)',\n", " 'SVC with RBF kernel',\n", " 'SVC with polynomial (degree 3) kernel']\n", "\n", "mpl.rcParams['figure.figsize']=[12,12]\n", "for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):\n", " # Plot the decision boundary. For that, we will assign a color to each\n", " # point in the mesh [x_min, m_max]x[y_min, y_max].\n", " plt.subplot(2, 2, i + 1)\n", " plt.subplots_adjust(wspace=0.4, hspace=0.4)\n", "\n", " Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n", "\n", " # Put the result into a color plot\n", " Z = Z.reshape(xx.shape)\n", " plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)\n", "\n", " # Plot also the training points\n", " plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)\n", " plt.xlabel('Sepal length')\n", " plt.ylabel('Sepal width')\n", " plt.xlim(xx.min(), xx.max())\n", " plt.ylim(yy.min(), yy.max())\n", " plt.xticks(())\n", " plt.yticks(())\n", " plt.title(titles[i])\n", "\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Unsupervised learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the case of unsupervised learning, the idea is to find a hidden pattern in the data. In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself. Unsupervised learning comprises tasks such as dimensionality reduction, clustering, and density estimation. \n", "Approaches to unsupervised learning are supported in `scikit-learn` such as:\n", "\n", "- K-means clustering\n", "- Principal Component Analysis\n", "- Neural networks" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is the task of grouping a set of objects together such as they are similar to one another in some way." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Principal Component Analysis(PCA)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PCA is extensively used in reducing dimensionality of data. PCA finds the directions in which the data is not flat and it can reduce the dimensionality of the data by projecting on a subspace. It finds the combination of variables that exhibit the maximum variance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![](http://shapeofdata.files.wordpress.com/2013/02/pca22.png)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "mpl.rcParams['figure.figsize']=[12,5]\n", "np.random.seed(1)\n", "X = np.dot(np.random.random(size=(2, 2)), np.random.normal(size=(2, 200))).T\n", "plt.plot(X[:, 0], X[:, 1], 'og')\n", "plt.axis('equal')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.decomposition import PCA\n", "pca = PCA(n_components=2)\n", "pca.fit(X)\n", "print(pca.explained_variance_)\n", "print(pca.components_)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "mpl.rcParams['figure.figsize']=[12,5]\n", "plt.plot(X[:, 0], X[:, 1], 'og', alpha=0.3)\n", "plt.axis('equal')\n", "for length, vector in zip(pca.explained_variance_, pca.components_):\n", " v = vector * 3 * np.sqrt(length)\n", " plt.plot([0, v[0]], [0, v[1]], '-k', lw=3)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "clf = PCA(0.95) # if we only keep 95% of the variance\n", "X_trans = clf.fit_transform(X)\n", "print(X.shape)\n", "print(X_trans.shape)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "X_new = clf.inverse_transform(X_trans)\n", "plt.plot(X[:, 0], X[:, 1], 'og', alpha=0.2)\n", "plt.plot(X_new[:, 0], X_new[:, 1], 'og', alpha=0.8)\n", "plt.axis('equal');" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "X, y = iris.data, iris.target\n", "from sklearn.decomposition import PCA\n", "pca = PCA(n_components=2)\n", "pca.fit(X)\n", "X_reduced = pca.transform(X)\n", "print (\"Reduced dataset shape:\", X_reduced.shape)\n", "\n", "import pylab as pl\n", "mpl.rcParams['figure.figsize']=[12,3]\n", "pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y)\n", "\n", "print (\"Meaning of the 2 components:\")\n", "for component in pca.components_:\n", " print(\" + \".join(\"%.3f x %s\" % (value, name)\n", " for value, name in zip(component,\n", " iris.feature_names)))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "K-means" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The simplest clustering algorithm is k-means. This divides a set into k clusters, assigning each observation to a cluster so as to minimize the distance of that observation (in n-dimensional space) to the cluster\u2019s mean; the means are then recomputed. This operation is run iteratively until the clusters converge, for a maximum for max_iter rounds." ] }, { "cell_type": "code", "collapsed": false, "input": [ "X, y = iris.data, iris.target\n", "from sklearn.cluster import KMeans\n", "k_means = KMeans(n_clusters=5, random_state=0) # Fixing the RNG in kmeans\n", "k_means.fit(X)\n", "y_pred = k_means.predict(X)\n", "\n", "pl.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_pred);\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "print(__doc__)\n", "\n", "\n", "# Code source: Ga\u00ebl Varoquaux\n", "# Modified for documentation by Jaques Grobler\n", "# License: BSD 3 clause\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from mpl_toolkits.mplot3d import Axes3D\n", "\n", "\n", "from sklearn.cluster import KMeans\n", "from sklearn import datasets\n", "\n", "np.random.seed(5)\n", "\n", "centers = [[1, 1], [-1, -1], [1, -1]]\n", "iris = datasets.load_iris()\n", "X = iris.data\n", "y = iris.target\n", "\n", "estimators = {'k_means_iris_3': KMeans(n_clusters=3),\n", " 'k_means_iris_8': KMeans(n_clusters=8),\n", " 'k_means_iris_bad_init': KMeans(n_clusters=3, n_init=1,\n", " init='random')}\n", "\n", "\n", "fignum = 1\n", "for name, est in estimators.items():\n", " fig = plt.figure(fignum, figsize=(4, 3))\n", " plt.clf()\n", " ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)\n", "\n", " plt.cla()\n", " est.fit(X)\n", " labels = est.labels_\n", "\n", " ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))\n", "\n", " ax.w_xaxis.set_ticklabels([])\n", " ax.w_yaxis.set_ticklabels([])\n", " ax.w_zaxis.set_ticklabels([])\n", " ax.set_xlabel('Petal width')\n", " ax.set_ylabel('Sepal length')\n", " ax.set_zlabel('Petal length')\n", " fignum = fignum + 1\n", "\n", "# Plot the ground truth\n", "fig = plt.figure(fignum, figsize=(4, 3))\n", "plt.clf()\n", "ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)\n", "\n", "plt.cla()\n", "\n", "for name, label in [('Setosa', 0),\n", " ('Versicolour', 1),\n", " ('Virginica', 2)]:\n", " ax.text3D(X[y == label, 3].mean(),\n", " X[y == label, 0].mean() + 1.5,\n", " X[y == label, 2].mean(), name,\n", " horizontalalignment='center',\n", " bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))\n", "# Reorder the labels to have colors matching the cluster results\n", "y = np.choose(y, [1, 2, 0]).astype(np.float)\n", "ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y)\n", "\n", "ax.w_xaxis.set_ticklabels([])\n", "ax.w_yaxis.set_ticklabels([])\n", "ax.w_zaxis.set_ticklabels([])\n", "ax.set_xlabel('Petal width')\n", "ax.set_ylabel('Sepal length')\n", "ax.set_zlabel('Petal length')\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Neural Networks**\n", "\n", "In general Neural Networks or Artificial Neural Networks(ANNs) as they are commonly known, are used both in machine learning and pattern recognition.\n", "\n", "For example, a neural network for handwriting recognition is defined by a set of input neurons which may be activated by the pixels of an input image. After being weighted and transformed by a function (determined by the network's designer), the activations of these neurons are then passed on to other neurons. This process is repeated until finally, an output neuron is activated. This determines which character was read.\n", "\n", "Typically, the ANN is represented by three layers. The hidden layer's job is to transform the inputs into something that the output layer can use. The output of a neuron is a function of the weighted sum of the inputs plus a bias.\n", "\n", "![](http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/img/example_network.svg)\n", "\n", "A restricted Boltzmann machine (RBM) is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. This is the example of unsupervised learning which is supported by `scikit-learn`." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Some Real World Examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Image compression using clustering**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.datasets import load_sample_image\n", "c1= load_sample_image(\"china.jpg\")\n", "plt.imshow(c1);" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import scipy.misc as misc\n", "from sklearn.datasets import load_sample_image\n", "c = load_sample_image(\"china.jpg\")\n", "#plt.imshow(c)\n", "c = c.astype(np.float32)\n", "#plt.imshow(c)\n", "X = c.reshape((-1, 1)) # Reshaping it into an array\n", "k_means = KMeans(n_clusters=5)\n", "k_means.fit(X)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "values = k_means.cluster_centers_.squeeze()\n", "labels = k_means.labels_\n", "c_compressed = np.choose(labels, values)\n", "c_compressed.shape = c.shape" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "mpl.rcParams['figure.figsize']=[12,5.5]\n", "fig, (ax0, ax1) = plt.subplots(ncols=2, sharex=True)\n", "ax0.imshow(c)\n", "ax1.imshow(c_compressed) \n", "fig.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**K-means clustering on PCA reduced Data**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(__doc__)\n", "\n", "from time import time\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn import metrics\n", "from sklearn.cluster import KMeans\n", "from sklearn.datasets import load_digits\n", "from sklearn.decomposition import PCA\n", "from sklearn.preprocessing import scale\n", "\n", "np.random.seed(42)\n", "\n", "digits = load_digits()\n", "data = scale(digits.data)\n", "\n", "n_samples, n_features = data.shape\n", "n_digits = len(np.unique(digits.target))\n", "labels = digits.target\n", "\n", "sample_size = 300\n", "\n", "print(\"n_digits: %d, \\t n_samples %d, \\t n_features %d\"\n", " % (n_digits, n_samples, n_features))\n", "\n", "\n", "print(79 * '_')\n", "print('% 9s' % 'init'\n", " ' time inertia homo compl v-meas ARI AMI silhouette')\n", "\n", "\n", "def bench_k_means(estimator, name, data):\n", " t0 = time()\n", " estimator.fit(data)\n", " print('% 9s %.2fs %i %.3f %.3f %.3f %.3f %.3f %.3f'\n", " % (name, (time() - t0), estimator.inertia_,\n", " metrics.homogeneity_score(labels, estimator.labels_),\n", " metrics.completeness_score(labels, estimator.labels_),\n", " metrics.v_measure_score(labels, estimator.labels_),\n", " metrics.adjusted_rand_score(labels, estimator.labels_),\n", " metrics.adjusted_mutual_info_score(labels, estimator.labels_),\n", " metrics.silhouette_score(data, estimator.labels_,\n", " metric='euclidean',\n", " sample_size=sample_size)))\n", "\n", "bench_k_means(KMeans(init='k-means++', n_clusters=n_digits, n_init=10),\n", " name=\"k-means++\", data=data)\n", "\n", "bench_k_means(KMeans(init='random', n_clusters=n_digits, n_init=10),\n", " name=\"random\", data=data)\n", "\n", "# in this case the seeding of the centers is deterministic, hence we run the\n", "# kmeans algorithm only once with n_init=1\n", "pca = PCA(n_components=n_digits).fit(data)\n", "bench_k_means(KMeans(init=pca.components_, n_clusters=n_digits, n_init=1),\n", " name=\"PCA-based\",\n", " data=data)\n", "print(79 * '_')\n", "\n", "###############################################################################\n", "# Visualize the results on PCA-reduced data\n", "\n", "reduced_data = PCA(n_components=2).fit_transform(data)\n", "kmeans = KMeans(init='k-means++', n_clusters=n_digits, n_init=10)\n", "kmeans.fit(reduced_data)\n", "\n", "# Step size of the mesh. Decrease to increase the quality of the VQ.\n", "h = .02 # point in the mesh [x_min, m_max]x[y_min, y_max].\n", "\n", "# Plot the decision boundary. For that, we will assign a color to each\n", "x_min, x_max = reduced_data[:, 0].min() + 1, reduced_data[:, 0].max() - 1\n", "y_min, y_max = reduced_data[:, 1].min() + 1, reduced_data[:, 1].max() - 1\n", "xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n", "\n", "# Obtain labels for each point in mesh. Use last trained model.\n", "Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])\n", "\n", "# Put the result into a color plot\n", "Z = Z.reshape(xx.shape)\n", "plt.figure(1)\n", "plt.clf()\n", "plt.imshow(Z, interpolation='nearest',\n", " extent=(xx.min(), xx.max(), yy.min(), yy.max()),\n", " cmap=plt.cm.Paired,\n", " aspect='auto', origin='lower')\n", "\n", "plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)\n", "# Plot the centroids as a white X\n", "centroids = kmeans.cluster_centers_\n", "plt.scatter(centroids[:, 0], centroids[:, 1],\n", " marker='x', s=169, linewidths=3,\n", " color='w', zorder=10)\n", "plt.title('K-means clustering on the digits dataset (PCA-reduced data)\\n'\n", " 'Centroids are marked with white cross')\n", "plt.xlim(x_min, x_max)\n", "plt.ylim(y_min, y_max)\n", "plt.xticks(())\n", "plt.yticks(())\n", "plt.show()\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "A clustering example from the real world" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print(__doc__)\n", "\n", "# Author: Gael Varoquaux gael.varoquaux@normalesup.org\n", "# License: BSD 3 clause\n", "\n", "import datetime\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from matplotlib import finance\n", "from matplotlib.collections import LineCollection\n", "\n", "from sklearn import cluster, covariance, manifold\n", "\n", "###############################################################################\n", "# Retrieve the data from Internet\n", "\n", "# Choose a time period reasonably calm (not too long ago so that we get\n", "# high-tech firms, and before the 2008 crash)\n", "d1 = datetime.datetime(2003, 1, 1)\n", "d2 = datetime.datetime(2010, 1, 1)\n", "\n", "# kraft symbol has now changed from KFT to MDLZ in yahoo\n", "symbol_dict = {\n", " 'TOT': 'Total',\n", " 'XOM': 'Exxon',\n", " 'CVX': 'Chevron',\n", " 'COP': 'ConocoPhillips',\n", " 'VLO': 'Valero Energy',\n", " 'MSFT': 'Microsoft',\n", " 'IBM': 'IBM',\n", " 'TWX': 'Time Warner',\n", " 'CMCSA': 'Comcast',\n", " 'CVC': 'Cablevision',\n", " 'YHOO': 'Yahoo',\n", " 'DELL': 'Dell',\n", " 'HPQ': 'HP',\n", " 'AMZN': 'Amazon',\n", " 'TM': 'Toyota',\n", " 'CAJ': 'Canon',\n", " 'MTU': 'Mitsubishi',\n", " 'SNE': 'Sony',\n", " 'F': 'Ford',\n", " 'HMC': 'Honda',\n", " 'NAV': 'Navistar',\n", " 'NOC': 'Northrop Grumman',\n", " 'BA': 'Boeing',\n", " 'KO': 'Coca Cola',\n", " 'MMM': '3M',\n", " 'MCD': 'Mc Donalds',\n", " 'PEP': 'Pepsi',\n", " 'MDLZ': 'Kraft Foods',\n", " 'K': 'Kellogg',\n", " 'UN': 'Unilever',\n", " 'MAR': 'Marriott',\n", " 'PG': 'Procter Gamble',\n", " 'CL': 'Colgate-Palmolive',\n", " 'GE': 'General Electrics',\n", " 'WFC': 'Wells Fargo',\n", " 'JPM': 'JPMorgan Chase',\n", " 'AIG': 'AIG',\n", " 'AXP': 'American express',\n", " 'BAC': 'Bank of America',\n", " 'GS': 'Goldman Sachs',\n", " 'AAPL': 'Apple',\n", " 'SAP': 'SAP',\n", " 'CSCO': 'Cisco',\n", " 'TXN': 'Texas Instruments',\n", " 'XRX': 'Xerox',\n", " 'LMT': 'Lockheed Martin',\n", " 'WMT': 'Wal-Mart',\n", " 'WAG': 'Walgreen',\n", " 'HD': 'Home Depot',\n", " 'GSK': 'GlaxoSmithKline',\n", " 'PFE': 'Pfizer',\n", " 'SNY': 'Sanofi-Aventis',\n", " 'NVS': 'Novartis',\n", " 'KMB': 'Kimberly-Clark',\n", " 'R': 'Ryder',\n", " 'GD': 'General Dynamics',\n", " 'RTN': 'Raytheon',\n", " 'CVS': 'CVS',\n", " 'CAT': 'Caterpillar',\n", " 'DD': 'DuPont de Nemours'}\n", "\n", "symbols, names = np.array(list(symbol_dict.items())).T\n", "\n", "quotes = [finance.quotes_historical_yahoo(symbol, d1, d2, asobject=True)\n", " for symbol in symbols]\n", "\n", "open = np.array([q.open for q in quotes]).astype(np.float)\n", "close = np.array([q.close for q in quotes]).astype(np.float)\n", "\n", "# The daily variations of the quotes are what carry most information\n", "variation = close - open\n", "\n", "###############################################################################\n", "# Learn a graphical structure from the correlations\n", "edge_model = covariance.GraphLassoCV()\n", "\n", "# standardize the time series: using correlations rather than covariance\n", "# is more efficient for structure recovery\n", "X = variation.copy().T\n", "X /= X.std(axis=0)\n", "edge_model.fit(X)\n", "\n", "###############################################################################\n", "# Cluster using affinity propagation\n", "\n", "_, labels = cluster.affinity_propagation(edge_model.covariance_)\n", "n_labels = labels.max()\n", "\n", "for i in range(n_labels + 1):\n", " print('Cluster %i: %s' % ((i + 1), ', '.join(names[labels == i])))\n", "\n", "###############################################################################\n", "# Find a low-dimension embedding for visualization: find the best position of\n", "# the nodes (the stocks) on a 2D plane\n", "\n", "# We use a dense eigen_solver to achieve reproducibility (arpack is\n", "# initiated with random vectors that we don't control). In addition, we\n", "# use a large number of neighbors to capture the large-scale structure.\n", "node_position_model = manifold.LocallyLinearEmbedding(\n", " n_components=2, eigen_solver='dense', n_neighbors=6)\n", "\n", "embedding = node_position_model.fit_transform(X.T).T\n", "\n", "###############################################################################\n", "# Visualization\n", "plt.figure(1, facecolor='w', figsize=(10, 8))\n", "plt.clf()\n", "ax = plt.axes([0., 0., 1., 1.])\n", "plt.axis('off')\n", "\n", "# Display a graph of the partial correlations\n", "partial_correlations = edge_model.precision_.copy()\n", "d = 1 / np.sqrt(np.diag(partial_correlations))\n", "partial_correlations *= d\n", "partial_correlations *= d[:, np.newaxis]\n", "non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.02)\n", "\n", "# Plot the nodes using the coordinates of our embedding\n", "plt.scatter(embedding[0], embedding[1], s=100 * d ** 2, c=labels,\n", " cmap=plt.cm.spectral)\n", "\n", "# Plot the edges\n", "start_idx, end_idx = np.where(non_zero)\n", "#a sequence of (*line0*, *line1*, *line2*), where::\n", "# linen = (x0, y0), (x1, y1), ... (xm, ym)\n", "segments = [[embedding[:, start], embedding[:, stop]]\n", " for start, stop in zip(start_idx, end_idx)]\n", "values = np.abs(partial_correlations[non_zero])\n", "lc = LineCollection(segments,\n", " zorder=0, cmap=plt.cm.hot_r,\n", " norm=plt.Normalize(0, .7 * values.max()))\n", "lc.set_array(values)\n", "lc.set_linewidths(15 * values)\n", "ax.add_collection(lc)\n", "\n", "# Add a label to each node. The challenge here is that we want to\n", "# position the labels to avoid overlap with other labels\n", "for index, (name, label, (x, y)) in enumerate(\n", " zip(names, labels, embedding.T)):\n", "\n", " dx = x - embedding[0]\n", " dx[index] = 1\n", " dy = y - embedding[1]\n", " dy[index] = 1\n", " this_dx = dx[np.argmin(np.abs(dy))]\n", " this_dy = dy[np.argmin(np.abs(dx))]\n", " if this_dx > 0:\n", " horizontalalignment = 'left'\n", " x = x + .002\n", " else:\n", " horizontalalignment = 'right'\n", " x = x - .002\n", " if this_dy > 0:\n", " verticalalignment = 'bottom'\n", " y = y + .002\n", " else:\n", " verticalalignment = 'top'\n", " y = y - .002\n", " plt.text(x, y, name, size=10,\n", " horizontalalignment=horizontalalignment,\n", " verticalalignment=verticalalignment,\n", " bbox=dict(facecolor='w',\n", " edgecolor=plt.cm.spectral(label / float(n_labels)),\n", " alpha=.6))\n", "\n", "plt.xlim(embedding[0].min() - .15 * embedding[0].ptp(),\n", " embedding[0].max() + .10 * embedding[0].ptp(),)\n", "plt.ylim(embedding[1].min() - .03 * embedding[1].ptp(),\n", " embedding[1].max() + .03 * embedding[1].ptp())\n", "\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Digit Classification using Neural Networks**" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from __future__ import print_function\n", "\n", "print(__doc__)\n", "\n", "# Authors: Yann N. Dauphin, Vlad Niculae, Gabriel Synnaeve\n", "# License: BSD\n", "\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from scipy.ndimage import convolve\n", "from sklearn import linear_model, datasets, metrics\n", "from sklearn.cross_validation import train_test_split\n", "from sklearn.neural_network import BernoulliRBM\n", "from sklearn.pipeline import Pipeline\n", "\n", "\n", "###############################################################################\n", "# Setting up\n", "\n", "def nudge_dataset(X, Y):\n", " \"\"\"\n", " This produces a dataset 5 times bigger than the original one,\n", " by moving the 8x8 images in X around by 1px to left, right, down, up\n", " \"\"\"\n", " direction_vectors = [\n", " [[0, 1, 0],\n", " [0, 0, 0],\n", " [0, 0, 0]],\n", "\n", " [[0, 0, 0],\n", " [1, 0, 0],\n", " [0, 0, 0]],\n", "\n", " [[0, 0, 0],\n", " [0, 0, 1],\n", " [0, 0, 0]],\n", "\n", " [[0, 0, 0],\n", " [0, 0, 0],\n", " [0, 1, 0]]]\n", "\n", " shift = lambda x, w: convolve(x.reshape((8, 8)), mode='constant',\n", " weights=w).ravel()\n", " X = np.concatenate([X] +\n", " [np.apply_along_axis(shift, 1, X, vector)\n", " for vector in direction_vectors])\n", " Y = np.concatenate([Y for _ in range(5)], axis=0)\n", " return X, Y\n", "\n", "# Load Data\n", "digits = datasets.load_digits()\n", "X = np.asarray(digits.data, 'float32')\n", "X, Y = nudge_dataset(X, digits.target)\n", "X = (X - np.min(X, 0)) / (np.max(X, 0) + 0.0001) # 0-1 scaling\n", "\n", "X_train, X_test, Y_train, Y_test = train_test_split(X, Y,\n", " test_size=0.2,\n", " random_state=0)\n", "\n", "# Models we will use\n", "logistic = linear_model.LogisticRegression()\n", "rbm = BernoulliRBM(random_state=0, verbose=True)\n", "\n", "classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])\n", "\n", "###############################################################################\n", "# Training\n", "\n", "# Hyper-parameters. These were set by cross-validation,\n", "# using a GridSearchCV. Here we are not performing cross-validation to\n", "# save time.\n", "rbm.learning_rate = 0.06\n", "rbm.n_iter = 20\n", "# More components tend to give better prediction performance, but larger\n", "# fitting time\n", "rbm.n_components = 100\n", "logistic.C = 6000.0\n", "\n", "# Training RBM-Logistic Pipeline\n", "classifier.fit(X_train, Y_train)\n", "\n", "# Training Logistic regression\n", "logistic_classifier = linear_model.LogisticRegression(C=100.0)\n", "logistic_classifier.fit(X_train, Y_train)\n", "\n", "###############################################################################\n", "# Evaluation\n", "\n", "print()\n", "print(\"Logistic regression using RBM features:\\n%s\\n\" % (\n", " metrics.classification_report(\n", " Y_test,\n", " classifier.predict(X_test))))\n", "\n", "print(\"Logistic regression using raw pixel features:\\n%s\\n\" % (\n", " metrics.classification_report(\n", " Y_test,\n", " logistic_classifier.predict(X_test))))\n", "\n", "###############################################################################\n", "# Plotting\n", "\n", "plt.figure(figsize=(4.2, 4))\n", "for i, comp in enumerate(rbm.components_):\n", " plt.subplot(10, 10, i + 1)\n", " plt.imshow(comp.reshape((8, 8)), cmap=plt.cm.gray_r,\n", " interpolation='nearest')\n", " plt.xticks(())\n", " plt.yticks(())\n", "plt.suptitle('100 components extracted by RBM', fontsize=16)\n", "plt.subplots_adjust(0.08, 0.02, 0.92, 0.85, 0.08, 0.23)\n", "\n", "plt.show()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**References**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [scikit-learn Documentation](http://scikit-learn.org/stable/)\n", "- [Machine Learning using Astronomical Data](http://www.astroml.org/)\n", "- [scikit-learn tutorial](http://www.astroml.org/sklearn_tutorial/)\n", "- [Stock Market Example](http://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html#example-applications-plot-stock-market-py)\n", "- [PyCon14](http://nbviewer.ipython.org/github/jakevdp/sklearn_pycon2014/blob/master/notebooks/04_supervised_in_depth.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Credits\n", "\n", "Lakshmi Rao, Abhishek Sharma, and Neal Davis developed these materials for [Computational Science and Engineering](http://cse.illinois.edu/) at the University of Illinois at Urbana\u2013Champaign.\n", "\n", "\n", "This content is available under a [Creative Commons Attribution 3.0 Unported License](https://creativecommons.org/licenses/by/3.0/).\n", "\n", "[![](https://bytebucket.org/davis68/resources/raw/f7c98d2b95e961fae257707e22a58fa1a2c36bec/logos/baseline_cse_wdmk.png?token=be4cc41d4b2afe594f5b1570a3c5aad96a65f0d6)](http://cse.illinois.edu/)" ] } ], "metadata": {} } ] }