{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import matplotlib\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import os\n", "import pandas as pd\n", "import scipy.stats\n", "import sklearn.datasets\n", "import sklearn.preprocessing\n", "\n", "%matplotlib notebook\n", "\n", "plt.xkcd()\n", "\n", "None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Singular Value Decomposition for Data Visualization\n", "\n", "## Displaying high-dimensional data using reduced-rank matrices\n", "\n", "A data visualization goes a long way in helping you understand the underlying dataset. If the data is highly dimensional, you can use Singular Value Decomposition (SVD) to find a reduced-rank approximation of the data that can be visualized easily." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 1: the Iris dataset\n", "\n", "We start off with the [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set). The data is multivariate, with 150 measurements of 4 features (length and width cm of both sepal and petal) on 3 distinct Iris species. Of the 150 measurements, there are 50 measurements each for _Iris setosa_, _Iris versicolor_, and _Iris virginica_.\n", "\n", "[Scikit Learn's `datasets`](http://scikit-learn.org/stable/datasets/) includes the Iris dataset, so let's load that up and start exploring." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Iris dataset has 150 rows and 4 columns\n", "\n", "Here are the first 5 rows of the data:\n", "\n", " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)\n", "0 5.1 3.5 1.4 0.2\n", "1 4.9 3.0 1.4 0.2\n", "2 4.7 3.2 1.3 0.2\n", "3 4.6 3.1 1.5 0.2\n", "4 5.0 3.6 1.4 0.2\n", "\n", "Some simple statistics on the Iris dataset:\n", "\n", " sepal length (cm) sepal width (cm) petal length (cm) \\\n", "count 150.000000 150.000000 150.000000 \n", "mean 5.843333 3.054000 3.758667 \n", "std 0.828066 0.433594 1.764420 \n", "min 4.300000 2.000000 1.000000 \n", "25% 5.100000 2.800000 1.600000 \n", "50% 5.800000 3.000000 4.350000 \n", "75% 6.400000 3.300000 5.100000 \n", "max 7.900000 4.400000 6.900000 \n", "\n", " petal width (cm) \n", "count 150.000000 \n", "mean 1.198667 \n", "std 0.763161 \n", "min 0.100000 \n", "25% 0.300000 \n", "50% 1.300000 \n", "75% 1.800000 \n", "max 2.500000 \n", "\n" ] } ], "source": [ "iris = sklearn.datasets.load_iris()\n", "\n", "df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)\n", "\n", "print('Iris dataset has {} rows and {} columns\\n'.format(*df_iris.shape))\n", "\n", "print('Here are the first 5 rows of the data:\\n\\n{}\\n'.format(df_iris.head(5)))\n", "\n", "print('Some simple statistics on the Iris dataset:\\n\\n{}\\n'.format(df_iris.describe()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we are exploring the dataset, it would be nice to view the data in order to get an idea of how the 3 species might be distributed with respect to one another in terms of their features. Perhaps we are interested in finding clusters, or maybe we would like to find a way to make class predictions?\n", "\n", "However, since the data has 4 dimensions, we would be hard-pressed to come up with a good way to graph the data in 4D that we could easily understand.\n", "\n", "_But what if we could reduce or compress the data so that we could work in 3 dimensions or less?_\n", "\n", "[Singular value decomposition](http://mathworld.wolfram.com/SingularValueDecomposition.html) lets us do just that." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Singular value decomposition\n", "\n", "Singular value decomposition factorizes an $\\mathbb{R}^{m \\times n}$ matrix $X$ into\n", "\n", "* matrix $U \\in \\mathbb{R}^{m \\times m}$ are the left-singular vectors of $X$, where columns are the set of orthonormal eigenvectors of $X \\, X^{\\intercal}$\n", "* diagonal matrix $\\Sigma$ with entries $\\sigma \\in \\mathbb{R}$ that are the non-negative singular values of $X$\n", "* matrix $V \\in \\mathbb{R}^{n \\times n}$ are the right-singular vectors $X$, where the columns are the set of orthonormal eigenvectors of $X^{\\intercal} \\, X$\n", "\n", "such that \n", "\n", "\\begin{align}\n", " X &= U \\, \\Sigma \\, V^{\\intercal}\n", "\\end{align}\n", "\n", "We can use [`numpy.linalg.svd`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html) to factorize the Iris data matrix into three components $U$, $\\Sigma$, and $V^{\\intercal}$." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "U_iris, S_iris, Vt_iris = np.linalg.svd(df_iris)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### $U$: left-singular vectors of $X$\n", "\n", "The rows of the $U$ correspond to the rows of original data matrix $X$, while the columns are the set of ordered, orthornormal eigenvectors of $X \\, X^{\\intercal}$." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "matrix U has 150 rows, 150 columns\n", "\n", " 0 1 2 3 4 5 6 \\\n", "0 -0.061617 0.129969 -0.000056 0.001058 -0.073179 -0.081571 -0.069303 \n", "1 -0.058072 0.111371 0.068439 0.052115 -0.127666 -0.142615 -0.129771 \n", "2 -0.056763 0.118295 0.002311 0.009078 -0.010367 0.012712 0.021947 \n", "3 -0.056654 0.105608 0.004218 -0.042215 -0.040903 -0.009878 -0.022402 \n", "4 -0.061230 0.131431 -0.033908 -0.033254 0.979924 -0.021636 -0.019800 \n", "\n", " 7 8 9 ... 140 141 142 \\\n", "0 -0.071247 -0.062021 -0.066137 ... -0.102203 -0.108392 -0.081006 \n", "1 -0.102197 -0.078402 -0.068525 ... 0.027873 0.029108 0.037220 \n", "2 -0.040583 -0.045632 -0.084427 ... 0.059291 -0.040847 0.089081 \n", "3 -0.043891 -0.037508 -0.067710 ... 0.176345 0.268409 0.019061 \n", "4 -0.016901 -0.013404 -0.013042 ... 0.001694 0.004298 -0.000120 \n", "\n", " 143 144 145 146 147 148 149 \n", "0 -0.097894 -0.104334 -0.103992 -0.088545 -0.093464 -0.096022 -0.080992 \n", "1 0.034247 0.003214 0.033050 0.082646 0.036676 -0.032446 0.012727 \n", "2 0.100846 0.113471 -0.011387 -0.062039 0.030825 0.195235 0.135568 \n", "3 0.067487 0.176562 0.233969 0.115344 0.085373 0.095596 -0.041733 \n", "4 -0.000255 -0.001583 0.003997 0.007958 0.001071 -0.007744 -0.004753 \n", "\n", "[5 rows x 150 columns]\n" ] } ], "source": [ "print('matrix U has {} rows, {} columns\\n'.format(*U_iris.shape))\n", "\n", "print('{}'.format(pd.DataFrame(U_iris).head(5)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### $V$: right-singular vectors of $X$\n", "\n", "[`numpy.linalg.svd`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html) actually returns $V^{\\intercal}$ instead of $V$, so it is the _columns_ of $V^{\\intercal}$ that correspond to the columns of original data matrix $X$. Hence, the _rows_ are the set of ordered, orthornormal eigenvectors of $X^{\\intercal} \\, X$." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "matrix Vt has 4 rows, 4 columns\n", "\n", " 0 1 2 3\n", "0 -0.751168 -0.379788 -0.513151 -0.167879\n", "1 0.285831 0.544890 -0.708899 -0.344758\n", "2 0.499424 -0.675025 -0.054720 -0.540299\n", "3 0.323455 -0.321243 -0.480775 0.749023\n" ] } ], "source": [ "print('matrix Vt has {} rows, {} columns\\n'.format(*Vt_iris.shape))\n", "\n", "print('{}'.format(pd.DataFrame(Vt_iris).head()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### $\\Sigma$: singular values of $X$\n", "\n", "The elements $\\sigma_{i}$ of diagonal matrix $\\Sigma$ are the non-zero singular values of matrix $X$, which are really just the square roots of the non-zero eigenvalues of $X^{\\intercal} \\, X$ (and also for $X \\, X^{\\intercal}$). These singular values can be used to determine the amount of variance $X^{\\prime}$ explains of the original data matrix $X$ when reducing the dimensions to find a lower rank approximation.\n", "\n", "\\begin{align}\n", " X^{\\prime}_{k} &= U_{k} \\, \\Sigma_{k} \\, V^{\\intercal}_{k} \\\\\n", " &\\approx X_{r} & \\text{ where } rank(X^{\\prime}) = k \\lt rank(X) = r\n", "\\end{align}\n", "\n", "The amount of variance that the reduced rank approximation $X^{\\prime}_{k}$ explains of $X_{r}$ is\n", "\n", "\\begin{align}\n", " \\text{cum. variance explained} &= \\frac{\\sum_{j=1}^{k} \\sigma_{j}^{2}}{\\sum_{i=1}^{r} \\sigma_{i}^{2}}\n", "\\end{align}\n", "\n", "**NOTE**: [`numpy.linalg.svd`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.svd.html) actually returns a $\\Sigma$ that is not a diagonal _matrix_, but a _list_ of the entries on the diagonal." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "num_sv_iris = np.arange(1, S_iris.size+1)\n", "\n", "cum_var_explained_iris = [np.sum(np.square(S_iris[0:n])) / np.sum(np.square(S_iris)) for n in num_sv_iris]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at the cumulative variance explained visually as a function of the number of singular values used when reducing rank to find a lower-ranked matrix $X^{\\prime}$ to approximate $X$. This will inform us as to how many dimensions we should use." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "/* Put everything inside the global mpl namespace */\n", "window.mpl = {};\n", "\n", "\n", "mpl.get_websocket_type = function() {\n", " if (typeof(WebSocket) !== 'undefined') {\n", " return WebSocket;\n", " } else if (typeof(MozWebSocket) !== 'undefined') {\n", " return MozWebSocket;\n", " } else {\n", " alert('Your browser does not have WebSocket support.' +\n", " 'Please try Chrome, Safari or Firefox ≥ 6. ' +\n", " 'Firefox 4 and 5 are also supported but you ' +\n", " 'have to enable WebSockets in about:config.');\n", " };\n", "}\n", "\n", "mpl.figure = function(figure_id, websocket, ondownload, parent_element) {\n", " this.id = figure_id;\n", "\n", " this.ws = websocket;\n", "\n", " this.supports_binary = (this.ws.binaryType != undefined);\n", "\n", " if (!this.supports_binary) {\n", " var warnings = document.getElementById(\"mpl-warnings\");\n", " if (warnings) {\n", " warnings.style.display = 'block';\n", " warnings.textContent = (\n", " \"This browser does not support binary websocket messages. \" +\n", " \"Performance may be slow.\");\n", " }\n", " }\n", "\n", " this.imageObj = new Image();\n", "\n", " this.context = undefined;\n", " this.message = undefined;\n", " this.canvas = undefined;\n", " this.rubberband_canvas = undefined;\n", " this.rubberband_context = undefined;\n", " this.format_dropdown = undefined;\n", "\n", " this.image_mode = 'full';\n", "\n", " this.root = $('
');\n", " this._root_extra_style(this.root)\n", " this.root.attr('style', 'display: inline-block');\n", "\n", " $(parent_element).append(this.root);\n", "\n", " this._init_header(this);\n", " this._init_canvas(this);\n", " this._init_toolbar(this);\n", "\n", " var fig = this;\n", "\n", " this.waiting = false;\n", "\n", " this.ws.onopen = function () {\n", " fig.send_message(\"supports_binary\", {value: fig.supports_binary});\n", " fig.send_message(\"send_image_mode\", {});\n", " if (mpl.ratio != 1) {\n", " fig.send_message(\"set_dpi_ratio\", {'dpi_ratio': mpl.ratio});\n", " }\n", " fig.send_message(\"refresh\", {});\n", " }\n", "\n", " this.imageObj.onload = function() {\n", " if (fig.image_mode == 'full') {\n", " // Full images could contain transparency (where diff images\n", " // almost always do), so we need to clear the canvas so that\n", " // there is no ghosting.\n", " fig.context.clearRect(0, 0, fig.canvas.width, fig.canvas.height);\n", " }\n", " fig.context.drawImage(fig.imageObj, 0, 0);\n", " };\n", "\n", " this.imageObj.onunload = function() {\n", " fig.ws.close();\n", " }\n", "\n", " this.ws.onmessage = this._make_on_message_function(this);\n", "\n", " this.ondownload = ondownload;\n", "}\n", "\n", "mpl.figure.prototype._init_header = function() {\n", " var titlebar = $(\n", " '
');\n", " var titletext = $(\n", " '
');\n", " titlebar.append(titletext)\n", " this.root.append(titlebar);\n", " this.header = titletext[0];\n", "}\n", "\n", "\n", "\n", "mpl.figure.prototype._canvas_extra_style = function(canvas_div) {\n", "\n", "}\n", "\n", "\n", "mpl.figure.prototype._root_extra_style = function(canvas_div) {\n", "\n", "}\n", "\n", "mpl.figure.prototype._init_canvas = function() {\n", " var fig = this;\n", "\n", " var canvas_div = $('
');\n", "\n", " canvas_div.attr('style', 'position: relative; clear: both; outline: 0');\n", "\n", " function canvas_keyboard_event(event) {\n", " return fig.key_event(event, event['data']);\n", " }\n", "\n", " canvas_div.keydown('key_press', canvas_keyboard_event);\n", " canvas_div.keyup('key_release', canvas_keyboard_event);\n", " this.canvas_div = canvas_div\n", " this._canvas_extra_style(canvas_div)\n", " this.root.append(canvas_div);\n", "\n", " var canvas = $('');\n", " canvas.addClass('mpl-canvas');\n", " canvas.attr('style', \"left: 0; top: 0; z-index: 0; outline: 0\")\n", "\n", " this.canvas = canvas[0];\n", " this.context = canvas[0].getContext(\"2d\");\n", "\n", " var backingStore = this.context.backingStorePixelRatio ||\n", "\tthis.context.webkitBackingStorePixelRatio ||\n", "\tthis.context.mozBackingStorePixelRatio ||\n", "\tthis.context.msBackingStorePixelRatio ||\n", "\tthis.context.oBackingStorePixelRatio ||\n", "\tthis.context.backingStorePixelRatio || 1;\n", "\n", " mpl.ratio = (window.devicePixelRatio || 1) / backingStore;\n", "\n", " var rubberband = $('');\n", " rubberband.attr('style', \"position: absolute; left: 0; top: 0; z-index: 1;\")\n", "\n", " var pass_mouse_events = true;\n", "\n", " canvas_div.resizable({\n", " start: function(event, ui) {\n", " pass_mouse_events = false;\n", " },\n", " resize: function(event, ui) {\n", " fig.request_resize(ui.size.width, ui.size.height);\n", " },\n", " stop: function(event, ui) {\n", " pass_mouse_events = true;\n", " fig.request_resize(ui.size.width, ui.size.height);\n", " },\n", " });\n", "\n", " function mouse_event_fn(event) {\n", " if (pass_mouse_events)\n", " return fig.mouse_event(event, event['data']);\n", " }\n", "\n", " rubberband.mousedown('button_press', mouse_event_fn);\n", " rubberband.mouseup('button_release', mouse_event_fn);\n", " // Throttle sequential mouse events to 1 every 20ms.\n", " rubberband.mousemove('motion_notify', mouse_event_fn);\n", "\n", " rubberband.mouseenter('figure_enter', mouse_event_fn);\n", " rubberband.mouseleave('figure_leave', mouse_event_fn);\n", "\n", " canvas_div.on(\"wheel\", function (event) {\n", " event = event.originalEvent;\n", " event['data'] = 'scroll'\n", " if (event.deltaY < 0) {\n", " event.step = 1;\n", " } else {\n", " event.step = -1;\n", " }\n", " mouse_event_fn(event);\n", " });\n", "\n", " canvas_div.append(canvas);\n", " canvas_div.append(rubberband);\n", "\n", " this.rubberband = rubberband;\n", " this.rubberband_canvas = rubberband[0];\n", " this.rubberband_context = rubberband[0].getContext(\"2d\");\n", " this.rubberband_context.strokeStyle = \"#000000\";\n", "\n", " this._resize_canvas = function(width, height) {\n", " // Keep the size of the canvas, canvas container, and rubber band\n", " // canvas in synch.\n", " canvas_div.css('width', width)\n", " canvas_div.css('height', height)\n", "\n", " canvas.attr('width', width * mpl.ratio);\n", " canvas.attr('height', height * mpl.ratio);\n", " canvas.attr('style', 'width: ' + width + 'px; height: ' + height + 'px;');\n", "\n", " rubberband.attr('width', width);\n", " rubberband.attr('height', height);\n", " }\n", "\n", " // Set the figure to an initial 600x600px, this will subsequently be updated\n", " // upon first draw.\n", " this._resize_canvas(600, 600);\n", "\n", " // Disable right mouse context menu.\n", " $(this.rubberband_canvas).bind(\"contextmenu\",function(e){\n", " return false;\n", " });\n", "\n", " function set_focus () {\n", " canvas.focus();\n", " canvas_div.focus();\n", " }\n", "\n", " window.setTimeout(set_focus, 100);\n", "}\n", "\n", "mpl.figure.prototype._init_toolbar = function() {\n", " var fig = this;\n", "\n", " var nav_element = $('
')\n", " nav_element.attr('style', 'width: 100%');\n", " this.root.append(nav_element);\n", "\n", " // Define a callback function for later on.\n", " function toolbar_event(event) {\n", " return fig.toolbar_button_onclick(event['data']);\n", " }\n", " function toolbar_mouse_event(event) {\n", " return fig.toolbar_button_onmouseover(event['data']);\n", " }\n", "\n", " for(var toolbar_ind in mpl.toolbar_items) {\n", " var name = mpl.toolbar_items[toolbar_ind][0];\n", " var tooltip = mpl.toolbar_items[toolbar_ind][1];\n", " var image = mpl.toolbar_items[toolbar_ind][2];\n", " var method_name = mpl.toolbar_items[toolbar_ind][3];\n", "\n", " if (!name) {\n", " // put a spacer in here.\n", " continue;\n", " }\n", " var button = $('');\n", " button.click(method_name, toolbar_event);\n", " button.mouseover(tooltip, toolbar_mouse_event);\n", " nav_element.append(button);\n", " }\n", "\n", " // Add the status bar.\n", " var status_bar = $('');\n", " nav_element.append(status_bar);\n", " this.message = status_bar[0];\n", "\n", " // Add the close button to the window.\n", " var buttongrp = $('
');\n", " var button = $('');\n", " button.click(function (evt) { fig.handle_close(fig, {}); } );\n", " button.mouseover('Stop Interaction', toolbar_mouse_event);\n", " buttongrp.append(button);\n", " var titlebar = this.root.find($('.ui-dialog-titlebar'));\n", " titlebar.prepend(buttongrp);\n", "}\n", "\n", "mpl.figure.prototype._root_extra_style = function(el){\n", " var fig = this\n", " el.on(\"remove\", function(){\n", "\tfig.close_ws(fig, {});\n", " });\n", "}\n", "\n", "mpl.figure.prototype._canvas_extra_style = function(el){\n", " // this is important to make the div 'focusable\n", " el.attr('tabindex', 0)\n", " // reach out to IPython and tell the keyboard manager to turn it's self\n", " // off when our div gets focus\n", "\n", " // location in version 3\n", " if (IPython.notebook.keyboard_manager) {\n", " IPython.notebook.keyboard_manager.register_events(el);\n", " }\n", " else {\n", " // location in version 2\n", " IPython.keyboard_manager.register_events(el);\n", " }\n", "\n", "}\n", "\n", "mpl.figure.prototype._key_event_extra = function(event, name) {\n", " var manager = IPython.notebook.keyboard_manager;\n", " if (!manager)\n", " manager = IPython.keyboard_manager;\n", "\n", " // Check for shift+enter\n", " if (event.shiftKey && event.which == 13) {\n", " this.canvas_div.blur();\n", " event.shiftKey = false;\n", " // Send a \"J\" for go to next cell\n", " event.which = 74;\n", " event.keyCode = 74;\n", " manager.command_mode();\n", " manager.handle_keydown(event);\n", " }\n", "}\n", "\n", "mpl.figure.prototype.handle_save = function(fig, msg) {\n", " fig.ondownload(fig, null);\n", "}\n", "\n", "\n", "mpl.find_output_cell = function(html_output) {\n", " // Return the cell and output element which can be found *uniquely* in the notebook.\n", " // Note - this is a bit hacky, but it is done because the \"notebook_saving.Notebook\"\n", " // IPython event is triggered only after the cells have been serialised, which for\n", " // our purposes (turning an active figure into a static one), is too late.\n", " var cells = IPython.notebook.get_cells();\n", " var ncells = cells.length;\n", " for (var i=0; i= 3 moved mimebundle to data attribute of output\n", " data = data.data;\n", " }\n", " if (data['text/html'] == html_output) {\n", " return [cell, data, j];\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\n", "// Register the function which deals with the matplotlib target/channel.\n", "// The kernel may be null if the page has been refreshed.\n", "if (IPython.notebook.kernel != null) {\n", " IPython.notebook.kernel.comm_manager.register_target('matplotlib', mpl.mpl_figure_comm);\n", "}\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = plt.figure(figsize=(7.0,5.5))\n", "ax = fig.add_subplot(111)\n", "\n", "plt.scatter(setosa_x,\n", " setosa_y,\n", " marker='o',\n", " color='#66c2a5',\n", " label='Iris-setosa',\n", " zorder=1000)\n", "\n", "plt.scatter(versicolor_x,\n", " versicolor_y,\n", " marker='D',\n", " color='#fc8d62',\n", " label='Iris-versicolor',\n", " zorder=1000)\n", "\n", "plt.scatter(virginica_x,\n", " virginica_y,\n", " marker='^',\n", " color='#8da0cb',\n", " label='Iris-virginica',\n", " zorder=1000)\n", "\n", "plt.legend(loc='upper left', scatterpoints=1, fontsize=8)\n", "\n", "ax.set_xlabel(r'singular value $\\sigma_{1}$')\n", "ax.set_ylabel(r'singular value $\\sigma_{2}$')\n", "ax.set_title('2D plot of Iris dataset',\n", " fontsize=14,\n", " y=1.03)\n", "ax.set_facecolor('0.98')\n", "\n", "plt.grid(alpha=0.6, zorder=1)\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There!\n", "\n", "Now that we are viewing the originally 4D data with 2 dimensions using the first 2 singular value columns of the $U$ left singular vectors matrix, we can see that there should be a very clear separation for the _Iris setosa_ class and the others. On the other hand, the demarcation between _Iris versicolor_ and _Iris virginica_ might not be as clear cut.\n", "\n", "Nevertheless, since this 2D reduced-rank matrix representation $X^{\\prime}$ explains nearly 99.8% of the variance in the original dataset, we can be pretty certain that clustering and classification should be possible." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 2: Countries and Primary Languages\n", "\n", "### Or, What to do when your data is categorical?\n", "\n", "In the previous example, we were exploring the Iris dataset which is a matrix $X \\in \\mathbb{R}^{150 \\times 4}$. Singular value decomposition helped us to find a reduced-rank matrix $X^{\\prime} \\in \\mathbb{R}^{150 \\times 2}$ that accurately approximated the original data matrix and let us visualize the 4-dimensional data using 2 dimensions.\n", "\n", "Let's now consider another dataset where the values are not in $\\mathbb{R}$, but are _categorical_.\n", "\n", "For this example, we explore a fictional survey of 1000 respondents from each of five countries (Canada, USA, England, Italy and Switzerland), asking them what their primary language is (among English, French, Spanish, German and Italian). So in our data we have categories for both _country_ and _language_.\n", "\n", "We read in the data from file using [`pandas.dataframe.read_csv`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data has 5000 measurements for 2 variables\n", "\n", "Here are the first 10 rows...\n", "\n", " country language\n", "0 2 5\n", "1 4 2\n", "2 2 3\n", "3 1 1\n", "4 5 5\n", "5 4 5\n", "6 5 4\n", "7 4 5\n", "8 5 4\n", "9 5 4\n", "...\n" ] } ], "source": [ "fin = os.path.join(os.getcwd(), \n", " 'data', \n", " 'country_language.csv')\n", "\n", "df_countries = pd.read_csv(fin, dtype='category')\n", "\n", "print(\"data has {} measurements for {} variables\\n\".format(*df_countries.shape))\n", "\n", "print(\"Here are the first 10 rows...\\n\\n{}\\n...\".format(df_countries.head(10)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Correspondence Analysis\n", "\n", "#### Contingency table $F$\n", "\n", "Our next step in exploring the data is to break out the data in terms of the 2 categories.\n", "\n", "Here we convert the raw observations into a [contingency table](http://onlinestatbook.com/2/chi_square/contingency.html) $F$, with the countries as rows and the languages as columns. [`pandas.crosstab`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html) will do just that." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " English French Spanish German Italian\n", "Canada 688 280 10 11 11\n", "USA 730 31 190 8 41\n", "England 798 74 38 31 59\n", "Italy 17 13 11 15 944\n", "Switzerland 15 222 20 648 95\n" ] } ], "source": [ "countries = ['Canada', 'USA', 'England', 'Italy', 'Switzerland']\n", "languages = ['English', 'French', 'Spanish', 'German', 'Italian']\n", "\n", "F = pd.crosstab(df_countries.country, df_countries.language, margins=False)\n", "F.index = countries\n", "F.columns = languages\n", "\n", "print(\"{}\".format(F))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now say that we are interested in seeing the relation between the countries and the languages.\n", "\n", "However, we cannot blindly apply singular value decomposition to contingency table $F$ above.\n", "\n", "Since we are working with 2 distinct categories, we can apply [correspondence analysis](http://www.mathematica-journal.com/2010/09/an-introduction-to-correspondence-analysis/) to transform the contingency table into a form where we can use singular value decomposition. At that point, we should be able to find a reduced-rank matrix that approximates the original data, and that in turn would let us graphically represent the the relations beween countries and languages.\n", "\n", "The idea is to use the $\\chi^{2}$ distance between rows and columns as our basis for singular value decomposition, as the [$\\chi^{2}$ distribution](https://en.wikipedia.org/wiki/Chi-squared_distribution) lets us calculate the independence of qualitative variables." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Correspondence matrix $P$\n", "\n", "We start by first calculating the correspondence matrix $P$ where\n", "\n", "\\begin{align}\n", " P &= \\left[ \\frac{f_{ij}}{\\sum_{i=1}^{I} \\sum_{j=1}^{J} f_{ij}} \\right] \\text{ for } f_{ij} \\in F\n", "\\end{align}" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "correspondence matrix P:\n", "\n", " English French Spanish German Italian\n", "Canada 0.1376 0.0560 0.0020 0.0022 0.0022\n", "USA 0.1460 0.0062 0.0380 0.0016 0.0082\n", "England 0.1596 0.0148 0.0076 0.0062 0.0118\n", "Italy 0.0034 0.0026 0.0022 0.0030 0.1888\n", "Switzerland 0.0030 0.0444 0.0040 0.1296 0.0190\n" ] } ], "source": [ "P = F / F.sum().sum()\n", "\n", "print('correspondence matrix P:\\n\\n{}'.format(P))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Row centroid $p_{i+}$\n", "\n", "Using correspondence matrix $P$, we next obtain the _row centroid_ $p_{i+}$. The row centroid can be interpreted as the marginal frequency distribution over the sum of the countries (rows), and reflects the fact that there were equally 1000 respondents per country in our fictional study.\n", "\n", "The row centroid $p_{i+}$ is derived from correspondence matrix $P$ with\n", "\n", "\\begin{align}\n", " p_{i+} &= \\sum_{j=1}^{J} p_{ij}\n", "\\end{align}" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "row centroid (marginal frequency distribution over countries):\n", "\n", "Canada 0.2\n", "USA 0.2\n", "England 0.2\n", "Italy 0.2\n", "Switzerland 0.2\n", "dtype: float64\n" ] } ], "source": [ "row_centroid = P.sum(axis=1)\n", "\n", "print('row centroid (marginal frequency distribution over countries):\\n\\n{}'.format(row_centroid))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Column centroid $p_{+j}$\n", "\n", "Similarly, we obtain the _column centroid_ $p_{+j}$ from $P$. The column centroid can be interpreted as the marginal frequency distribution over the sum of the languages (columns).\n", "\n", "The column centroid $p_{+j}$ is derived from correspondence matrix $P$ with\n", "\n", "\\begin{align}\n", " p_{+j} &= \\sum_{i=1}^{I} p_{ij}\n", "\\end{align}" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "column centroid (marginal frequency distribution over languages):\n", "\n", "English 0.4496\n", "French 0.1240\n", "Spanish 0.0538\n", "German 0.1426\n", "Italian 0.2300\n", "dtype: float64\n" ] } ], "source": [ "col_centroid = P.sum(axis=0)\n", "\n", "print('column centroid (marginal frequency distribution over languages):\\n\\n{}'.format(col_centroid))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### $\\chi^{2}$ distances between countries and languages\n", "\n", "So rather than using the contingency table $F$ as the basis for singular value decomposition, we can look at the $\\chi^{2}$ distances between rows and columns for visualizing the relation between countries and languages.\n", "\n", "The $\\chi^{2}$ statistic is given by\n", "\n", "\\begin{align}\n", " \\chi^{2} &= \\sum_{i=1}^{I} \\sum_{j=1}^{J} \\frac{(O_{ij} - E_{ij})^{2}}{E_{ij}} \\\\\n", " &= N \\, \\sum_{i=1}^{I} \\sum_{j=1}^{J} \\frac{(p_{ij} - \\mu_{ij})^{2}}{\\mu_{ij}} \\\\\n", " &= N \\, \\Lambda^{2} \\\\\n", " \\\\\n", " \\Rightarrow \\Lambda^{2} &= \\frac{\\chi^{2}}{N} \\\\\n", " \\Lambda &= \\left[ \\frac{p_{ij} - \\mu_{ij}}{\\sqrt{\\mu_{ij}}} \\right] = \\left[ \\frac{p_{ij} - p_{i+}\\,p_{+j}}{\\sqrt{p_{i+}\\,p_{+j}}} \\right]\n", "\\end{align}\n", "\n", "Inertia matrix $\\Lambda$ measures the distribution of the individual profiles (rows/columns) around the average profile (centroids). Thus inertia represents the observed deviation from independence.\n", "\n", "Through its relation with the statistic $\\chi^{2}$, inertia $\\Lambda$ (a matrix of _standardized residuals_) provides the basis for using singular value decomposition." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "inertia Lambda:\n", "\n", " English French Spanish German Italian\n", "Canada 0.159004 0.19812 -0.084450 -0.155852 -0.204219\n", "USA 0.187016 -0.11811 0.262604 -0.159404 -0.176243\n", "England 0.232370 -0.06350 -0.030464 -0.132166 -0.159458\n", "Italy -0.288528 -0.14097 -0.082522 -0.151114 0.665808\n", "Switzerland -0.289862 0.12446 -0.065169 0.598536 -0.125888\n" ] } ], "source": [ "Mu_ij = row_centroid.values.reshape((P.index.size,1)) * col_centroid.values.reshape((1,P.columns.size))\n", "\n", "Lambda = (P - Mu_ij) / np.sqrt(Mu_ij)\n", "\n", "print('inertia Lambda:\\n\\n{}'.format(Lambda))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Factorizing inertia matrix $\\Lambda$ with singular value decomposition\n", "\n", "Now that we have transformed contingency table $F$ into inertia matrix $\\Lambda$ where element $\\lambda_{ij} \\in \\mathbb{R}$, we can use singular value decomposition to factorize $\\Lambda$ instead of the original raw data matrix." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "U_lambda, S_lambda, Vt_lambda = np.linalg.svd(Lambda)\n", "\n", "num_sv_lambda = np.arange(1, S_lambda.size+1)\n", "\n", "cum_var_explained_lambda = [np.sum(np.square(S_lambda[0:n])) / np.sum(np.square(S_lambda)) for n in num_sv_lambda]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once again, we look at the cumulative variance explained visually as a function of the number of singular values used when reducing rank to find a lower-ranked inertia matrix $\\Lambda^{\\prime}$ to approximate $\\Lambda$. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "application/javascript": [ "/* Put everything inside the global mpl namespace */\n", "window.mpl = {};\n", "\n", "\n", "mpl.get_websocket_type = function() {\n", " if (typeof(WebSocket) !== 'undefined') {\n", " return WebSocket;\n", " } else if (typeof(MozWebSocket) !== 'undefined') {\n", " return MozWebSocket;\n", " } else {\n", " alert('Your browser does not have WebSocket support.' +\n", " 'Please try Chrome, Safari or Firefox ≥ 6. ' +\n", " 'Firefox 4 and 5 are also supported but you ' +\n", " 'have to enable WebSockets in about:config.');\n", " };\n", "}\n", "\n", "mpl.figure = function(figure_id, websocket, ondownload, parent_element) {\n", " this.id = figure_id;\n", "\n", " this.ws = websocket;\n", "\n", " this.supports_binary = (this.ws.binaryType != undefined);\n", "\n", " if (!this.supports_binary) {\n", " var warnings = document.getElementById(\"mpl-warnings\");\n", " if (warnings) {\n", " warnings.style.display = 'block';\n", " warnings.textContent = (\n", " \"This browser does not support binary websocket messages. \" +\n", " \"Performance may be slow.\");\n", " }\n", " }\n", "\n", " this.imageObj = new Image();\n", "\n", " this.context = undefined;\n", " this.message = undefined;\n", " this.canvas = undefined;\n", " this.rubberband_canvas = undefined;\n", " this.rubberband_context = undefined;\n", " this.format_dropdown = undefined;\n", "\n", " this.image_mode = 'full';\n", "\n", " this.root = $('
');\n", " this._root_extra_style(this.root)\n", " this.root.attr('style', 'display: inline-block');\n", "\n", " $(parent_element).append(this.root);\n", "\n", " this._init_header(this);\n", " this._init_canvas(this);\n", " this._init_toolbar(this);\n", "\n", " var fig = this;\n", "\n", " this.waiting = false;\n", "\n", " this.ws.onopen = function () {\n", " fig.send_message(\"supports_binary\", {value: fig.supports_binary});\n", " fig.send_message(\"send_image_mode\", {});\n", " if (mpl.ratio != 1) {\n", " fig.send_message(\"set_dpi_ratio\", {'dpi_ratio': mpl.ratio});\n", " }\n", " fig.send_message(\"refresh\", {});\n", " }\n", "\n", " this.imageObj.onload = function() {\n", " if (fig.image_mode == 'full') {\n", " // Full images could contain transparency (where diff images\n", " // almost always do), so we need to clear the canvas so that\n", " // there is no ghosting.\n", " fig.context.clearRect(0, 0, fig.canvas.width, fig.canvas.height);\n", " }\n", " fig.context.drawImage(fig.imageObj, 0, 0);\n", " };\n", "\n", " this.imageObj.onunload = function() {\n", " fig.ws.close();\n", " }\n", "\n", " this.ws.onmessage = this._make_on_message_function(this);\n", "\n", " this.ondownload = ondownload;\n", "}\n", "\n", "mpl.figure.prototype._init_header = function() {\n", " var titlebar = $(\n", " '
');\n", " var titletext = $(\n", " '
');\n", " titlebar.append(titletext)\n", " this.root.append(titlebar);\n", " this.header = titletext[0];\n", "}\n", "\n", "\n", "\n", "mpl.figure.prototype._canvas_extra_style = function(canvas_div) {\n", "\n", "}\n", "\n", "\n", "mpl.figure.prototype._root_extra_style = function(canvas_div) {\n", "\n", "}\n", "\n", "mpl.figure.prototype._init_canvas = function() {\n", " var fig = this;\n", "\n", " var canvas_div = $('
');\n", "\n", " canvas_div.attr('style', 'position: relative; clear: both; outline: 0');\n", "\n", " function canvas_keyboard_event(event) {\n", " return fig.key_event(event, event['data']);\n", " }\n", "\n", " canvas_div.keydown('key_press', canvas_keyboard_event);\n", " canvas_div.keyup('key_release', canvas_keyboard_event);\n", " this.canvas_div = canvas_div\n", " this._canvas_extra_style(canvas_div)\n", " this.root.append(canvas_div);\n", "\n", " var canvas = $('');\n", " canvas.addClass('mpl-canvas');\n", " canvas.attr('style', \"left: 0; top: 0; z-index: 0; outline: 0\")\n", "\n", " this.canvas = canvas[0];\n", " this.context = canvas[0].getContext(\"2d\");\n", "\n", " var backingStore = this.context.backingStorePixelRatio ||\n", "\tthis.context.webkitBackingStorePixelRatio ||\n", "\tthis.context.mozBackingStorePixelRatio ||\n", "\tthis.context.msBackingStorePixelRatio ||\n", "\tthis.context.oBackingStorePixelRatio ||\n", "\tthis.context.backingStorePixelRatio || 1;\n", "\n", " mpl.ratio = (window.devicePixelRatio || 1) / backingStore;\n", "\n", " var rubberband = $('');\n", " rubberband.attr('style', \"position: absolute; left: 0; top: 0; z-index: 1;\")\n", "\n", " var pass_mouse_events = true;\n", "\n", " canvas_div.resizable({\n", " start: function(event, ui) {\n", " pass_mouse_events = false;\n", " },\n", " resize: function(event, ui) {\n", " fig.request_resize(ui.size.width, ui.size.height);\n", " },\n", " stop: function(event, ui) {\n", " pass_mouse_events = true;\n", " fig.request_resize(ui.size.width, ui.size.height);\n", " },\n", " });\n", "\n", " function mouse_event_fn(event) {\n", " if (pass_mouse_events)\n", " return fig.mouse_event(event, event['data']);\n", " }\n", "\n", " rubberband.mousedown('button_press', mouse_event_fn);\n", " rubberband.mouseup('button_release', mouse_event_fn);\n", " // Throttle sequential mouse events to 1 every 20ms.\n", " rubberband.mousemove('motion_notify', mouse_event_fn);\n", "\n", " rubberband.mouseenter('figure_enter', mouse_event_fn);\n", " rubberband.mouseleave('figure_leave', mouse_event_fn);\n", "\n", " canvas_div.on(\"wheel\", function (event) {\n", " event = event.originalEvent;\n", " event['data'] = 'scroll'\n", " if (event.deltaY < 0) {\n", " event.step = 1;\n", " } else {\n", " event.step = -1;\n", " }\n", " mouse_event_fn(event);\n", " });\n", "\n", " canvas_div.append(canvas);\n", " canvas_div.append(rubberband);\n", "\n", " this.rubberband = rubberband;\n", " this.rubberband_canvas = rubberband[0];\n", " this.rubberband_context = rubberband[0].getContext(\"2d\");\n", " this.rubberband_context.strokeStyle = \"#000000\";\n", "\n", " this._resize_canvas = function(width, height) {\n", " // Keep the size of the canvas, canvas container, and rubber band\n", " // canvas in synch.\n", " canvas_div.css('width', width)\n", " canvas_div.css('height', height)\n", "\n", " canvas.attr('width', width * mpl.ratio);\n", " canvas.attr('height', height * mpl.ratio);\n", " canvas.attr('style', 'width: ' + width + 'px; height: ' + height + 'px;');\n", "\n", " rubberband.attr('width', width);\n", " rubberband.attr('height', height);\n", " }\n", "\n", " // Set the figure to an initial 600x600px, this will subsequently be updated\n", " // upon first draw.\n", " this._resize_canvas(600, 600);\n", "\n", " // Disable right mouse context menu.\n", " $(this.rubberband_canvas).bind(\"contextmenu\",function(e){\n", " return false;\n", " });\n", "\n", " function set_focus () {\n", " canvas.focus();\n", " canvas_div.focus();\n", " }\n", "\n", " window.setTimeout(set_focus, 100);\n", "}\n", "\n", "mpl.figure.prototype._init_toolbar = function() {\n", " var fig = this;\n", "\n", " var nav_element = $('
')\n", " nav_element.attr('style', 'width: 100%');\n", " this.root.append(nav_element);\n", "\n", " // Define a callback function for later on.\n", " function toolbar_event(event) {\n", " return fig.toolbar_button_onclick(event['data']);\n", " }\n", " function toolbar_mouse_event(event) {\n", " return fig.toolbar_button_onmouseover(event['data']);\n", " }\n", "\n", " for(var toolbar_ind in mpl.toolbar_items) {\n", " var name = mpl.toolbar_items[toolbar_ind][0];\n", " var tooltip = mpl.toolbar_items[toolbar_ind][1];\n", " var image = mpl.toolbar_items[toolbar_ind][2];\n", " var method_name = mpl.toolbar_items[toolbar_ind][3];\n", "\n", " if (!name) {\n", " // put a spacer in here.\n", " continue;\n", " }\n", " var button = $('');\n", " button.click(method_name, toolbar_event);\n", " button.mouseover(tooltip, toolbar_mouse_event);\n", " nav_element.append(button);\n", " }\n", "\n", " // Add the status bar.\n", " var status_bar = $('');\n", " nav_element.append(status_bar);\n", " this.message = status_bar[0];\n", "\n", " // Add the close button to the window.\n", " var buttongrp = $('
');\n", " var button = $('');\n", " button.click(function (evt) { fig.handle_close(fig, {}); } );\n", " button.mouseover('Stop Interaction', toolbar_mouse_event);\n", " buttongrp.append(button);\n", " var titlebar = this.root.find($('.ui-dialog-titlebar'));\n", " titlebar.prepend(buttongrp);\n", "}\n", "\n", "mpl.figure.prototype._root_extra_style = function(el){\n", " var fig = this\n", " el.on(\"remove\", function(){\n", "\tfig.close_ws(fig, {});\n", " });\n", "}\n", "\n", "mpl.figure.prototype._canvas_extra_style = function(el){\n", " // this is important to make the div 'focusable\n", " el.attr('tabindex', 0)\n", " // reach out to IPython and tell the keyboard manager to turn it's self\n", " // off when our div gets focus\n", "\n", " // location in version 3\n", " if (IPython.notebook.keyboard_manager) {\n", " IPython.notebook.keyboard_manager.register_events(el);\n", " }\n", " else {\n", " // location in version 2\n", " IPython.keyboard_manager.register_events(el);\n", " }\n", "\n", "}\n", "\n", "mpl.figure.prototype._key_event_extra = function(event, name) {\n", " var manager = IPython.notebook.keyboard_manager;\n", " if (!manager)\n", " manager = IPython.keyboard_manager;\n", "\n", " // Check for shift+enter\n", " if (event.shiftKey && event.which == 13) {\n", " this.canvas_div.blur();\n", " event.shiftKey = false;\n", " // Send a \"J\" for go to next cell\n", " event.which = 74;\n", " event.keyCode = 74;\n", " manager.command_mode();\n", " manager.handle_keydown(event);\n", " }\n", "}\n", "\n", "mpl.figure.prototype.handle_save = function(fig, msg) {\n", " fig.ondownload(fig, null);\n", "}\n", "\n", "\n", "mpl.find_output_cell = function(html_output) {\n", " // Return the cell and output element which can be found *uniquely* in the notebook.\n", " // Note - this is a bit hacky, but it is done because the \"notebook_saving.Notebook\"\n", " // IPython event is triggered only after the cells have been serialised, which for\n", " // our purposes (turning an active figure into a static one), is too late.\n", " var cells = IPython.notebook.get_cells();\n", " var ncells = cells.length;\n", " for (var i=0; i= 3 moved mimebundle to data attribute of output\n", " data = data.data;\n", " }\n", " if (data['text/html'] == html_output) {\n", " return [cell, data, j];\n", " }\n", " }\n", " }\n", " }\n", "}\n", "\n", "// Register the function which deals with the matplotlib target/channel.\n", "// The kernel may be null if the page has been refreshed.\n", "if (IPython.notebook.kernel != null) {\n", " IPython.notebook.kernel.comm_manager.register_target('matplotlib', mpl.mpl_figure_comm);\n", "}\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pylab\n", "from mpl_toolkits.mplot3d import Axes3D, proj3d\n", "\n", "fig = pylab.figure(figsize=(7.5,5.5))\n", "ax = fig.add_subplot(111, projection='3d')\n", "\n", "ax.scatter(country_x, country_y, country_z, marker='s', s=50, c='#2171b5')\n", "cntry_labels = []\n", "for i,(x,y,z) in enumerate(zip(country_x, country_y, country_z)):\n", " x2, y2, _ = proj3d.proj_transform(x,y,z, ax.get_proj())\n", " \n", " label = pylab.annotate(Lambda.index[i],\n", " xy=(x2,y2),\n", " xytext=(-2,2),\n", " textcoords='offset points',\n", " ha='right',\n", " va='bottom',\n", " color='#2171b5',\n", " alpha=0.9)\n", " cntry_labels.append(label)\n", "\n", "ax.scatter(lang_x, lang_y, lang_z, marker='o', s=50, c='#fc4e2a')\n", "lang_labels = []\n", "for i,(x,y,z) in enumerate(zip(lang_x, lang_y, lang_z)):\n", " x2, y2, _ = proj3d.proj_transform(x,y,z, ax.get_proj())\n", " \n", " label = pylab.annotate(Lambda.columns[i],\n", " xy=(x2,y2),\n", " xytext=(-2,2),\n", " textcoords='offset points',\n", " ha='right',\n", " va='bottom',\n", " color='#fc4e2a',\n", " alpha=0.4)\n", " lang_labels.append(label)\n", "\n", "def update_position(e):\n", " for i,(x,y,z) in enumerate(zip(country_x, country_y, country_z)):\n", " x2, y2, _ = proj3d.proj_transform(x,y,z, ax.get_proj())\n", " cntry_labels[i].xy = x2, y2\n", " for i,(x,y,z) in enumerate(zip(lang_x, lang_y, lang_z)):\n", " x2, y2, _ = proj3d.proj_transform(x,y,z, ax.get_proj())\n", " lang_labels[i].xy = x2, y2\n", " fig.canvas.draw()\n", "fig.canvas.mpl_connect('button_release_event', update_position)\n", "\n", "ax.set_xlabel(r'singular value $\\sigma_{1}$')\n", "ax.set_xticks([-0.5, 0.0, 0.5])\n", "ax.set_ylabel(r'singular value $\\sigma_{2}$')\n", "ax.set_yticks([-0.5, 0.0, 0.4])\n", "ax.set_zlabel(r'singular value $\\sigma_{3}$')\n", "ax.set_zticks([-0.5, 0.0, 0.5])\n", "ax.set_title('3D plot of Countries/Languages dataset',\n", " fontsize=16,\n", " y=1.1)\n", "\n", "plt.tight_layout()\n", "\n", "pylab.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And there we are!\n", "\n", "You can see how the anglophone countries Canada, USA and England are in close proximity of English and with each other, with Spanish being close to the USA while French is closer to Canada. German is close to Switzerland, with French somewhat in proximity. Italian, however, is out close to Italy, both being located largely in isolation from the other countries and languages.\n", "\n", "This should match up with your intuition from contingency table $F$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Helpful Resources\n", "\n", "* [Making sense of principal component analysis, eigenvectors & eigenvalues](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579)\n", "* [Correspondence analysis is a useful tool to uncover the relationships among categorical variables](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3718710/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Appendix A\n", "\n", "## PCA and SVD\n", "\n", "Principal components analysis (PCA) and singular value decomposition are closely related, and you may often hear both these terms used in the same breath. \n", "\n", "Here is a quick mathematical treatment explaining how PCA and SVD are related.\n", "\n", "Consider data matrix $X \\in \\mathbb{R}^{m \\times n}$ where $m > n$, and all $x_{ij}$ are centered about the column means. With principal components analysis, we have\n", "\n", "\\begin{align}\n", " \\text{covariance matrix } C &= \\frac{1}{m} \\, X^{\\intercal} \\, X & \\text{from PCA} \\\\\n", " &= \\frac{1}{m} \\, V \\, \\Gamma \\, V^{\\intercal} & \\text{by eigendecomposition of } X^{\\intercal} \\, X \\\\\n", " \\\\\n", " \\text{ but } X &= U \\, \\Sigma V^{\\intercal} & \\text{from SVD} \\\\\n", " \\\\\n", " \\Rightarrow C &= \\frac{1}{m} \\, V \\, \\Sigma \\, U^{\\intercal} \\, U \\, \\Sigma V^{\\intercal} \\\\\n", " &= \\frac{1}{m} \\, V \\, \\Sigma^{2} \\, V^{\\intercal} \\\\\n", "\\end{align}\n", "\n", "So we see that:\n", "\n", "1. the singular values in $\\Sigma$ obtained via SVD are really just the square roots of the eigenvalues in $\\Gamma$ of the covariance matrix in PCA.\n", "1. if you mean-center your raw data matrix $X$ and then calculate SVD, you are doing the same thing as PCA.\n", "1. the above example shows covariance of $X$ with respect to its columns ($X^{\\intercal} \\, X$); it also applies for covariance of $X$ with respect to rows ($X \\, X^{\\intercal}$).\n", "\n", "#### Iris dataset: PCA & SVD" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "\n", "pca = PCA()\n", "pca.fit(iris.data)\n", "\n", "# don't forget to mean-center the data before SVD\n", "X = iris.data - np.mean(iris.data, axis=0)\n", "U, S, Vt = np.linalg.svd(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Compare the eigenvalues of $\\Gamma$ derived from PCA with the singular values from $\\Sigma$ derived with SVD: $\\Gamma = \\Sigma^{2}$?" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "eigenvalues from PCA:\n", "[ 633.72611525 36.33653574 11.77858621 3.55245407]\n", "\n", "squared singular values from SVD:\n", "[ 629.50127448 36.09429217 11.70006231 3.52877104]\n" ] } ], "source": [ "Cov_pca = pca.get_covariance()\n", "\n", "print('eigenvalues from PCA:\\n{}\\n'.format(np.linalg.eigvals(Cov_pca * X.shape[0])))\n", "\n", "print('squared singular values from SVD:\\n{}'.format(np.square(S)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Can we obtain the covariance matrix $C$ derived with PCA, but using $\\frac{1}{m} \\, V \\, \\Sigma^{2} \\, V^{\\intercal}$ from SVD?" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "covariance matrix C derived from PCA:\n", "[[ 0.68569351 -0.03926846 1.27368233 0.5169038 ]\n", " [-0.03926846 0.18800403 -0.32171275 -0.11798121]\n", " [ 1.27368233 -0.32171275 3.11317942 1.29638747]\n", " [ 0.5169038 -0.11798121 1.29638747 0.58241432]]\n", "\n", "covariance matrix using S and Vt from SVD:\n", "[[ 0.68112222 -0.03900667 1.26519111 0.51345778]\n", " [-0.03900667 0.18675067 -0.319568 -0.11719467]\n", " [ 1.26519111 -0.319568 3.09242489 1.28774489]\n", " [ 0.51345778 -0.11719467 1.28774489 0.57853156]]\n", "\n", "Are these matrices equivalent (element-wise closeness comparison)?\n", "True\n" ] } ], "source": [ "print('covariance matrix C derived from PCA:\\n{}\\n'.format(Cov_pca))\n", "\n", "Cov_svd = (1. / X.shape[0]) * Vt.T.dot(np.diag(np.square(S))).dot(Vt) \n", "print('covariance matrix using S and Vt from SVD:\\n{}\\n'.format(Cov_svd))\n", "\n", "allclose = np.allclose(Cov_pca, Cov_svd, atol=1e-1)\n", "print('Are these matrices equivalent (element-wise closeness comparison)?\\n{}'.format(allclose))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }