{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "You can read an overview of this Numerical Linear Algebra course in [this blog post](http://www.fast.ai/2017/07/17/num-lin-alg/). The course was originally taught in the [University of San Francisco MS in Analytics](https://www.usfca.edu/arts-sciences/graduate-programs/analytics) graduate program. Course lecture videos are [available on YouTube](https://www.youtube.com/playlist?list=PLtmWHNX-gukIc92m1K0P6bIOnZb-mg0hY) (note that the notebook numbers and video numbers do not line up, since some notebooks took longer than 1 video to cover).\n", "\n", "You can ask questions about the course on [our fast.ai forums](http://forums.fast.ai/c/lin-alg)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Topic Modeling with NMF and SVD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Topic modeling is a great way to get started with matrix factorizations. We start with a **term-document matrix**:\n", "\n", "\n", "(source: [Introduction to Information Retrieval](http://player.slideplayer.com/15/4528582/#))\n", "\n", "We can decompose this into one tall thin matrix times one wide short matrix (possibly with a diagonal matrix in between).\n", "\n", "Notice that this representation does not take into account word order or sentence structure. It's an example of a **bag of words** approach." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Motivation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider the most extreme case - reconstructing the matrix using an outer product of two vectors. Clearly, in most cases we won't be able to reconstruct the matrix exactly. But if we had one vector with the relative frequency of each vocabulary word out of the total word count, and one with the average number of words per document, then that outer product would be as close as we can get.\n", "\n", "Now consider increasing that matrices to two columns and two rows. The optimal decomposition would now be to cluster the documents into two groups, each of which has as different a distribution of words as possible to each other, but as similar as possible amongst the documents in the cluster. We will call those two groups \"topics\". And we would cluster the words into two groups, based on those which most frequently appear in each of the topics. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### In today's class" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll take a dataset of documents in several different categories, and find topics (consisting of groups of words) for them. Knowing the actual categories helps us evaluate if the topics we find make sense.\n", "\n", "We will try this with two different matrix factorizations: **Singular Value Decomposition (SVD)** and **Non-negative Matrix Factorization (NMF)**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.datasets import fetch_20newsgroups\n", "from sklearn import decomposition\n", "from scipy import linalg\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "np.set_printoptions(suppress=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional Resources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Data source](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html): Newsgroups are discussion groups on Usenet, which was popular in the 80s and 90s before the web really took off. This dataset includes 18,000 newsgroups posts with 20 topics.\n", "- [Chris Manning's book chapter](https://nlp.stanford.edu/IR-book/pdf/18lsi.pdf) on matrix factorization and LSI \n", "- Scikit learn [truncated SVD LSI details](http://scikit-learn.org/stable/modules/decomposition.html#lsa)\n", "\n", "### Other Tutorials\n", "- [Scikit-Learn: Out-of-core classification of text documents](http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html): uses [Reuters-21578](https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection) dataset (Reuters articles labeled with ~100 categories), HashingVectorizer\n", "- [Text Analysis with Topic Models for the Humanities and Social Sciences](https://de.dariah.eu/tatom/index.html): uses [British and French Literature dataset](https://de.dariah.eu/tatom/datasets.html) of Jane Austen, Charlotte Bronte, Victor Hugo, and more" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit Learn comes with a number of built-in datasets, as well as loading utilities to load several standard external datasets. This is a [great resource](http://scikit-learn.org/stable/datasets/), and the datasets include Boston housing prices, face images, patches of forest, diabetes, breast cancer, and more. We will be using the newsgroups dataset.\n", "\n", "Newsgroups are discussion groups on Usenet, which was popular in the 80s and 90s before the web really took off. This dataset includes 18,000 newsgroups posts with 20 topics. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']\n", "remove = ('headers', 'footers', 'quotes')\n", "newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=remove)\n", "newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=remove)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((2034,), (2034,))" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "newsgroups_train.filenames.shape, newsgroups_train.target.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at some of the data. Can you guess which category these messages are in?" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hi,\n", "\n", "I've noticed that if you only save a model (with all your mapping planes\n", "positioned carefully) to a .3DS file that when you reload it after restarting\n", "3DS, they are given a default position and orientation. But if you save\n", "to a .PRJ file their positions/orientation are preserved. Does anyone\n", "know why this information is not stored in the .3DS file? Nothing is\n", "explicitly said in the manual about saving texture rules in the .PRJ file. \n", "I'd like to be able to read the texture rule information, does anyone have \n", "the format for the .PRJ file?\n", "\n", "Is the .CEL file format available from somewhere?\n", "\n", "Rych\n", "\n", "\n", "Seems to be, barring evidence to the contrary, that Koresh was simply\n", "another deranged fanatic who thought it neccessary to take a whole bunch of\n", "folks with him, children and all, to satisfy his delusional mania. Jim\n", "Jones, circa 1993.\n", "\n", "\n", "Nope - fruitcakes like Koresh have been demonstrating such evil corruption\n", "for centuries.\n", "\n", " >In article <1993Apr19.020359.26996@sq.sq.com>, msb@sq.sq.com (Mark Brader) \n", "\n", "MB> So the\n", "MB> 1970 figure seems unlikely to actually be anything but a perijove.\n", "\n", "JG>Sorry, _perijoves_...I'm not used to talking this language.\n", "\n", "Couldn't we just say periapsis or apoapsis?\n", "\n", " \n" ] } ], "source": [ "print(\"\\n\".join(newsgroups_train.data[:3]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "hint: definition of *perijove* is the point in the orbit of a satellite of Jupiter nearest the planet's center " ] }, { "cell_type": "code", "execution_count": 249, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['comp.graphics', 'talk.religion.misc', 'sci.space'], \n", " dtype='\n", "(source: [Facebook Research: Fast Randomized SVD](https://research.fb.com/fast-randomized-svd/))\n", "\n", "SVD is an **exact decomposition**, since the matrices it creates are big enough to fully cover the original matrix. SVD is extremely widely used in linear algebra, and specifically in data science, including:\n", "\n", "- semantic analysis\n", "- collaborative filtering/recommendations ([winning entry for Netflix Prize](https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf))\n", "- calculate Moore-Penrose pseudoinverse\n", "- data compression\n", "- principal component analysis (will be covered later in course)" ] }, { "cell_type": "code", "execution_count": 344, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 27.2 s, sys: 812 ms, total: 28 s\n", "Wall time: 27.9 s\n" ] } ], "source": [ "%time U, s, Vh = linalg.svd(vectors, full_matrices=False)" ] }, { "cell_type": "code", "execution_count": 345, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2034, 2034) (2034,) (2034, 26576)\n" ] } ], "source": [ "print(U.shape, s.shape, Vh.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Confirm this is a decomposition of the input." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Answer" ] }, { "cell_type": "code", "execution_count": 346, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 346, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Exercise: confrim that U, s, Vh is a decomposition of the var Vectors\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Confirm that U, V are orthonormal" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Answer" ] }, { "cell_type": "code", "execution_count": 246, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 246, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Exercise: Confirm that U, Vh are orthonormal\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Topics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What can we say about the singular values s?" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAGWpJREFUeJzt3X2MHPd93/H3d/bp7ngkRYrHE0NSJhXTD3TcSAkjCVDs\nOpYdMU5qygmg0G1SthUgFFACG00RSDGSpn+wcerGTZzaTZTECOM4llkkqQgjqS1TclwDtWXKkWRR\nFMPTAy1SR/JESXw43tPufvvH/HZv73ize0fePszw8wIOO/ebmd0vZ4+f+e1vfztr7o6IiGRX1O0C\nRESkvRT0IiIZp6AXEck4Bb2ISMYp6EVEMk5BLyKScQp6EZGMU9CLiGScgl5EJOPy3S4AYN26db5l\ny5ZulyEikipPPvnka+4+1Gq7ngj6LVu2cOjQoW6XISKSKmZ2fDHbaehGRCTjFPQiIhmnoBcRyTgF\nvYhIxinoRUQyTkEvIpJxCnoRkYxLddCfOjfJp792lBfGLna7FBGRnpXqoD99fpLPPDbC8bPj3S5F\nRKRnpTroa/T95iIiyVId9GbxrYJeRCRZuoMe63YJIiI9L9VBX6MOvYhIslQHvalDLyLSUqqDvsY1\nSC8ikigTQS8iIskyEfTqz4uIJEt10Gt6pYhIa+kOek2vFBFpKdVBP0tdehGRJIsOejPLmdk/mtlX\nwu9rzexRMzsWbtc0bPugmY2Y2VEzu6sdhceP0657FhHJjqX06D8GHGn4/QHgoLtvAw6G3zGz7cBu\n4F3ATuBzZpZbnnIXpjF6EZFkiwp6M9sE/Czwpw3Nu4B9YXkfcHdD+8PuPuXuLwEjwK3LU+78utpx\nryIi2bLYHv3vA78OVBvaht19NCyfAobD8kbglYbtToS2tlGHXkQkWcugN7OfA864+5NJ23j80dQl\n5a2Z3Wdmh8zs0NjY2FJ2nb2PMOtGQzciIskW06O/A/iwmb0MPAy838z+EjhtZhsAwu2ZsP1JYHPD\n/ptC2xzu/pC773D3HUNDQ1dUvIZuRERaaxn07v6gu29y9y3Eb7I+5u6/BBwA9oTN9gCPhOUDwG4z\nK5nZVmAb8MSyV95YowZvREQS5a9i308C+83sXuA4cA+Aux82s/3Ac0AZuN/dK1dd6QLUoRcRaW1J\nQe/u3wC+EZbPAncmbLcX2HuVtS2hrk49kohI+qT6k7H1a910twwRkZ6W6qDX4I2ISGspD/qYvnhE\nRCRZqoNe0ytFRFpLddCLiEhrqQ56dehFRFpLddDXaIheRCRZqoPewiC9PhkrIpIs3UHf7QJERFIg\n1UFfo6EbEZFkqQ56Ta8UEWkt1UFfox69iEiyVAe9aZReRKSlVAd9jTr0IiLJUh309atXauxGRCRR\nqoNeRERay0TQqz8vIpIs1UGv6ZUiIq2lOujr1KUXEUmU6qDXtW5ERFpLd9B3uwARkRRIddDXaHal\niEiyVAe93owVEWkt1UFfow69iEiyVAe9rnUjItJaqoO+RmP0IiLJUh309WvdaPBGRCRRuoO+2wWI\niKRAqoO+RkM3IiLJ0h306tKLiLSU7qAP1KEXEUmW6qDX9EoRkdZSHfR1GqQXEUmU6qCfnV4pIiJJ\n0h303S5ARCQFUh30NRq5ERFJluqgN12+UkSkpZZBb2Z9ZvaEmT1tZofN7D+H9rVm9qiZHQu3axr2\nedDMRszsqJnd1c5/AICrSy8ikmgxPfop4P3u/qPAzcBOM7sdeAA46O7bgIPhd8xsO7AbeBewE/ic\nmeXaUXytP6+YFxFJ1jLoPXYx/FoIPw7sAvaF9n3A3WF5F/Cwu0+5+0vACHDrslYdaORGRKS1RY3R\nm1nOzJ4CzgCPuvt3gGF3Hw2bnAKGw/JG4JWG3U+Etvn3eZ+ZHTKzQ2NjY1f8DwC9GSsi0syigt7d\nK+5+M7AJuNXMfmTeemeJIyju/pC773D3HUNDQ0vZtU6fjBURaW1Js27c/U3gceKx99NmtgEg3J4J\nm50ENjfstim0tY069CIiyRYz62bIzK4Ly/3AB4HngQPAnrDZHuCRsHwA2G1mJTPbCmwDnljuwuPi\n2nKvIiKZkl/ENhuAfWHmTATsd/evmNn/A/ab2b3AceAeAHc/bGb7geeAMnC/u1faUXz9EggapBcR\nSdQy6N39GeCWBdrPAncm7LMX2HvV1bUQhaRXzouIJEv3J2PDbVVJLyKSKNVBX+/Rd7kOEZFeluqg\nr43Rq0cvIpIsE0GvnBcRSZbqoJ99M1ZJLyKSJBNBX1XOi4gkSnXQa9aNiEhr6Q56jdGLiLSU8qA3\nzDRGLyLSTKqDHuJxeo3Ri4gkS33QGxqjFxFpJvVBH5npk7EiIk2kPujN1KMXEWkmE0GvnBcRSZb6\noI/MNOtGRKSJTAS9Zt2IiCRLfdBr1o2ISHPpD3qN0YuINJX6oI8ijdGLiDST+qCPh266XYWISO9K\nfdDHH5hS0ouIJEl90JsZlWq3qxAR6V2pD/p8ZFSqSnoRkSSpD/pcZJQ1SC8ikij1QZ/PGRUFvYhI\novQHvXr0IiJNZSDoIyoVBb2ISJLUB73G6EVEmkt90Mdj9Jp1IyKSJPVBrx69iEhz6Q96M8oaoxcR\nSZT+oI+Mii5qJiKSKBNBX9XQjYhIokwEvXr0IiLJUh/0kalHLyLSTOqDXp+MFRFprmXQm9lmM3vc\nzJ4zs8Nm9rHQvtbMHjWzY+F2TcM+D5rZiJkdNbO72voPiHStGxGRZhbToy8Dv+bu24HbgfvNbDvw\nAHDQ3bcBB8PvhHW7gXcBO4HPmVmuHcVDPL1SXw4uIpKsZdC7+6i7fy8sXwCOABuBXcC+sNk+4O6w\nvAt42N2n3P0lYAS4dbkLr8np6pUiIk0taYzezLYAtwDfAYbdfTSsOgUMh+WNwCsNu50IbW2RMwW9\niEgziw56MxsE/hr4uLufb1zn7g5L++JWM7vPzA6Z2aGxsbGl7DqHpleKiDS3qKA3swJxyH/R3f8m\nNJ82sw1h/QbgTGg/CWxu2H1TaJvD3R9y9x3uvmNoaOhK6w/TK694dxGRzFvMrBsD/gw44u6fblh1\nANgTlvcAjzS07zazkpltBbYBTyxfyXPlNetGRKSp/CK2uQP4ZeD7ZvZUaPsN4JPAfjO7FzgO3APg\n7ofNbD/wHPGMnfvdvbLslQeR5tGLiDTVMujd/VuAJay+M2GfvcDeq6hr0XIRml4pItJE6j8Zq1k3\nIiLNpT/oo0jXuhERaSIDQY/G6EVEmkh90EeaRy8i0lTqgz6nyxSLiDSV+qDPq0cvItJU6oM+igx3\n1KsXEUmQ+qDPWTzFX716EZGFpT7ooygEvXr0IiILSn3Q5xX0IiJNpT7ocyHoNZdeRGRhqQ/6Qi7+\nJ6hHLyKysNQH/WyPXhelFxFZSOqDXmP0IiLNpT7o6z36ioJeRGQhqQ/62hi93owVEVlY6oM+Vx+6\n0Ri9iMhCUh/0eU2vFBFpKvVBrzF6EZHmUh/0+Zx69CIizaQ/6KPaB6Y0Ri8ispAMBL2GbkREmkl9\n0OtaNyIizaU+6DVGLyLSXPqDXmP0IiJNpT7oNb1SRKS51Ae9hm5ERJpLfdCvKOYBOD8x0+VKRER6\nU+qDfsPqPgBGz012uRIRkd6U+qDP5yIGijnGp8rdLkVEpCelPugBVpTyjE8r6EVEFpKJoB8s5bk4\nVel2GSIiPSkTQb+ipKEbEZEk2Qj6Yp6LCnoRkQVlIuhvWN3HS6+Nd7sMEZGelImg37SmnzfGp7td\nhohIT8pE0A8U85SrznRZ17sREZmvZdCb2efN7IyZPdvQttbMHjWzY+F2TcO6B81sxMyOmtld7Sq8\n0UAxB8AlTbEUEbnMYnr0fw7snNf2AHDQ3bcBB8PvmNl2YDfwrrDP58wst2zVJqhdBmF8WlMsRUTm\naxn07v5N4PV5zbuAfWF5H3B3Q/vD7j7l7i8BI8Cty1Rrov7Qo59Qj15E5DJXOkY/7O6jYfkUMByW\nNwKvNGx3IrRdxszuM7NDZnZobGzsCsuIrSjFQT+uD02JiFzmqt+MdXcHlnyNYHd/yN13uPuOoaGh\nq6phoD50ox69iMh8Vxr0p81sA0C4PRPaTwKbG7bbFNraqv5mrHr0IiKXudKgPwDsCct7gEca2neb\nWcnMtgLbgCeursTWaj36SzMKehGR+fKtNjCzLwHvA9aZ2QngPwGfBPab2b3AceAeAHc/bGb7geeA\nMnC/u7c9fWtj9Jd0GQQRkcu0DHp3/2jCqjsTtt8L7L2aopZqoKDplSIiSTLxyVhNrxQRSZaJoC/m\nI4q5iAsauhERuUwmgh5gw3V9vPqmvjdWRGS+zAT9jWsH+MFZXapYRGS+TAX98dcvdbsMEZGek5mg\nX7+yjzcvzVCpLvlDuiIimZaZoNelikVEFpadoK99aEpz6UVE5shM0K8bLAHwsr47VkRkjswE/R1v\nXUcxF/HY82dabywicg3JTNAPlvLcNLSCF8YudrsUEZGekpmgB9i0ZoAfaIqliMgcmQr6G9cO8Mrr\nE8TfhSIiIpCxoH/nhpVMzFR45sS5bpciItIzMhX0d7x1HQCPH9UbsiIiNZkK+h+6rp9t6wd5+pU3\nu12KiEjPyFTQA7xn2xD/99hrjJ6b6HYpIiI9IXNB/2/v2IIDnzk40u1SRER6QuaCfvPaAX7pthv5\n8nd/wOFX9aasiEjmgh7g4x94G4OlPL/6pX9kqqxr34jItS2TQb9mRZH//os38+LYOJ9+9J+6XY6I\nSFdlMugB7nznMPfs2MQf/8OLfOqrz+tDVCJyzcp3u4B22vuRd2MYn338BSIzPnbnNvK5zJ7bREQW\nlOmgL+Qifufn381MpcofPjbC0VMX+MxHb6GvkOt2aSIiHZP57m0UGb93z4/yGx96B1977jT3feFJ\nXn1Tc+xF5NqR+aAHMDPue+8P818+8m6+/eJZfuq/fYNPffV5LkzOdLs0EZG2uyaCvuZf3nYjj/3a\nP2fnj9zAZx9/gfd96ht84dvHmalUu12aiEjbXFNBD/E16/9g9y0c+JU7+OH1g/zm/36Wu37/mxx4\n+lXGp/TF4iKSPdYL0w537Njhhw4d6vjjujtfP3KG3/n7I7w4Nk4pH/HB7cP84k9s5vabrqegGToi\n0sPM7El339Fqu0zPumnFzPjg9mF+6u1DPPHy6/yfZ0/xyFOv8pVnRukv5Hj/O9Zz201ruW3r9bx1\n/SC5yLpdsojIkl3TPfqFTM5UePz5M3xr5DW+evg0r12cAmBFMcc7NqziHTes5J0bVvHODat4+w0r\nGSxd0+dKEemixfboFfRNuDsn3pjgiZde55kTb3Jk9AJHTp3nwuTsWP5brh/gbcMr2bxmgE1r+sPP\nABvX9LO6v9DF6kUk6zR0swzMjM1rB9i8doBf+PFNQBz+J9+c4PnRCxwZPc+RU+c5dvoi3zr2GhMz\ncy+gtrKU54bVfdywuo/hVX0MryoxvKqP9Sv7WL+qxPqVJdYNlvQBLhFpKwX9EpkZm9YMsGnNAB/Y\nPlxvd3deH5/m5JsTnHhjghNvXOLkGxOcOj/JqXOTHDt9kbGLU1Sql7+CGijmWLuiyPUrilw/WKov\nXzdQZM1AgdX9BVbXbvsLrOovMFjME+k9AxFZBAX9MjEzrh8scf1giX+26boFt6lU45PB6fOTnLkw\nyZnzU5wdn+b18WnOXoyXT5+f5Mjoec6OTzNdTp7fHxmsKOVZ1VdgsJRnsC/PilKelaU8g6V4eUUp\nR38xx0Ahx0AxHy8XQ1sxHy8XZtv6CznMdPIQyRoFfQflImNoZYmhlSVgddNt3Z2JmQpvXJrh3KUZ\nzk3McG5imvMTZc5Pxr9fmCxzcarMxckyF6bitpNvXGJ8qsLFqTLj02WW+hZMfyEO/b58RF8hRzEf\nUchFFPMRxVxEIdyW8hGFnMXt+YhiLkchb5TCtvV9wnIp7Dd/XXGB+y7ma/cfaaaTyDJoW9Cb2U7g\nD4Ac8Kfu/sl2PVYWmVnodefZeF3/Fd2HuzNVrnJpusKl6TIT05WwXGFipjy7XL+N2ybLFSZnqkzO\nVJguV5mpVJmuVJkuV7k0EbdNlyvMVDxeDutqt8spF1l8EsgZxXxugRPM7Imj1OSkVDvB5HNGPjLy\nUbyci4xCFJ9Q4nXxciGsq22Xj4x8LiJnVt82srg9F9W2nV2u7RtFxLeGXi1J17Ql6M0sB3wW+CBw\nAviumR1w9+fa8XiyMDOjr5CjrxC/B9AJ7s5MxeOTwwIngXp7ucpUpcrMAuumytWGk0jtZBOftObc\nRyVenipXuTBZ5uy8k1JtXW3bbk8wm3NCMCMXTiD1E0bOZk8kUUS04Mlj7nJkFk5YETmDXBTF7Qvs\nG5mRiwiPEZGL4ov+1R4zqt3W26i3zVlfb2tYb2G/xvVmRBFz1897jKhejy1ci06Qy6JdPfpbgRF3\nfxHAzB4GdgEK+owzM4r5uMe9otTtauYqV6qUq0656lQqTrk6+3ttXaUan6Ti2/j3crVKOWxfqUIl\n3JarVarulMN2FQ/bVzxuD/dXqT1mw/7lqlOtzt2mvp3X6mu8n2q9tomZJvs2LMe1Veu1Ves1d/uZ\nWJrIZk+SjSeU+SeP2RMSc06U9ZPevO0aT5j5XESh9qquvhy/mitEs68GC+FVXyFsl4+M/mKOFcX4\nfbGBYo7BUp6BUo61A8We+f6LdgX9RuCVht9PALe16bFEFiX+j9vtKrrP3al6PDmg6t5wIvA5J4XL\n22ZPZrX1jfcxZ717OKk1rPfZE1F9fcNjXF4Llz1uZd62s/cZTsBOOAnGJ9XaiXK2zZkqV6h4OOFW\nZk+MM5XZE/pMJT5JzlScmeqVvRpc3V9g/crWvZ33vX2IT/zs9it4Jheva2/Gmtl9wH0AN954Y7fK\nELnmmFkY5tGQyGLVXk3VXv3NVGZf6U3MVBifit/fujhV5tJ0PEHi6RPnuDTd+kKJw6v62l5/u4L+\nJLC54fdNoa3O3R8CHoL4k7FtqkNE5KrFwzxLezn4y22q5Uq0awDpu8A2M9tqZkVgN3CgTY8lIiJN\ntKVH7+5lM/sV4KvE0ys/7+6H2/FYIiLSXNvG6N3974C/a9f9i4jI4vTG3B8REWkbBb2ISMYp6EVE\nMk5BLyKScQp6EZGM64mvEjSzMeD4VdzFOuC1ZSpnuaimxevFunqxJujNunqxJujNupa7pre4+1Cr\njXoi6K+WmR1azPcmdpJqWrxerKsXa4LerKsXa4LerKtbNWnoRkQk4xT0IiIZl5Wgf6jbBSxANS1e\nL9bVizVBb9bVizVBb9bVlZoyMUYvIiLJstKjFxGRBKkOejPbaWZHzWzEzB7o4ONuNrPHzew5Mzts\nZh8L7b9tZifN7Knw86GGfR4MdR41s7vaWNvLZvb98PiHQttaM3vUzI6F2zWdqsvM3t5wPJ4ys/Nm\n9vFuHCsz+7yZnTGzZxvalnxszOzHwzEeMbPP2FV8qWlCTZ8ys+fN7Bkz+1szuy60bzGziYZj9kcd\nrGnJz9dy1tSkri831PSymT0V2jt1rJKyoKt/V5dx91T+EF/++AXgJqAIPA1s79BjbwB+LCyvBP4J\n2A78NvAfF9h+e6ivBGwNdefaVNvLwLp5bf8VeCAsPwD8bqfranjOTgFv6caxAt4L/Bjw7NUcG+AJ\n4HbAgL8HfmaZa/ppIB+Wf7ehpi2N2827n3bXtOTnazlrSqpr3vrfA36rw8cqKQu6+nc1/yfNPfr6\nF5C7+zRQ+wLytnP3UXf/Xli+ABwh/p7cJLuAh919yt1fAkaI6++UXcC+sLwPuLtLdd0JvODuzT4c\n17aa3P2bwOsLPN6ij42ZbQBWufu3Pf7f+RcN+yxLTe7+NXevfQfdt4m/oS1RJ2pqoiPHqVVdofd7\nD/ClZvfRhmOVlAVd/buaL81Bv9AXkDcL27Ywsy3ALcB3QtOvhpfcn294udbJWh34upk9afH38gIM\nu/toWD4FDHehLoi/aazxP2K3jxUs/dhsDMudqu/fEffuaraGoYh/MLP3NNTaiZqW8nx1+ji9Bzjt\n7sca2jp6rOZlQU/9XaU56LvOzAaBvwY+7u7ngf9JPJR0MzBK/FKy037S3W8Gfga438ze27gy9BY6\nPtXK4q+U/DDwv0JTLxyrObp1bJKY2SeAMvDF0DQK3Bie3/8A/JWZrepQOT33fM3zUeZ2Ijp6rBbI\ngrpe+LtKc9C3/ALydjKzAvET+0V3/xsAdz/t7hV3rwJ/wuyQQ8dqdfeT4fYM8LehhtPhpWHtpeuZ\nTtdFfOL5nrufDvV1/VgFSz02J5k7lNKW+szs3wA/B/yrEBSEl/tnw/KTxOO7b+tETVfwfHXkOAGY\nWR74eeDLDfV27FgtlAX02N9VmoO+a19AHsYD/ww44u6fbmjf0LDZR4Da7IADwG4zK5nZVmAb8Rsv\ny13XCjNbWVsmflPv2fD4e8Jme4BHOllXMKfH1e1j1WBJxya8HD9vZreHv4N/3bDPsjCzncCvAx92\n90sN7UNmlgvLN4WaXuxQTUt6vjpRU4MPAM+7e33oo1PHKikL6LW/q+V6V7cbP8CHiN/lfgH4RAcf\n9yeJX4o9AzwVfj4EfAH4fmg/AGxo2OcToc6jLOO76fPquon4Hf2ngcO1YwJcDxwEjgFfB9Z2uK4V\nwFlgdUNbx48V8YlmFJghHgO990qODbCDOOheAP4H4YOHy1jTCPE4bu1v64/Ctr8QntengO8B/6KD\nNS35+VrOmpLqCu1/Dvz7edt26lglZUFX/67m/+iTsSIiGZfmoRsREVkEBb2ISMYp6EVEMk5BLyKS\ncQp6EZGMU9CLiGScgl5EJOMU9CIiGff/AQqkXAuonCReAAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(s);" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD9CAYAAACyYrxEAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl01fWd//HnOzd7gLAkhCQ3FZAogqyJSBUc616xgAqW\nThdnnFZnxrHaX8+vp3bmTO2ZX6edni62ndFWbXvsatlESt23wZ0m7AGRsCgJARK2AIGs798f96LB\nqrmBJN+be1+Pc3LyvZ/7/Sbv3AOv+72f7+f7+Zi7IyIiiSsl6AJERKR3KehFRBKcgl5EJMEp6EVE\nEpyCXkQkwSnoRUQSXMxBb2YhM1tjZiuij+8xs1ozWxv9urbTvnebWbWZbTGzq3ujcBERiU1qN/a9\nE9gMDOrU9iN3/37nncxsHLAAGA8UAc+a2Tnu3n6mxYqISPfFdEZvZmFgFvBQDLvPAR5x92Z33wFU\nA9NOv0QRETkTsXbd3At8Deh4X/sdZrbezH5pZkOibcXArk771ETbREQkAF123ZjZdcA+d680s0s7\nPXU/8B+AR7//ALgl1l9sZrcCtwLk5OSUjR07thtli4hIZWVlg7vnd7VfLH30FwOzoxdbM4FBZvZb\nd//cyR3M7EFgRfRhLVDS6fhwtO0U7v4A8ABAeXm5V1RUxFCKiIicZGZvx7Jfl1037n63u4fdfSSR\ni6zPu/vnzKyw027XAxuj28uBBWaWYWajgFJgVbeqFxGRHtOdUTfv9z0zm0yk62YncBuAu1eZ2UJg\nE9AG3K4RNyIiwbF4mKZYXTciIt1nZpXuXt7VfrozVkQkwSnoRUQSnIJeRCTBKehFRBJcvw763YeO\n85+Pb6bhaHPQpYiIxK1+HfRHm9t4YOV2lq35q/uxREQkql8H/TkFA5kUzmVRRQ3xMExURCQe9eug\nB5hXXsKWvUfYUHs46FJEROJSvw/62ROLSE9NYVFFTdCliIjEpX4f9LnZaVw9fgSPra3lRKtmWhAR\neb9+H/QA88vCNJ5o45lNe4MuRUQk7iRE0F88Jo/C3EwWVar7RkTk/RIi6EMpxryyMC9trafu8PGg\nyxERiSsJEfQA88rCuMPS1RpTLyLSWcIE/VnDcpg2aiiLKnZpTL2ISCcJE/QQuSi7c38TFW8fDLoU\nEZG4kVBBf+2EQrLTQyyq2BV0KSIicSOhgj4nI5VZEwr58/o6mlragi5HRCQuJFTQA8wvL+FYSzuP\nb9gTdCkiInEh4YL+gpFDGDksW903IiJRCRf0ZpEx9W/sOMA7+5uCLkdEJHAJF/QAN0wNYwaLK3VW\nLyKSkEFfNDiLGWPyWLK6lo4OjakXkeQWc9CbWcjM1pjZiujjoWb2jJltjX4f0mnfu82s2sy2mNnV\nvVF4V+aXl1B76DivbtsfxK8XEYkb3TmjvxPY3Onx14Hn3L0UeC76GDMbBywAxgPXAPeZWahnyo3d\nVeMKGJSZyiJ134hIkosp6M0sDMwCHurUPAd4OLr9MDC3U/sj7t7s7juAamBaz5Qbu8y0ELMnF/Hk\nxj0cPt7a179eRCRuxHpGfy/wNaCjU1uBu9dFt/cABdHtYqDzaXRNtO0UZnarmVWYWUV9fX33qo7R\n/LISmts6WLF+d6/8fBGR/qDLoDez64B97l75Yft4ZBaxbl31dPcH3L3c3cvz8/O7c2jMJoZzOadg\nAAu1zKCIJLFYzugvBmab2U7gEeAyM/stsNfMCgGi3/dF968FSjodH4629TkzY35ZCet2HWLr3iNB\nlCAiErgug97d73b3sLuPJHKR9Xl3/xywHLg5utvNwGPR7eXAAjPLMLNRQCmwqscrj9HcKcWEUkyr\nT4lI0jqTcfTfBa40s63AFdHHuHsVsBDYBDwJ3O7uga3anT8wg8vGDmfp6lpa2zu6PkBEJMF0K+jd\n/UV3vy66vd/dL3f3Une/wt0PdNrv2+5+truf6+5P9HTR3TW/LEzD0Wb+d0vvXPQVEYlnCXln7Pt9\nYuxw8gaka0y9iCSlpAj6tFAKcycX89zmfew/2hx0OSIifSopgh4iUyK0dTjL1mpMvYgkl6QJ+nNH\nDGRiOFeLh4tI0kmaoIfIRdk39xyhandj0KWIiPSZpAr62ZOKSU9N0epTIpJUkiroc7PTuGpcAY+t\n201zW2BD+0VE+lRSBT1ELsoeamrl2U37ut5ZRCQBJF3QzxiTR2FupsbUi0jSSLqgD6UYN0wtZuVb\n9ew5fCLockREel3SBT3AvLISOhyWrtFEZyKS+JIy6Efl5XDByCEsrqjRmHoRSXhJGfQQWX1qe8Mx\nVr9zMOhSRER6VdIG/bUTC8lKC7FIq0+JSIJL2qAfkJHKtRMKWbG+jqaWtqDLERHpNUkb9AA3lYc5\n2tzGkxv3BF2KiEivSeqgnzZqKGcNy1b3jYgktKQOejNj3tQwr23fz64DTUGXIyLSK5I66AFuLAtj\nBou1eLiIJKikD/qiwVnMGJPH4soaOjo0pl5EEk/SBz3AvLIwtYeO89r2/UGXIiLS4xT0wNXjRzAw\nM1Xz1ItIQuoy6M0s08xWmdk6M6sys29F2+8xs1ozWxv9urbTMXebWbWZbTGzq3vzD+gJmWkhZk8q\n4omNe2g80Rp0OSIiPSqWM/pm4DJ3nwRMBq4xs+nR537k7pOjX48DmNk4YAEwHrgGuM/MQr1Qe4+a\nX15Cc1sHK9bVBV2KiEiP6jLoPeJo9GFa9OujrlrOAR5x92Z33wFUA9POuNJeNimcS+nwAZqnXkQS\nTkx99GYWMrO1wD7gGXd/I/rUHWa23sx+aWZDom3FQOe0rIm2xTUzY355mDXvHKJ635GgyxER6TEx\nBb27t7v7ZCAMTDOz84H7gdFEunPqgB905xeb2a1mVmFmFfX19d0su3fMnVJMKMVYpDH1IpJAujXq\nxt0PAS8A17j73ugbQAfwIO91z9QCJZ0OC0fb3v+zHnD3cncvz8/PP73qe9jwgZl84tx8lq6upa29\nI+hyRER6RCyjbvLNbHB0Owu4EnjTzAo77XY9sDG6vRxYYGYZZjYKKAVW9WzZvWdeWQn1R5pZuTU+\nPmWIiJyp1Bj2KQQejo6cSQEWuvsKM/uNmU0mcmF2J3AbgLtXmdlCYBPQBtzu7u29Un0vuGzscIbm\npLOooobLxhYEXY6IyBnrMujdfT0w5QPaP/8Rx3wb+PaZlRaM9NQU5k4u5jev7+TAsRaG5qQHXZKI\nyBnRnbEfYH55mNZ257G1f3VpQUSk31HQf4DzCgdxfvEgzVMvIglBQf8hbiovYVNdI1W7DwddiojI\nGVHQf4jZk4pID6XorF5E+j0F/YcYnJ3OleMLeGxtLS1tGlMvIv2Xgv4jzC8Lc7Cplec27w26FBGR\n06ag/wgzS/MZMShTUyKISL+moP8IoRTjhqnFvLhlH/saTwRdjojIaVHQd2FeWZgOh6VrNKZeRPon\nBX0XRucPoPysISyq2IW7Fg8Xkf5HQR+D+eVhttUfY82uQ0GXIiLSbQr6GMyaWERWWkhj6kWkX1LQ\nx2BARiqfnDCCFet2c7yl30zEKSICKOhjNr+shCPNbTxVtSfoUkREukVBH6MLRw2lZGgWCyu0eLiI\n9C8K+hilpBjzppbw6rb97DrQFHQ5IiIxU9B3w41lxZjBktW6KCsi/YeCvhvCQ7K56OxhLK6soaND\nY+pFpH9Q0HfT/LISag4e5/Ud+4MuRUQkJgr6brp6/AgGZqSyWGPqRaSfUNB3U1Z6iE9NLuLxjXUc\nOdEadDkiIl1S0J+G+WVhTrR28Of1dUGXIiLSpS6D3swyzWyVma0zsyoz+1a0faiZPWNmW6Pfh3Q6\n5m4zqzazLWZ2dW/+AUGYXDKYMcMHaJ56EekXYjmjbwYuc/dJwGTgGjObDnwdeM7dS4Hnoo8xs3HA\nAmA8cA1wn5mFeqP4oJgZ88vCVL59kG31R4MuR0TkI3UZ9B5xMs3Sol8OzAEejrY/DMyNbs8BHnH3\nZnffAVQD03q06jhw/dRiQinGYp3Vi0ici6mP3sxCZrYW2Ac84+5vAAXufrKTeg9QEN0uBjrPE1AT\nbUsowwdmcuk5+SxdXUO7xtSLSByLKejdvd3dJwNhYJqZnf++553IWX7MzOxWM6sws4r6+vruHBo3\n5peH2dvYzMqt/bN+EUkO3Rp14+6HgBeI9L3vNbNCgOj3fdHdaoGSToeFo23v/1kPuHu5u5fn5+ef\nTu2Bu2xsAUNz0jWmXkTiWiyjbvLNbHB0Owu4EngTWA7cHN3tZuCx6PZyYIGZZZjZKKAUWNXThceD\n9NQU5kwu4plNeznU1BJ0OSIiHyiWM/pC4AUzWw/8hUgf/Qrgu8CVZrYVuCL6GHevAhYCm4Angdvd\nPWFX65hfVkJLewePrd0ddCkiIh/I4mHB6/Lycq+oqAi6jNM26ycvYQYr7pgZdCkikkTMrNLdy7va\nT3fG9oAFF5SwsbaRHz69hXh44xQR6Sw16AISwd9eeBZVuxv5yfPVNJ5o49+vG0dKigVdlogIoKDv\nEaEU4zs3TGBARioPvbyDo81tfPeGCaSG9IFJRIKnoO8hZsa/zjqPgZlp/OjZtzjW3Ma9CyaTkZpQ\nsz+ISD+kU84eZGbceUUp/37dOJ7YuIcvPlxBU0tb0GWJSJJT0PeCW2aM4ns3TuSV6ga+8ItVNGre\nehEJkIK+l9x0QQk//cxU1tUc4jMPvM7+o81BlyQiSUpB34tmTSzkwS+Us63+KDf9/DXqDh8PuiQR\nSUIK+l526bnD+fUtF7K3sZl597/GzoZjQZckIklGQd8Hpo0ayh++NJ2mljbm//w1tuw5EnRJIpJE\nFPR9ZEI4l4W3fZwUg08/8Bprdx0KuiQRSRIK+j5UWjCQxf94EYMy0/jsg6/z2rb9QZckIklAQd/H\nSoZms+gfP07R4Cxu/tUqntu8N+iSRCTBKegDUDAokz/e9nHGjhjIbb+p5LG1f7Uui4hIj1HQB2Ro\nTjq/++KFTD1rCHf9cS2/f+OdoEsSkQSloA/QwMw0fn3LNC49J59vPLqBn//vtqBLEpEEpKAPWGZa\niJ9/vpzrJhbynSfe5PtPaU57EelZmr0yDqSnpvDjBVMYkJHKf79QzdFmzWkvIj1HQR8nTs5pPzAz\nlQdf2sGRE238142a015EzpyCPo6YGd+4NjKn/Q+ficxp/+PPaE57ETkzOl2MM2bGly+PzGn/ZJXm\ntBeRM6egj1O3zBjF9+ZF5rT//C9Wcfi45rQXkdPTZdCbWYmZvWBmm8ysyszujLbfY2a1ZrY2+nVt\np2PuNrNqM9tiZlf35h+QyG4qL+G//3Yq66Nz2jdoTnsROQ2xnNG3AV9193HAdOB2MxsXfe5H7j45\n+vU4QPS5BcB44BrgPjNTJ/NpunZCZE777Q2ROe13H9Kc9iLSPV0GvbvXufvq6PYRYDNQ/BGHzAEe\ncfdmd98BVAPTeqLYZHVyTvv6xmbm/+w1dmhOexHphm710ZvZSGAK8Ea06Q4zW29mvzSzIdG2YmBX\np8Nq+Og3BonBtFFD+cOt0zne2s78n73Gm3sagy5JRPqJmIPezAYAS4C73L0RuB8YDUwG6oAfdOcX\nm9mtZlZhZhX19fXdOTRpnV+cy8LbphNKgU///HXWvHMw6JJEpB+IKejNLI1IyP/O3ZcCuPted293\n9w7gQd7rnqkFSjodHo62ncLdH3D3cncvz8/PP5O/IamMGR6Z0z43K43PPvQGr25rCLokEYlzsYy6\nMeAXwGZ3/2Gn9sJOu10PbIxuLwcWmFmGmY0CSoFVPVeynJzTPjwki7/71V94dpPmtBeRDxfLGf3F\nwOeBy943lPJ7ZrbBzNYDnwC+AuDuVcBCYBPwJHC7u7f3TvnJq2BQJn+8NTqn/W81p72IfDiLh5kS\ny8vLvaKiIugy+qUjJ1r54sMVrNp5gP8393w+e+FZQZckIn3EzCrdvbyr/XRnbD83MDONh2+ZxifO\nHc6/PrqRHzy9hcNNuotWRN6joE8AmWkhfva5MuZMLuKnz1cz7T+f5c5H1vBKdQMdHcF/YhORYKnr\nJoG4O1W7G1lYsYtla2ppPNFG8eAs5peHmVcWJjwkO+gSRaQHxdp1o6BPUCda23l6014W/mUXr0SH\nYM4Yk8f88hKuGldAZppmpRDp7xT08q5dB5pYsrqGRRU11B46zqDMVOZOKeam8hLOL84NujwROU0K\nevkrHR3Oa9v3s7BiF09s3ENLWwfjCgdxU3mYOZOLGZKTHnSJItINCnr5SIebWlm+rpaFFTVsqD1M\neiiFK8cX8OnyEi4ek0dI69WKxD0FvcRs08kLuGtrOdTUSlFuJvPKwswrK+Fjw3QBVyReKeil25rb\n2nl20z4WVuxi5dZ63OGis4dxU3kJ15w/QhdwReKMgl7OyO5Dx1lSWcOiyhreOdDEwMxUZk8q4tMX\nlDChOJfIFEgiEiQFvfSIjg7njR0HWFSxi8c31nGitYOxIwYyv7yEuZOLGDYgI+gSRZKWgl56XOOJ\nVv60bjcLK2pYt+sQaSHjivMKuKm8hEvOydcFXJE+pqCXXrVlzxEWVexi6ZpaDhxrYcSgTG4sK2Z+\nWQkj83KCLk8kKSjopU+0tHXw/Jt7WVhRw4tb9tHhkWUPvzRzNFeOKwi6PJGEpqCXPre38QRLVtfw\nx7/s4u39TdwwpZhvzh5PblZa0KWJJCRNUyx9rmBQJv986Rie/T9/w52Xl/LYut188t6VvFKt5Q5F\ngqSglx6XFkrhK1eew5J/uojMtBCffegNvvWnKk60aqExkSAo6KXXTC4ZzJ+/PJO/u2gkv3plJ7N+\n8hLraw4FXZZI0lHQS6/KSg9xz+zx/OYfpnGsuZ0b7nuVHz+7ldb2jqBLE0kaCnrpEzNL83nqrku4\nbmIhP3r2Lebd/yrb6o8GXZZIUlDQS5/JzU7j3gVT+J+/ncrbB5qY9ZOXePjVnVruUKSXKeilz82a\nWMjTd13C9NHD+ObyKr7wy1XUHT4edFkiCUtBL4EYPiiTX/3dBXz7+vOpfPsgV/1oJcvW1BIP93WI\nJJoug97MSszsBTPbZGZVZnZntH2omT1jZluj34d0OuZuM6s2sy1mdnVv/gHSf5kZn73wLJ64cybn\nFAzkrj+u5V9+v4aDx1qCLk0kocRyRt8GfNXdxwHTgdvNbBzwdeA5dy8Fnos+JvrcAmA8cA1wn5lp\nInP5UCPzclh428f5v1efy9Ob9nDVvSt5Ycu+oMsSSRhdBr2717n76uj2EWAzUAzMAR6O7vYwMDe6\nPQd4xN2b3X0HUA1M6+nCJbGEUozbPzGGZbdfzNDsdP7+V3/hG49u4FhzW9ClifR73eqjN7ORwBTg\nDaDA3euiT+0BTs5gVQzs6nRYTbRNpEvji3J57F8u5tZLRvOHVe9w7U9eovLtA0GXJdKvxRz0ZjYA\nWALc5e6NnZ/zyBW0bl1FM7NbzazCzCrq6+u7c6gkuMy0EN+49jwe+dJ02juc+T97je89+SYtbbrJ\nSuR0xBT0ZpZGJOR/5+5Lo817zaww+nwhcLJTtRYo6XR4ONp2Cnd/wN3L3b08Pz//dOuXBHbh6GE8\ncedM5peVcN+L25jzP6+wZc+RoMsS6XdiGXVjwC+Aze7+w05PLQdujm7fDDzWqX2BmWWY2SigFFjV\ncyVLMhmYmcZ/zZvIg18op/7ICT7105d5YOU22nWTlUjMYjmjvxj4PHCZma2Nfl0LfBe40sy2AldE\nH+PuVcBCYBPwJHC7u2vaQjkjV44r4Km7LuHSc/P5z8ff5DMPvs6uA01BlyXSL2jhEelX3J0lq2u5\nZ3kV7s43PzWe+eVhIh88RZKLFh6RhGRmzCsL8+RdM5kQzuVrS9bzpV9XUn+kOejSROKWgl76pfCQ\nbH7/xen826zzWLm1nmvuXclTVXuCLkskLinopd9KSTG+OHM0K+6YwYjcTG77TSVfXbiOxhOtQZcm\nElcU9NLvnVMwkEf/+WLuuGwMj66p4ZP3vsRr2/YHXZZI3FDQS0JIT03hq1edy+J/uoj01BQ+8+Dr\n/MeKTZpCQQSNupEE1NTSxncef5PfvP42aSGj7KwhzCzNZ8aYPM4vziWUohE6khhiHXWjoJeEVfn2\nQZ6q2sNLWxvYXBeZtWNwdhoXnT3s3eAvGZodcJUipy/WoE/ti2JEglB21hDKzoosk1B/pJlXtzWw\n8q0GXq6u5/ENkRE6I4dlM6M0j5ml+Xz87GEMykwLsmSRXqEzekk67k71vqO8tLWBl6sbeH37fppa\n2gmlGJPCucwozWdmaR6TSwaTFtJlLIlf6roRiVFLWwer3znIy1sbeKm6gQ01h+hwGJCRyvTRw5hZ\nmseM0jxG5+XoDlyJKwp6kdN0uKmVV7dFQv+lrfXsOhBZuLwoNzPSt1+ax8Vj8hiakx5wpZLsFPQi\nPeTt/cci3TxbG3h1WwONJ9owg/FFg5gxJp9LSvMoGzmEjFStmCl9S0Ev0gva2jtYX3uYl6PBv/qd\ng7R1OJlpKUwbNYyZY/KYeU4e5xYMVDeP9DoFvUgfONrcxhvb9/PS1kg3z7b6YwDkD8xgxpg8ZozJ\nY2ZpHsMHZQZcqSQiDa8U6QMDMlK5/LwCLj8vsmRy3eHj73bzrHyrnkfXRBZXu2DkEG6YGmbWxEIN\n4ZQ+pzN6kV7S0eFs3tPIi1vqWbq6hm31x8hITeGq8SO4cWoxM0vzdZeunBF13YjEEXdnXc1hlq6u\nYfm63RxqamX4wAyun1LMDVPDnDtiYNAlSj+koBeJU81t7bzw5j4WV9by4pZ9tHU45xcP4sapYWZP\nKmLYgIygS5R+QkEv0g80HG1m+drdLFldQ9XuRlJTjEvPHc68smIuG1tAeqruzJUPp6AX6Wfe3NPI\n0tW1PLqmlvojzQzOTmP2pCJunBpmYjhXwzXlryjoRfqptvYOXqpuYOnqWp6u2kNzWwdjhg/ghqnF\nXD+lmMLcrKBLlDihoBdJAIePt/L4hjqWVNZQ8fZBzGDGmDxunBrm6vEjyErX3bjJrMeC3sx+CVwH\n7HP386Nt9wBfAuqju33D3R+PPnc38A9AO/Bld3+qqyIU9CJd29lwjKWra1iyupbaQ8fJSQ9x7YRC\nbiwLM23kUFI0VDPp9GTQXwIcBX79vqA/6u7ff9++44A/ANOAIuBZ4Bx3b/+o36GgF4ldR4ezaucB\nllTW8PiGOo61tBMeksUNU8PcOLWYs4blBF2i9JEeuzPW3Vea2cgYf+8c4BF3bwZ2mFk1kdB/Lcbj\nRaQLKSnG9NHDmD56GN+aM56nqvawpLKWnz6/lZ88t1V34cpfOZMpEO4wsy8AFcBX3f0gUAy83mmf\nmmibiPSC7PRUrp8S5vopYeoOH+fRNbUsqazh7qUbuGd51bt34c4Yk0eqFlFJWqcb9PcD/wF49PsP\ngFu68wPM7FbgVoCPfexjp1mGiJxUmJvFP186hn/6m7NZV3OYJZWRu3D/tG43wwdmMHdKZNTOeYWD\ngi5V+lhMo26iXTcrTvbRf9hz0QuxuPt3os89Bdzj7h/ZdaM+epHe8UF34Y4dMZC5U4qZPamIosEa\nqtmf9ejwyvcHvZkVuntddPsrwIXuvsDMxgO/572Lsc8BpboYKxK8/Ueb+fOGOpatqWX1O4cwg+mj\nhjF3ShHXnF9Ibpb68/ubnhx18wfgUiAP2At8M/p4MpGum53AbZ2C/1+JdOO0AXe5+xNdFaGgF+lb\nOxuO8dja3SxbW8uOhmOkp6ZwxXnDmTu5mEvPHa6pF/oJ3TAlIl06OavmsjW1/GndbvYfa2Fwdhqz\nJhRy/ZRiys4aoqkX4piCXkS6pbW9g5erG1i2ppanqvZworWD8JAs5k4uZu6UYsYMHxB0ifI+CnoR\nOW1Hm9t4umoPj66p5ZXqBjocJhTnMndKMZ+aVMjwgVoaMR4o6EWkR+w7coI/rYtcxN1Qe5gUg4vH\n5HH9lGKuHj+CnAytSBoUBb2I9LjqfUdYtiZyEbfm4HGy0kJcNb6AuVOKmambsvqcgl5Eeo27U/n2\nQR5dU8uK9XUcPt5K3oB0rptYxNwpxUzS/Pl9QkEvIn2ipa2DF7fsY9naWp7dvI+Wtg5G5eVEL+IW\naZK1XqSgF5E+d/h4K09urGPZmt28vmM/7jD1Y4O5fkoxsyYWMTQnPegSE4qCXkQCtfvQcZav282j\nq2vZsvcIqSnG35yTzxXjChiclUZWeojs9FSy00NkpoXITo98ZaWHSA+lqOsnBgp6EYkbm+saWbam\nlsfW7mZP44ku9w+lGNlpoeibQYis9FSy0lLITk99ty07PURWWuq7bw6Rx6FT3kDe3Tct9ZR9EmWR\nlh6bj15E5EydVziI8woH8bVrxlJ78DjHWtpoamnnRGs7TS3tNLW0cbwlsn28NfK4qaX9lLbj0f0a\njjZH93mvraOb56sZqSkMy0lnVH4Oo/MGMDo/h1F5OZydP4CiwVmEEuSN4CQFvYj0mVCK8bFh2T36\nM92d5rYOjr/7JvHeG0BTa6c3i5NvHtG2fUea2d5wjGVrazlyou3dn5eemsLIYdmMzhsQfSPIYXT+\nAEbn5TCkn15jUNCLSL9mZmSmRfr5h5zG8e7O/mMtbK8/xvb6o+xoOMa2+mO8te8Iz27eS1unjwtD\nstMYFQ3+yCeAyPbHhmaTmRa/C7Ur6EUkqZkZeQMyyBuQwbRRQ095rq29g10Hj7Oj4Sjb6yNvADsa\njrLyrXoWV9Z0+hkQHpLFqLzImf/Z+TmR7fwcRgzKDPyagIJeRORDpIZSGJUX6b+/bOypzx1tbmNH\n/TG2R98EtjdEPhFU7DxAU8t7S3BkpYUYmZfD6He7gd57E+irNX0V9CIip2FARioTwrlMCOee0u7u\n7G1sfu8NIPopYGPtYZ7YUHfKheO8ARnMnVzEv103rldrVdCLiPQgM2NEbiYjcjO56Oy8U55raevg\nnQMnu4AinwAK+2A5RwW9iEgfSU9NYczwgYwZPrBPf6+mmhMRSXAKehGRBKegFxFJcAp6EZEEp6AX\nEUlwCnoRkQSnoBcRSXAKehGRBBcXC4+YWT3w9hn8iDygoYfK6e/0WpxKr8d79FqcKhFej7PcPb+r\nneIi6M8h1tCiAAACwklEQVSUmVXEsspKMtBrcSq9Hu/Ra3GqZHo91HUjIpLgFPQiIgkuUYL+gaAL\niCN6LU6l1+M9ei1OlTSvR0L00YuIyIdLlDN6ERH5EP066M3sGjPbYmbVZvb1oOsJkpmVmNkLZrbJ\nzKrM7M6gawqamYXMbI2ZrQi6lqCZ2WAzW2xmb5rZZjP7eNA1BcnMvhL9f7LRzP5gZplB19Sb+m3Q\nm1kI+B/gk8A44DNm1rvrccW3NuCr7j4OmA7cnuSvB8CdwOagi4gTPwaedPexwCSS+HUxs2Lgy0C5\nu58PhIAFwVbVu/pt0APTgGp33+7uLcAjwJyAawqMu9e5++ro9hEi/5GLg60qOGYWBmYBDwVdS9DM\nLBe4BPgFgLu3uPuhYKsKXCqQZWapQDawO+B6elV/DvpiYFenxzUkcbB1ZmYjgSnAG8FWEqh7ga8B\nHUEXEgdGAfXAr6JdWQ+ZWU7QRQXF3WuB7wPvAHXAYXd/Otiqeld/Dnr5AGY2AFgC3OXujUHXEwQz\nuw7Y5+6VQdcSJ1KBqcD97j4FOAYk7TUtMxtC5NP/KKAIyDGzzwVbVe/qz0FfC5R0ehyOtiUtM0sj\nEvK/c/elQdcToIuB2Wa2k0iX3mVm9ttgSwpUDVDj7ic/4S0mEvzJ6gpgh7vXu3srsBS4KOCaelV/\nDvq/AKVmNsrM0olcTFkecE2BMTMj0ge72d1/GHQ9QXL3u9097O4jify7eN7dE/qM7aO4+x5gl5md\nG226HNgUYElBeweYbmbZ0f83l5PgF6dTgy7gdLl7m5n9C/AUkavmv3T3qoDLCtLFwOeBDWa2Ntr2\nDXd/PMCaJH7cAfwuelK0Hfj7gOsJjLu/YWaLgdVERqutIcHvktWdsSIiCa4/d92IiEgMFPQiIglO\nQS8ikuAU9CIiCU5BLyKS4BT0IiIJTkEvIpLgFPQiIgnu/wPmVE6l8LHNJQAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(s[:10])" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": true }, "outputs": [], "source": [ "num_top_words=8\n", "\n", "def show_topics(a):\n", " top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_top_words-1:-1]]\n", " topic_words = ([top_words(t) for t in a])\n", " return [' '.join(t) for t in topic_words]" ] }, { "cell_type": "code", "execution_count": 347, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['critus ditto propagandist surname galacticentric kindergarten surreal imaginative',\n", " 'jpeg gif file color quality image jfif format',\n", " 'graphics edu pub mail 128 3d ray ftp',\n", " 'jesus god matthew people atheists atheism does graphics',\n", " 'image data processing analysis software available tools display',\n", " 'god atheists atheism religious believe religion argument true',\n", " 'space nasa lunar mars probe moon missions probes',\n", " 'image probe surface lunar mars probes moon orbit',\n", " 'argument fallacy conclusion example true ad argumentum premises',\n", " 'space larson image theory universe physical nasa material']" ] }, "execution_count": 347, "metadata": {}, "output_type": "execute_result" } ], "source": [ "show_topics(Vh[:10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get topics that match the kinds of clusters we would expect! This is despite the fact that this is an **unsupervised algorithm** - which is to say, we never actually told the algorithm how our documents are grouped." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will return to SVD in **much more detail** later. For now, the important takeaway is that we have a tool that allows us to exactly factor a matrix into orthogonal columns and orthogonal rows." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Non-negative Matrix Factorization (NMF)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "#### Motivation" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "\n", "\n", "(source: [NMF Tutorial](http://perso.telecom-paristech.fr/~essid/teach/NMF_tutorial_ICME-2014.pdf))\n", "\n", "A more interpretable approach:\n", "\n", "\n", "\n", "(source: [NMF Tutorial](http://perso.telecom-paristech.fr/~essid/teach/NMF_tutorial_ICME-2014.pdf))" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "#### Idea" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Rather than constraining our factors to be *orthogonal*, another idea would to constrain them to be *non-negative*. NMF is a factorization of a non-negative data set $V$: $$V = W H$$ into non-negative matrices $W,\\; H$. Often positive factors will be **more easily interpretable** (and this is the reason behind NMF's popularity). \n", "\n", "\n", "\n", "(source: [NMF Tutorial](http://perso.telecom-paristech.fr/~essid/teach/NMF_tutorial_ICME-2014.pdf))\n", "\n", "Nonnegative matrix factorization (NMF) is a non-exact factorization that factors into one skinny positive matrix and one short positive matrix. NMF is NP-hard and non-unique. There are a number of variations on it, created by adding different constraints. " ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "#### Applications of NMF" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "- [Face Decompositions](http://scikit-learn.org/stable/auto_examples/decomposition/plot_faces_decomposition.html#sphx-glr-auto-examples-decomposition-plot-faces-decomposition-py)\n", "- [Collaborative Filtering, eg movie recommendations](http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/)\n", "- [Audio source separation](https://pdfs.semanticscholar.org/cc88/0b24791349df39c5d9b8c352911a0417df34.pdf)\n", "- [Chemistry](http://ieeexplore.ieee.org/document/1532909/)\n", "- [Bioinformatics](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0485-4) and [Gene Expression](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2623306/)\n", "- Topic Modeling (our problem!)\n", "\n", "\n", "\n", "(source: [NMF Tutorial](http://perso.telecom-paristech.fr/~essid/teach/NMF_tutorial_ICME-2014.pdf))" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "**More Reading**:\n", "\n", "- [The Why and How of Nonnegative Matrix Factorization](https://arxiv.org/pdf/1401.5226.pdf)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### NMF from sklearn" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "First, we will use [scikit-learn's implementation of NMF](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html):" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "m,n=vectors.shape\n", "d=5 # num topics" ] }, { "cell_type": "code", "execution_count": 363, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "clf = decomposition.NMF(n_components=d, random_state=1)\n", "\n", "W1 = clf.fit_transform(vectors)\n", "H1 = clf.components_" ] }, { "cell_type": "code", "execution_count": 296, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "['jpeg image gif file color images format quality',\n", " 'edu graphics pub mail 128 ray ftp send',\n", " 'space launch satellite nasa commercial satellites year market',\n", " 'jesus matthew prophecy people said messiah david isaiah',\n", " 'image data available software processing ftp edu analysis',\n", " 'god atheists atheism religious believe people religion does']" ] }, "execution_count": 296, "metadata": {}, "output_type": "execute_result" } ], "source": [ "show_topics(H1)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### TF-IDF" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "[Topic Frequency-Inverse Document Frequency](http://www.tfidf.com/) (TF-IDF) is a way to normalize term counts by taking into account how often they appear in a document, how long the document is, and how commmon/rare the term is.\n", "\n", "TF = (# occurrences of term t in document) / (# of words in documents)\n", "\n", "IDF = log(# of documents / # documents with term t in it)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "vectorizer_tfidf = TfidfVectorizer(stop_words='english')\n", "vectors_tfidf = vectorizer_tfidf.fit_transform(newsgroups_train.data) # (documents, vocab)" ] }, { "cell_type": "code", "execution_count": 263, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "W1 = clf.fit_transform(vectors_tfidf)\n", "H1 = clf.components_" ] }, { "cell_type": "code", "execution_count": 255, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "['don people just think like know say religion',\n", " 'thanks graphics files image file program windows format',\n", " 'space nasa launch shuttle orbit lunar moon earth',\n", " 'ico bobbe tek beauchaine bronx manhattan sank queens',\n", " 'god jesus bible believe atheism christian does belief',\n", " 'objective morality values moral subjective science absolute claim']" ] }, "execution_count": 255, "metadata": {}, "output_type": "execute_result" } ], "source": [ "show_topics(H1)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD8CAYAAACMwORRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XmUHGd5LvDn9Sh2SMBhsTC+XpCdCBxBbC5MBOFwCZvx\nlhOZYwdksyUBdA0xCQEuCAzGAUNsjBcEsibyAsgYy5tsCUu2bMmWJWsZaSRLo3Wk2aRZJM0izb5P\nv/ePrh7V9FR3fdVd1VVd/fzO0VFPdXX1V11Vb331raKqICKieDkt7AQQEZH/GNyJiGKIwZ2IKIYY\n3ImIYojBnYgohhjciYhiiMGdiCiGGNyJiGKIwZ2IKIamhfXFZ511ls6YMSOsryciKkrbt2/vUNXp\nbuuFFtxnzJiBqqqqsL6eiKgoichhk/VYLENEFEMM7kREMcTgTkQUQwzuREQxxOBORBRDDO5ERDHE\n4E5EFENGwV1ErhCRGhGpFZH5Gdb5iIjsFJG9IvKKv8kkonwdOt6LyvrOsJNBBeLaiUlEygAsBHAZ\ngGYA20Rkharus63zRgD3AbhCVY+IyFuDSjAR5eaye9YDABpvvzrklFAhmOTcZwOoVdV6VR0BsBTA\nnLR1bgCwTFWPAICqtvmbTCIi8sIkuJ8LoMn2d7O1zO4dAN4kIutEZLuIfMFpQyIyT0SqRKSqvb09\ntxQTEZErvypUpwF4H4CrAVwO4Ici8o70lVR1saqWq2r59Omu494QEVGOTIJ7C4DzbX+fZy2zawaw\nWlX7VbUDwHoAl/qTRIqqBzbU47UjJ8NOBhE5MAnu2wDMFJELReR0AHMBrEhbZzmAD4nINBH5MwDv\nB7Df36RS1Ny2cj8+dd+msJNBRA5cW8uo6piI3ARgNYAyAA+p6l4RudF6v0JV94vI8wCqASQAPKCq\ne4JMOBERZWY0nruqrgKwKm1ZRdrfdwK407+kERFRrthDlYgohhjciagkdQ+Ohp2EQDG4E1HJeba6\nFZf+1wvY2dQVdlICw+BORCVnY21yjJ29rd0hpyQ4DO5ERDHE4E5EFEMM7kREMcTgTkQUQwzuREQx\nxOBORBRDDO5ERDHE4E5EFEMM7kRUslTDTkFwGNyJqOSIhJ2C4DG4ExHFEIM7kQc1x3oxPDYedjKI\nXDG4Exlq6xnC5feuxw+f4SRjFH0M7kSGeoaS439vP8xJwSn6GNyJiGKIwZ2IKIYY3ImIYojBnYgo\nhhjciYhiyCi4i8gVIlIjIrUiMt/h/Y+ISLeI7LT+3eJ/UomIyNQ0txVEpAzAQgCXAWgGsE1EVqjq\nvrRVN6jqPwSQRiIi8sgk5z4bQK2q1qvqCIClAOYEmywiIsqHSXA/F0CT7e9ma1m6D4pItYg8JyLv\n8iV1RESUE9diGUM7AFygqn0ichWAZwDMTF9JROYBmAcAF1xwgU9fTURE6Uxy7i0Azrf9fZ61bIKq\n9qhqn/V6FYA/EZGz0jekqotVtVxVy6dPn55HsomIKBuT4L4NwEwRuVBETgcwF8AK+woi8jaR5AjJ\nIjLb2m6n34klIiIzrsUyqjomIjcBWA2gDMBDqrpXRG603q8AcB2Ar4rIGIBBAHNV4zzHCRFRtBmV\nuVtFLavSllXYXv8awK/9TRoREeWKPVSJiGKIwZ2ISlacy44Z3Ik8inNAKBUlMD82gzuRuVIICRQX\nDO5ERDHE4E5EFEMM7kREMcTgTkQUQwzuREQxxOBORBRDDO5ERDHE4E5EFEMM7kREMcTgTkQUQwzu\nREQxxOBORBRDDO5ERDHE4E5EFEMM7kREMcTgTkQUQwzuREQxxOBO5BXn2aMiwOBOZEg4y178aHzv\n1AzuRFRySuFGbRTcReQKEakRkVoRmZ9lvb8VkTERuc6/JBIRkVeuwV1EygAsBHAlgFkArheRWRnW\nuwPAC34nkoiIvDHJuc8GUKuq9ao6AmApgDkO630dwFMA2nxMHxER5cAkuJ8LoMn2d7O1bIKInAvg\nUwAW+Zc0IiLKlV8VqvcC+K6qJrKtJCLzRKRKRKra29t9+moiIko3zWCdFgDn2/4+z1pmVw5gqSSr\noM8CcJWIjKnqM/aVVHUxgMUAUF5eHt82SEREITMJ7tsAzBSRC5EM6nMB3GBfQVUvTL0Wkd8CeDY9\nsBMRUeG4BndVHRORmwCsBlAG4CFV3SsiN1rvVwScRiIi8sgk5w5VXQVgVdoyx6Cuqv+cf7KIiCgf\n7KFKRBRDDO5ERDHE4E5EFEMM7kREMcTgTkQUQwzuREQxxOBO5BG7VlMxYHAnMlQC8ztQjDC4ExHF\nEIM7kSEWx1AxYXAn8ojFM/ER5xs2gzsRlRwpgVs0gzsRUQwxuBMRxRCDO5HPrl20Cfetqw07GVTi\nGNyJfLb98En8/PmasJNBJY7BnYgohhjciYhiiMGdiCiGGNyJiGKIwZ2oiC3Z3IgZ81eia2Ak7KRQ\nxDC4ExWxP1QeAQC0dg2FnBKKGgZ3IqIYYnAnIooho+AuIleISI2I1IrIfIf354hItYjsFJEqEfmQ\n/0klioY4jyRI8THNbQURKQOwEMBlAJoBbBORFaq6z7baWgArVFVF5BIAjwO4OIgEE4Ul/uMIUpyY\n5NxnA6hV1XpVHQGwFMAc+wqq2qeqqQzNn4OZGyKiUJkE93MBNNn+braWTSIinxKRAwBWAvhXpw2J\nyDyr2Kaqvb09l/QSEZEB3ypUVfVpVb0YwDUAfpJhncWqWq6q5dOnT/frq4mIKI1JcG8BcL7t7/Os\nZY5UdT2Ai0TkrDzTRkREOTIJ7tsAzBSRC0XkdABzAaywryAifyUiYr1+L4AzAHT6nVgiIjLjGtxV\ndQzATQBWA9gP4HFV3SsiN4rIjdZq1wLYIyI7kWxZ8xlbBSsR0RTdg6O4d81BjCfCCxVxjlKuTSEB\nQFVXAViVtqzC9voOAHf4mzQiirMf/3EfntrRjL8+50xc/q63FfS7pQTatbKHKhGFYnB0DAAwNh7j\n7HOIGNyJiGKIwZ2IKIYY3IkiqLNvGDubusJORkEoO7QHgsGdKILmLNyIaxZuDDsZgRKO1hMoBnei\nCGo+ORh2EqjIMbgT5aClaxDv+fELqG/vCzspRI4Y3Ily8OyuVnQNjGLptib3lSmrOHckChODOxGF\ng0XugWJwJ/KII2tQMWBwJzIkpdBnPU9Do+PoGRoNOxkEBnci8tGVv9yAS259wdNn+BwUDAZ3IvJN\nQ0e/8bp8DgoWgzu5qmvvw6X/9QJau9j2mqhYMLiTq0e2HEH34ChW7T4adlKIyBCDOxGFiq2PgsHg\nTkShYOujYDG4kyteg0TFh8GdKAY4bG5h7ThyEu/7yYvoHohum34GdzIWpaLRsfEEhsfGw05G6Fi0\nkZ9cy/t/tfYQOvtHsP3ICZ9T5B8Gd3IVxfBxwwOVeOcPng87GZSHMM+r1HePJRSj44kQUxIcBncy\nFqVH/60N0c0xFRJbmuTntpX7MfPm58JORiAY3MkVn/yjj7MaUTqj4C4iV4hIjYjUish8h/c/KyLV\nIrJbRDaJyKX+J5WoeI2NJ7B4fR2GRllPkC5ODx8Hj/fiF6trIvFE5RrcRaQMwEIAVwKYBeB6EZmV\ntloDgL9X1b8B8BMAi/1OKIUvAudr0Vq2owU/W3UAC9YeCjspvvq/D1fhY79Yl9Nn4/hEeP3iLfj1\ny7XoHgy/Fc00g3VmA6hV1XoAEJGlAOYA2JdaQVU32dbfAuA8PxNJ4WKLjPwNjIwBAPqHx0JOib9W\n7z0edhIiJVU5G4ViMpNimXMB2OcSa7aWZfIlAPGsoShxzLjnrxR/w46+Ydz94kEkEqWz91FofGCS\nczcmIh9FMrh/KMP78wDMA4ALLrjAz6+mAIWfByl+pfz0890nq7H2QBv+7qK34O/+8i1T3o9CIPRL\nlI6zSc69BcD5tr/Ps5ZNIiKXAHgAwBxV7XTakKouVtVyVS2fPn16LuklCl18QlFhDFqVyIm0Spvo\nhMF4Mgnu2wDMFJELReR0AHMBrLCvICIXAFgG4POqetD/ZFIUlHqFqkkwqmrM3v6+FH9D7nM4XItl\nVHVMRG4CsBpAGYCHVHWviNxovV8B4BYAbwFwn/VYMqaq5cElmwqKWSxjrx3pclweoaf10KT/BBGI\nf76L0nE2KnNX1VUAVqUtq7C9/jKAL/ubNKL4iEJOjswMj40jkQBed3pZ2EnJC3uokrE4VXxR4WQ6\nbyKUyZ3kk/esx1/fkt+4RVG4UhjcyVUU2uxGgckFmzGQ8SeMbjRPc7hzIOfPRmkXGdzJmL1ooXco\n/B54YbFfwFHoZk7khME9B4mE4rndR0umU4ZTrvMHz+wpfEJiIIpFWx19wzjWPRTY9t3uf3G8P0bh\nps/gnoNHth7BVx/ZgceqmtxXjqnOvhHPn3l06xE8sKE+gNSEJ73TSqZrOkqP6+nKb1uDD/z32sC/\nJ714L0odfvwSpX1icM9BW08yl9PeOxxySorL95btxm0r94edjIKpajyBxo7+sJMRuvDzsOFq7x3G\ngWM9Bf9eX4cfKDURePIqiOjkRaLPnnG7rmIzAKDx9qsnlpXKOeMkQpnawKUOc82xXlx+73oAk8+D\nQmDOPQcldI6WtERC8fi2Jk/TsJVy8Pbisw9swdOvJUcxidNvlh4bfrupMYxkAGBwpxyVQi7syR3N\n+M5T1Vi8Pl71BAXnELw31joOP1UwQZeNO92wmk8O4PdbDgf6vXYM7nnI1PJhaHQc//rbbahr7ytw\nishP3QPJ5p4n+71XHk9RCndDF6XwC2Q7zNffvwU/eGZPwZoRM7jnwuVCrWw4gZcOtOHWFXsLlKDC\niELzLgrf1oYTnoqqSpk9VHT1J4N6oVpQM7jnoVRiXalnOr0cZrd17e9vP3wCI2PFFSR3N3fj0/+z\nGT9//oDxZ9za9gd1GfUNj6G2rXBPz6qKDquJsOM+F/g6YnDPQZRiXW1bLz7/YCUnXo649HPm4PFe\nXLtoM362qriahnb0J5v/HjzuPWgWug345x+sxCfufiWQbbd2Te309ZuNjVOWOe4xc+7RF4WM+3/9\ncR82HOpAZUP2ccT9UCpPKumCCEknrHL8fUcL3/650MI6bzINv5yvo92DqDneO2X5QYdldoXOFDK4\n5yCKxRRBlodz4LDJfP2lS/SGWcw6es0r2J1iRaGGoGBwLxIn+0cwY/5KLN85eYbDQj7qlnoc8vOX\nLobb5S3L9+CJDENs5HIuZDpV41JR73YpFrpYisHd0PDYOBa+XDu5AizDSWl6COva+zBj/kq8dOC4\n67r1Vjf2TJ0igrw8nM7JKI2hQcFYsvkw/t+T1ZOW5XLU4xG6c+P01FuoexmHHzD0wIYG3Lm6BmdM\nO823YopUmeCz1UfxsYvPzmkbDLHFJf3C9usRPYzRJhs8jpsT/3PVtodOjWUk41uBYM7d0MDIGABg\ncORUq5RMB8nzwfPjaBfgjLEHJj8u1LtfqMGL+9yfWsLmqSlkplEhJf1vf0JdLtu5Zbk/wzX3D4/5\nsp2oaut1HgY51xspK1QjKpVbV/hXoerHZk7lBoKsUA3Ggpdq8ZUlVQFtPaomH6cwipuXbM6/C/z6\ng+3G14FbmXpUi23aeryN+upcfDl1WaHqGBjcDfldxNzQ0Y9vPbELQH45w0LmBnY1n2paxiJ3c6o6\npSivkJ1r/JTPE0e+58yrhzrQ0jWY30YKxOmaZoVqxNmDaz4TM7yw95in73U7L4LMDIyMJzf+0oG2\ngnxf3Gw41DFl2fef3h1CSvxV6Cayn3uwEh+/a11Bv9MLp1/DaRnL3CMmdZCCKP7I5zGNrVaiQ1Vx\non/qo3z3YOaBoor5HmlcLOPjdw6NFsdwDU6XNMvco8rnIGrfnC/1qQFGCdOyRFNxadecbsnmw7h/\nQ8OU5UHef8P8Lb3vl/MHvvNkNY50DuSdnmyGRsfROzSKtt4h4xnUMv20ppXmyWXhNYU0Cu4icoWI\n1IhIrYjMd3j/YhHZLCLDIvJt/5MZHare7sCb6jrwrlueR09Aw3wWIjfg93es3H3U5y1GwysH2z1/\nxq/gXKgikpzauRvs4qPbjuSwZXOfvGc9/ubWFzD7p2vxtz9dE+h32dmPbyEaP9i5BncRKQOwEMCV\nAGYBuF5EZqWtdgLAvwP4he8pjICFL9dOlJFnOixtPUO44t71aLVV+KgC9645hP6RcexrnTyGiP1i\n9HJ9+9H8ct6SKsyYv9LDJ6bKJ5T42YSu5lj28TyiwORYN50YwMKXa0PNiT+1vRlbPYxR5PWGki2n\nn9rtldVHMWP+SnQN+DCGvs2RE8E+GQBpxzkCBW4mOffZAGpVtV5VRwAsBTDHvoKqtqnqNgCFGYW+\nwO5cXYMDDkHEfgAfr2rCgWO9eKTSuZlZtmt2xa5WzHbJTUycNqp4bvfRiZ6yE7kB6wu+t2w3PnLn\ny1m39YLHtuVRK9a3PwWFPWSufZambIE5Y9d76/8v/W4b7lxdg+aT4bUG+dYTu/Dp/9mcdR37Hno9\nL7JdA6lr6YFXk79nXXvmDlIdfaeKVVQVy3Y0T/RDWbK5EbNueT7te3MLtKk0barryHnU1TDrxEyC\n+7kA7ANMNFvLPBOReSJSJSJV7e3eH2EjQdXxpE4NwD+po49Mroht6x3C1Qs24Gj34JRttBmWA+5q\n7sZXH9mBu16sSX3LpPcf3XoEjT6XX/r9yJ9P5nRkLIFLbn1h4u8gr51M2x4bV/zqpdopy/PJqw2M\nRGvI5nyf7HJlcm58+OenMi9Vh0/im4/vwo+WJyfGuWX5Xt9+S1Xg0PFe3HB/pS8T76TGei+Uglao\nqupiVS1X1fLp06cX8qt9Yz/37Cfi3S8enPI+MDlAPL6tCXtbe3yZR7ElwBxeIqHYcKg9a47n5Zrc\nb857W3uwPofyaQAYcZgBaMb8lfjxH/cBAH60fE8ggWlLfSf6rOKklq7BidcmJlWeh/+0nhfJ8Nro\ns9k+oJP+y7quPXinjkN65sh+7ub6mytOtXQ6ZPVLeOa1lomJvbN+Ntt3RmhsmRYA59v+Ps9aVpJU\nc3zUCvGiPtzZj7LTBOe96c+M1v/Npkb85Nl9qPjce3HFu8/xPXf88JbDeNh2g6s51ou+4VG87+1v\nznmbD21swLwPX4TfZeh9ubmuE1sbTuDkwAje8KfT8K1PvtN4292Do5i7eAvOf/Prsq6XT7+HYuTn\neaEAjnQOYFdTsqOc8aYDvK4Sk24QydffeGynp21kGuivEEyC+zYAM0XkQiSD+lwANwSaqiLhdF5N\n7UEqGdfN97tbuwaxZv9x1+3//Z3rAACNt19ttO0jncnyzmPdybE1gg5Ol9+7HoB5+jK58ffbM753\n/f1bJv3tJbgPW+X6TSecn5Zq2/pw7aJNOPvMM1y35bUz2sfvWoe13/qISTIBAGv2HUdZmeCj73yr\n8WeCZnruX1exaeJ1KgP1eFUTZp1zJt597l8E/v3peofG8PozyozXD6PdfzauwV1Vx0TkJgCrAZQB\neEhV94rIjdb7FSLyNgBVAM4EkBCRbwCYparxn2Ymjb2SNZnLd17Pa+4/ff2ugRF88PaXPKcvbipe\nqZt4HdRAVm5H6vdbDqN7cDRrZ6WUKaNCWn9nOh2yVSw6+bI1Vk++N0ozHs/hLO+pKroGpv5+37GG\nHPa6P34Xf3ndXNZSmSi1c1fVVar6DlX9S1X9qbWsQlUrrNfHVPU8VT1TVd9ovY5lYDdp4tRo5Xzt\nrTr8PqAbazsnp8tg+5+6b6O/iQhBeoCwT1MXVOWq+yQMBtvIMw3ffGwnPvdApfH6QYzBsq3xhOdh\nfgEYnZz5jpmUvZllPhefeUqiVvzGHqousp0Yjm8p8Ms1hwAA1c3dBey44L79oOaUDEptWx9W7GrN\nvtKkYYinXl61bX2Bt/xwa000uTey9/Ngxa5WLHutBa/WdqBveAzDY+6tQX6xusZ1Ha/+qWIzfmRr\nNeL1ZurladVt1VTZfCZuv3KPwVNWUCLTianU3beubtLf2YpaAIfWMhku/EaHHNDYeMJTK4yghd2w\n4xN3v4J/f/S1rOvY0zjkEPS2H3bulJNIqHGOzi0kuccscW3nfupvxXhi8lL7b/DuH63GtYs2Id3g\nqFnQjyrT6yZlzsLkU2iz9YQynGXMGaejvMyoxYu3Fjf2m1cUhthgcHfx6NbJ3aInNYU0CH+v1k4d\nEXDhy3WTWouk/MfSnXj3j1Y7bmdva3fW7/HjXFq2oxkz5q/EYI4dNgohWyBN79B0uLMf333KefTF\ni76/yjFIOn+nS87caCunpAfv5DZObcWtmeielqklntcu2ox//FVhi91M99v41LRt0CSTr6oTE49s\nrk8vpsz/glDDdHjebpTK3GmybLkKzdDJyUS2MVduftqf2XOyWbA2WZx0zOMkBWHKdhH/cHn2jic7\nbMVU4wnFi/uOO27P7XCedpp5sUxCgVcOnho62elKdwr+JmqOF3YoBs+NAlze99rb2HScp5yDqTq+\nNPto+Bl3Bnc36edvzh0iAj7Yqz2OD+8k1bN1SoCL2vgDNtl647rE3EkqXqnDV5ZUeR6aATALMMet\nG+aT25sxNj71ZLA/BUb45w5MZUOn+0ppEqoZbzC+jLQK9fRUFrWmkAzuLtJz6Wp0O3cY5tO/JDl6\nZqdLxaMHrUUy2026o92T57w8zUOUTI3pYh+3ZIJrobv79u09ivtHTtWr5Hpe/MOvNuDhzY05fjrp\n+T3H8qpsdmvamJLKkSey5HDq2ry3wlGXNJxarzDh1B4rgmu9Y47B3aORsQQGrYvT6fE523H7/ZZg\nhzX1S7a21eW3vZhxcLQg3XD/FsyYv9LTE5BJzn3HkZOu67i2hnF9HzjNdqX952O7sm7D5J60p6XH\ntdgpZV9rD3Y6tC55bk9+Qy9nS+ez1ae2nRp0b7mVAWnrcZ54Otu2X3M4Tm6NG/LlNQZHbZgJBncX\nTsUyC6xBo1Y5lJHvbnGu+OwaGMExg5M6iuw/QUffSEHK/9Ntqks+tj9R1eSyZlJ9e59RmfDe1vy7\nY5gFmPDKWq5asAHXLCxsZWtqyF57X4++4TF09A1j9s/WGm3jRP+pgbY+/+DWKe8ni00yFMuo82sv\nJn3OYCPGFcysUI2+4bEE7l1zEGv3nyqnrWw44ZhjdCpnJWf/+dhOLEprgpry3B6zuoXr79/iqcw9\nHwddxpQXydIU0uG0KPTcpLnK2rAAyUng7SN4AkBnhpER01toCWTSHAhOTYT9DJKX3Dq1lVqytUz2\nY7Fk8+GJ8eejVldiMrZMSTucpcJOkZyMI+7cTtrOvmG85fXu46qYyjbqXqXhZBKDI+OeytxTbn56\nDz77/rejsr4Tt63cD8B5JEq7tbaJwzPJdKPZ3dKN7YdPFTlE4XHeVLafN5FQNHT0TV4f5uXfWxs6\n8Y6z32CQCOfFXsvZe4ay9y/JtLV11uiorV1DkZvPmDl3j45222dacj7kTmOzD0Sw7bi9DX+2yQjc\nAs77bluD2rZozYikMMvl7zh8EjfcvwXjickB/JuPnyoX/6NbL1kD2W401y7alHOuz2kSmbBM6vQD\n51xvwrC1461/3OcaLBOqk5pP+t0T+StLqrB8p/kAuNEK7Qzunq3e672pHBDs+OtObn7aufNOwlYJ\n/L1lu3HgWPLRt/nk1CeU1LVqMuVZQ8fkzx863utpyja/9brkxFKefq0Fm+o6p4z46GcmbGh03NuF\nH3KUqKz33iwRwKRZpFSTUwfaiXjLUbsVq2Wd2cmHMncA+M3GRgDJoUScrpFsnOrkConBPQ9ezhkv\nJ3X6E8E91kQgXjxS6dwy5//YZrEBsnfbnkiPwfelN3O77J71rlO2RYk9/cNjuRXpZPL9DDfaMGVr\njvd1lyEfUtx+olvSWvOcHBj1FGirm116ZRtuZ8nmRqP16tv7sr5/9YJXs2/A9nt8/+nd+NojOxxX\nY4VqEQjqINm3OzaewC/X+leunz5aYM/QKJ6oasq6L9naJ6dkW2U8objj+QOmSZz6/Tn22PTipO3p\n5LZn9/uac+8dGjMujzVtux11TkfsRY8dxH66an/W98cNGyks22FWtPKxu17J+n62IZ3TW+5sODR1\n2JFCY4VqHjx1Rshh1ermLvzjr3NvwtZ0YgC/ein7jeE7T1bjaPcQ7vnMpTl/D5D8LTbWdmDFzlbc\ncd0lk95bsaslY+sXE24XuR8OHj+Va3Ma9ydo9oDuV8VcfXtfxqKB7JNV5y/TteFnhuhrf8g8OYud\nl58zn4H7zHuoFibrXnLBfWdTF9559hvwutPNZ1jJxFuxjId1NZl/e8mgFUY26UUwTlK9Ok/2Z8mV\nGCT+q7ZH0PTg7tRpx0nTiQGc/+apUwE++GqD0eej7HBn9h6Yfk9qDgA33F+ZsW9FtkM6dUIRxZb6\nqfUn9ptQtgp5O5OnQFPpcxrY5fo1pv0o0kkEG7CWVLFMe+8wrlm4Ed9+0izY+MlrheqidXWodLig\nsvFa4WP342f3TVm24VA7/uU3WzOW35vwUqRicjMqVtma1NoJkudpvsYT6jgEckplfedEj1E3D285\nPGWawnTps4JlCq6FulHbr4V6DxOM5FPXYpxzL1CZe0nl3FO5C7eB/k15OUjZRnxMt7GuM6cy6h88\n42/P0Zdrsg89m4n9kfyi76/yKznFzcOgUt9+Iv/Mh9skJzdkmdUpfXyd9IrRlE9XnKowt/cmBTIX\nPbhOvuKTy+5ZP/Hay2iTuXZ8O94zFLm8e9Hl3IfHxrFiV2tOg+9MzIrk050zqIk1vvjQ1K7WJtbl\nGIz91tqd+zALQc+aFBbTAPPRX6wLNiE+Sq+ctzc/PXA0Ou3vvci1vqOy4UTkRoUsupz7vWsOYdG6\nOrz+jDJ87OKzPX029cjlZ7kfTVXoNv3FoLkEfpOrFmyYeG0y01EU5Voso6roHza7gad3mAtK0eXc\nj1sVRLube1xnJ0pnD+4n+0cwY/7KKTMtpTR29OOBDfX5JbZEubUXJooq+0BnXiRU8ZtNZvUJn7h7\nfV71Y6aKLrhPswrF7llz0L1TgU19e99EV+KEAk3Wj/uHDJWFn1m8eWJsEfJm/rLoddohMnH7c7n1\nx7h/Q4PXBDStAAAF4ElEQVSn4t5CFFsVXXAvy1Ljoap4bNsR9Drcfa9asAH/bR24REKRasTh9BTW\n1js0MXMOEZHfFrj0P/GDUXAXkStEpEZEakVkvsP7IiILrPerReS9/ic1Kb1MrGtgBOsPtqO+vQ9P\nVDXju09N7vZb394HVcWQrZv9yYGRiQrZ6uZuDI2OJwO+FfH/+aFtQSWfiAj1WSbE8YtrhaqIlAFY\nCOAyAM0AtonIClW1N4y+EsBM69/7ASyy/vddes79PT9+cco6Gw51YO3+49hx5CQWvlyHa97zvya9\nn1BM6tJ/8Q+fn3h91z9din1H85/AgYgok6Ba2tmZtJaZDaBWVesBQESWApgDwB7c5wBYosns8BYR\neaOInKOqvg+Llq1Yxu5Lv6uaeO00v2imZoPf8qGNMRFR2EyKZc4FYO+T22wt87qOL6YVanodIqIi\nVtAKVRGZJyJVIlLV3p5bh5uPvvOtPqeKiKiw5n34osC/w6RYpgXA+ba/z7OWeV0HqroYwGIAKC8v\nz6kn0Qf/6iw03n51Lh8lIioZJjn3bQBmisiFInI6gLkAVqStswLAF6xWMx8A0B1EeTsREZlxzbmr\n6piI3ARgNYAyAA+p6l4RudF6vwLAKgBXAagFMADgX4JLMhERuTEaW0ZVVyEZwO3LKmyvFcC/+Zs0\nIiLKVdH1UCUiIncM7kREMcTgTkQUQwzuREQxxOBORBRDkst0db58sUg7gMM5fvwsAB0+JieqSmE/\nuY/xUAr7CERjP9+uqtPdVgotuOdDRKpUtTzsdAStFPaT+xgPpbCPQHHtJ4tliIhiiMGdiCiGijW4\nLw47AQVSCvvJfYyHUthHoIj2syjL3ImIKLtizbkTEVEWRRfc3SbrjjoRaRSR3SKyU0SqrGVvFpEX\nReSQ9f+bbOt/z9rXGhG53Lb8fdZ2aq3JyUObokpEHhKRNhHZY1vm2z6JyBki8pi1vFJEZhRy/6w0\nOO3jrSLSYh3LnSJyle29YtzH80XkZRHZJyJ7ReQ/rOVxO5aZ9jNWxxOqWjT/kBxyuA7ARQBOB7AL\nwKyw0+VxHxoBnJW27OcA5luv5wO4w3o9y9rHMwBcaO17mfXeVgAfACAAngNwZYj79GEA7wWwJ4h9\nAvA1ABXW67kAHovIPt4K4NsO6xbrPp4D4L3W6zcAOGjtS9yOZab9jNXxLLac+8Rk3ao6AiA1WXex\nmwPgd9br3wG4xrZ8qaoOq2oDkuPlzxaRcwCcqapbNHn2LLF9puBUdT2AE2mL/dwn+7aeBPDxQj+p\nZNjHTIp1H4+q6g7rdS+A/UjOhRy3Y5lpPzMpyv0stuBesIm4A6QA1ojIdhGZZy07W0/NXHUMwNnW\n60z7e671On15lPi5TxOfUdUxAN0A3hJMsj37uohUW8U2qeKKot9HqxjhfwOoRIyPZdp+AjE6nsUW\n3OPgQ6r6HgBXAvg3Efmw/U0rBxCrJkxx3CfLIiSLCN8D4CiAu8JNjj9E5PUAngLwDVXtsb8Xp2Pp\nsJ+xOp7FFtyNJuKOMlVtsf5vA/A0kkVNx61HPFj/t1mrZ9rfFut1+vIo8XOfJj4jItMA/AWAzsBS\nbkhVj6vquKomANyP5LEEingfReRPkAx4j6jqMmtx7I6l037G7XgWW3A3maw7skTkz0XkDanXAD4J\nYA+S+/BFa7UvAlhuvV4BYK5V834hgJkAtlqPyD0i8gGrHO8Lts9EhZ/7ZN/WdQBesnKQoUoFPMun\nkDyWQJHuo5WmBwHsV9W7bW/F6lhm2s+4Hc+C1t768Q/JibgPIlljfXPY6fGY9ouQrHXfBWBvKv1I\nlsWtBXAIwBoAb7Z95mZrX2tgaxEDoBzJk68OwK9hdUgLab8eRfIxdhTJcscv+blPAP4UwBNIVmRt\nBXBRRPbxYQC7AVQjeTGfU+T7+CEki1yqAey0/l0Vw2OZaT9jdTzZQ5WIKIaKrViGiIgMMLgTEcUQ\ngzsRUQwxuBMRxRCDOxFRDDG4ExHFEIM7EVEMMbgTEcXQ/wfAybDICFnzbQAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(clf.components_[0])" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "43.71292605795277" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.reconstruction_err_" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### NMF in summary" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Benefits: Fast and easy to use!\n", "\n", "Downsides: took years of research and expertise to create" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Notes:\n", "- For NMF, matrix needs to be at least as tall as it is wide, or we get an error with fit_transform\n", "- Can use df_min in CountVectorizer to only look at words that were in at least k of the split texts" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### NMF from scratch in numpy, using SGD" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "#### Gradient Descent" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "The key idea of standard **gradient descent**:\n", "1. Randomly choose some weights to start\n", "2. Loop:\n", " - Use weights to calculate a prediction\n", " - Calculate the derivative of the loss\n", " - Update the weights\n", "3. Repeat step 2 lots of times. Eventually we end up with some decent weights.\n", "\n", "**Key**: We want to decrease our loss and the derivative tells us the direction of **steepest descent**. \n", "\n", "Note that *loss*, *error*, and *cost* are all terms used to describe the same thing.\n", "\n", "Let's take a look at the [Gradient Descent Intro notebook](gradient-descent-intro.ipynb) (originally from the [fast.ai deep learning course](https://github.com/fastai/courses))." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "#### Stochastic Gradient Descent (SGD)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "**Stochastic gradient descent** is an incredibly useful optimization method (it is also the heart of deep learning, where it is used for backpropagation).\n", "\n", "For *standard* gradient descent, we evaluate the loss using **all** of our data which can be really slow. In *stochastic* gradient descent, we evaluate our loss function on just a sample of our data (sometimes called a *mini-batch*). We would get different loss values on different samples of the data, so this is *why it is stochastic*. It turns out that this is still an effective way to optimize, and it's much more efficient!\n", "\n", "We can see how this works in this [excel spreadsheet](graddesc.xlsm) (originally from the [fast.ai deep learning course](https://github.com/fastai/courses)).\n", "\n", "**Resources**:\n", "- [SGD Lecture from Andrew Ng's Coursera ML course](https://www.coursera.org/learn/machine-learning/lecture/DoRHJ/stochastic-gradient-descent)\n", "- fast.ai wiki page on SGD\n", "- [Gradient Descent For Machine Learning](http://machinelearningmastery.com/gradient-descent-for-machine-learning/) (Jason Brownlee- Machine Learning Mastery)\n", "- [An overview of gradient descent optimization algorithms](http://sebastianruder.com/optimizing-gradient-descent/)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "#### Applying SGD to NMF" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "hidden": true }, "source": [ "**Goal**: Decompose $V\\;(m \\times n)$ into $$V \\approx WH$$ where $W\\;(m \\times d)$ and $H\\;(d \\times n)$, $W,\\;H\\;>=\\;0$, and we've minimized the Frobenius norm of $V-WH$.\n", "\n", "**Approach**: We will pick random positive $W$ & $H$, and then use SGD to optimize." ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "**To use SGD, we need to know the gradient of the loss function.**\n", "\n", "**Sources**:\n", "- Optimality and gradients of NMF: http://users.wfu.edu/plemmons/papers/chu_ple.pdf\n", "- Projected gradients: https://www.csie.ntu.edu.tw/~cjlin/papers/pgradnmf.pdf" ] }, { "cell_type": "code", "execution_count": 272, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "lam=1e3\n", "lr=1e-2\n", "m, n = vectors_tfidf.shape" ] }, { "cell_type": "code", "execution_count": 252, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "W1 = clf.fit_transform(vectors)\n", "H1 = clf.components_" ] }, { "cell_type": "code", "execution_count": 253, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "['jpeg image gif file color images format quality',\n", " 'edu graphics pub mail 128 ray ftp send',\n", " 'space launch satellite nasa commercial satellites year market',\n", " 'jesus matthew prophecy people said messiah david isaiah',\n", " 'image data available software processing ftp edu analysis',\n", " 'god atheists atheism religious believe people religion does']" ] }, "execution_count": 253, "metadata": {}, "output_type": "execute_result" } ], "source": [ "show_topics(H1)" ] }, { "cell_type": "code", "execution_count": 265, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "mu = 1e-6\n", "def grads(M, W, H):\n", " R = W@H-M\n", " return R@H.T + penalty(W, mu)*lam, W.T@R + penalty(H, mu)*lam # dW, dH" ] }, { "cell_type": "code", "execution_count": 266, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "def penalty(M, mu):\n", " return np.where(M>=mu,0, np.min(M - mu, 0))" ] }, { "cell_type": "code", "execution_count": 267, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "def upd(M, W, H, lr):\n", " dW,dH = grads(M,W,H)\n", " W -= lr*dW; H -= lr*dH" ] }, { "cell_type": "code", "execution_count": 268, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "def report(M,W,H): \n", " print(np.linalg.norm(M-W@H), W.min(), H.min(), (W<0).sum(), (H<0).sum())" ] }, { "cell_type": "code", "execution_count": 348, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "W = np.abs(np.random.normal(scale=0.01, size=(m,d)))\n", "H = np.abs(np.random.normal(scale=0.01, size=(d,n)))" ] }, { "cell_type": "code", "execution_count": 349, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "44.4395133509 5.67503308167e-07 2.49717354504e-07 0 0\n" ] } ], "source": [ "report(vectors_tfidf, W, H)" ] }, { "cell_type": "code", "execution_count": 350, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "upd(vectors_tfidf,W,H,lr)" ] }, { "cell_type": "code", "execution_count": 351, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "44.4194155587 -0.00186845669883 -0.000182969569359 509 788\n" ] } ], "source": [ "report(vectors_tfidf, W, H)" ] }, { "cell_type": "code", "execution_count": 352, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "44.4071645597 -0.00145791197281 -0.00012862260312 343 1174\n", "44.352156176 -0.000549676823494 -9.16363641124e-05 218 4689\n", "44.3020593384 -0.000284017335617 -0.000130903875061 165 9685\n", "44.2468609535 -0.000279317810433 -0.000182173029912 169 16735\n", "44.199218 -0.000290092649623 -0.000198140867356 222 25109\n" ] } ], "source": [ "for i in range(50): \n", " upd(vectors_tfidf,W,H,lr)\n", " if i % 10 == 0: report(vectors_tfidf,W,H)" ] }, { "cell_type": "code", "execution_count": 281, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "['cview file image edu files use directory temp',\n", " 'moral like time does don software new years',\n", " 'god jesus bible believe objective exist atheism belief',\n", " 'thanks graphics program know help looking windows advance',\n", " 'space nasa launch shuttle orbit station moon lunar',\n", " 'people don said think ico tek bobbe bronx']" ] }, "execution_count": 281, "metadata": {}, "output_type": "execute_result" } ], "source": [ "show_topics(H)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "This is painfully slow to train! Lots of parameter fiddling and still slow to train (or explodes)." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### PyTorch" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "[PyTorch](http://pytorch.org/) is a Python framework for tensors and dynamic neural networks with GPU acceleration. Many of the core contributors work on Facebook's AI team. In many ways, it is similar to Numpy, only with the increased parallelization of using a GPU.\n", "\n", "From the [PyTorch documentation](http://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html):\n", "\n", "\n", "\n", "**Further learning**: If you are curious to learn what *dynamic* neural networks are, you may want to watch [this talk](https://www.youtube.com/watch?v=Z15cBAuY7Sc) by Soumith Chintala, Facebook AI researcher and core PyTorch contributor.\n", "\n", "If you want to learn more PyTorch, you can try this [tutorial](http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) or this [learning by examples](http://pytorch.org/tutorials/beginner/pytorch_with_examples.html)." ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "**Note about GPUs**: If you are not using a GPU, you will need to remove the .cuda() from the methods below. GPU usage is not required for this course, but I thought it would be of interest to some of you. To learn how to create an AWS instance with a GPU, you can watch the [fast.ai setup lesson](http://course.fast.ai/lessons/aws.html)." ] }, { "cell_type": "code", "execution_count": 282, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "import torch\n", "import torch.cuda as tc\n", "from torch.autograd import Variable" ] }, { "cell_type": "code", "execution_count": 283, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "def V(M): return Variable(M, requires_grad=True)" ] }, { "cell_type": "code", "execution_count": 284, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "v=vectors_tfidf.todense()" ] }, { "cell_type": "code", "execution_count": 285, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "t_vectors = torch.Tensor(v.astype(np.float32)).cuda()" ] }, { "cell_type": "code", "execution_count": 286, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "mu = 1e-5" ] }, { "cell_type": "code", "execution_count": 287, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "def grads_t(M, W, H):\n", " R = W.mm(H)-M\n", " return (R.mm(H.t()) + penalty_t(W, mu)*lam, \n", " W.t().mm(R) + penalty_t(H, mu)*lam) # dW, dH\n", "\n", "def penalty_t(M, mu):\n", " return (M]" ] }, "execution_count": 292, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD8CAYAAACMwORRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHrRJREFUeJzt3XmcXGWd7/HPj4RFZVNoGA1ggoJOXEBpEeciKiIEvHe4\njjqCzqiMGlHw6jjz0gguiI46LLJDiJmALBIUwppAQmLIYhbSHUJn7XRn7yS9ZOslne708rt/VFWn\nurq66lT36a6qk+/79corVeecOvU8fep86znPec4pc3dERCRajsh3AUREJHwKdxGRCFK4i4hEkMJd\nRCSCFO4iIhGkcBcRiSCFu4hIBCncRUQiSOEuIhJBI/P1xieffLKPHj06X28vIlKUysvLd7l7Sbbl\n8hbuo0ePpqysLF9vLyJSlMxsS5Dl1C0jIhJBCncRkQhSuIuIRJDCXUQkghTuIiIRpHAXEYkghbuI\nSAQp3EWKWOOBDl54Y0e+iyEFKG8XMYnI4P3wyRXMWVfP+95xPGeWHJvv4kgBydpyN7MpZlZvZqv6\nmf8VM6sws5VmtsjMzgm/mCKSzvZ9BwBo6+jOc0mk0ATplnkYGJdh/ibgE+7+AeBXwKQQyiUiIoOQ\ntVvG3eeb2egM8xclPV0CnDb4YomIyGCEfUL1G8BLIa9TRERyFNoJVTP7FLFwvzDDMuOB8QBnnHFG\nWG8tIiIpQmm5m9kHgcnAle6+u7/l3H2Su5e6e2lJSdbbEYuIyAANOtzN7AxgGvCv7r5+8EUSEZHB\nytotY2ZPAJ8ETjazGuAXwJEA7j4R+DlwEnC/mQF0unvpUBVYRESyCzJa5uos878JfDO0EomIyKDp\n9gMiIhGkcBcRiSCFu4hIBCncRUQiSOEuIhJBCncRkQhSuIuIRJDCXUQkghTuIiIRpHAXEYkghbuI\nSAQp3EVEIkjhLiISQQp3EZEIUriLiESQwl1EJIIU7iIiEaRwFxGJIIW7iEgEKdxFRCJI4S4iEkEK\ndxGRCFK4i4hEUNZwN7MpZlZvZqv6mW9mdreZVZtZhZl9OPxiiohILoK03B8GxmWYfzlwVvzfeOCB\nwRdLREQGI2u4u/t8YE+GRa4EHvGYJcCJZvb2sAooIiK5C6PPfRSwLel5TXyaiIjkybCeUDWz8WZW\nZmZlDQ0Nw/nWIiKHlTDCfTtwetLz0+LT+nD3Se5e6u6lJSUlIby1iIikE0a4Pw98NT5q5gKg0d13\nhrBeEREZoJHZFjCzJ4BPAiebWQ3wC+BIAHefCMwArgCqgVbgmqEqrIiIBJM13N396izzHbgutBKJ\niMig6QpVEZEIUriLiESQwl1EJIIU7iIiEaRwFxGJIIW7iEgEKdxFRCJI4S4iEkEKdxGRCFK4i4hE\nkMJdRCSCFO4iIhGkcBcRiSCFu4hIBCncRQpQY2sHlbXN+S6GFDGFu0gB+sLERVx25/x8F0OKmMJd\npABV1bfkuwhS5BTuIiIRpHAXEYkghbuISAQp3EUiwPF8F0EKjMJdpIiZWb6LIAVK4S4iEkGBwt3M\nxplZpZlVm9mENPNPMLMXzOwNM1ttZteEX1QRSeWu7hhJL2u4m9kI4D7gcmAscLWZjU1Z7Dpgjbuf\nA3wSuN3Mjgq5rCLSD0PdM9JbkJb7+UC1u29094PAVODKlGUcOM5iHYDHAnuAzlBLKiIigQUJ91HA\ntqTnNfFpye4F/h7YAawEvu/u3aGUUEREchbWCdXLgBXAO4BzgXvN7PjUhcxsvJmVmVlZQ0NDSG8t\nIiKpgoT7duD0pOenxacluwaY5jHVwCbgvakrcvdJ7l7q7qUlJSUDLbOIiGQRJNyXAWeZ2Zj4SdKr\ngOdTltkKfBrAzE4F3gNsDLOgIiIS3MhsC7h7p5ldD8wERgBT3H21mV0bnz8R+BXwsJmtBAz4sbvv\nGsJyi4hIBlnDHcDdZwAzUqZNTHq8A7g03KKJiMhA6QpVEZEIUriLiESQwl1EJIIU7iIiEaRwFxGJ\nIIW7iEgEKdxFRCJI4S4iEkEKdxGRCFK4i4hEkMJdRCSCFO4iIhGkcBcRiSCFu4hIBCncRUQiSOEu\nIhJBCncRkQhSuIuIRJDCXUQkghTuIiIRpHAXEYkghbuISAQp3EVEIkjhLiISQYHC3czGmVmlmVWb\n2YR+lvmkma0ws9VmNi/cYoqISC5GZlvAzEYA9wGfAWqAZWb2vLuvSVrmROB+YJy7bzWzU4aqwCIi\nkl2Qlvv5QLW7b3T3g8BU4MqUZb4MTHP3rQDuXh9uMUVEJBdBwn0UsC3peU18WrKzgbea2atmVm5m\nX023IjMbb2ZlZlbW0NAwsBKLiEhWYZ1QHQmcB3wWuAz4mZmdnbqQu09y91J3Ly0pKQnprUVEJFXW\nPndgO3B60vPT4tOS1QC73X0/sN/M5gPnAOtDKaWIiOQkSMt9GXCWmY0xs6OAq4DnU5Z5DrjQzEaa\n2ZuBjwJrwy2qiIgElbXl7u6dZnY9MBMYAUxx99Vmdm18/kR3X2tmLwMVQDcw2d1XDWXBRUSkf0G6\nZXD3GcCMlGkTU57fCtwaXtFERGSgdIWqiEgEKdxFRCJI4S4iEkEKdxGRCFK4i4hEkMJdRCSCFO4i\nIhGkcBcRiSCFu4hIBCncRUQiSOEuIhJBCncRkQhSuIuIRJDCXUQkghTuIiI56Op2fvbsKjbv2p/v\nomSkcBcRycGaHU08umQL1/1peb6LkpHCXSQCHM93EQ47ZvkuQWYKd5EiZoWeMJI3CncRkQhSuIuI\nRJDCXUQkghTuIiID4AV+DlvhLiKSg2I5hx0o3M1snJlVmlm1mU3IsNxHzKzTzL4QXhFFRApHobfY\nE7KGu5mNAO4DLgfGAleb2dh+lvtvYFbYhRSR9LxYkiaCCr0FH6Tlfj5Q7e4b3f0gMBW4Ms1y3wOe\nBupDLJ+IBGAUeNLIsAsS7qOAbUnPa+LTepjZKOBzwAOZVmRm482szMzKGhoaci2riIgEFNYJ1TuB\nH7t7d6aF3H2Su5e6e2lJSUlIby0iMvwKvUdsZIBltgOnJz0/LT4tWSkwNX4p9MnAFWbW6e7PhlJK\nEZECUeh97QlBwn0ZcJaZjSEW6lcBX05ewN3HJB6b2cPAiwp2EZH8yRru7t5pZtcDM4ERwBR3X21m\n18bnTxziMoqISI6CtNxx9xnAjJRpaUPd3b8++GKJiBSmQu9rT9AVqiIiA1Dofe8KdxGRASj0FrzC\nXUQkB4XeYk9QuIuIRJDCXUQkghTuIiIRpHAXEYkghbuISAQp3EVEIkjhLiISQQp3EZEIUriLiESQ\nwl1EJIIU7iIiEaRwFxEZAN04TEREhp3CXUQio6G5nXnrG/JdjIKgcBeRyPjSpMV8bcprw/JehX7r\nX4W7iETGxob9+S5CwVC4i4hEkMJdRPJm9Y5GRk+YTvmWvaGu14dhKItGy4iI9OPVytjJz9lr6/Jc\nkuhRuGfxYsUOzr7xJdo6uvJdFJHIKvRWcDEKFO5mNs7MKs2s2swmpJn/FTOrMLOVZrbIzM4Jv6j5\n8buX1nGwq5uG5vZ8F0UkcoZqxIm+LAKEu5mNAO4DLgfGAleb2diUxTYBn3D3DwC/AiaFXVBJr62j\ni5mra/NdDJFBcZTGYQvScj8fqHb3je5+EJgKXJm8gLsvcvfEGZElwGnhFjP/CrUlcPOLa/j2o+Us\n3xruCSmR4WAU+GDxIhYk3EcB25Ke18Sn9ecbwEuDKVQhKfQLFbbtaQWgua0zzyURGYQCbTxlUujZ\nMDLMlZnZp4iF+4X9zB8PjAc444wzwnzrIafDRpHwJQIy7L1rOPbWQj2aTwjSct8OnJ70/LT4tF7M\n7IPAZOBKd9+dbkXuPsndS929tKSkZCDlHXY6bBQZOmHvXYXemh5OQcJ9GXCWmY0xs6OAq4Dnkxcw\nszOAacC/uvv68IuZf4X+LS1SzIbjoqPDTdZuGXfvNLPrgZnACGCKu682s2vj8ycCPwdOAu632Fdn\np7uXDl2xh49aAiJDZ+iGQjrhHxcUl0B97u4+A5iRMm1i0uNvAt8Mt2hyuPv1i2uYvHATm3/32XwX\nRYaYGu7h0xWqAemzN/wmL9yU7yLIENM5raGjcM9CH73C09nVzVenvEbZ5j35LopIRm0dXTS1deTl\nvRXuUnR2NrYxf30DP3hyRb6LIgUq25H2gYNdTJy3ga7uoT0mv+T38/jgTbOG9D36o3AXKQJtHV3s\nbx/chWqPLdnCL19YHVKJwhVWxAY90r5j9np+99I6nn29z6juUNXsPTCk689E4R6QhmoVnsNpk5z/\nX7N53y9mDmodP312FQ/9bXM4BQpJvkajJa7oPhDhu70q3LOwgJ++bz9axuQFG4e4NHK4aor47SUG\n+kVd29jGlt19f1ov2/oOhyHOCveQzFxdx6+nr813MYbNss17OOeXs2g8kJ+TRRItA729xwW/ncMn\nbn2153nQxtjhQOEeUJg9AI0HOtiz/2CIaxx6+1p7l/eu2VU0HuigomZfTusp37KXD908i8bW/H4p\nbN3dyq4W3aM/bC+v2tnns5LJcIVx+ZY9TK/Y2Wf6YPbrQu8VVLhnMRQfvXN+OYsP/+qVIVjz0Ji5\nupZzb36F1zb1HXqY6+H03XOq2NvawfJtA79FcRh5cNGtcyn99ezBr0h61DW1ce1jy7n2sfKcXxv2\n+ZPUI4HPP7CY6/60vOf5YD5CxXJwoHAP6HA6eZdq6cZYqCe30gf9AT+M/54QG/0yesJ0nlsxtKM1\nhtPBzm4gtxEiod84LNcXDGDHLpYsULhnUyTf0kMpU39orp/zYmn1DLXaxjYAbp8VyfvsDalHl2zh\nAzcNbuRQGJ/DQv8oK9wDK5Kv6xBs3d1KfVNbn+np+kcHOkR0uO+PP2n+Bu6bWz2s75lJovaH+5dd\nLvV/tbKebXta+dmzqwrix2k6urrzXYSMQv2xjig6HPe9i26dC5Dxhl2JoM+55T7QQqWRyxfLb2as\nA+C6T707xBIMXKLsh+PnK8Hd2bHvQM9jd6er2xk5In2b8+sPLeOopHmZtn/Qj8ZAmhgrtsW6J6vq\nWwbw6uGjlntAizYc+v2R6voWRk+YzsaGwt64YUm3o+QzlKIw3O1Qy31wdVlX2wzACxU7qG/ue7RV\naLq6nR8/VUFVXTOPLN7CHxYcujncr6ev5d03vpTxlgAHQ2otZ7phWV1TW8YrV/cOYKRb4nzEcFK4\nB/Tz5w5dtp04CfZimqFVufj4LX/Ny0YfjOdWbKdmb+uhCRHprfqn+//G+EfKhu39El+YYX1NPfDq\nBs7/rzmUbynsm6ltaGjhybJtfPfx5X1GXz26eAsQvLsjjBOb6dbxlclL+cGTK2gZ5O0ekp390+H/\nWWl1ywzCYD9c2/YcoKGlnVEnvimcAg2D709dQclxR/O+dxwP5NZ33tzWwdzKhkGX4Y34YXGY3yvL\nt6Yfr19Z28zZpx4b6tHCwc5u/jA/fjVzyAch62qbOe+dbwt3pUMk+U/qDO/5h0zvVRc/39TfEUSx\nHDiq5T4AmbbtL19YzWd+Pw+IXay0dmdTxnWl9hu2tHcy4ekKmvNwm9DGAx2sr2vud36i3g3N7RyR\n6HPPIWFvm1nZ89g9doJs9ITpNDQHv5ioraOL7z6+PPuC/Vi6cTflWw6Nsa9vauPRxZvTLjt3XT2X\n3Tmfp5eHO1xx8sKNPFm2LdCy2/a08tsZwa98Ttfd0NTW0evCosvvWtBr/qzVtfwlYHlSfeCmmVx8\n+6sDeu0RKSmZeNod8EOVbqkwR8Gk69N//y9mcluRjHBSyz2LTC22dK3W5BszXTVpCWt3NuX0S0IP\nLdzE1GXbOOW4o/nhpe/pNe/p8houGXsqJ7zpyMDry8WXHlzc04ebUNvYxrravl9Qib9KLndMbU/q\ngqpramfWmloAVm7fx8XvPTXQOgY7QuFLk5b0ev7tx8p5vZ9W+4b4OZU1O5rgvNi0v1Xv4tzTT+Qt\nR+e+67S0d3LLy+sYccShz1S2LLr+idd7jlQg9gWc6/ZPveVsaoNj/KOxi46+WHp6TuuF2A24Bjpy\npVfL3aGto7vncbbXuYdzM79067AMDZfBdNWsq23i2dd38ONx7xmW80ZquQ9EwA2T2IleeGNHzm/R\nlfLJWrOjif/4yxv86Kk3Ar1++74DjJ4wnWnLa3qm7d1/MOMOkRrsELt3x5L4RUzJ1U58yHPZwY5I\nCrXHlmxhRHyF3QPM652Ngz+BmHobhJb2TloPxuqWaFkmWpI79h3gK5OX8h9/DrYNUj04bwOPLN7S\nqwGQ2Mkbmtv5c5rWc0fKOZl/vHdhxvcImhnr65pZvrX3VcJd3T7gwLx7ThUfv2VuTq/JVtSOru7A\nJy9Tj/4S1ejo6u5Vp32tB7n49lepqut/MMRQ5e4/T1zMxHkbhu0mcAr3LJK38+gJ0we0jsSJonRS\n96VEAKY2UBO3Jq1rSt+FsSE+gifRyq6Kd6/88M9vcMMzK9m0az8f+tUr/HHR5gHUIKY2aez70vjJ\nsFyiIHWfsZTwzKSto4tP3/5q2lsgJMxdV8/Xprw24C8ciB12j/35TMq37OnZyRPrS9xPfd76gZ03\n6ExzmJN49/GPlvGjpyrY2Zj56s4tu1szzg/q0jvm80/3L+o17V03zODfHl7W83zm6to+J2ib2jrS\n3izu96/031XR1e1Upwwb7Hbv1S2zIWnkWeLWxt+f+jofynCbjuS/5k+mVQC9u6VqG9s468aXuCTe\nTQpw7s2vsLFhP4s37iabsMcK9JxEH6Y+e4V7jpKDI1OGJH8RdGRomr6edMi9v72zZ/2pAZX4QKzY\nto87XlnP3HX1veY/Hz86eG5F7P/kQ/8/Ld3K5vhtUQdzQvPBeX1vadyeZrRPe2cXP5lW0efGXKl9\nrLPX1gF976nd0t7Z64jjB1NfZ+mmPWxo2M/NL67pt3zXPLyMeesbaGhp79Mq7U9/P4Dx+QcW95Q/\nkcmJuma6B/iBg109J+RSpdunE3+SRMvzT0u30h2gr+vxpVti3UUB3iOTbXt6f1nMrWxg255W1tc1\n8+1Hy/n8A4t7zf/gTbNidwMNeOO3tTubuH1WJZf8fh7V9c09w4c3NOzvdXSaPNQ4YcbK2p7Hf6ve\n1fM43X53sKvvxMRnfkND31sC96wr6fHcynrKt+zN2OeeTi43ShtOCvccveenLx/a+AFf8/rWff1e\nHdmY9MF4X9LJmtQz9ck77V1zqrjm4WW9xtknLqxYHd/hR6QE6b1/jb3/vPUNnHtzeD/79f+eeL3P\ntNtmVvLEa9v49ydXcMvL69jV0s76umYWVB36Ykmu3W2zKnu9/oZpK/lhUtfHsyt28LUprwHQFOAW\nw//nnoV9WqX9aT3Yf1DfN3cDkP7IoqJmH6MnTGfRhl1sbIgdNS2oauDLk5fw0d/MYVdLO1V1zdTs\nbc14B9D1dS29WrX3/LWaM2+YkbXcNz6ziivuXtBn+l1zqnoer6tt4rcvZT4Zm64r5eO3zOXSO+b3\nPF+zo4kDB7tYtb2xZ9oNz6zMWsaava1cftcC7n819nfc2djWqyst0RDJ5oZnVvKVyUv7TM+Wvble\nZHTNQ8v4/AOL2Bv/4kqsvrvbufyuBcxY2Xfo8zOv13Duza/0/G02NLRwz5wqDhzs4o4MRzPDQSdU\ns2jr7L3zp15E0dLeGai1dOvMSl5atZP/TDlJunl3K7fPquwzZj65VdPV7exv7xtCF98+r8+0Xc3t\nLNm4m60pLbLkESL7Wjv45wcX09nVzbTv/i+a2jp67bgJQbuh3J3VO5p4/6gTAHouTFlQtYsFVbt6\ndu5kySf19rT0Dr/aDH3pewO0GFO7rjJ1owQ5RE58z945+1Bw/uO9fwNiXUEnvvkoAB5fsrXn5Gzq\nHScznVS/8t6FvPUtR2UvSAA7G9u4c/Z6vnHhGL78h6Wh3Fr6irsXcOnYU5m1pq5n2vSVO/no4s1p\nl3eHjQ0tPLms9zmErm4f0MjPPy3dmnZ68siu+Snb2HF+9uyqrOvu6na27zuQtpXe2eXsb+9kxBHG\n2p1N/GBq39/sfbo8NpLqf9+zkN987gPcMXs9Dc3t1DW38diS9OUeLoHC3czGAXcBI4DJ7v67lPkW\nn38F0Ap83d0HPl4tz+ZW1tN0oIOquha27enbB5poHd89p4q751Rx1MhgB0Crtjfx9YeW9Zr2Pws3\npV22raOb0ROm8++XnM32fa38uawm7XKp1uxs4qqUESHpJPddX/f4chZU7cqwdGY/eqqCv5TX8Md/\nO5+PnXlSzq/v7Ha27m7lolvn8tDXP0J1jlf+bmxo4anyvn+fFyt2sGZHU9ovl4R9Ab4sEjt+ohsp\n1cT4+l9eXZt2fkJ/d0vcf7CL/QeD30kxW7fNnbOrqGtqD/U3A5KDPSH5wr5k2/cdSNvwCDrEEYI1\nLH78dEWv51t27+9pfF18W9/3T+fX09f2+yM7F/x2DkDP/p3u6tiFSd1FNzyzkuOPiUXqwkHsT2Gx\nbP1KZjYCWA98BqgBlgFXu/uapGWuAL5HLNw/Ctzl7h/NtN7S0lIvKxu6KwIXb9jNcceM7GlNjp4w\nnS+cdxq3ffGcrK8d6InTYvSNC8f0+wWTq1OOO5r6HMasF4sxJ7+Fad/5h7Qn9y5+7yn8NeX8RzpP\nf+cf+PwDwbqKoupjZ57EuPf/Hb94Ppwf6X7HCcewI4QRU8Ot4qZLOf6YgQ9nNrNydy/NulyAcP8Y\ncJO7XxZ//hMAd/9t0jIPAq+6+xPx55XAJ9293+vzBxPus1bXctHZJTyyeDPvPuXYnjHSX5y4iGWb\n93L8MSN7hhu9dsOnOeX4Y3oC+56rP8Tl7/87fvrsKr5/yVm8/YTY1aEbGlp4xwlv4pgjj2DMT7L3\neYqIDFQu176kChruQbplRgHJnWc1xFrn2ZYZBQzu5itp3DW7ijtm9z5RMfII6zXMLHkc6fm/mdNr\n2e8lnQCcumwbR44wuro9p4txREQG4zuPlfPAv5w3pO8xrKNlzGy8mZWZWVlDw8CG5P3LBWcAcO7p\nJwKxw+JvXXQmn37vKf2+5lsfH8O7St7Cp95TAsClY2Mt/Uv+/hS+9fEzufYT7+KdJ72Zk489umee\niMhQ+dJHcr8aOFdBWu7bgeSSnBaflusyuPskYBLEumVyKmncSccePaBDmhs/Ozbj/B+Ne+9AiiMi\nUpCCtNyXAWeZ2RgzOwq4Cng+ZZnnga9azAVAY6b+dhERGVpZW+7u3mlm1wMziQ2FnOLuq83s2vj8\nicAMYiNlqokNhbxm6IosIiLZBBrn7u4ziAV48rSJSY8duC7coomIyEDp9gMiIhGkcBcRiSCFu4hI\nBCncRUQiSOEuIhJBWe8tM2RvbNYA9P8TRZmdDOT/tmtD73Cop+oYDYdDHaEw6vlOdy/JtlDewn0w\nzKwsyI1zit3hUE/VMRoOhzpCcdVT3TIiIhGkcBcRiaBiDfdJ+S7AMDkc6qk6RsPhUEcoonoWZZ+7\niIhkVqwtdxERyaDowt3MxplZpZlVm9mEfJcnV2a22cxWmtkKMyuLT3ubmb1iZlXx/9+atPxP4nWt\nNLPLkqafF19PtZndHf+R8rwwsylmVm9mq5KmhVYnMzvazJ6MT19qZqOHs37xMqSr401mtj2+LVfE\nf0s4Ma8Y63i6mc01szVmttrMvh+fHrVt2V89I7U9cfei+UfslsMbgDOBo4A3gLH5LleOddgMnJwy\n7RZgQvzxBOC/44/Hxut4NDAmXvcR8XmvARcABrwEXJ7HOl0EfBhYNRR1Ar4LTIw/vgp4skDqeBPw\nn2mWLdY6vh34cPzxccD6eF2iti37q2ektmextdzPB6rdfaO7HwSmAlfmuUxhuBL4Y/zxH4H/mzR9\nqru3u/smYvfLP9/M3g4c7+5LPPbpeSTpNcPO3ecDe1Imh1mn5HU9BXx6uI9U+qljf4q1jjvdfXn8\ncTOwlthvIUdtW/ZXz/4UZT2LLdz7+yHuYuLAbDMrN7Px8Wmn+qFfrqoFEj/k2l99R8Ufp04vJGHW\nqec17t4JNAInDU2xc/Y9M6uId9skuiuKvo7xboQPAUuJ8LZMqSdEaHsWW7hHwYXufi5wOXCdmV2U\nPDPeAojUEKYo1inuAWJdhOcCO4Hb81uccJjZscDTwA/cvSl5XpS2ZZp6Rmp7Flu4B/oh7kLm7tvj\n/9cDzxDraqqLH+IR/78+vnh/9d0ef5w6vZCEWaee15jZSOAEYPeQlTwgd69z9y537wb+QGxbQhHX\n0cyOJBZ4j7v7tPjkyG3LdPWM2vYstnAP8mPdBcvM3mJmxyUeA5cCq4jV4Wvxxb4GPBd//DxwVfzM\n+xjgLOC1+CFyk5ldEO/H+2rSawpFmHVKXtcXgL/GW5B5lQi8uM8R25ZQpHWMl+l/gLXu/vukWZHa\nlv3VM2rbc1jP3obxj9gPca8ndsb6xnyXJ8eyn0nsrPsbwOpE+Yn1xc0BqoDZwNuSXnNjvK6VJI2I\nAUqJffg2APcSvyAtT/V6gthhbAexfsdvhFkn4BjgL8ROZL0GnFkgdXwUWAlUENuZ317kdbyQWJdL\nBbAi/u+KCG7L/uoZqe2pK1RFRCKo2LplREQkAIW7iEgEKdxFRCJI4S4iEkEKdxGRCFK4i4hEkMJd\nRCSCFO4iIhH0/wE3G3RhEFjGRgAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(t_H.cpu().numpy()[0])" ] }, { "cell_type": "code", "execution_count": 1328, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "0.43389660120010376" ] }, "execution_count": 1328, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t_W.mm(t_H).max()" ] }, { "cell_type": "code", "execution_count": 1329, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "0.9188119769096375" ] }, "execution_count": 1329, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t_vectors.max()" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### PyTorch: autograd" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Above, we used our knowledge of what the gradient of the loss function was to do SGD from scratch in PyTorch. However, PyTorch has an automatic differentiation package, [autograd](http://pytorch.org/docs/autograd.html) which we could use instead. This is really useful, in that we can use autograd on problems where we don't know what the derivative is. \n", "\n", "The approach we use below is very general, and would work for almost any optimization problem.\n", "\n", "In PyTorch, Variables have the same API as tensors, but Variables remember the operations used on to create them. This lets us take derivatives." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true, "hidden": true }, "source": [ "#### PyTorch Autograd Introduction" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Example taken from [this tutorial](http://pytorch.org/tutorials/beginner/former_torchies/autograd_tutorial.html) in the official documentation." ] }, { "cell_type": "code", "execution_count": 375, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Variable containing:\n", " 1 1\n", " 1 1\n", "[torch.FloatTensor of size 2x2]\n", "\n" ] } ], "source": [ "x = Variable(torch.ones(2, 2), requires_grad=True)\n", "print(x)" ] }, { "cell_type": "code", "execution_count": 376, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " 1 1\n", " 1 1\n", "[torch.FloatTensor of size 2x2]\n", "\n" ] } ], "source": [ "print(x.data)" ] }, { "cell_type": "code", "execution_count": 377, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Variable containing:\n", " 0 0\n", " 0 0\n", "[torch.FloatTensor of size 2x2]\n", "\n" ] } ], "source": [ "print(x.grad)" ] }, { "cell_type": "code", "execution_count": 378, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Variable containing:\n", " 3 3\n", " 3 3\n", "[torch.FloatTensor of size 2x2]\n", "\n" ] } ], "source": [ "y = x + 2\n", "print(y)" ] }, { "cell_type": "code", "execution_count": 383, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Variable containing:\n", " 27 27\n", " 27 27\n", "[torch.FloatTensor of size 2x2]\n", " Variable containing:\n", " 108\n", "[torch.FloatTensor of size 1]\n", "\n" ] } ], "source": [ "z = y * y * 3\n", "out = z.sum()\n", "print(z, out)" ] }, { "cell_type": "code", "execution_count": 382, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Variable containing:\n", " 18 18\n", " 18 18\n", "[torch.FloatTensor of size 2x2]\n", "\n" ] } ], "source": [ "out.backward()\n", "print(x.grad)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "#### Using Autograd for NMF" ] }, { "cell_type": "code", "execution_count": 167, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "lam=1e6" ] }, { "cell_type": "code", "execution_count": 168, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "pW = Variable(tc.FloatTensor(m,d), requires_grad=True)\n", "pH = Variable(tc.FloatTensor(d,n), requires_grad=True)\n", "pW.data.normal_(std=0.01).abs_()\n", "pH.data.normal_(std=0.01).abs_();" ] }, { "cell_type": "code", "execution_count": 357, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "def report():\n", " W,H = pW.data, pH.data\n", " print((M-pW.mm(pH)).norm(2).data[0], W.min(), H.min(), (W<0).sum(), (H<0).sum())\n", "\n", "def penalty(A):\n", " return torch.pow((A<0).type(tc.FloatTensor)*torch.clamp(A, max=0.), 2)\n", "\n", "def penalize(): return penalty(pW).mean() + penalty(pH).mean()\n", "\n", "def loss(): return (M-pW.mm(pH)).norm(2) + penalize()*lam" ] }, { "cell_type": "code", "execution_count": 358, "metadata": { "collapsed": true, "hidden": true }, "outputs": [], "source": [ "M = Variable(t_vectors).cuda()" ] }, { "cell_type": "code", "execution_count": 359, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "43.66044616699219 -0.0002547535696066916 -0.00046720390673726797 319 8633\n" ] } ], "source": [ "opt = torch.optim.Adam([pW,pH], lr=1e-3, betas=(0.9,0.9))\n", "lr = 0.05\n", "report()" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "How to apply SGD, using autograd:" ] }, { "cell_type": "code", "execution_count": 361, "metadata": { "hidden": true, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "43.628597259521484 -0.022899555042386055 -0.26526615023612976 692 82579\n", "43.62860107421875 -0.021287493407726288 -0.2440912425518036 617 77552\n", "43.628597259521484 -0.020111067220568657 -0.22828206419944763 576 77726\n", "43.628604888916016 -0.01912039890885353 -0.21654289960861206 553 84411\n", "43.62861251831055 -0.018248897045850754 -0.20736189186573029 544 75546\n", "43.62862014770508 -0.01753264293074608 -0.19999365508556366 491 78949\n", "43.62862777709961 -0.016773322597146034 -0.194113627076149 513 83822\n", "43.628639221191406 -0.01622121036052704 -0.18905577063560486 485 74101\n", "43.62863540649414 -0.01574397087097168 -0.18498440086841583 478 85987\n", "43.628639221191406 -0.015293922275304794 -0.18137598037719727 487 74023\n" ] } ], "source": [ "for i in range(1000): \n", " opt.zero_grad()\n", " l = loss()\n", " l.backward()\n", " opt.step()\n", " if i % 100 == 99: \n", " report()\n", " lr *= 0.9 # learning rate annealling" ] }, { "cell_type": "code", "execution_count": 362, "metadata": { "hidden": true }, "outputs": [ { "data": { "text/plain": [ "['god jesus bible believe atheism christian belief does',\n", " 'thanks graphics files image file windows program format',\n", " 'just don think people like ve graphics religion',\n", " 'objective morality values moral subjective science absolute claim',\n", " 'ico bobbe tek beauchaine bronx manhattan sank queens',\n", " 'space nasa shuttle launch orbit lunar moon data']" ] }, "execution_count": 362, "metadata": {}, "output_type": "execute_result" } ], "source": [ "h = pH.data.cpu().numpy()\n", "show_topics(h)" ] }, { "cell_type": "code", "execution_count": 174, "metadata": { "hidden": true, "scrolled": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD8CAYAAACMwORRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHSBJREFUeJzt3XucHGWd7/HPjwSQVcBVsopcNuGICl7wEgPrC128IaB7\ncjwvdYE9Xji6nByE5fg6qxuOCqwXVAQUBA0BMYJIVASJJCEhkARCEpLJ/TqTyY3M5DIzJJnJbTK3\n3/mja3pqerqne2aqu6effN+vV17pqq6uep7p7m8/9dRTVebuiIhIWI4rdwFERCR5CncRkQAp3EVE\nAqRwFxEJkMJdRCRACncRkQAp3EVEAqRwFxEJUN5wN7OHzKzBzNbmeN7M7B4zqzWz1Wb2/uSLKSIi\nAzGygGWmAPcCD+d4/nLg3OjfhcCvov/7ddppp/no0aMLKqSIiKQsW7asyd1H5Vsub7i7+wtmNrqf\nRcYDD3vqOgaLzez1Zna6u+/qb72jR4+mqqoq3+ZFRCTGzLYXslwSfe5nADti03XRvGyFutbMqsys\nqrGxMYFNi4hINiU9oOruk919rLuPHTUq716FiIgMUhLhXg+cFZs+M5onIiJlkkS4TwO+FI2auQho\nztffLiIixZX3gKqZPQZcApxmZnXALcDxAO4+CZgBXAHUAoeBa4pVWBERKUwho2WuyvO8A19PrEQi\nIjJkOkNVRCRACnepSLPW7abhQGu5iyEybCncpeIc7ejkfz2yjKsfeLncRREZthTuUnG67+m+Y+/h\n8hZEZBhTuIuIBEjhLiISIIW7iEiAFO4iIgFSuIuIBEjhLiISIIW7iEiAFO4iIgFSuIuIBEjhLiIS\nIIW7iEiAFO4iIgFSuIuIBEjhLiISIIW7VCwvdwFEhjGFu4hIgBTuIiIBUriLiARI4S4iEiCFu4hI\ngBTuIiIBUriLiARI4S4iEiCFu4hIgBTuIiIBUriLiARI4S4iEiCFu4hIgAoKdzO7zMyqzazWzCZm\nef5UM/urma0ys3Vmdk3yRRURkULlDXczGwHcB1wOnA9cZWbnZyz2dWC9u18AXALcaWYnJFxWEREp\nUCEt93FArbtvcfc2YCowPmMZB042MwNeB+wFOhItqYiIFKyQcD8D2BGbrovmxd0LnAfsBNYAN7p7\nVyIlFMlFd+sQySmpA6qfAlYCbwHeC9xrZqdkLmRm15pZlZlVNTY2JrRpOdaYlbsEIsNfIeFeD5wV\nmz4zmhd3DfCEp9QCW4F3ZK7I3Se7+1h3Hztq1KjBlllERPIoJNyXAuea2ZjoIOmVwLSMZV4BPg5g\nZm8C3g5sSbKgIiJSuJH5FnD3DjO7HpgFjAAecvd1ZjYhen4S8H1gipmtAQz4D3dvKmK5RUSkH3nD\nHcDdZwAzMuZNij3eCVyabNFERGSwdIaqiEiAFO4iIgFSuIuIBEjhLiISIIW7iEiAFO4iIgFSuIuI\nBEjhLiISIIW7iEiAFO4iIgFSuEvFcl3QXSQnhbtUHEMXdBfJR+EuIhIghbuISIAU7iIiAVK4i4gE\nSOEuIhIghbuISIAU7iIiAVK4i4gESOEuIhIghbuISIAU7iIiAVK4i4gESOEuIhIghbuISIAU7iIi\nAVK4i4gESOEuIhIghbuISIAU7iIiAVK4i4gEqKBwN7PLzKzazGrNbGKOZS4xs5Vmts7M5idbTBER\nGYiR+RYwsxHAfcAngTpgqZlNc/f1sWVeD/wSuMzdXzGzvytWgUVEJL9CWu7jgFp33+LubcBUYHzG\nMlcDT7j7KwDu3pBsMUVEZCAKCfczgB2x6bpoXtzbgL81s3lmtszMvpRUAUVEZODydssMYD0fAD4O\nnAQsMrPF7l4TX8jMrgWuBTj77LMT2rSIiGQqpOVeD5wVmz4zmhdXB8xy90Pu3gS8AFyQuSJ3n+zu\nY9197KhRowZbZhERyaOQcF8KnGtmY8zsBOBKYFrGMk8BF5vZSDP7G+BCYEOyRRURkULl7ZZx9w4z\nux6YBYwAHnL3dWY2IXp+krtvMLNngNVAF/Cgu68tZsFFRCS3gvrc3X0GMCNj3qSM6Z8CP02uaCIi\nMlg6Q1VEJEAKdxGRACncpWK5l7sEIsOXwl1EJEAKdxGRACncRUQCpHAXEQmQwl1EJEAKdxGRACnc\nRUQCpHAXEQmQwl1EJEAKdxGRACncRUQCpHAXEQmQwl1EJEAKdxGRACncRUQCpHCXiqXLuYvkpnAX\nEQmQwl1EJEAKdxGRACncRUQCpHAXEQmQwl1EJEAKdxGRACncRUQCpHAXEQmQwl1EJEAKd5E8Gg8c\n5bt/WUt7Z1e5iyJSMIW7SB63TFvLI4u3M2f9nnIXRaRgCneRPLrUYJcKVFC4m9llZlZtZrVmNrGf\n5T5oZh1m9rnkiigiIgOVN9zNbARwH3A5cD5wlZmdn2O5nwCzky6kiIgMTCEt93FArbtvcfc2YCow\nPstyNwB/BhoSLJ+IiAxCIeF+BrAjNl0XzUszszOAzwK/Sq5oIv1z1+06RHJJ6oDqz4H/cPd+Dz2Z\n2bVmVmVmVY2NjQltWkREMo0sYJl64KzY9JnRvLixwFQzAzgNuMLMOtz9L/GF3H0yMBlg7NixanaJ\niBRJIeG+FDjXzMaQCvUrgavjC7j7mO7HZjYFeDoz2EVEpHTyhru7d5jZ9cAsYATwkLuvM7MJ0fOT\nilxGkWFBu5pSSQppuePuM4AZGfOyhrq7f2XoxRIZPlK9jSKVRWeoiogESOEuIhIghbuISIAU7iIi\nAVK4i4gESOEuIhIghbuISIAU7iIiAVK4ixRIF6GUSqJwF8lDZ6hKJVK4S8VxXeVFJC+Fu4hIgBTu\nIiIBUriLiARI4S4iEiCFu4hIgBTuIiIBUriLiARI4S5SII2vl0qicBfJw9ApqlJ5FO5ltnBzExt3\nt5S7GCISmJHlLsCx7uoHXgZg248/XeaSiEhI1HIXEQmQwl1EJEAKdxGRACncRUQCpHAXEQmQwl0q\nVqlPKdJt9qSSKNxFRAKkcBfJRyeoSgVSuIuIBEjhLiISoILC3cwuM7NqM6s1s4lZnv8XM1ttZmvM\nbKGZXZB8UUXKRAdSpQLlDXczGwHcB1wOnA9cZWbnZyy2FfhHd3838H1gctIFDc26nc2Mnji93MWQ\nATD1vUsFKaTlPg6odfct7t4GTAXGxxdw94Xuvi+aXAycmWwxw7NgU1O5iyAiASsk3M8AdsSm66J5\nuXwVmDmUQomIyNAkeslfM/soqXC/OMfz1wLXApx99tlJblpERGIKabnXA2fFps+M5vViZu8BHgTG\nu/ur2Vbk7pPdfay7jx01atRgyhsMHaOrPDpDVSpJIeG+FDjXzMaY2QnAlcC0+AJmdjbwBPBFd69J\nvpgiZaQDqVKB8oa7u3cA1wOzgA3AH919nZlNMLMJ0WI3A28EfmlmK82sqmglrjA79h7mztnVeBGa\nffX7j/DzOTVFWbeIVLaC+tzdfQYwI2PepNjjrwFfS7ZoYfjXh6vYuPsAn33fGZwz6nWJrvu63y1j\nVV0zl7/rdN7+5pMTXbeIVDadoVpkbR1dAHRlNK6TaGy3tnevWy13EelN4V5sReyv7T6pRtkuIpkU\n7iVTvAT2Y3TsjX7URHJTuBdZd8O9GEFkUdNdIScimRTuRWZFvCBJOUfoNbS08oOn19OZeTBBRIYF\nhXuJFCMCy9nn/q0/r+bBBVtZtDnr+WoiUmYK9yIrZrfM/sPtqXWXoc+9o9OHvO3dza3cPWdTxYzT\nr4xSiqQo3IusmJeJrd9/BICj0XDLSnP975fzszk1bNh1oNxF6ZdOUJVKpHAvkcwWbpKt7Urq9z7Q\n2s4ji7fj7hxu6wQ0Tl+kGBK9KqT0ZRR/REslZeMtT63jiRX1vDXhs3VFpDe13IusFHfvqaRx7q8e\nagOgtb2zzCWRcrty8iK+8YeV5S5GsBTuMiQD3WtIj/CpoB8kKY7FW/by5Ip6Vu7YX+6iBEnhXiKZ\nIZhoV0oZcnKweyTFHD0klal+35FyFyFICvcAVFJO6qxakdJQuJdIMbshKiko0y33spZChhN10RWH\nRssUWSlaquX4cgy0Pjv3H6Gzy0tygFlEFO5FV4osK+cw90LD+kM/fh6AT5z3dwAVc1ZqXH9lPtLW\nyUknjChhaUT6p26ZhDy5oo79h9vKsu3KCspoT6bMpRiIfBd/e27DHs67+RlWvLKvRCUqrxdqGnnw\nxS2Jra+UH99n1u5m2qqdpdtgGSncE7C16RDf+MMqbnhsRZ/nSnFxr3IG5WCHQhai4UArD764Zdj/\neL24qQmAFa8cG0P6vvTQEn4wfUO5izEoE363jH/L8j0NkcI9AUc7UifkNLQc7fNcSfqYE86+nfuP\n8NE75rFzf+4hakOtVyF5fcPvV/CD6RvY1HBwaBsrk5bWdkZPnM5fj5GWogwvCvcSKepomYTXPXXp\nDrY2HeKPVTsSXS/Ej0HkL3PzkdRVL7uvQFmoT9/zIt/806qBFSwBDrR3dnH3nE20tneyvekwAPe/\nsBmAu2ZX8/iyupKXS8rnh9PXc+XkRWXZtg6oJihbyOraMr2Voptq3c4W1u1s4aefv6B4G4mJ78VM\nXfIKP5tTQ0dXF5ee/2agp673PF8LwOc+cGZJyiXl98CLW8u2bbXcE2D9jIkpybVlBhGU+w+3se9Q\n6Q8Ax/9W63e1lHz7xXYkumbOkbZO1tQ3l7k0lanp4FGORFcM7ejsoqOzMi9pXW4K9yE42tFZloBM\nwnu/9yzv+/6zZdt+Be1sMG9jAwAPxlph7s7SbXtzHuw1g//35Jpo2YFvs27fYe6cXZ3oweRHFm3j\nlqfWJra+pGTWcOwP5vCF+1NdGe/5z9lc9KPnS1qehpZWtjYdKuk2i0HhPgRfnVJVcEBmfoCT/NIm\nHZT3PLcp4TX2yNYt85lfLODg0Y70dM2eA+xqHj7XGzkQlS3eEn9m7W4+P2kRU5f2HJdoOng0se6m\nCb9bxi+er6U2wYPJ331qHb9dtD2x9WXTfKSdHXsP513ugRf6H0rZ/bc+3NZJ08G+AxWyeX7jHkZP\nnF7Q38zdaWhpzfrcuNue46N3zCtom5m2v3oob91KReE+BAtqm/Iuk6tXZqgh0Ba7+1L8h+Ku2dVc\n/cDioa28AN3D/wZajZlrd0ev6/3KV17tCYRLf/YC/1Di1tpAbY8CbFushfereZvTj/ONjT/c1sGl\nP5uf84qIR9tT7+9wvw9L3b7DvbpNPvOLF/nw7XPzvu6HM3oPpTza0cmt09bRHN06MpeNu1tydtM8\nvXoXQEFXmXx8WR3jbnsu8StSXjV5MT+csSE9GKCcFO4J6i+wM1vqy4d4wsvu5p5WR3zN9zxfy8IS\n3rR6sIcUSjUm3N17/RAmJX33qIw/QLaPQG3jwT7v/8tb91Kz5yC3zcg+XryYl0Y+0NreK0Tn1zTy\n9OqBD9fc3dzKxT+Zy+2zqtPzduwd+B6Xu/OXFfVMWbiNH0xfn3O5bU2HuOznL/LjmRv55bzanOPV\nC9krfnnrXiC1l5ik7j3Q4XBuhsI9Af020iz7GZmt7bkDZ/x9L3HhbXMK3mau4XUbdrXwz/cvKvjG\nGE+v3kltQ98Pu7vzjT+szLu729XldBXY1Pz1goGNIpjy0lZ+OmtjtJ3CX3fn7Bre9p2Zg745yOiJ\n07POX7Yt9eN8nFmvg8TZMr+to6tXH667c81vlgKwZOteXt7S98c48yD9vkNtQ+qqind7vfvW2Vzw\nvdnp6S8/tITrf7+CrU2HsoZSw4FWPj9pIY0HenePdHeXLNiUfw+2W/OR9qx/0/ZouOuf+hkqujc6\nA7xq+z5uf6Y6faZpW0cX7t7vwIZMxcreXNeSWrCpiYaW1l57esWmcC/AM2t3886bn+kVEPHH7f0c\nzc/1cTucETat7Z2Mnjid0ROns2rHfvZknBDVfKSdw209X9ARx/Ws+dn1e7Ju4z//uo6Xt+4teC/h\n+t+v4BN3vdBn/mNLdvDkino+cdf8fl//wR/OSV9DprPLBzTKIV8L9da/rue+ualuj5/GWopXP7CY\nO2LTvdbpztSlrwDQeOAoDQey97Fmenb9HuZWN/T7vj4XHWTdnPGDN3v97qzLx29iviKjKyBX6z1u\n3G1zhtRV9a5bZuVd5qN3zGPMTTPSI1W6PbxwO0u37eOxJa9kfd36XS08uaKw8ft7svRzm1lBLd3j\n0sHZs2zdvsO87Tszex37iGtt7+R935vNnPV7uOmJ1en5W5pS79ukeZu5ddq6nK8dqJ49rt7ueraa\ncbc9xyWD7MsfDIV7ASb8bhmH2jrZFXWFLNzcxDu++0z6+Ssn9+3jzuwKyPzsrsr4gq/tZ9jc6InT\nueA/Z/f6ch83kDGWsW3P3dgw4AM+G2JDFvvUK/r/xU2NvHqojd3Rl/ezv3yJt3575oC2U6h5NQ3p\nxws3v8q9c2uzLvfn5fXpH8Ev3L+IcT98Lv1ca3snjy+ryxoq//pwVbplnc/s9Xt6/Wisrsv+Pv45\n1iLNvAZRfzs73cVrH+CJXEOReY2c7h/epoNHc+49PPjiVuZV97wvmX/XQ0c7WJjjGNXGXS1896ns\nARvX3Z7pjK17S2OqJTxjza6sr6nbd5h9h9u5beYGHlvS8wPQ3S24pekQUxZuy/ratujHfW19M79e\nsJWNu1vo7PKcN6Tv7HL2R91dfbthS39pCp3ENACr6/Yz5rTX9tkNPdDa06J2d+r2HeHDt8/l3z5+\nbqz7pP8vZ32WU/0/P2khf5rwofR0zxmbXekQzWVPS2t6N/Xg0Q52NR/hxsdWsmTb3qzL9zfC4bkN\nPXsGdz5bzU2Xn9dnmS/+ekmv6Vwhl0tDy1He+Zbe8zbuzt4f2v2FzjRzzS4uOueN6ekZa3al94B2\nxY5RdHR2pX+cTxh5HP/1gowND1D3gTwg/cWvzujLjb+/d8yq6fXcmvrmVLdC7Ac718leW5sOcepJ\nxzNn/R7ec9apvOPNpwDwz/cvYnPjQaq+88k+5bvr2Zo+8wC2NB7kja87sc/8TncaDrTywAtbuO6S\nt6b3mB5etJ2Hc4y26exynt/YE+5NB9v41uOr2Lj7AAsnfozrHl3O/JpGfvOVD/Z57S9jB6Jz6ejs\nSjdo4t1y1/9+OQAtR9pzHOTP0ZQu0JbGg3zmFwvS02987Ql0urPy5ktp6+jiukeXMWdDA/O/eUmv\nz8Grh9p4Ynn94DaakGDCva2ji+Yj7fzPKUv50X9/Ny/VNvGRt43ivNNPGdJ6u68bA3Dj1JWcd/op\nOVsJmxoOMuamGenpgQwpvHFq3xsFL922r08ftrtz24yNPPRS7z7r2oaDjDq554s6v7ox/fjaR5b1\nu213zzrCYW19M3OrG3r9kNTtPdJrt709oYOV10xZyrYffzrvcrn+9k0Hj/K/H13Oe848NT0vHjbd\n3L3XyVPTVtb3Cvf4HlS2FtotT63l5n96Z95yzov9/TPXle3krYNHOzj5Ncenl831w/bRO+Zx8okj\n08MzAZ6+4eL0AcLW9k5ec3zvSw/n+hx+7M75nHPaa/vM393cyjW/WUpHlxd8hmWXe89BZlJdgnOj\nv8Gv5m9mfk3qcbzcA9HW2ZW+1lH879cSNaxWxRoTf1y6gy+MPYvOLu/VldmfO2ZV8++fejtbGnu6\n2XbuP9LnOFP3Dd4Pt3Uw/t6X0tc9emJ5PQ2xYxI3Tl3Za4837p9+sYC/3nBxQeUaioLC3cwuA+4G\nRgAPuvuPM5636PkrgMPAV9x9ecJlzSnzAE33L+2PZm7k8x84k/HvPYOLzz1tUOv+5p9W95q+9Gd9\n+6QLcfBoKhB3N7dy6knHF/y6tox+3/NufibrwdjM/vDWjk72FXAJ4tETp3PiyOy9c3M2NDBnQ++A\nnL5mF68e6vkQf+3hKqZc07c11m3dzmbe+ZZTcz4fl9l67RYPjesezf6xOhSFRr49hviPL6Tq2NHZ\nxcgRx+HuvVpp8a63br9dtH1QY8W78vQpP7+xgdNPPYmZa3fx5IqeFt8V97zYZ9nMgMxW5pNPHMmC\niR/L+1nbkuUA3zcfX51lyf7V7DlIzZ6eIIy3Ym9/pueYyOIsB48LMXPNbv5vgdcLqtq+j6dW1vdq\nMOVruN87t5bL3vXmXn/Lrz+6nM059hI//JO56aAHuPu5TZz1hpPS07mCHSjZmcuW70CGmY0AaoBP\nAnXAUuAqd18fW+YK4AZS4X4hcLe7X9jfeseOHetVVVVDKvyNU1fw1MrChnAV0iqMa23vzPrlHor/\ncdHZ/G5x9oNSuTx9w8W9PnCV6CsfGp2zXzMpb3jtCewdwtnCE/7xvzBpfv7ugUpyxbvfTNPBNpZs\nzd4VN1TvP/v1Re1LfvRrF/IvD75ctPWX00DzKM7Mlrn72LzLFRDu/wDc6u6fiqZvAnD3H8WWuR+Y\n5+6PRdPVwCXunn0fmqGF+/3zN/OjmRsH9dpvXfZ27pxdwxcv+nueXb8na1+3iEgxrb71Uk55TeF7\n8HGFhnsho2XOAOLjjOqieQNdJhF3z9k06GCH1C5iZ5czZeE2BbuIlMV7bp2df6EhKulQSDO71syq\nzKyqsbEx/wuy+MqHRidbKBGREvvuZ84v+jYKOaBaD5wVmz4zmjfQZXD3ycBkSHXLDKikkVP/5vgh\n9VeJiBwLCmm5LwXONbMxZnYCcCUwLWOZacCXLOUioLm//nYRESmuvC13d+8ws+uBWaSGQj7k7uvM\nbEL0/CRgBqmRMrWkhkJeU7wii4hIPgWNc3f3GaQCPD5vUuyxA19PtmgiIjJYuraMiEiAFO4iIgFS\nuIuIBEjhLiISIIW7iEiA8l5bpmgbNmsEBnsr9tOAwu/tVbmOhXqqjmE4FuoIw6Oef+/uo/ItVLZw\nHwozqyrkwjmV7liop+oYhmOhjlBZ9VS3jIhIgBTuIiIBqtRwn1zuApTIsVBP1TEMx0IdoYLqWZF9\n7iIi0r9KbbmLiEg/Ki7czewyM6s2s1ozm1ju8gyUmW0zszVmttLMqqJ5bzCzZ81sU/T/38aWvymq\na7WZfSo2/wPRemrN7B7LdmfpEjGzh8yswczWxuYlViczO9HM/hDNf9nMRpeyflEZstXxVjOrj97L\nldG9hLufq8Q6nmVmc81svZmtM7Mbo/mhvZe56hnU+4m7V8w/Upcc3gycA5wArALOL3e5BliHbcBp\nGfNuByZGjycCP4kenx/V8URgTFT3EdFzS4CLAANmApeXsU4fAd4PrC1GnYDrgEnR4yuBPwyTOt4K\n/HuWZSu1jqcD748enwzURHUJ7b3MVc+g3s9Ka7mPA2rdfYu7twFTgfFlLlMSxgO/jR7/FvhvsflT\n3f2ou28ldb38cWZ2OnCKuy/21Kfn4dhrSs7dXwD2ZsxOsk7xdT0OfLzUeyo56phLpdZxl7svjx4f\nADaQuhdyaO9lrnrmUpH1rLRwL9mNuIvIgTlmtszMro3mvcl77ly1G3hT9DhXfc+IHmfOH06SrFP6\nNe7eATQDbyxOsQfsBjNbHXXbdHdXVHwdo26E9wEvE/B7mVFPCOj9rLRwD8HF7v5e4HLg62b2kfiT\nUQsgqCFMIdYp8itSXYTvBXYBd5a3OMkws9cBfwb+j7u3xJ8L6b3MUs+g3s9KC/eCbsQ9nLl7ffR/\nA/Akqa6mPdEuHtH/DdHiuepbHz3OnD+cJFmn9GvMbCRwKvBq0UpeIHff4+6d7t4FPEDqvYQKrqOZ\nHU8q8B519yei2cG9l9nqGdr7WWnhXsjNuoctM3utmZ3c/Ri4FFhLqg5fjhb7MvBU9HgacGV05H0M\ncC6wJNpFbjGzi6J+vC/FXjNcJFmn+Lo+BzwftSDLqjvwIp8l9V5ChdYxKtOvgQ3uflfsqaDey1z1\nDO39LOnR2yT+kboRdw2pI9bfLnd5Blj2c0gddV8FrOsuP6m+uOeATcAc4A2x13w7qms1sRExwFhS\nH77NwL1EJ6SVqV6PkdqNbSfV7/jVJOsEvAb4E6kDWUuAc4ZJHR8B1gCrSX2ZT6/wOl5MqstlNbAy\n+ndFgO9lrnoG9X7qDFURkQBVWreMiIgUQOEuIhIghbuISIAU7iIiAVK4i4gESOEuIhIghbuISIAU\n7iIiAfr/05ddF5ayNa4AAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(h[0]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparing Approaches" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "#### Scikit-Learn's NMF\n", "- Fast\n", "- No parameter tuning\n", "- Relies on decades of academic research, took experts a long time to implement\n", "\n", "\n", "source: [Python Nimfa Documentation](http://nimfa.biolab.si/)\n", "\n", "#### Using PyTorch and SGD\n", "- Took us an hour to implement, didn't have to be NMF experts\n", "- Parameters were fiddly\n", "- Not as fast (tried in numpy and was so slow we had to switch to PyTorch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Truncated SVD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We saved a lot of time when we calculated NMF by only calculating the subset of columns we were interested in. Is there a way to get this benefit with SVD? Yes there is! It's called truncated SVD. We are just interested in the vectors corresponding to the **largest** singular values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "(source: [Facebook Research: Fast Randomized SVD](https://research.fb.com/fast-randomized-svd/))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Shortcomings of classical algorithms for decomposition:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Matrices are \"stupendously big\"\n", "- Data are often **missing or inaccurate**. Why spend extra computational resources when imprecision of input limits precision of the output?\n", "- **Data transfer** now plays a major role in time of algorithms. Techniques the require fewer passes over the data may be substantially faster, even if they require more flops (flops = floating point operations).\n", "- Important to take advantage of **GPUs**.\n", "\n", "(source: [Halko](https://arxiv.org/abs/0909.4061))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Advantages of randomized algorithms:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- inherently stable\n", "- performance guarantees do not depend on subtle spectral properties\n", "- needed matrix-vector products can be done in parallel\n", "\n", "(source: [Halko](https://arxiv.org/abs/0909.4061))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Randomized SVD" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reminder: full SVD is **slow**. This is the calculation we did above using Scipy's Linalg SVD:" ] }, { "cell_type": "code", "execution_count": 384, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2034, 26576)" ] }, "execution_count": 384, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectors.shape" ] }, { "cell_type": "code", "execution_count": 344, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 27.2 s, sys: 812 ms, total: 28 s\n", "Wall time: 27.9 s\n" ] } ], "source": [ "%time U, s, Vh = linalg.svd(vectors, full_matrices=False)" ] }, { "cell_type": "code", "execution_count": 345, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2034, 2034) (2034,) (2034, 26576)\n" ] } ], "source": [ "print(U.shape, s.shape, Vh.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fortunately, there is a faster way:" ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 144 ms, sys: 8 ms, total: 152 ms\n", "Wall time: 154 ms\n" ] } ], "source": [ "%time u, s, v = decomposition.randomized_svd(vectors, 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The runtime complexity for SVD is $\\mathcal{O}(\\text{min}(m^2 n,\\; m n^2))$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question**: How can we speed things up? (without new breakthroughs in SVD research)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Idea**: Let's use a smaller matrix (with smaller $n$)!\n", "\n", "Instead of calculating the SVD on our full matrix $A$ which is $m \\times n$, let's use $B = A Q$, which is just $m \\times r$ and $r << n$\n", "\n", "We haven't found a better general SVD method, we are just using the method we have on a smaller matrix." ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 144 ms, sys: 8 ms, total: 152 ms\n", "Wall time: 154 ms\n" ] } ], "source": [ "%time u, s, v = decomposition.randomized_svd(vectors, 5)" ] }, { "cell_type": "code", "execution_count": 177, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((2034, 5), (5,), (5, 26576))" ] }, "execution_count": 177, "metadata": {}, "output_type": "execute_result" } ], "source": [ "u.shape, s.shape, v.shape" ] }, { "cell_type": "code", "execution_count": 178, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['jpeg image edu file graphics images gif data',\n", " 'jpeg gif file color quality image jfif format',\n", " 'space jesus launch god people satellite matthew atheists',\n", " 'jesus god matthew people atheists atheism does graphics',\n", " 'image data processing analysis software available tools display']" ] }, "execution_count": 178, "metadata": {}, "output_type": "execute_result" } ], "source": [ "show_topics(v)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are some results from [Facebook Research](https://research.fb.com/fast-randomized-svd/):\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Johnson-Lindenstrauss Lemma**: ([from wikipedia](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma)) a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved.\n", "\n", "It is desirable to be able to reduce dimensionality of data in a way that preserves relevant structure. The Johnsonâ€“Lindenstrauss lemma is a classic result of this type." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Implementing our own Randomized SVD" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from scipy import linalg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The method randomized_range_finder finds an orthonormal matrix whose range approximates the range of A (step 1 in our algorithm above). To do so, we use the LU and QR factorizations, both of which we will be covering in depth later.\n", "\n", "I am using the [scikit-learn.extmath.randomized_svd source code](https://github.com/scikit-learn/scikit-learn/blob/14031f65d144e3966113d3daec836e443c6d7a5b/sklearn/utils/extmath.py) as a guide." ] }, { "cell_type": "code", "execution_count": 182, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# computes an orthonormal matrix whose range approximates the range of A\n", "# power_iteration_normalizer can be safe_sparse_dot (fast but unstable), LU (imbetween), or QR (slow but most accurate)\n", "def randomized_range_finder(A, size, n_iter=5):\n", " Q = np.random.normal(size=(A.shape[1], size))\n", " \n", " for i in range(n_iter):\n", " Q, _ = linalg.lu(A @ Q, permute_l=True)\n", " Q, _ = linalg.lu(A.T @ Q, permute_l=True)\n", " \n", " Q, _ = linalg.qr(A @ Q, mode='economic')\n", " return Q" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's our randomized SVD method:" ] }, { "cell_type": "code", "execution_count": 236, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def randomized_svd(M, n_components, n_oversamples=10, n_iter=4):\n", " \n", " n_random = n_components + n_oversamples\n", " \n", " Q = randomized_range_finder(M, n_random, n_iter)\n", " \n", " # project M to the (k + p) dimensional space using the basis vectors\n", " B = Q.T @ M\n", " \n", " # compute the SVD on the thin matrix: (k + p) wide\n", " Uhat, s, V = linalg.svd(B, full_matrices=False)\n", " del B\n", " U = Q @ Uhat\n", " \n", " return U[:, :n_components], s[:n_components], V[:n_components, :]" ] }, { "cell_type": "code", "execution_count": 237, "metadata": { "collapsed": true }, "outputs": [], "source": [ "u, s, v = randomized_svd(vectors, 5)" ] }, { "cell_type": "code", "execution_count": 238, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 136 ms, sys: 0 ns, total: 136 ms\n", "Wall time: 137 ms\n" ] } ], "source": [ "%time u, s, v = randomized_svd(vectors, 5)" ] }, { "cell_type": "code", "execution_count": 239, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((2034, 5), (5,), (5, 26576))" ] }, "execution_count": 239, "metadata": {}, "output_type": "execute_result" } ], "source": [ "u.shape, s.shape, v.shape" ] }, { "cell_type": "code", "execution_count": 247, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['jpeg image edu file graphics images gif data',\n", " 'edu graphics data space pub mail 128 3d',\n", " 'space jesus launch god people satellite matthew atheists',\n", " 'space launch satellite commercial nasa satellites market year',\n", " 'image data processing analysis software available tools display']" ] }, "execution_count": 247, "metadata": {}, "output_type": "execute_result" } ], "source": [ "show_topics(v)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write a loop to calculate the error of your decomposition as you vary the # of topics. Plot the result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Answer" ] }, { "cell_type": "code", "execution_count": 248, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Exercise: Write a loop to calculate the error of your decomposition as you vary the # of topics\n" ] }, { "cell_type": "code", "execution_count": 242, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 242, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xt4XPV95/H3V3dpJOsykmVZliUbZIMNxmBjIAmXlHBJ\noDFtWup0Q9mELvv0oS2w7abQ7LZpdllILzzZtCWNc6u3uRA3TYoLhUIcLiEBOzZ2fMVYWJYtWbYu\nlmRbN+vy3T/mSB5fZEmW5BnNfF7PM885c+Yc6asD/sxvfvM7v2PujoiIJK6UWBcgIiJTS0EvIpLg\nFPQiIglOQS8ikuAU9CIiCU5BLyKS4BT0IiIJTkEvIpLgFPQiIgkuLdYFABQXF3tVVVWsyxARmVY2\nb97c4u4lo+0XF0FfVVXFpk2bYl2GiMi0YmZ1Y9lPXTciIglOQS8ikuAU9CIiCU5BLyKS4BT0IiIJ\nTkEvIpLgFPQiIgluWgd9Q3s3f/PyHg60dsW6FBGRuDWtg/5Ydx9/+5MatjW0x7oUEZG4Na2DvjKc\nA0CdWvQiIiOa1kGfk5HGzLxM9rd0xroUEZG4Na2DHqAqHFKLXkTkPKZ/0BfnsL9VLXoRkZFM+6Cv\nDIdoOt5LZ29/rEsREYlL0z7oq8IhQF/IioiMZExBb2YPm9kOM9tpZo8E24rM7BUz2xssC6P2f9zM\nasxsj5ndMVXFQ/TIG3XfiIicy6hBb2ZXAP8FWAFcBdxtZpcCjwHr3b0aWB88x8wWAauAxcCdwDNm\nljo15Z8K+v1q0YuInNNYWvSXAxvcvcvd+4HXgV8HVgJrgn3WAPcE6yuBZ929191rgRoibxJTIi8r\nneLcDLXoRURGMJag3wHcaGZhM8sBPgZUAKXu3hjscxgoDdbLgYNRx9cH26ZMVThErcbSi4ic06hB\n7+67gS8CLwMvAVuBgTP2ccDH84vN7EEz22Rmm5qbm8dz6FkqNZZeRGREY/oy1t2/4e7L3P0moA14\nDzhiZmUAwbIp2L2BSIt/yJxg25k/c7W7L3f35SUlo97E/LyqwjkcPtZD98mB0XcWEUkyYx11MzNY\nziXSP/9dYB1wf7DL/cBzwfo6YJWZZZrZPKAa2DiZRZ+psjgyxPLAUbXqRUTOlDbG/f7FzMJAH/CQ\nu7eb2VPAWjN7AKgD7gVw951mthbYBfQH+09pU7tqeORNJwtn5U3lrxIRmXbGFPTufuM5trUCt46w\n/xPAExMrbewqhy+a0heyIiJnmvZXxgLkZ6dTFMqgtkVdNyIiZ0qIoIfIhVNq0YuInC1hgl7TFYuI\nnFvCBH1lOIdDHd309GmIpYhItIQJ+nnFIdyhvk2tehGRaAkT9EMjb/SFrIjI6RIm6Ks0XbGIyDkl\nTNAX5GSQn52u2wqKiJwhYYIeIq16jbwRETldQgV9ZTikFr2IyBkSKuirikM0tHVzsn8w1qWIiMSN\nxAr6cA6DDgc1xFJEZFhCBb0mNxMROVtCBf3wdMUaSy8iMiyhgr4olEFeZppa9CIiURIq6M2MquIQ\ntRpiKSIyLKGCHjRdsYjImRIu6KvCIerbuukb0BBLERFIwKCvDOcwMOg0tHXHuhQRkbgwpqA3s0fN\nbKeZ7TCz75lZlpkVmdkrZrY3WBZG7f+4mdWY2R4zu2Pqyj9bVXFkiKWukBURiRg16M2sHPhDYLm7\nXwGkAquAx4D17l4NrA+eY2aLgtcXA3cCz5hZ6tSUf7aq4bH0+kJWRATG3nWTBmSbWRqQAxwCVgJr\ngtfXAPcE6yuBZ929191rgRpgxeSVfH7FuRmEMlKpbVGLXkQExhD07t4A/DVwAGgEOtz9ZaDU3RuD\n3Q4DpcF6OXAw6kfUB9suCjOjMhzSyBsRkcBYum4KibTS5wGzgZCZfSp6H3d3wMfzi83sQTPbZGab\nmpubx3PoqKqKNV2xiMiQsXTdfASodfdmd+8Dfgh8ADhiZmUAwbIp2L8BqIg6fk6w7TTuvtrdl7v7\n8pKSkon8DWepDIc42NZFv4ZYioiMKegPANebWY6ZGXArsBtYB9wf7HM/8Fywvg5YZWaZZjYPqAY2\nTm7Z51cVzqFvwGns6LmYv1ZEJC6ljbaDu28wsx8A7wD9wBZgNZALrDWzB4A64N5g/51mthbYFez/\nkLsPTFH951Q1fKPwTiqKci7mrxYRiTujBj2Au/858OdnbO4l0ro/1/5PAE9MrLQLNzSWPvKF7OR2\nC4mITDcJd2UswMy8TLLSU9ivL2RFRBIz6M2MKg2xFBEBEjToITLnjVr0IiIJHPRVxSEOtHYxMDiu\n4f0iIgkncYM+HOLkwCCNHZrFUkSSW8IGfWVw/1hdISsiyS5hg35oLL2mKxaRZJewQT9rRhaZaSlq\n0YtI0kvYoE9JMSrDOZquWESSXsIGPaDpikVESPCgrwpHpise1BBLEUliCR30leEQvf2DHDmuWSxF\nJHkldNAPj7xp0ReyIpK8EjvoiyNj6TXEUkSSWUIHfVl+NhmpKQp6EUlqCR30qSlGRVE2deq6EZEk\nltBBD5F+erXoRSSZJXzQR8bSd+GuIZYikpwSPuirinPo7hug+XhvrEsREYmJUYPezBaa2daoxzEz\ne8TMiszsFTPbGywLo4553MxqzGyPmd0xtX/C+UXfKFxEJBmNGvTuvsfdl7r7UmAZ0AX8CHgMWO/u\n1cD64DlmtghYBSwG7gSeMbPUKap/VENBr8nNRCRZjbfr5lbgfXevA1YCa4Lta4B7gvWVwLPu3uvu\ntUANsGIyir0QswuySEsxfSErIklrvEG/CvhesF7q7o3B+mGgNFgvBw5GHVMfbIuJtNQUKopy1KIX\nkaQ15qA3swzg48A/n/maR4a0jGtYi5k9aGabzGxTc3PzeA4dt8iNwtWiF5HkNJ4W/UeBd9z9SPD8\niJmVAQTLpmB7A1ARddycYNtp3H21uy939+UlJSXjr3wcqsIh9rd0aoiliCSl8QT9JznVbQOwDrg/\nWL8feC5q+yozyzSzeUA1sHGihU5EVTiHzpMDtJw4GcsyRERiYkxBb2Yh4Dbgh1GbnwJuM7O9wEeC\n57j7TmAtsAt4CXjI3Qcms+jxqiweGnmj7hsRST5pY9nJ3TuB8BnbWomMwjnX/k8AT0y4ukly6kbh\nXSyvKopxNSIiF1fCXxkLMKcwm9QUU4teRJJSUgR9emoKcwqzdXWsiCSlpAh6ODW5mYhIskmaoK8K\nxtJriKWIJJukCfrKcIjjPf20dfXFuhQRkYsqaYK+Kqz7x4pIckqaoK8cGmKpL2RFJMkkTdBXFGWT\nYpGx9CIiySRpgj4zLZXZBdkaSy8iSSdpgh6GbhSuFr2IJJekCvrKcI5a9CKSdJIq6KvCIdq7+mjv\n0iyWIpI8kivoi09NbiYikiySK+iDsfTqvhGRZJJUQV9RlIMZ7G9Ri15EkkdSBX1WeiplM7LUoheR\npJJUQQ+RK2Q1DYKIJJOkC/qqYo2lF5HkknxBH87haOdJOro1i6WIJIex3hy8wMx+YGbvmtluM7vB\nzIrM7BUz2xssC6P2f9zMasxsj5ndMXXlj9/Q5GYH1KoXkSQx1hb9/wVecvfLgKuA3cBjwHp3rwbW\nB88xs0XAKmAxcCfwjJmlTnbhF6qqWNMVi0hyGTXozSwfuAn4BoC7n3T3dmAlsCbYbQ1wT7C+EnjW\n3XvdvRaoAVZMduEXam6RxtKLSHIZS4t+HtAMfMvMtpjZ180sBJS6e2Owz2GgNFgvBw5GHV8fbIsL\nORlplM7IpFZj6UUkSYwl6NOAa4CvuPvVQCdBN80Qj9yIdVw3YzWzB81sk5ltam5uHs+hE1YVDqlF\nLyJJYyxBXw/Uu/uG4PkPiAT/ETMrAwiWTcHrDUBF1PFzgm2ncffV7r7c3ZeXlJRcaP0XRNMVi0gy\nGTXo3f0wcNDMFgabbgV2AeuA+4Nt9wPPBevrgFVmlmlm84BqYOOkVj1BlcU5tJzo5URvf6xLERGZ\ncmlj3O8PgO+YWQawD/g0kTeJtWb2AFAH3Avg7jvNbC2RN4N+4CF3H5j0yiegKhhiWdfayeLZ+TGu\nRkRkao0p6N19K7D8HC/dOsL+TwBPTKCuKVUZzGK5v6VLQS8iCS/proyFUy16jaUXkWSQlEEfykyj\nJC9TI29EJCkkZdBDZM4bjbwRkWSQtEFfqbH0IpIkkjboq8I5HDnWS9dJDbEUkcSWvEFfPDTEUt03\nIpLYkjfoo8bSi4gksqQN+rlDY+nVoheRBJe0QT8jK51wKEMtehFJeEkb9BC5Qna/pisWkQSX1EEf\nmcVSLXoRSWzJHfTFIRo7eujpi6s510REJlVSB/3Q5GYHjqr7RkQSV1IH/SUluQD8YHN9jCsREZk6\nSR30i2fP4JMr5rL6jX189fX3Y12OiMiUGOuNRxKSmfG/77mC4z19PPniu8zITueTK+bGuiwRkUmV\n1EEPkJpiPH3vUk709vOnP9pObmYav3rV7FiXJSIyaZK662ZIRloKX/lPy7i2sohHv7+VV/c0jX6Q\niMg0oaAPZGek8vX/vJzLyvL4vW9vZmPt0ViXJCIyKcYU9Ga238y2m9lWM9sUbCsys1fMbG+wLIza\n/3EzqzGzPWZ2x1QVP9lmZKWz5tMrKC/I5oF//AU7GjpiXZKIyISNp0X/YXdf6u5DNwl/DFjv7tXA\n+uA5ZrYIWAUsBu4EnjGz1EmseUqFczP5pweuY0Z2Or/zzY3UNJ2IdUkiIhMyka6blcCaYH0NcE/U\n9mfdvdfda4EaYMUEfs9FN7sgm2//7nWkmHHfNzZQ36YLqkRk+hpr0DvwYzPbbGYPBttK3b0xWD8M\nlAbr5cDBqGPrg23TyrziEP/vMyvo7O3nvm9spPl4b6xLEhG5IGMN+g+5+1Lgo8BDZnZT9Ivu7kTe\nDMbMzB40s01mtqm5uXk8h140i2bP4FufvpbDHT38zjc30tHdF+uSRETGbUxB7+4NwbIJ+BGRrpgj\nZlYGECyHxiQ2ABVRh88Jtp35M1e7+3J3X15SUnLhf8EUW1ZZxOrfWUZN03E+84+/0D1mRWTaGTXo\nzSxkZnlD68DtwA5gHXB/sNv9wHPB+jpglZllmtk8oBrYONmFX0w3Vpfw5VVXs+VAG//1nzbT26/Z\nLkVk+hhLi74UeNPMfkkksF9w95eAp4DbzGwv8JHgOe6+E1gL7AJeAh5y92mfjB+9soynPrGEn+5t\n4dHvb2VgcFw9VSIiMTPqFAjuvg+46hzbW4FbRzjmCeCJCVcXZ+5dXsHxnn7+1/O7yM3cxhc/sQQz\ni3VZIiLnlfRz3YzXAx+aR0d3H19ev5cZWel87q7LFfYiEtcU9Bfg0Y9Uc6y7j6+/WUt+djp/cGt1\nrEsSERmRgv4CmBl/dvcijvf08zevvMeR4z08/tHLCWXqdIpI/FEyXaCUFOOLn7iSolA6X3+zljfe\na+GvfmMJ180Px7o0EZHTaPbKCUhLTeFzdy3i+w/egBms+trbfOHfdtF9ctoPMhKRBKKgnwQr5hXx\n4sM3ct/1lXzzZ7Xc9eWfsrmuLdZliYgACvpJk5ORxhdWXsF3fvc6evsH+c1/+DlPvribnj617kUk\nthT0k+yDlxbz0iM3cu/yCr76+j5+9W/fZHu95rUXkdhR0E+BvKx0nvrEEr716Ws51tPHPc/8jKdf\n3sPJ/sFYlyYiSUhBP4U+vHAmLz9yMyuXzubLP6nhnr//Gbsbj8W6LBFJMgr6KZafk87T9y5l9X3L\naDrey8f/7k3+7id76R9Q615ELg4F/UVy++JZvPzoTdyxeBZ//fJ7fOIrP6em6XisyxKRJKCgv4iK\nQhn83W9fw9//9jUcONrFx778Jl99/X217kVkSinoY+CuJWW8/OjN3LKghCdffJfbv/QGz287xKCm\nPhaRKaCgj5GSvEy+et8yVt+3jLQU4/e/u4W7/vZNfvLuESJ3ZhQRmRwK+hgyM25fPIsXH76JL/3W\nUrpO9vOZf9zEJ77yc956vzXW5YlIglDQx4HUFOOeq8v58X+7mf/za1dyqL2HT37tbT719Q1sPdge\n6/JEZJqzeOgmWL58uW/atCnWZcSNnr4Bvv12HV957X1aO09y26JS/uj2BVw2a0asSxOROGJmm919\n+aj7KejjV2dvP9/6WS1ffWMfJ3r7+fhVs3n0IwuoKg7FujQRiQNjDfoxd92YWaqZbTGz54PnRWb2\nipntDZaFUfs+bmY1ZrbHzO64sD9BQplp/P6vVPPTz36Y37v5El7eeYRbn36dx3+4jUPt3bEuT0Sm\nifH00T8M7I56/hiw3t2rgfXBc8xsEbAKWAzcCTxjZqmTU25yKsjJ4LN3Xsbrn72F+66v5F82N3DL\nX7/GF/5tFy0nemNdnojEuTEFvZnNAe4Cvh61eSWwJlhfA9wTtf1Zd+9191qgBlgxOeUmt5l5WXz+\n44t59b/fwq8tLWfNW/u56S9f5X/+6w72HNZVtiJybmNt0X8J+CwQfQlnqbs3BuuHgdJgvRw4GLVf\nfbBNJkl5QTZf/I0lvPLoTdx5xSy+v+kgd3zpDX7zH37Oc1sb6O3XHPgicsqoQW9mdwNN7r55pH08\n8o3uuL7VNbMHzWyTmW1qbm4ez6ESmF+Sy9P3LuXtx2/lTz92GU3He3n42a3c8ORPePLF3Rxo7Yp1\niSISB0YddWNmTwL3Af1AFjAD+CFwLXCLuzeaWRnwmrsvNLPHAdz9yeD4/wA+7+5vjfQ7NOpmcgwO\nOm/WtPCdDXX8eHcTg+7cVF3Cp66v5Fcum0lqisW6RBGZRFMyvNLMbgH+2N3vNrO/Alrd/Skzewwo\ncvfPmtli4LtE+uVnE/mittrdR+xPUNBPvsaObp7deJBnf3GAI8d6mZ2fxaoVc1l1bQUzZ2TFujwR\nmQQXI+jDwFpgLlAH3OvuR4P9Pgd8hsingEfc/cXz/VwF/dTpGxhk/e4mvrOhjp/ubSEtxbh9cSmf\nuq6SGy4JY6ZWvsh0pQum5Cy1LZ18d0Md/7y5nvauPuYXh/jt6+aycmk5JXmZsS5PRMZJQS8j6ukb\n4IVtjXx7Qx1bDrSTYnD9/DB3L5nNHYtLCecq9EWmAwW9jMmew8d5ftshnt/WSG1LJ6kpxgcuCXPX\nlWXcsXgWhaGMWJcoIiNQ0Mu4uDu7GyOh/8L2Rupau0hLMT54aTF3LSnjjkWzyM9Jj3WZIhJFQS8X\nzN3ZeegYz29r5Plth6hv6yY91bixuoS7rizjtsWlzMhS6IvEmoJeJoW7s62+gxe2N/LCtkYa2rvJ\nSE3hpgWRlv5HLi8lT6EvEhMKepl07s6Wg+28sK2Rf9/eSGNHDxlpKVw/P8yHF5Zwy8KZzNMUyiIX\njYJeptTgoLPlYBsvbDvMa3ua2NfSCUBlOIcPL5zJzQtLuGF+mKx0TVwqMlUU9HJR1bV28tqeZl7b\n08Rb+1rp6RskMy2FGy4Jc8uCSGtfN0wRmVwKeomZnr4BNtQe5dV3m3j9vWZqg9b+vOIQNy8o4ZaF\nJVyv1r7IhCnoJW7sb+nktT1NvPZeM2+930pv/yBZ6SncMD/MLQtn8sFLi7mkJKTpGETGSUEvcamn\nb4C397UOd/PsD6ZSLs7N5Lr5RVw/P8wN84u4pCRXwS8yirEGfdrFKEZkSFZ6KrcsnMktC2cCi6lr\n7eTtfa28ve8ob73fygvbIveyKc7N4Lp5Ya4Pwv/SmQp+kQuloJeYqgyHqAyH+K1r5+LuHDjaxYZ9\nR4Pwb+WF7ZHgD4cyhlv8188PU63gFxkzBb3EDTMbDv57r63A3alv6+atIPQ37DvKv28/DESCf8W8\nSPAvqyzksll5pKWO5173IslDQS9xy8yoKMqhoiiHe5dXAHDwaNdwV8/b+1p5cUck+LPTU1kyJ59r\nKgu5Zm4hV88toFizcIoACnqZZoaC/zeD4K9v6+KdA+28U9fGlgNtfO2NffQPRgYYzC3K4Zq5BVxT\nWcjVFYVcVpZHulr9koQU9DKtzSnMYU5hDh+/ajYQGdWzo6GDdw60seVAO2/ta+Vftx4CICs9hSXl\nBVxdWcA1cyMtf91wRZKBhldKQnN3DnX0sOVAG+/UtfPOgTZ2HuqgbyDy//2cwmyuqijgyvJ8lpTn\ns7g8n/xsTdIm04OGV4oQ6ecvL8imvCCbu5ecavXvPHSMLUGrf1t9+/CwToCqcA5XlOezZE4+V5RH\nHpqWWaazUYPezLKAN4DMYP8fuPufm1kR8H2gCthP5ObgbcExjwMPAAPAH7r7f0xJ9SIXICs9lWWV\nhSyrLBze1tZ5kh2HOthW38GOhg62HGjn+ajwn1cc4sryfK4sHwr/GZqeWaaNUbtuLDJYOeTuJ8ws\nHXgTeBj4deCouz9lZo8Bhe7+J2a2CPgesAKYDfwYWODuAyP9DnXdSDw62nmS7Q2R4N9W386OhmM0\ntHcPvz6/OMSVc/JZVDaDy4OH+vzlYpq0rhuPvBOcCJ6mBw8HVgK3BNvXAK8BfxJsf9bde4FaM6sh\nEvpvje9PEImtolAGNy8o4eYFJcPbWk/0sr2hg+31HWxv6GBj7VGeC77shchUDpeX5Z0W/vNLQhrt\nIzE1pj56M0sFNgOXAn/v7hvMrNTdhz7bHgZKg/Vy4O2ow+uDbSLTXjg3M2oKh4i2zpPsbjzG7sPH\nI8vGY3zrZ/s5OTAIQEZqCtWluVxeNoPLZp16E9CN1+ViGVPQB90uS82sAPiRmV1xxutuZuMavmNm\nDwIPAsydO3c8h4rElcJQBh+4tJgPXFo8vK1vYJB9zZ3Dwb+r8Riv7WnmB5vrh/eZNSOLy8vyWDhr\nBgtn5bKgNI9LSnI1fbNMunGNunH3djN7FbgTOGJmZe7eaGZlQFOwWwNQEXXYnGDbmT9rNbAaIn30\nF1K8SLxKT01h4aw8Fs7K456rT32gbT7ey+7GY7x7+Bi7GyOfAN6saRke7pliUBUOsaA0jwWluSyY\nlcfC0jyqitX9IxduLKNuSoC+IOSzgduALwLrgPuBp4Llc8Eh64DvmtnTRL6MrQY2TkHtItNOSV4m\nJXkl3BTV7983MMj+lk7eO3KCPUeO897h47zXdJyXdx0muMiX9FRjfnEk+BfMPPUGUFGUQ2qKJneT\n8xtLi74MWBP006cAa939eTN7C1hrZg8AdcC9AO6+08zWAruAfuCh8424EUl26akpVJfmUV2ax12U\nDW/v6Rvg/eYT7I16A9h6sI1/++WpL3+z0lOYX5zLpTNPf1SFQ2Sk6ROAROjKWJFpprO3n71NJ3gv\nCP+9TSeoaTpx2tDP1BSjsiiHS4bCvySyvGRmLrmZuk4yUejKWJEEFcpMY2lFAUsrCk7b3nWyn33N\nndQEwV/TdIKa5hO8+m7T8ERvAGX5WZHQHwr/klwuKQlRkpepOf4TlIJeJEHkZKQNT9kQrW9gkLrW\nLmqaTvB+86k3gbWbDtJ18lSvam5mGvOKQ8wvCQXLXOYXR9ZD+hQwrem/nkiCS09NGe67jzY46DQe\n62Ff8wlqWzrZ19zJvpZONte1se6Xh4ju1Z01I+u0N4FLSnKZXxKivCBbN3yZBhT0IkkqJeXUhG83\nVpec9lpP3wB1rV3saz7BvuBNoLblBC9sb6S9q294v/TUyM1h5gV3Bqsqzokswzl6E4gjCnoROUtW\neurwdQBnaus8yb6WE8OfAOpaO9nf0sVb+1pP6wpKSzHmFGYPB3/0G8Gcwmwy03Rh2MWioBeRcSkM\nZbAsVMSyyqLTtrs7zSd6qWvtYn9LZ2TZGlm+U9fG8d7+4X1TDGYXZFMVDjE3nMPcolOPiqIc3RNg\nkinoRWRSmBkz87KYmZfFtVVnvwm0dfUFwR/5BFDX2kltaxcv7TjM0c6Tp+2fn50eCf4z3gTmFuVQ\nlp+lLqFxUtCLyJQzM4pCGRSFMrhmbuFZrx/v6ePg0W4OHO3iwNHOYNnNrkPHeHnn4eEpIiByjUB5\nQfZw67+iKDu4pWQ2cwqzKcnVMNEzKehFJObystJZNDudRbNnnPXawKBz+FgPB1q7OHi0i7qjnRwI\n3hRe2tFIW9SXwwCZaSmUF54e/sn+RqCgF5G4lho1OuiGS8Jnvd7Z209Dezf1bV3Ut3UHj8j6joaO\ns7qFznwjGPrZswuyKS/MpjQvM+G6hhT0IjKthTLTgtk+zx4hBON/I0hNMWbNyGJ2QVYk/IfeBII3\ngtkF2dNuGonpVa2IyDiN9kbQdbKfQ+09HGrvpqG9e3jZ0NbNOwfaeGFb42lTSADMyEqjvDCH8oIs\nyvKzKSvIYnZ+NmX5kTeH0hlZcTWpnIJeRJJaTkbaOa8cHjIw6DQf7z3tTeBQ8Khv6+YX+9vo6D79\newKzyG0ly/Kzgkc2swtOX868iF1ECnoRkfNITTFm5WcxKz+LZZVnjxiCSPdQY0cPjR3dNLb3cChq\n+X5zJ2/ubaHz5OmztacYzMzL4u4lZfyPuxdN6d+goBcRmaBQ5vk/Fbg7x3r6Odxx6k2gsaObQ+09\nlBVkT3l9CnoRkSlmZuRnp5OfnX7OaSWmWvx8WyAiIlNCQS8ikuAU9CIiCW7UoDezCjN71cx2mdlO\nM3s42F5kZq+Y2d5gWRh1zONmVmNme8zsjqn8A0RE5PzG0qLvB/7I3RcB1wMPmdki4DFgvbtXA+uD\n5wSvrQIWA3cCz5iZJp4WEYmRUYPe3Rvd/Z1g/TiwGygHVgJrgt3WAPcE6yuBZ929191rgRpgxWQX\nLiIiYzOuPnozqwKuBjYApe7eGLx0GCgN1suBg1GH1QfbREQkBsYc9GaWC/wL8Ii7H4t+zd0d8HMe\nOPLPe9DMNpnZpubm5vEcKiIi4zCmC6bMLJ1IyH/H3X8YbD5iZmXu3mhmZUBTsL0BqIg6fE6w7TTu\nvhpYHfz8ZjOru8C/AaAYaJnA8VNN9U2M6psY1Tcx8Vxf5Vh2skhj/Dw7RGboXwMcdfdHorb/FdDq\n7k+Z2WNAkbt/1swWA98l0i8/m8gXtdXuPnCOHz8pzGyTuy+fqp8/UapvYlTfxKi+iYn3+sZiLC36\nDwL3AdvNbGuw7U+Bp4C1ZvYAUAfcC+DuO81sLbCLyIidh6Yy5EVE5PxGDXp3fxMY6b5bt45wzBPA\nExOoS0RDmS5rAAADPElEQVREJkmiXBm7OtYFjEL1TYzqmxjVNzHxXt+oRu2jFxGR6S1RWvQiIjKC\naR30ZnZnMJ9OTTDyJ+bMbL+ZbTezrWa2Kdg24rxAF6Geb5pZk5ntiNoWN/MUjVDf582sITiHW83s\nYzGsL67nejpPfXFxDs0sy8w2mtkvg/r+ItgeL+dvpPri4vxNGneflg8gFXgfmA9kAL8EFsVBXfuB\n4jO2/SXwWLD+GPDFi1jPTcA1wI7R6gEWBecxE5gXnN/UGNT3eeCPz7FvLOorA64J1vOA94I64uIc\nnqe+uDiHRAZy5Abr6USuqr8+js7fSPXFxfmbrMd0btGvAGrcfZ+7nwSeJTLPTjwaaV6gKefubwBH\nx1jPRZ+naIT6RhKL+uJ6rqfz1DeSi12fu/uJ4Gl68HDi5/yNVN9IpuVcXtM56ON1Th0Hfmxmm83s\nwWDbSPMCxcp0mKfoD8xsW9C1M/SxPqb1xftcT2fUB3FyDs0sNbgGpwl4xd3j6vyNUB/EyfmbDNM5\n6OPVh9x9KfBRIlM63xT9okc+/8XNUKd4qyfwFSJdckuBRuBvYlvO5M/1NNnOUV/cnEN3Hwj+TcwB\nVpjZFWe8HtPzN0J9cXP+JsN0Dvoxzalzsbl7Q7BsAn5E5GPdEYvMB4SdPi9QrIxUT1ycU3c/Evzj\nGwS+xqmPxjGpz84z11PwekzP4bnqi7dzGNTUDrxK5D4VcXP+zlVfPJ6/iZjOQf8LoNrM5plZBpGb\nnayLZUFmFjKzvKF14HZgR1DX/cFu9wPPxabCYSPVsw5YZWaZZjYPqAY2XuzihgIg8GtEzmFM6jMz\nA74B7Hb3p6NeiotzOFJ98XIOzazEzAqC9WzgNuBd4uf8nbO+eDl/kybW3wZP5AF8jMgog/eBz8VB\nPfOJfCP/S2DnUE1AmMjkbnuBHxOZAO5i1fQ9Ih89+4j0Jz5wvnqAzwXncw/w0RjV90/AdmAbkX9Y\nZTGs70NEuhW2AVuDx8fi5Ryep764OIfAEmBLUMcO4M+C7fFy/kaqLy7O32Q9dGWsiEiCm85dNyIi\nMgYKehGRBKegFxFJcAp6EZEEp6AXEUlwCnoRkQSnoBcRSXAKehGRBPf/Ab3xUQZhjfpjAAAAAElF\nTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(range(0,n*step,step), error)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Further Resources**:\n", "- [a whole course on randomized algorithms](http://www.cs.ubc.ca/~nickhar/W12/)" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### More Details" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Here is a process to calculate a truncated SVD, described in [Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions](https://arxiv.org/pdf/0909.4061.pdf) and [summarized in this blog post](https://research.fb.com/fast-randomized-svd/):\n", "\n", "1\\. Compute an approximation to the range of $A$. That is, we want $Q$ with $r$ orthonormal columns such that $$A \\approx QQ^TA$$\n", "\n", "\n", "2\\. Construct $B = Q^T A$, which is small ($r\\times n$)\n", "\n", "\n", "3\\. Compute the SVD of $B$ by standard methods (fast since $B$ is smaller than $A$), $B = S\\,\\Sigma V^T$\n", "\n", "4\\. Since $$A \\approx Q Q^T A = Q (S\\,\\Sigma V^T)$$ if we set $U = QS$, then we have a low rank approximation $A \\approx U \\Sigma V^T$." ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "#### So how do we find $Q$ (in step 1)?" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "To estimate the range of $A$, we can just take a bunch of random vectors $w_i$, evaluate the subspace formed by $Aw_i$. We can form a matrix $W$ with the $w_i$ as it's columns. Now, we take the QR decomposition of $AW = QR$, then the columns of $Q$ form an orthonormal basis for $AW$, which is the range of $A$.\n", "\n", "Since the matrix $AW$ of the product has far more rows than columns and therefore, approximately, orthonormal columns. This is simple probability - with lots of rows, and few columns, it's unlikely that the columns are linearly dependent." ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "#### The QR Decomposition" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "We will be learning about the QR decomposition **in depth** later on. For now, you just need to know that $A = QR$, where $Q$ consists of orthonormal columns, and $R$ is upper triangular. Trefethen says that the QR decomposition is the most important idea in numerical linear algebra! We will definitely be returning to it." ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "#### How should we choose $r$?" ] }, { "cell_type": "markdown", "metadata": { "hidden": true }, "source": [ "Suppose our matrix has 100 columns, and we want 5 columns in U and V. To be safe, we should project our matrix onto an orthogonal basis with a few more rows and columns than 5 (let's use 15). At the end, we will just grab the first 5 columns of U and V\n", "\n", "So even although our projection was only approximate, by making it a bit bigger than we need, we can make up for the loss of accuracy (since we're only taking a subset later). " ] }, { "cell_type": "code", "execution_count": 175, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 144 ms, sys: 8 ms, total: 152 ms\n", "Wall time: 154 ms\n" ] } ], "source": [ "%time u, s, v = decomposition.randomized_svd(vectors, 5)" ] }, { "cell_type": "code", "execution_count": 176, "metadata": { "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 2.38 s, sys: 592 ms, total: 2.97 s\n", "Wall time: 2.96 s\n" ] } ], "source": [ "%time u, s, v = decomposition.randomized_svd(vectors.todense(), 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## End" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }