{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Homework 5: Matrix Factorization goes to the Movies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this HW, we'll explore matrix factorization and its application to Movie rating prediction. Matrix factorization is the most widely used approach for ratings prediction and many related tasks, and it proved its mettle in the Netflix challenge in 2008-2009, being the dominant model component (the actual winning model was a blend of many models) in the winning entry and most highly-competitive entries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Start your EC2 instance\n",
    "\n",
    "Follow the directions in lab 8 for starting your instance. Make sure you coordinate with your team-mates in starting up and shutting down. Please dont leave instances running when you're not using them. \n",
    "\n",
    "You should be able to do this HW concurrently with other team members, but some teams reported problems with network reliability in lab with multiple connections. This will depend on the time of day and the overall load on EC2. The longest cell in this lab (eigenvalue anaysis) is about 2 minutes, so avoid doing that concurrently."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Downloading this notebook. \n",
    "\n",
    "The easiest way to download this notebook is start a terminal connection to the instance. Then in a *local* (laptop) browser window, open this notebook from the link in the course page and right click on the download link at the top of this page. Then select \"copy link location\" (or whatever your browser option is for copying a link). Then go to your Amazon EC2 terminal window. cd to the directory where you would like to save this notebook (e.g. ~/hw5) and type\n",
    "<pre>wget &lt;paste&gt;</pre>\n",
    "i.e. type wget and then right click your mouse and select paste to paste the URL for this document. wget will copy hw5.ipynb to that location. \n",
    "\n",
    "Now in your ec2 instance terminal window do\n",
    "<pre>bidmach notebook</pre>\n",
    "to start interacting with this notebook. As in lab, if you have X11 installed on your laptop, it will bring up a browser window. Close that, and instead point your local (laptop) browser at this URL:\n",
    "<pre>http://localhost:8888</pre>\n",
    "or another port address if the notebook server starts on another port (look at its startup message in the ec2 terminal window).\n",
    "\n",
    "### Reminder\n",
    "Dont use your VM for any of this HW. You should work directly from your laptop to the EC2 instance. \n",
    "\n",
    "First we import BIDMach's classes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import BIDMat.{CMat,CSMat,DMat,Dict,FMat,FND,GMat,GDMat,GIMat,GLMat,GSMat,GSDMat,HMat,IDict,Image,IMat,LMat,Mat,SMat,SBMat,SDMat}\n",
    "import BIDMat.MatFunctions._\n",
    "import BIDMat.SciFunctions._\n",
    "import BIDMat.Solvers._\n",
    "import BIDMat.JPlotting._\n",
    "import BIDMach.Learner\n",
    "import BIDMach.models.{FM,GLM,KMeans,KMeansw,LDA,LDAgibbs,Model,NMF,SFA,RandomForest}\n",
    "import BIDMach.networks.{DNN}\n",
    "import BIDMach.datasources.{DataSource,MatDS,FilesDS,SFilesDS}\n",
    "import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}\n",
    "import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}\n",
    "import BIDMach.causal.{IPTW}\n",
    "\n",
    "Mat.plotInline = true\n",
    "Mat.checkMKL\n",
    "Mat.checkCUDA"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading Movie Data\n",
    "The movie data is from the Movielens website of the Grouplens project at U. Minnesota:\n",
    "http://grouplens.org/datasets/movielens/\n",
    "The numerical ratings data has been pre-processed and split into test/train subsets. Its already installed on your instance. Lets load it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "val dir = \"/opt/BIDMach/data/movielens/\"\n",
    "val train = loadSMat(dir + \"train.smat.lz4\")\n",
    "val test = loadSMat(dir + \"test.smat.lz4\")\n",
    "val nmovies = train.nrows\n",
    "val nusers = train.ncols\n",
    "val nratings = train.nnz + test.nnz\n",
    "(nmovies, nusers, nratings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets do some quick exploration of this dataset. A natural question is what is the distribution of movie ratings? Lets take a look. Note that the train and test data are in sparse (nmovies x nusers) matrices. Zeros in the matrices are missing ratings, not zero scores (the smallest legal score is 0.5). To access the non-zeros of a sparse matrix we use the \"contents\" method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "hist(test.contents, 30)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> TODO: List two interesting features of this distribution. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another natural question is what is the distribution of frequencies of ratings of movies? We say many datasets with power law behavior. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "// Make all values be 1 with !=, then sum across each row to get movie rating count\n",
    "val counts = sum(train != 0, 2) + sum(test != 0, 2)\n",
    "val scounts = sortdown(counts)    // Sort them down\n",
    "val nc = sum(scounts > 0).v.toInt // Restrict to movies with non-zero counts\n",
    "plot(row(1 to nc),scounts(0->nc,0)+1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Power law perhaps? But to be sure we need to examine the distribution with a loglog plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "loglog(row(1 to nc),scounts(0->nc,0)+1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not really power law. Not enough movies with very many ratings, and not enough movies with very few. There are economic forces at play - successful movies already reach a large fraction of the potential audience so we expect saturation at the top. At the other end, its less clear. While movies need a minimum expected viewership to be released, many underachieve even low expectations. Its less clear whether the weak tail in this dataset is from the movies themselves of from the Movielens authors' decisions to cull less frequently-reviewed movies."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ratings Prediction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets define a learner for this dataset. The model type is SFA for Sparse Factor Analysis. SFA factorizes the observed movie data $S$ as $$S \\approx M^T * U$$\n",
    "Where $S$ is an nmovies x nusers matrix with numerical ratings such that $S_{i,j}$ is the rating of movie i by user j. $M$ has dimension dim x nmovies, where dim is the hidden dimenion. $U$ has dimension dim x nusers. \n",
    "\n",
    "A key component of sparse matrix factorization is that the error in the prediction is only evaluated at elements $S_{i,j}$ for which we actually have a rating. Other elements of $S$ are left as zeros (which means that zero cannot be a legal rating - e.g. ratings could be offset so they are all >= 1). So if we write $P = M^T*U$ the total squared error that we try to minimize is\n",
    "\n",
    "$$\\sum_{(i,j) \\in R}\\left(S_{ij}-P_{ij}\\right)^2$$\n",
    "\n",
    "where $R$ is the set of actual ratings. \n",
    "\n",
    "Next we create a learner object. Since we are building a factorization $S\\approx M^T* U$, we would like to get two results, $M$ and $U$ back. BIDMach's data model assumes that input data has dimensions nfeatures x npoints. It returns a model whose size is set by the number of features. This will be the matrix $M$ in this case. To get back $U$ as well, we need to create a suitable-sized matrix first, called ufact here. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "val dim = 128\n",
    "val ufact = zeros(dim, test.ncols);\n",
    "val (nn,opts) = SFA.learner(train, ufact, dim)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Its very important for Matrix factorization models to carefully set the regularization parameters - the likelihood factors that bias movie and user coefficient matrices toward zero. Matrix factorization data is extremely sparse: most users rate a very small fraction of the available items. The number of paramaters in the model is often high compared to the number of data values (non-zeros), and this makes the model easy to overfit. Regularization reduces the level of overfit by nudging model parameters toward zero. \n",
    "\n",
    "The total function minimized by SFA with regularization is\n",
    "$$\\sum_{(i,j)\\in R}\\left(S_{i,j}-P_{i,j}\\right)^2 + \\lambda_M || M ||^2 + \\lambda_U || U ||^2$$\n",
    "where $|| X ||^2$ is the sum of the squares of all the elements of $X$, and $\\lambda_M$ is the regularization parameter for $M$ and $\\lambda_U$ is the regularization parameter for $U$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "opts.lambdam = 0.025f            // Movie matrix regularizer\n",
    "opts.lambdau = 1f                // User matrix regularizer\n",
    "opts.batchSize = 1000            // Adjust batch size - smaller is generally slower but more accurate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "nn.train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A quick overview of BIDMach's log output. As we discussed in lecture, there are several aspects of performance. The log output is designed to allow monitoring of all of these. The columns are\n",
    "1. Percent of dataset processed in this pass over the dataset. The pass number is separately printed at the start of the pass. \n",
    "2. The log likelihood. This is typically normalized to a per-feature or per-input-point value so that its value is independent of dataset or minibatch size. Since its the log of a probability, its inevitably negative. \n",
    "3. The computational throughput in gigflops. \n",
    "4. The elapsed time in seconds.\n",
    "5. The total GB processed.\n",
    "6. The data processing rate in MB/s\n",
    "7. When using a GPU (true by default), the amount (fraction) of GPU memory available. \n",
    "\n",
    "## Statistical Efficiency\n",
    "\n",
    "For instance, to interpret the statistical efficiency of the algorithm we compare the log loss with the amount of data consumed. its easiest to do this with a graph. Every learner contains a \"results\" field with a history of the data consumed and the loss. The first row of results is the loss over time. The second row is the amount of data consumed (number of points processed). Lets plot it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plot(nn.results(1,?),nn.results(0,?))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> TODO: The likelihood plot has a noticeable \"kink\" at the end of each pass over the dataset. Investigate and explain. HINT: The learner computes and prints scores on every 11 minibatches by default, but always scores the last minibatch of each pass. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hardware Efficiency\n",
    "The hardware efficiency can be read directly from the gflops number. While its not easy to interpret at first, various types of calculation have well-defined performance limits. Here were using certain sparse matrix products which have a limit on this hardware of about 30 gflops. We're getting slightly more than half of that limit, which is a good amount in practice. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computational Efficiency\n",
    "Computational efficiency as we use the term here means the amount of work (flops) per model update. You can get it by dividing the gflops log value by the data throughput value. This particular model has a fairly low efficiency score - it does several iterative updates to the user factor for each model update. This is necessary to be able to scale to very large datasets. Otherwise you have to save the user factor data between passes over the dataset, and its larger than the dataset itself. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now lets evaluate the model on some test data. Note train/test breakdown is a little different from other problems. Notice the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "train.dims"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "test.dims"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both matrices have the same number of movies and users. This is to model user history. Typically when making a movie prediction, you have some history for each user to predict from which is saved in the training matrix. The test/train split is therefore done on the set of non-zeros for the given movies and users, giving a test set with a different set of (movie, user) pairs. \n",
    "\n",
    "This still gives some movies for which you have no user history (cold-start problem). The model makes a prediction for those based solely on overall average movie ratings. \n",
    "\n",
    "The predictor takes the model, the training data once again, the user factor matrix, and a matrix to hold the movie predictions (ptest). These matrices are necessary because the predictor is designed to work if needed with a different set of users. It actually recomputes the user factor using the ratings in its second argument, and then uses those to predict the values in its final argument (ptest). Here train, ufact and ptest are for the same set of users as were used in training, but in general a different set could be used. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "val ptest = (test != 0)                 // form a sparse matrix with same non-zeros as test, but set to 1's\n",
    "val (pp,popts) = SFA.predictor(nn.model, train, ufact, ptest) // The predictor as explained above. \n",
    "popts.batchSize = 10000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Batch size has no effect on prediction peformance. We made it larger simply to reduce the amount of printout."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "pp.predict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To score the model we compute the root-mean-squared error (RMSE) between the predictions and actual ratings on the test set. Since predictions and test matrices are sparse but we only care about their non-zeros, its convenient to use the contents method which returns a dense matrix containing only the non-zeros of a sparse matrix. The ∙ operator is the dot product which computes the sum of the element-wise products between two vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "val diff = test.contents - ptest.contents;\n",
    "sqrt((diff ∙ diff) / diff.length)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is quite a good score. It means that typical predictions were less than a score point away from the actual score on a 0 to 5 point scale. Movielens does allow half-point scores, but more than 70% of users scores are integral. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Understanding Model Factors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By using a matrix factorization model, we're assuming that user preferences lie in a low-dimensional linear space. But what are the main dimensions of this space? Can we articulate the main dimensions in terms of movies? These kinds of questions are usualy answered with PCA (Principal Component Analysis) https://en.wikipedia.org/wiki/Principal_component_analysis \n",
    "\n",
    "PCA in turn uses SVD (Singular Value Decomposition) https://en.wikipedia.org/wiki/Principal_component_analysis to decompose a matrix. SVD is designed for dense matrices, and our data is sparse. But we do have a dense matrix of *predictions* P:\n",
    "$$P = M^T * U$$\n",
    "so if we do an SVD of $P$, we should recover an ordered (in decreasing order of importance for explaining the data) list of the components of the predictor in the left singular vectors. But wait, don't try this! The matrix $P$ is huge:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(\"P is %d by %d and has %d nonzeros\" format (nmovies, nusers, 1L*nmovies*nusers))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets use our knowledge of the SVD algorithm to reduce the work we have to do. Lets first shrink the matrices $P$ and $M$ since some movies never got rated in the dataset. We want $P_{trimmed} = M_{trimmed}^T * U$. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "val inz = find(sum(train,2) + sum(test,2) != 0);\n",
    "val m_trimmed = FMat(nn.modelmat)(?,inz)\n",
    "val nmovies_trimmed = m_trimmed.ncols"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "m_trimmed.dims"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's good, we reduced the number of movies by 6x. Next from \n",
    "https://en.wikipedia.org/wiki/Singular_value_decomposition\n",
    "we notice that there is a relation between singular vectors and eigenvectors. We want the singular vectors that match the movies dimension of $P_{trimmed}$. These are the left singular-vectors. From the article, we note that the left singular vectors of a matrix $M$ are the eigenvectors of $M M^T$. So our goal is to compute the eigenvectors of\n",
    "\n",
    "$$X = P_{trimmed} * P_{trimmed}^T = M_{trimmed}^T U U^T M_{trimmed}$$\n",
    "\n",
    "We can save the transpose operations using BIDMach's &ast;^ and ^&ast; operators:\n",
    "\n",
    "A &ast;^ B = A &ast; B.t\n",
    "\n",
    "A ^&ast; B = A.t &ast; B\n",
    "\n",
    "which suggests the expression:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "// val x = m_trimmed ^* ufact *^ ufact * m_trimmed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But wait again, this formula involves some very large intermediate results. Matrix multiply of an (m x k) and (k x n) matrix takes time 2mkn operation. What's the cost of the calculation above? \n",
    "\n",
    "> TODO: In the cell below, calculate how many arithmetic steps are needed for the three multiplies (ignore the additions) in the formula above.\n",
    "\n",
    "> WARNING: Single-precision integers in Scala and other languages have 32 bits and overflow (without error) at values large than two billion. They will overflow here, so make sure you use Double or Long types. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Luckily, matrix multiply is associative so we can use an arbitrary arrangement of brackets around subsequences of matrices in the formula above. \n",
    "\n",
    "> TODO: In the cell below, give a more efficient bracketization, and in the cell below that, compute its computational cost: Hint: you should be able to get at least an order of magnitude reduction in cost"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> TODO: What's the cost of computing the expression above?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have our matrix X, we are almost ready to compute eigenvectors. One last tweak is important: our calculations so far have used floating point arithmetic. This is fine for coarse operations like gradient updates to compute a model. But for many numerical algorithms such as eigenvalue routines, single-precision is not enough. The cell below converts X from single to double precision:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "val dx = DMat(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we compute eigenvalues and eigenvectors using a scientific library (LAPACK) routines. There are actually several eigenvalue routines in BIDMach. feig is \"fast eigenvalues\". Lets also track its performance with flip/gflop, which start and stop performance bookkeeping."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "flip; val (eigvals, eigvecs) = feig(dx); val gf=gflop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first number is the gigaflops achieved, and the second is the time taken in seconds. Notice that the gflops througput is pretty modest for this routine - it was run on the cpu first of all, and second it used double precision. Matrix algebra on the CPU is quite a bit slower than on the GPU. Compare with dense matrix multiply on the GPU:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "val gdx = GMat(dx)\n",
    "flip; val b = gdx * gdx; val gf=gflop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several reasons why we dont compute eigenvalues on the GPU:\n",
    "* There is no standard library routine for it, and that's because:\n",
    "* Eigenvalue calculations are more naturally sequential than matrix multiply. Its difficult to run the calculation entirely on the GPU, and most research implementations of GPU eigenvalues use hybrid computation with both CPU and GPU.\n",
    "* GPUs have much more modest double precision performance than single precision. Since we need doubles here, the advantage (if any) of using the GPU would be even less."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Back to the analysis, lets look at the largest eigenvalues since those correspond to the strongest factors. The feig routine returns them in sorted order ascending. So the first ones are the smallest. We want the largest values, so we reverse this list: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "reverse(eigvals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It will be convenient to access last columns of the eigenvector matrix in reverse order as well, so we create some suitable indices:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "val revinds = irow(nmovies_trimmed-1 to nmovies_trimmed-20 by -1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets check that the eigenanalysis actually worked, i.e. produced eigenvalues and eigenvectors (sometimes it doesnt!, e.g. with single precision arithmetic on this data). By definition of an eigenvector $v$, $M v = \\lambda v$, where $\\lambda$ is the eigenvalues associated with $v$. So if we do $M v / v$ (element-wise division) we should get a column of copies of the eigenvalue $\\lambda$. \n",
    "\n",
    "As a data scientist, you should always check your partial results. Especially when you use a complex black box like an eigenvalue routine. Even well-engineered software breaks. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "dx * eigvecs(?,revinds) / eigvecs(?,revinds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "> TODO: describe the columns of the matrix above. Did eigenanalysis work? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting Movie Meta-data\n",
    "\n",
    "There is some meta-data like titles and genres in the data directory. Lets grab it now. We'll write a short custom data parser for this. Its very easy in Scala. First we use Scala's \"process\" package to perform some shell commands without leaving the Ipython prompt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import scala.sys.process._\n",
    "import java.io.File\n",
    "val dir2 = dir + \"ml-10M100K/\"\n",
    "// Do a unix \"ls\" on the data directory for Movielens\n",
    "(\"ls \"+dir2) !"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "// Run the unix \"head\" command on the movies.dat file\n",
    "(\"head \"+dir2+\"movies.dat\") !"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we know what the data looks like, lets parse it. We use Scala's Source class for this"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import scala.io.Source\n",
    "val movielines = Source.fromFile(dir2+\"movies.dat\").getLines.toArray  // get lines of the file into an array of String\n",
    "val movietable = movielines.map((x:String) => x.split(\":\")) // Split the strings at separator, now Array[Array[String]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have the data as an nlines Array, each of whose elements is an Array with ncolumns of String. Lets extract the individual columns of this array in integer (IMat) and String (CSMat) matrices. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "val len = movietable.length\n",
    "val movienums = new IMat(len, 1, movietable.map((x:Array[String]) => x(0).toInt)) // Parse first column to Int\n",
    "// CSMat = matrix of Strings\n",
    "val movienames0 = new CSMat(len, 1, movietable.map((x:Array[String]) => x(2)))    // Save third column as name\n",
    "val moviegenres0 = new CSMat(len, 1, movietable.map((x:Array[String]) => x(4)))   // Save fifth column as genre\n",
    "val moviemeta = CSMat(nmovies,2); // This will contain both movie title and genre\n",
    "moviemeta(movienums-1, 0) = movienames0;   // Correctly map the title to the corresponding movie number (one-based) \n",
    "moviemeta(movienums-1, 1) = moviegenres0;  // Correctly map the genre to the corresponding movie number (one-based)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we're ready to peek at the movies corresponding to the various factors. The strongest factor, i.e. largest eigenvalue, is at the end of the array, at position nmovies_trimmed - 1. Lets get the corresponding eigenvector, and look at its largest and smallest values. These correspond to the movies that have the most influence (positive or negative) on that factor. When looking up the metadata, we have to undo the trimming we did with the matrix inz. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "val ivec = nmovies_trimmed - 1\n",
    "println(\"eigval %f\" format eigvals(ivec,0))  // Print the eigenvalue\n",
    "val pvec = eigvecs(?,ivec)                   // Get the corresponding eigenvector\n",
    "val (sbig,isort) = sort2(pvec)               // Sort its values and return the indices of the sorted values\n",
    "println(moviemeta(inz(isort(0->20)),0).t)    // Print the numerically smallest movies (biggest negative magnitude)\n",
    "println()\n",
    "println(moviemeta(inz(isort(revinds)),0).t)  // Print the numerically largest movies (biggest positive magnitude)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before trying to figure out what this data is telling us, let's think critially about this particular eigenvalue - notice that it is two orders of magnitude bigger than the next few values which have similar magnitudes. That suggests that it is modeling something qualitatively different. You'll also find that all its components have the same sign. Finally, if you only had one vector to predict movie scores as accurately as possible, what would it be?\n",
    "\n",
    "This special component is actually modeling the average movie score. So the range you see is from the worst movies (in this particular dataset) through to the best. This component captures average user preference rather than the variation between users. \n",
    "\n",
    "Now lets look at the other components. These are necessarily orthogonal to the first component, and will be capturing inter-user factors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "val ivec = nmovies_trimmed - 2\n",
    "val pvec = eigvecs(?,ivec)\n",
    "println(\"eigval %f\" format eigvals(ivec,0))\n",
    "val (sbig,ibig) = sort2(pvec)\n",
    "println(moviemeta(inz(ibig(0->20)),0).t)\n",
    "println()\n",
    "println(moviemeta(inz(ibig(revinds)),0).t)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> TODO: Describe the types of moves at the two extremes of this factor. btw, expect to see some \"leakage\" of the first factor into this one. PCA is imperfect and finds orthogonal factors rather than independent factors. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "val ivec = nmovies_trimmed - 3\n",
    "val pvec = eigvecs(?,ivec)\n",
    "println(\"eigval %f\" format eigvals(ivec,0))\n",
    "val (sbig,ibig) = sort2(pvec)\n",
    "println(moviemeta(inz(ibig(0->20)),0).t)\n",
    "println()\n",
    "println(moviemeta(inz(ibig(revinds)),0).t)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> TODO: Describe the two extremes of this factor. Try to ignore movies that \"leak\" though from the previous two factors. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "val ivec = nmovies_trimmed - 4\n",
    "val pvec = eigvecs(?,ivec)\n",
    "println(\"eigval %f\" format eigvals(ivec,0))\n",
    "val (sbig,ibig) = sort2(pvec)\n",
    "println(moviemeta(inz(ibig(0->20)),0).t)\n",
    "println()\n",
    "println(moviemeta(inz(ibig(revinds)),0).t)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> TODO: Try to describe the extreme movies of this factor. By now things will be getting very leaky (in terms of influence of the previous 3 factors) so you will have to find the new signal in a lot of noise. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Submit this HW. \n",
    "\n",
    "Save this notebook on your EC2 instance and copy it to your laptop for submission. Its probably easiest to do this using scp from your laptop. The submission link is <a href=\"https://bcourses.berkeley.edu/courses/1377158/assignments/6675867\">here</a>. \n",
    "\n",
    "Every person in the team should submit their own version (since the HW has some narrative questions in it. Dont copy those). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Shutdown your Instance\n",
    "\n",
    "Please shutdown your instance when you're done with it. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Optional: Eigenvalues on a budget"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The eigenvalue calculation was the most expensive part of the pipeline we used. It also has poor scaling since its cost grows as nmovies^3 and the storage as nmovies^2. An accurate eigenvalue is important for many numerical problems. On the other hand many large datasets are naturally noisy, and high precision isnt important. For SVD and PCA, we only want a few of the largest magnitude eigenvalues. For that we can use a much simpler method called the **Power Method**. If you've had a linear algebra course, you already know how to do this :-). \n",
    "\n",
    "To use the power method you really only need a couple of elementary facts about eigenvalues and eigenvectors. Recall that by definition, when you multiply an eigenvector by a matrix you get a multiple (by the eigenvalue) of the eigenvector. i.e. \n",
    "$$M v = \\lambda v$$\n",
    "\n",
    "**First Fact** The product of a matrix and its transpose (like the matrix X here) is *positive semi-definite* meaning its eigenvalues are all non-negative. \n",
    "\n",
    "**Second Fact** The eigenvectors for distinct eigenvalues of a positive semi-definite matrix are orthogonal, and we can form a complete basis using the eigenvectors. That is we can write *any* vector $u$ as\n",
    "$$u = \\sum_{i=1}^N a_i v_i$$\n",
    "for some scalar coefficients $a_i$ and eigenvectors $v_i$. \n",
    "\n",
    "Now suppose we multiply $u$ by the matrix $X$ whose eigenvalues we want. The result is\n",
    "$$Xu = \\sum_{i=1}^N a_i Xv_i = \\sum_{i=1}^N a_i \\lambda_i v_i$$\n",
    "So every eigenvector $v_i$ was scaled by its eigenvalue $\\lambda_i$. The vector with the largest eigenvalue grew faster than all the others. Suppose we repeated this k times:\n",
    "$$X^ku = \\sum_{i=1}^N a_i X^kv_i = \\sum_{i=1}^N a_i \\lambda_i^k v_i$$\n",
    "and to be clear $X^ku$ means $X*(X*(X,\\ldots,(X*u),\\ldots))$. Now the coefficient of the eigenvector with largest eigenvalue is much larger than the others. If $u$ began as a random vector, we can get the strongest eigenvalue by simply multiplying by $X$ repeatedly. This method is actually used at large scale, and otherwise known as **Pagerank**.\n",
    "\n",
    "Pagerank only gives the largest eigenvalue and vector. If we want to get a few more, we can use the same idea but restrict ourselves to vectors which are orthogonal to the strongest eigenvector. That is, we compute the strongest eigenvector $v_1$ using the power method above, and then power a random vector $u$ after orthogonalizing it to $v_1$:\n",
    "$$u^{\\prime} = u - v_1 * (u \\cdot v_1)/(v_1 \\cdot v_1)$$\n",
    "After a few power cycles on such a $u$, the strongest eigenvector will be $v_2$ (since we removed $v_1$) the eigenvector corresponding to the second largest magnitude eigenvalue. \n",
    "\n",
    "We can compute $v_3$ in a similar way by orthogonalizing to $v_1$ and $v_2$. We can do all these orthogonalizations in a single step by computing the inverse of the triangular matrix which is the pairwise inner products of the estimated eigenvectors. \n",
    "\n",
    "This method is very simple and has, in numerical terms, slow convergence. On the other hand, for noisy data applications it gets to good results in just a few iterations.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "val k = 5                             // Number of eigenvalues/vectors to compute\n",
    "val p = normrnd(0,1,inz.length, k)    // random matrix with k columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "flip\n",
    "val niter = 20\n",
    "for (i <- 0 until niter) {\n",
    "  val xp = x * p                           // Compute X * p, boosting the strong eigenvectors\n",
    "  println(xp(0->5,?) / p(0->5,?)); println // Print the ratios of X * p / p to track progress\n",
    "  xp ~ xp / sqrt(xp ∙ xp)                  // Normalize the eigenvector estimates to unit magnitude\n",
    "  val dots = xp ^* xp                      // Compute all pairwise dot products\n",
    "  p ~ xp * triinv(dots.clearLower)         // Invert the upper triangular matrix of products, mult by xp, save in p\n",
    "}\n",
    "val gf=gflop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> TODO: What was the running time and gflops for the power method? Enter in this cell"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> TODO: What was the running time and storage required as a function of nmovies for the method above? Enter in this cell."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, the method above is quite fast but still not optimal. Its weakness is that the matrix x was computed explicitly which takes O(nmovies^2) time and space. You can run this iteration quite a bit faster by computing x&ast;p *without computing x*. That is, use the definition of x from an earlier cell, and associate matrix multiplies for x&ast;p to use the least number of arithmetic steps. Be sure to precompute any fixed values. \n",
    "\n",
    "> TODO: fill in the line below with a more efficient formula for xp. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "flip\n",
    "val niter = 20\n",
    "for (i <- 0 until niter) {\n",
    "  val xp = x * p                           // TODO: modify this line to compute xp faster\n",
    "  println(xp(0->5,?) / p(0->5,?)); println // Print the ratios of X * p / p to track progress\n",
    "  xp ~ xp / sqrt(xp ∙ xp)                  // Normalize the eigenvector estimates to unit magnitude\n",
    "  val dots = xp ^* xp                      // Compute all pairwise dot products\n",
    "  p ~ xp * triinv(dots.clearLower)         // Invert the upper triangular matrix of products, mult by xp, save in p\n",
    "}\n",
    "val gf=gflop"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> TODO: What was the running time of your modified code? Enter it here. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> TODO: What is the number of operations in terms of nmovies? Enter it here."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "name": "scala",
   "version": "2.11.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}