{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# SVD for Movie Recommendations\n",
    "In this notebook, I'll detail a basic version of model-based collaborative filtering for recommendations by employing it on the MovieLens 1M dataset. \n",
    "\n",
    "[In my previous attempt](https://github.com/khanhnamle1994/movielens/blob/master/Content_Based_and_Collaborative_Filtering_Models.ipynb), I used user-based and item-based collaborative filtering to make movie recommendations from users' ratings data. I can only try them on a very small data sample (20,000 ratings), and ended up getting pretty high Root Mean Squared Error (bad recommendations). Memory-based collaborative filtering approaches that compute distance relationships between items or users have these two major issues:\n",
    "\n",
    "1. It doesn't scale particularly well to massive datasets, especially for real-time recommendations based on user behavior similarities - which takes a lot of computations.\n",
    "2. Ratings matrices may be overfitting to noisy representations of user tastes and preferences. When we use distance based \"neighborhood\" approaches on raw data, we match to sparse low-level details that we assume represent the user's preference vector instead of the vector itself.\n",
    "\n",
    "Thus I'd need to apply **Dimensionality Reduction** technique to derive the tastes and preferences from the raw data, otherwise known as doing low-rank matrix factorization. Why reduce dimensions?\n",
    "\n",
    "* I can discover hidden correlations / features in the raw data.\n",
    "* I can remove redundant and noisy features that are not useful.\n",
    "* I can interpret and visualize the data easier.\n",
    "* I can also access easier data storage and processing.\n",
    "\n",
    "With that goal in mind, I'll introduce Singular Vector Decomposition (SVD) to you, a powerful dimensionality reduction technique that is used heavily in modern model-based CF recommender system.\n",
    "\n",
    "![dim-red](images/dimensionality-reduction.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the Dataset\n",
    "Let's load the 3 data files just like last time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import libraries\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# Reading ratings file\n",
    "ratings = pd.read_csv('ratings.csv', sep='\\t', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating', 'timestamp'])\n",
    "\n",
    "# Reading users file\n",
    "users = pd.read_csv('users.csv', sep='\\t', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])\n",
    "\n",
    "# Reading movies file\n",
    "movies = pd.read_csv('movies.csv', sep='\\t', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a look at the movies and ratings dataframes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movie_id</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>Animation|Children's|Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>Adventure|Children's|Fantasy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Grumpier Old Men (1995)</td>\n",
       "      <td>Comedy|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>Comedy|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   movie_id                               title                        genres\n",
       "0         1                    Toy Story (1995)   Animation|Children's|Comedy\n",
       "1         2                      Jumanji (1995)  Adventure|Children's|Fantasy\n",
       "2         3             Grumpier Old Men (1995)                Comedy|Romance\n",
       "3         4            Waiting to Exhale (1995)                  Comedy|Drama\n",
       "4         5  Father of the Bride Part II (1995)                        Comedy"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>movie_id</th>\n",
       "      <th>rating</th>\n",
       "      <th>timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1193</td>\n",
       "      <td>5</td>\n",
       "      <td>978300760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>661</td>\n",
       "      <td>3</td>\n",
       "      <td>978302109</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>914</td>\n",
       "      <td>3</td>\n",
       "      <td>978301968</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>3408</td>\n",
       "      <td>4</td>\n",
       "      <td>978300275</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>2355</td>\n",
       "      <td>5</td>\n",
       "      <td>978824291</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   user_id  movie_id  rating  timestamp\n",
       "0        1      1193       5  978300760\n",
       "1        1       661       3  978302109\n",
       "2        1       914       3  978301968\n",
       "3        1      3408       4  978300275\n",
       "4        1      2355       5  978824291"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also let's count the number of unique users and movies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of users = 6040 | Number of movies = 3706\n"
     ]
    }
   ],
   "source": [
    "n_users = ratings.user_id.unique().shape[0]\n",
    "n_movies = ratings.movie_id.unique().shape[0]\n",
    "print 'Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_movies)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now I want the format of my ratings matrix to be one row per user and one column per movie. To do so, I'll pivot *ratings* to get that and call the new variable *Ratings* (with a capital *R)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>movie_id</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>10</th>\n",
       "      <th>...</th>\n",
       "      <th>3943</th>\n",
       "      <th>3944</th>\n",
       "      <th>3945</th>\n",
       "      <th>3946</th>\n",
       "      <th>3947</th>\n",
       "      <th>3948</th>\n",
       "      <th>3949</th>\n",
       "      <th>3950</th>\n",
       "      <th>3951</th>\n",
       "      <th>3952</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>user_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 3706 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "movie_id  1     2     3     4     5     6     7     8     9     10    ...   \\\n",
       "user_id                                                               ...    \n",
       "1          5.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...    \n",
       "2          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...    \n",
       "3          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...    \n",
       "4          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  ...    \n",
       "5          0.0   0.0   0.0   0.0   0.0   2.0   0.0   0.0   0.0   0.0  ...    \n",
       "\n",
       "movie_id  3943  3944  3945  3946  3947  3948  3949  3950  3951  3952  \n",
       "user_id                                                               \n",
       "1          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
       "2          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
       "3          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
       "4          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
       "5          0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0  \n",
       "\n",
       "[5 rows x 3706 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Ratings = ratings.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0)\n",
    "Ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Last but not least, I need to de-normalize the data (normalize by each users mean) and convert it from a dataframe to a numpy array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "R = Ratings.as_matrix()\n",
    "user_ratings_mean = np.mean(R, axis = 1)\n",
    "Ratings_demeaned = R - user_ratings_mean.reshape(-1, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With my ratings matrix properly formatted and normalized, I'm ready to do some dimensionality reduction. But first, let's go over the math."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model-Based Collaborative Filtering\n",
    "*Model-based Collaborative Filtering* is based on *matrix factorization (MF)* which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF:\n",
    "\n",
    "* The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items. \n",
    "* When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization, you can restructure the user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector. \n",
    "* You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.\n",
    "\n",
    "For example, let's check the sparsity of the ratings dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The sparsity level of MovieLens1M dataset is 95.5%\n"
     ]
    }
   ],
   "source": [
    "sparsity = round(1.0 - len(ratings) / float(n_users * n_movies), 3)\n",
    "print 'The sparsity level of MovieLens1M dataset is ' +  str(sparsity * 100) + '%'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Support Vector Decomposition (SVD)\n",
    "A well-known matrix factorization method is *Singular value decomposition (SVD)*. At a high level, SVD is an algorithm that decomposes a matrix $A$ into the best lower rank (i.e. smaller/simpler) approximation of the original matrix $A$. Mathematically, it decomposes A into a two unitary matrices and a diagonal matrix:\n",
    "\n",
    "![svd](images/svd.png)\n",
    "\n",
    "where $A$ is the input data matrix (users's ratings), $U$ is the left singular vectors (user \"features\" matrix), $\\Sigma$ is the diagonal matrix of singular values (essentially weights/strengths of each concept), and  $V^{T}$ is the right singluar vectors (movie \"features\" matrix). $U$ and $V^{T}$ are column orthonomal, and represent different things. $U$ represents how much users \"like\" each feature and $V^{T}$ represents how relevant each feature is to each movie.\n",
    "\n",
    "To get the lower rank approximation, I take these matrices and keep only the top $k$ features, which can be thought of as the underlying tastes and preferences vectors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setting Up SVD\n",
    "Scipy and Numpy both have functions to do the singular value decomposition. I'm going to use the Scipy function *svds* because it let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from scipy.sparse.linalg import svds\n",
    "U, sigma, Vt = svds(Ratings_demeaned, k = 50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As I'm going to leverage matrix multiplication to get predictions, I'll convert the $\\Sigma$ (now are values) to the diagonal matrix form."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "sigma = np.diag(sigma)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Making Predictions from the Decomposed Matrices\n",
    "I now have everything I need to make movie ratings predictions for every user. I can do it all at once by following the math and matrix multiply $U$, $\\Sigma$, and $V^{T}$ back to get the rank $k=50$ approximation of $A$.\n",
    "\n",
    "But first, I need to add the user means back to get the actual star ratings prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the predictions matrix for every user, I can build a function to recommend movies for any user. I return the list of movies the user has already rated, for the sake of comparison."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>movie_id</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>10</th>\n",
       "      <th>...</th>\n",
       "      <th>3943</th>\n",
       "      <th>3944</th>\n",
       "      <th>3945</th>\n",
       "      <th>3946</th>\n",
       "      <th>3947</th>\n",
       "      <th>3948</th>\n",
       "      <th>3949</th>\n",
       "      <th>3950</th>\n",
       "      <th>3951</th>\n",
       "      <th>3952</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>4.288861</td>\n",
       "      <td>0.143055</td>\n",
       "      <td>-0.195080</td>\n",
       "      <td>-0.018843</td>\n",
       "      <td>0.012232</td>\n",
       "      <td>-0.176604</td>\n",
       "      <td>-0.074120</td>\n",
       "      <td>0.141358</td>\n",
       "      <td>-0.059553</td>\n",
       "      <td>-0.195950</td>\n",
       "      <td>...</td>\n",
       "      <td>0.027807</td>\n",
       "      <td>0.001640</td>\n",
       "      <td>0.026395</td>\n",
       "      <td>-0.022024</td>\n",
       "      <td>-0.085415</td>\n",
       "      <td>0.403529</td>\n",
       "      <td>0.105579</td>\n",
       "      <td>0.031912</td>\n",
       "      <td>0.050450</td>\n",
       "      <td>0.088910</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.744716</td>\n",
       "      <td>0.169659</td>\n",
       "      <td>0.335418</td>\n",
       "      <td>0.000758</td>\n",
       "      <td>0.022475</td>\n",
       "      <td>1.353050</td>\n",
       "      <td>0.051426</td>\n",
       "      <td>0.071258</td>\n",
       "      <td>0.161601</td>\n",
       "      <td>1.567246</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.056502</td>\n",
       "      <td>-0.013733</td>\n",
       "      <td>-0.010580</td>\n",
       "      <td>0.062576</td>\n",
       "      <td>-0.016248</td>\n",
       "      <td>0.155790</td>\n",
       "      <td>-0.418737</td>\n",
       "      <td>-0.101102</td>\n",
       "      <td>-0.054098</td>\n",
       "      <td>-0.140188</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1.818824</td>\n",
       "      <td>0.456136</td>\n",
       "      <td>0.090978</td>\n",
       "      <td>-0.043037</td>\n",
       "      <td>-0.025694</td>\n",
       "      <td>-0.158617</td>\n",
       "      <td>-0.131778</td>\n",
       "      <td>0.098977</td>\n",
       "      <td>0.030551</td>\n",
       "      <td>0.735470</td>\n",
       "      <td>...</td>\n",
       "      <td>0.040481</td>\n",
       "      <td>-0.005301</td>\n",
       "      <td>0.012832</td>\n",
       "      <td>0.029349</td>\n",
       "      <td>0.020866</td>\n",
       "      <td>0.121532</td>\n",
       "      <td>0.076205</td>\n",
       "      <td>0.012345</td>\n",
       "      <td>0.015148</td>\n",
       "      <td>-0.109956</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.408057</td>\n",
       "      <td>-0.072960</td>\n",
       "      <td>0.039642</td>\n",
       "      <td>0.089363</td>\n",
       "      <td>0.041950</td>\n",
       "      <td>0.237753</td>\n",
       "      <td>-0.049426</td>\n",
       "      <td>0.009467</td>\n",
       "      <td>0.045469</td>\n",
       "      <td>-0.111370</td>\n",
       "      <td>...</td>\n",
       "      <td>0.008571</td>\n",
       "      <td>-0.005425</td>\n",
       "      <td>-0.008500</td>\n",
       "      <td>-0.003417</td>\n",
       "      <td>-0.083982</td>\n",
       "      <td>0.094512</td>\n",
       "      <td>0.057557</td>\n",
       "      <td>-0.026050</td>\n",
       "      <td>0.014841</td>\n",
       "      <td>-0.034224</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1.574272</td>\n",
       "      <td>0.021239</td>\n",
       "      <td>-0.051300</td>\n",
       "      <td>0.246884</td>\n",
       "      <td>-0.032406</td>\n",
       "      <td>1.552281</td>\n",
       "      <td>-0.199630</td>\n",
       "      <td>-0.014920</td>\n",
       "      <td>-0.060498</td>\n",
       "      <td>0.450512</td>\n",
       "      <td>...</td>\n",
       "      <td>0.110151</td>\n",
       "      <td>0.046010</td>\n",
       "      <td>0.006934</td>\n",
       "      <td>-0.015940</td>\n",
       "      <td>-0.050080</td>\n",
       "      <td>-0.052539</td>\n",
       "      <td>0.507189</td>\n",
       "      <td>0.033830</td>\n",
       "      <td>0.125706</td>\n",
       "      <td>0.199244</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 3706 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "movie_id      1         2         3         4         5         6     \\\n",
       "0         4.288861  0.143055 -0.195080 -0.018843  0.012232 -0.176604   \n",
       "1         0.744716  0.169659  0.335418  0.000758  0.022475  1.353050   \n",
       "2         1.818824  0.456136  0.090978 -0.043037 -0.025694 -0.158617   \n",
       "3         0.408057 -0.072960  0.039642  0.089363  0.041950  0.237753   \n",
       "4         1.574272  0.021239 -0.051300  0.246884 -0.032406  1.552281   \n",
       "\n",
       "movie_id      7         8         9         10      ...         3943  \\\n",
       "0        -0.074120  0.141358 -0.059553 -0.195950    ...     0.027807   \n",
       "1         0.051426  0.071258  0.161601  1.567246    ...    -0.056502   \n",
       "2        -0.131778  0.098977  0.030551  0.735470    ...     0.040481   \n",
       "3        -0.049426  0.009467  0.045469 -0.111370    ...     0.008571   \n",
       "4        -0.199630 -0.014920 -0.060498  0.450512    ...     0.110151   \n",
       "\n",
       "movie_id      3944      3945      3946      3947      3948      3949  \\\n",
       "0         0.001640  0.026395 -0.022024 -0.085415  0.403529  0.105579   \n",
       "1        -0.013733 -0.010580  0.062576 -0.016248  0.155790 -0.418737   \n",
       "2        -0.005301  0.012832  0.029349  0.020866  0.121532  0.076205   \n",
       "3        -0.005425 -0.008500 -0.003417 -0.083982  0.094512  0.057557   \n",
       "4         0.046010  0.006934 -0.015940 -0.050080 -0.052539  0.507189   \n",
       "\n",
       "movie_id      3950      3951      3952  \n",
       "0         0.031912  0.050450  0.088910  \n",
       "1        -0.101102 -0.054098 -0.140188  \n",
       "2         0.012345  0.015148 -0.109956  \n",
       "3        -0.026050  0.014841 -0.034224  \n",
       "4         0.033830  0.125706  0.199244  \n",
       "\n",
       "[5 rows x 3706 columns]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "preds = pd.DataFrame(all_user_predicted_ratings, columns = Ratings.columns)\n",
    "preds.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now I write a function to return the movies with the highest predicted rating that the specified user hasn't already rated. Though I didn't use any explicit movie content features (such as genre or title), I'll merge in that information to get a more complete picture of the recommendations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def recommend_movies(predictions, userID, movies, original_ratings, num_recommendations):\n",
    "    \n",
    "    # Get and sort the user's predictions\n",
    "    user_row_number = userID - 1 # User ID starts at 1, not 0\n",
    "    sorted_user_predictions = preds.iloc[user_row_number].sort_values(ascending=False) # User ID starts at 1\n",
    "    \n",
    "    # Get the user's data and merge in the movie information.\n",
    "    user_data = original_ratings[original_ratings.user_id == (userID)]\n",
    "    user_full = (user_data.merge(movies, how = 'left', left_on = 'movie_id', right_on = 'movie_id').\n",
    "                     sort_values(['rating'], ascending=False)\n",
    "                 )\n",
    "\n",
    "    print 'User {0} has already rated {1} movies.'.format(userID, user_full.shape[0])\n",
    "    print 'Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations)\n",
    "    \n",
    "    # Recommend the highest predicted rating movies that the user hasn't seen yet.\n",
    "    recommendations = (movies[~movies['movie_id'].isin(user_full['movie_id'])].\n",
    "         merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',\n",
    "               left_on = 'movie_id',\n",
    "               right_on = 'movie_id').\n",
    "         rename(columns = {user_row_number: 'Predictions'}).\n",
    "         sort_values('Predictions', ascending = False).\n",
    "                       iloc[:num_recommendations, :-1]\n",
    "                      )\n",
    "\n",
    "    return user_full, recommendations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try to recommend 20 movies for user with ID 1310."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "User 1310 has already rated 24 movies.\n",
      "Recommending highest 20 predicted ratings movies not already rated.\n"
     ]
    }
   ],
   "source": [
    "already_rated, predictions = recommend_movies(preds, 1310, movies, ratings, 20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>movie_id</th>\n",
       "      <th>rating</th>\n",
       "      <th>timestamp</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>1310</td>\n",
       "      <td>2248</td>\n",
       "      <td>5</td>\n",
       "      <td>974781573</td>\n",
       "      <td>Say Anything... (1989)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1310</td>\n",
       "      <td>2620</td>\n",
       "      <td>5</td>\n",
       "      <td>974781573</td>\n",
       "      <td>This Is My Father (1998)</td>\n",
       "      <td>Drama|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1310</td>\n",
       "      <td>3683</td>\n",
       "      <td>5</td>\n",
       "      <td>974781935</td>\n",
       "      <td>Blood Simple (1984)</td>\n",
       "      <td>Drama|Film-Noir</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>1310</td>\n",
       "      <td>1704</td>\n",
       "      <td>5</td>\n",
       "      <td>974781573</td>\n",
       "      <td>Good Will Hunting (1997)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1310</td>\n",
       "      <td>1293</td>\n",
       "      <td>5</td>\n",
       "      <td>974781839</td>\n",
       "      <td>Gandhi (1982)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>1310</td>\n",
       "      <td>3101</td>\n",
       "      <td>4</td>\n",
       "      <td>974781573</td>\n",
       "      <td>Fatal Attraction (1987)</td>\n",
       "      <td>Thriller</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>1310</td>\n",
       "      <td>1343</td>\n",
       "      <td>4</td>\n",
       "      <td>974781534</td>\n",
       "      <td>Cape Fear (1991)</td>\n",
       "      <td>Thriller</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>1310</td>\n",
       "      <td>2000</td>\n",
       "      <td>4</td>\n",
       "      <td>974781892</td>\n",
       "      <td>Lethal Weapon (1987)</td>\n",
       "      <td>Action|Comedy|Crime|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>1310</td>\n",
       "      <td>3526</td>\n",
       "      <td>4</td>\n",
       "      <td>974781892</td>\n",
       "      <td>Parenthood (1989)</td>\n",
       "      <td>Comedy|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>1310</td>\n",
       "      <td>3360</td>\n",
       "      <td>4</td>\n",
       "      <td>974781935</td>\n",
       "      <td>Hoosiers (1986)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>1310</td>\n",
       "      <td>3111</td>\n",
       "      <td>4</td>\n",
       "      <td>974782001</td>\n",
       "      <td>Places in the Heart (1984)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>1310</td>\n",
       "      <td>1097</td>\n",
       "      <td>4</td>\n",
       "      <td>974781534</td>\n",
       "      <td>E.T. the Extra-Terrestrial (1982)</td>\n",
       "      <td>Children's|Drama|Fantasy|Sci-Fi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>1310</td>\n",
       "      <td>1196</td>\n",
       "      <td>4</td>\n",
       "      <td>974781701</td>\n",
       "      <td>Star Wars: Episode V - The Empire Strikes Back...</td>\n",
       "      <td>Action|Adventure|Drama|Sci-Fi|War</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1310</td>\n",
       "      <td>1185</td>\n",
       "      <td>4</td>\n",
       "      <td>974781839</td>\n",
       "      <td>My Left Foot (1989)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1310</td>\n",
       "      <td>3685</td>\n",
       "      <td>4</td>\n",
       "      <td>974781935</td>\n",
       "      <td>Prizzi's Honor (1985)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1310</td>\n",
       "      <td>2243</td>\n",
       "      <td>4</td>\n",
       "      <td>974782001</td>\n",
       "      <td>Broadcast News (1987)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1310</td>\n",
       "      <td>1299</td>\n",
       "      <td>4</td>\n",
       "      <td>974781701</td>\n",
       "      <td>Killing Fields, The (1984)</td>\n",
       "      <td>Drama|War</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>1310</td>\n",
       "      <td>144</td>\n",
       "      <td>3</td>\n",
       "      <td>974781573</td>\n",
       "      <td>Brothers McMullen, The (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>1310</td>\n",
       "      <td>1960</td>\n",
       "      <td>3</td>\n",
       "      <td>974782001</td>\n",
       "      <td>Last Emperor, The (1987)</td>\n",
       "      <td>Drama|War</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1310</td>\n",
       "      <td>2988</td>\n",
       "      <td>3</td>\n",
       "      <td>974781935</td>\n",
       "      <td>Melvin and Howard (1980)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    user_id  movie_id  rating  timestamp  \\\n",
       "5      1310      2248       5  974781573   \n",
       "6      1310      2620       5  974781573   \n",
       "7      1310      3683       5  974781935   \n",
       "15     1310      1704       5  974781573   \n",
       "1      1310      1293       5  974781839   \n",
       "12     1310      3101       4  974781573   \n",
       "11     1310      1343       4  974781534   \n",
       "20     1310      2000       4  974781892   \n",
       "18     1310      3526       4  974781892   \n",
       "17     1310      3360       4  974781935   \n",
       "13     1310      3111       4  974782001   \n",
       "23     1310      1097       4  974781534   \n",
       "10     1310      1196       4  974781701   \n",
       "9      1310      1185       4  974781839   \n",
       "8      1310      3685       4  974781935   \n",
       "4      1310      2243       4  974782001   \n",
       "3      1310      1299       4  974781701   \n",
       "16     1310       144       3  974781573   \n",
       "19     1310      1960       3  974782001   \n",
       "0      1310      2988       3  974781935   \n",
       "\n",
       "                                                title  \\\n",
       "5                              Say Anything... (1989)   \n",
       "6                            This Is My Father (1998)   \n",
       "7                                 Blood Simple (1984)   \n",
       "15                           Good Will Hunting (1997)   \n",
       "1                                       Gandhi (1982)   \n",
       "12                            Fatal Attraction (1987)   \n",
       "11                                   Cape Fear (1991)   \n",
       "20                               Lethal Weapon (1987)   \n",
       "18                                  Parenthood (1989)   \n",
       "17                                    Hoosiers (1986)   \n",
       "13                         Places in the Heart (1984)   \n",
       "23                  E.T. the Extra-Terrestrial (1982)   \n",
       "10  Star Wars: Episode V - The Empire Strikes Back...   \n",
       "9                                 My Left Foot (1989)   \n",
       "8                               Prizzi's Honor (1985)   \n",
       "4                               Broadcast News (1987)   \n",
       "3                          Killing Fields, The (1984)   \n",
       "16                      Brothers McMullen, The (1995)   \n",
       "19                           Last Emperor, The (1987)   \n",
       "0                            Melvin and Howard (1980)   \n",
       "\n",
       "                               genres  \n",
       "5                Comedy|Drama|Romance  \n",
       "6                       Drama|Romance  \n",
       "7                     Drama|Film-Noir  \n",
       "15                              Drama  \n",
       "1                               Drama  \n",
       "12                           Thriller  \n",
       "11                           Thriller  \n",
       "20          Action|Comedy|Crime|Drama  \n",
       "18                       Comedy|Drama  \n",
       "17                              Drama  \n",
       "13                              Drama  \n",
       "23    Children's|Drama|Fantasy|Sci-Fi  \n",
       "10  Action|Adventure|Drama|Sci-Fi|War  \n",
       "9                               Drama  \n",
       "8                Comedy|Drama|Romance  \n",
       "4                Comedy|Drama|Romance  \n",
       "3                           Drama|War  \n",
       "16                             Comedy  \n",
       "19                          Drama|War  \n",
       "0                               Drama  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Top 20 movies that User 1310 has rated \n",
    "already_rated.head(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movie_id</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1618</th>\n",
       "      <td>1674</td>\n",
       "      <td>Witness (1985)</td>\n",
       "      <td>Drama|Romance|Thriller</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1880</th>\n",
       "      <td>1961</td>\n",
       "      <td>Rain Man (1988)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1187</th>\n",
       "      <td>1210</td>\n",
       "      <td>Star Wars: Episode VI - Return of the Jedi (1983)</td>\n",
       "      <td>Action|Adventure|Romance|Sci-Fi|War</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1216</th>\n",
       "      <td>1242</td>\n",
       "      <td>Glory (1989)</td>\n",
       "      <td>Action|Drama|War</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1202</th>\n",
       "      <td>1225</td>\n",
       "      <td>Amadeus (1984)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1273</th>\n",
       "      <td>1302</td>\n",
       "      <td>Field of Dreams (1989)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1220</th>\n",
       "      <td>1246</td>\n",
       "      <td>Dead Poets Society (1989)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1881</th>\n",
       "      <td>1962</td>\n",
       "      <td>Driving Miss Daisy (1989)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1877</th>\n",
       "      <td>1957</td>\n",
       "      <td>Chariots of Fire (1981)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1938</th>\n",
       "      <td>2020</td>\n",
       "      <td>Dangerous Liaisons (1988)</td>\n",
       "      <td>Drama|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1233</th>\n",
       "      <td>1259</td>\n",
       "      <td>Stand by Me (1986)</td>\n",
       "      <td>Adventure|Comedy|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3011</th>\n",
       "      <td>3098</td>\n",
       "      <td>Natural, The (1984)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2112</th>\n",
       "      <td>2194</td>\n",
       "      <td>Untouchables, The (1987)</td>\n",
       "      <td>Action|Crime|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1876</th>\n",
       "      <td>1956</td>\n",
       "      <td>Ordinary People (1980)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1268</th>\n",
       "      <td>1296</td>\n",
       "      <td>Room with a View, A (1986)</td>\n",
       "      <td>Drama|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2267</th>\n",
       "      <td>2352</td>\n",
       "      <td>Big Chill, The (1983)</td>\n",
       "      <td>Comedy|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1278</th>\n",
       "      <td>1307</td>\n",
       "      <td>When Harry Met Sally... (1989)</td>\n",
       "      <td>Comedy|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1165</th>\n",
       "      <td>1186</td>\n",
       "      <td>Sex, Lies, and Videotape (1989)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1199</th>\n",
       "      <td>1222</td>\n",
       "      <td>Full Metal Jacket (1987)</td>\n",
       "      <td>Action|Drama|War</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2833</th>\n",
       "      <td>2919</td>\n",
       "      <td>Year of Living Dangerously (1982)</td>\n",
       "      <td>Drama|Romance</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      movie_id                                              title  \\\n",
       "1618      1674                                     Witness (1985)   \n",
       "1880      1961                                    Rain Man (1988)   \n",
       "1187      1210  Star Wars: Episode VI - Return of the Jedi (1983)   \n",
       "1216      1242                                       Glory (1989)   \n",
       "1202      1225                                     Amadeus (1984)   \n",
       "1273      1302                             Field of Dreams (1989)   \n",
       "1220      1246                          Dead Poets Society (1989)   \n",
       "1881      1962                          Driving Miss Daisy (1989)   \n",
       "1877      1957                            Chariots of Fire (1981)   \n",
       "1938      2020                          Dangerous Liaisons (1988)   \n",
       "1233      1259                                 Stand by Me (1986)   \n",
       "3011      3098                                Natural, The (1984)   \n",
       "2112      2194                           Untouchables, The (1987)   \n",
       "1876      1956                             Ordinary People (1980)   \n",
       "1268      1296                         Room with a View, A (1986)   \n",
       "2267      2352                              Big Chill, The (1983)   \n",
       "1278      1307                     When Harry Met Sally... (1989)   \n",
       "1165      1186                    Sex, Lies, and Videotape (1989)   \n",
       "1199      1222                           Full Metal Jacket (1987)   \n",
       "2833      2919                  Year of Living Dangerously (1982)   \n",
       "\n",
       "                                   genres  \n",
       "1618               Drama|Romance|Thriller  \n",
       "1880                                Drama  \n",
       "1187  Action|Adventure|Romance|Sci-Fi|War  \n",
       "1216                     Action|Drama|War  \n",
       "1202                                Drama  \n",
       "1273                                Drama  \n",
       "1220                                Drama  \n",
       "1881                                Drama  \n",
       "1877                                Drama  \n",
       "1938                        Drama|Romance  \n",
       "1233               Adventure|Comedy|Drama  \n",
       "3011                                Drama  \n",
       "2112                   Action|Crime|Drama  \n",
       "1876                                Drama  \n",
       "1268                        Drama|Romance  \n",
       "2267                         Comedy|Drama  \n",
       "1278                       Comedy|Romance  \n",
       "1165                                Drama  \n",
       "1199                     Action|Drama|War  \n",
       "2833                        Drama|Romance  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Top 20 movies that User 1310 hopefully will enjoy\n",
    "predictions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These look like pretty good recommendations. It's good to see that, although I didn't actually use the genre of the movie as a feature, the truncated matrix factorization features \"picked up\" on the underlying tastes and preferences of the user. I've recommended some comedy, drama, and romance movies - all of which were genres of some of this user's top rated movies."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model Evaluation\n",
    "Can't forget to evaluate our model, can we?\n",
    "\n",
    "Instead of doing manually like the last time, I will use the *[Surprise](https://pypi.python.org/pypi/scikit-surprise)* library that provided various ready-to-use powerful prediction algorithms including (SVD) to evaluate its RMSE (Root Mean Squared Error) on the MovieLens dataset. It is a Python scikit building and analyzing recommender systems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import libraries from Surprise package\n",
    "from surprise import Reader, Dataset, SVD, evaluate\n",
    "\n",
    "# Load Reader library\n",
    "reader = Reader()\n",
    "\n",
    "# Load ratings dataset with Dataset library\n",
    "data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)\n",
    "\n",
    "# Split the dataset for 5-fold evaluation\n",
    "data.split(n_folds=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/khanhnamle/anaconda2/lib/python2.7/site-packages/surprise/evaluate.py:66: UserWarning: The evaluate() method is deprecated. Please use model_selection.cross_validate() instead.\n",
      "  'model_selection.cross_validate() instead.', UserWarning)\n",
      "/Users/khanhnamle/anaconda2/lib/python2.7/site-packages/surprise/dataset.py:193: UserWarning: Using data.split() or using load_from_folds() without using a CV iterator is now deprecated. \n",
      "  UserWarning)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Evaluating RMSE of algorithm SVD.\n",
      "\n",
      "------------\n",
      "Fold 1\n",
      "RMSE: 0.8731\n",
      "------------\n",
      "Fold 2\n",
      "RMSE: 0.8743\n",
      "------------\n",
      "Fold 3\n",
      "RMSE: 0.8742\n",
      "------------\n",
      "Fold 4\n",
      "RMSE: 0.8724\n",
      "------------\n",
      "Fold 5\n",
      "RMSE: 0.8737\n",
      "------------\n",
      "------------\n",
      "Mean RMSE: 0.8736\n",
      "------------\n",
      "------------\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "CaseInsensitiveDefaultDict(list,\n",
       "                           {'rmse': [0.873147136065789,\n",
       "                             0.8742574033899374,\n",
       "                             0.8742215891785942,\n",
       "                             0.8724048167991113,\n",
       "                             0.8737344531771998]})"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Use the SVD algorithm.\n",
    "svd = SVD()\n",
    "\n",
    "# Compute the RMSE of the SVD algorithm.\n",
    "evaluate(svd, data, measures=['RMSE'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I get a mean *Root Mean Square Error* of 0.8736 which is pretty good. Let's now train on the dataset and arrive at predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/khanhnamle/anaconda2/lib/python2.7/site-packages/surprise/prediction_algorithms/algo_base.py:51: UserWarning: train() is deprecated. Use fit() instead\n",
      "  warnings.warn('train() is deprecated. Use fit() instead', UserWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<surprise.prediction_algorithms.matrix_factorization.SVD at 0x10df4b1d0>"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainset = data.build_full_trainset()\n",
    "svd.train(trainset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I'll pick again user with ID 1310 and check the ratings he has given."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>movie_id</th>\n",
       "      <th>rating</th>\n",
       "      <th>timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>215928</th>\n",
       "      <td>1310</td>\n",
       "      <td>2988</td>\n",
       "      <td>3</td>\n",
       "      <td>974781935</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215929</th>\n",
       "      <td>1310</td>\n",
       "      <td>1293</td>\n",
       "      <td>5</td>\n",
       "      <td>974781839</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215930</th>\n",
       "      <td>1310</td>\n",
       "      <td>1295</td>\n",
       "      <td>2</td>\n",
       "      <td>974782001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215931</th>\n",
       "      <td>1310</td>\n",
       "      <td>1299</td>\n",
       "      <td>4</td>\n",
       "      <td>974781701</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215932</th>\n",
       "      <td>1310</td>\n",
       "      <td>2243</td>\n",
       "      <td>4</td>\n",
       "      <td>974782001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215933</th>\n",
       "      <td>1310</td>\n",
       "      <td>2248</td>\n",
       "      <td>5</td>\n",
       "      <td>974781573</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215934</th>\n",
       "      <td>1310</td>\n",
       "      <td>2620</td>\n",
       "      <td>5</td>\n",
       "      <td>974781573</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215935</th>\n",
       "      <td>1310</td>\n",
       "      <td>3683</td>\n",
       "      <td>5</td>\n",
       "      <td>974781935</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215936</th>\n",
       "      <td>1310</td>\n",
       "      <td>3685</td>\n",
       "      <td>4</td>\n",
       "      <td>974781935</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215937</th>\n",
       "      <td>1310</td>\n",
       "      <td>1185</td>\n",
       "      <td>4</td>\n",
       "      <td>974781839</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215938</th>\n",
       "      <td>1310</td>\n",
       "      <td>1196</td>\n",
       "      <td>4</td>\n",
       "      <td>974781701</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215939</th>\n",
       "      <td>1310</td>\n",
       "      <td>1343</td>\n",
       "      <td>4</td>\n",
       "      <td>974781534</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215940</th>\n",
       "      <td>1310</td>\n",
       "      <td>3101</td>\n",
       "      <td>4</td>\n",
       "      <td>974781573</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215941</th>\n",
       "      <td>1310</td>\n",
       "      <td>3111</td>\n",
       "      <td>4</td>\n",
       "      <td>974782001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215942</th>\n",
       "      <td>1310</td>\n",
       "      <td>2313</td>\n",
       "      <td>2</td>\n",
       "      <td>974781839</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215943</th>\n",
       "      <td>1310</td>\n",
       "      <td>1704</td>\n",
       "      <td>5</td>\n",
       "      <td>974781573</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215944</th>\n",
       "      <td>1310</td>\n",
       "      <td>144</td>\n",
       "      <td>3</td>\n",
       "      <td>974781573</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215945</th>\n",
       "      <td>1310</td>\n",
       "      <td>3360</td>\n",
       "      <td>4</td>\n",
       "      <td>974781935</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215946</th>\n",
       "      <td>1310</td>\n",
       "      <td>3526</td>\n",
       "      <td>4</td>\n",
       "      <td>974781892</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215947</th>\n",
       "      <td>1310</td>\n",
       "      <td>1960</td>\n",
       "      <td>3</td>\n",
       "      <td>974782001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215948</th>\n",
       "      <td>1310</td>\n",
       "      <td>2000</td>\n",
       "      <td>4</td>\n",
       "      <td>974781892</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215949</th>\n",
       "      <td>1310</td>\n",
       "      <td>1231</td>\n",
       "      <td>2</td>\n",
       "      <td>974781963</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215950</th>\n",
       "      <td>1310</td>\n",
       "      <td>1090</td>\n",
       "      <td>2</td>\n",
       "      <td>974781839</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>215951</th>\n",
       "      <td>1310</td>\n",
       "      <td>1097</td>\n",
       "      <td>4</td>\n",
       "      <td>974781534</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        user_id  movie_id  rating  timestamp\n",
       "215928     1310      2988       3  974781935\n",
       "215929     1310      1293       5  974781839\n",
       "215930     1310      1295       2  974782001\n",
       "215931     1310      1299       4  974781701\n",
       "215932     1310      2243       4  974782001\n",
       "215933     1310      2248       5  974781573\n",
       "215934     1310      2620       5  974781573\n",
       "215935     1310      3683       5  974781935\n",
       "215936     1310      3685       4  974781935\n",
       "215937     1310      1185       4  974781839\n",
       "215938     1310      1196       4  974781701\n",
       "215939     1310      1343       4  974781534\n",
       "215940     1310      3101       4  974781573\n",
       "215941     1310      3111       4  974782001\n",
       "215942     1310      2313       2  974781839\n",
       "215943     1310      1704       5  974781573\n",
       "215944     1310       144       3  974781573\n",
       "215945     1310      3360       4  974781935\n",
       "215946     1310      3526       4  974781892\n",
       "215947     1310      1960       3  974782001\n",
       "215948     1310      2000       4  974781892\n",
       "215949     1310      1231       2  974781963\n",
       "215950     1310      1090       2  974781839\n",
       "215951     1310      1097       4  974781534"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings[ratings['user_id'] == 1310]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's use SVD to predict the rating that User with ID 1310 will give to a random movie (let's say with Movie ID 1994)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Prediction(uid=1310, iid=1994, r_ui=None, est=3.348657218865324, details={u'was_impossible': False})"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "svd.predict(1310, 1994)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For movie with ID 1994, I get an estimated prediction of 3.349. The recommender system works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.\n",
    "\n",
    "## Conclusion\n",
    "In this notebook, I attempted to build a model-based Collaborative Filtering movie recommendation sytem based on latent features from a low rank matrix factorization method called SVD. As it captures the underlying features driving the raw data, it can scale significantly better to massive datasets as well as make better recommendations based on user's tastes.\n",
    "\n",
    "However, we still likely lose some meaningful signals by using a low-rank approximation. Specifically, there's an interpretability problem as a singular vector specifies a linear combination of all input columns or rows. There's also a lack of sparsity when the singular vectors are quite dense. Thus, SVD approach is limited to linear projections."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}