{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deep Learning for Movie Recommendation\n",
    "In this notebook, I'll build a deep learning model movie recommendations system on the MovieLens 1M dataset.\n",
    "\n",
    "This will be my 3rd attempt doing this. \n",
    "* In the [1st attempt](https://github.com/khanhnamle1994/movielens/blob/master/Content_Based_and_Collaborative_Filtering_Models.ipynb), I tried out content-based and memory-based collaboratice filtering, which rely on the calculation of users and movies' similarity scores. As no training or optimization is involved, these are easy to use approaches. But their performance decrease when we have sparse data which hinders scalability of these approaches for most of the real-world problems.\n",
    "* In the [2nd attempt](https://github.com/khanhnamle1994/movielens/blob/master/SVD_Model.ipynb), I tried out a matrix factorization model-based collaborative filtering approach called Singular Vector Decomposition, which reduces the dimension of the dataset and gives low-rank approximation of user tastes and preferences.\n",
    "\n",
    "In this post, I will use a Deep Learning / Neural Network approach that is up and coming with recent development in machine learning and AI technologies.\n",
    "\n",
    "![movixai](images/movixai.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading Datasets\n",
    "Similar to what I did for the previous notebooks, I loaded the 3 datasets into 3 dataframes: *ratings*, *users*, and *movies*. Additionally, to make it easy to use series from the *ratings* dataframe as training inputs and output to the Keras model, I set *max_userid* as the max value of user_id in the ratings and *max_movieid* as the max value of movie_id in the ratings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Import libraries\n",
    "%matplotlib inline\n",
    "import math\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# Reading ratings file\n",
    "ratings = pd.read_csv('ratings.csv', sep='\\t', encoding='latin-1', \n",
    "                      usecols=['user_id', 'movie_id', 'user_emb_id', 'movie_emb_id', 'rating'])\n",
    "max_userid = ratings['user_id'].drop_duplicates().max()\n",
    "max_movieid = ratings['movie_id'].drop_duplicates().max()\n",
    "\n",
    "# Reading ratings file\n",
    "users = pd.read_csv('users.csv', sep='\\t', encoding='latin-1', \n",
    "                    usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])\n",
    "\n",
    "# Reading ratings file\n",
    "movies = pd.read_csv('movies.csv', sep='\\t', encoding='latin-1', \n",
    "                     usecols=['movie_id', 'title', 'genres'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Matrix Factorization for Collaborative Filtering\n",
    "I have discussed this extensively in my 2nd notebook, but want to revisit it here as the baseline approach. The idea behind matrix factorization models is that attitudes or preferences of a user can be determined by a small number of hidden factors. We can call these factors as **Embeddings**.\n",
    "\n",
    "Intuitively, we can understand embeddings as low dimensional hidden factors for movies and users. For e.g. say we have 3 dimensional embeddings for both movies and users.\n",
    "\n",
    "For instance, for movie A, the 3 numbers in the movie embedding matrix represent 3 different characteristics about the movie, such as:\n",
    "* How recent is the movie A?\n",
    "* How much special effects are in movie A?\n",
    "* How CGI-driven is movie A? \n",
    "\n",
    "For user B, the 3 numbers in the user embedding matrix represent:\n",
    "* How much does user B like Drama movie?\n",
    "* How likely does user B to give a 5-star rating?\n",
    "* How often does user B watch movies?\n",
    "\n",
    "![matrix-factorization](images/matrix-factorization.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I definitely would want to create a training and validation set and optimize the number of embeddings by minimizing the RMSE. Intuitively, the RMSE will decrease on the training set as the number of embeddings increases (because I'm approximating the original ratings matrix with a higher rank matrix). Here I create a training set by shuffling randomly the values from the original ratings dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Users: [1284 1682 2367 ... 2937 2076 4665] , shape = (1000209,)\n",
      "Movies: [1093 3071   44 ... 1135 1731 1953] , shape = (1000209,)\n",
      "Ratings: [4 5 4 ... 5 5 1] , shape = (1000209,)\n"
     ]
    }
   ],
   "source": [
    "# Create training set\n",
    "shuffled_ratings = ratings.sample(frac=1., random_state=RNG_SEED)\n",
    "\n",
    "# Shuffling users\n",
    "Users = shuffled_ratings['user_emb_id'].values\n",
    "print 'Users:', Users, ', shape =', Users.shape\n",
    "\n",
    "# Shuffling movies\n",
    "Movies = shuffled_ratings['movie_emb_id'].values\n",
    "print 'Movies:', Movies, ', shape =', Movies.shape\n",
    "\n",
    "# Shuffling ratings\n",
    "Ratings = shuffled_ratings['rating'].values\n",
    "print 'Ratings:', Ratings, ', shape =', Ratings.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deep Learning Model\n",
    "The idea of using deep learning is similar to that of Model-Based Matrix Factorization. In matrix factorizaion, we decompose our original sparse matrix into product of 2 low rank orthogonal matrices. For deep learning implementation, we don’t need them to be orthogonal, we want our model to learn the values of embedding matrix itself. The user latent features and movie latent features are looked up from the embedding matrices for specific movie-user combination. These are the input values for further linear and non-linear layers. We can pass this input to multiple relu, linear or sigmoid layers and learn the corresponding weights by any optimization algorithm (Adam, SGD, etc.).\n",
    "\n",
    "### Build the Model\n",
    "I created a sparse matrix factoring algorithm in Keras for my model in [CFModel.py](https://github.com/khanhnamle1994/movielens/blob/master/CFModel.py). Here are the main components:\n",
    "\n",
    "* A left embedding layer that creates a Users by Latent Factors matrix.\n",
    "* A right embedding layer that creates a Movies by Latent Factors matrix.\n",
    "* When the input to these layers are (i) a user id and (ii) a movie id, they'll return the latent factor vectors for the user and the movie, respectively.\n",
    "* A merge layer that takes the dot product of these two latent vectors to return the predicted rating.\n",
    "\n",
    "This code is based on the approach outlined in [Alkahest](http://www.fenris.org/)'s blog post [Collaborative Filtering in Keras](http://www.fenris.org/2016/03/07/index-html).\n",
    "\n",
    "![embedding-layers](images/embedding-layers.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/khanhnamle/anaconda2/lib/python2.7/site-packages/h5py/__init__.py:34: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n",
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "# Import Keras libraries\n",
    "from keras.callbacks import Callback, EarlyStopping, ModelCheckpoint\n",
    "# Import CF Model Architecture\n",
    "from CFModel import CFModel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Define constants\n",
    "K_FACTORS = 100 # The number of dimensional embeddings for movies and users\n",
    "TEST_USER = 2000 # A random test user (user_id = 2000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I then compile the model using Mean Squared Error (MSE) as the loss function and the AdaMax learning algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "CFModel.py:28: UserWarning: The `Merge` layer is deprecated and will be removed after 08/2017. Use instead layers from `keras.layers.merge`, e.g. `add`, `concatenate`, etc.\n",
      "  self.add(Merge([P, Q], mode='dot', dot_axes=1))\n"
     ]
    }
   ],
   "source": [
    "# Define model\n",
    "model = CFModel(max_userid, max_movieid, K_FACTORS)\n",
    "# Compile the model using MSE as the loss function and the AdaMax learning algorithm\n",
    "model.compile(loss='mse', optimizer='adamax')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train the Model\n",
    "Now I need to train the model. This step will be the most-time consuming one. In my particular case, for our dataset with nearly 1 million ratings, almost 6,000 users and 4,000 movies, I trained the model in roughly 6 minutes per epoch (30 epochs ~ 3 hours) inside my Macbook Laptop CPU. I spitted the training and validataion data with ratio of 90/10."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/khanhnamle/anaconda2/lib/python2.7/site-packages/keras/models.py:944: UserWarning: The `nb_epoch` argument in `fit` has been renamed `epochs`.\n",
      "  warnings.warn('The `nb_epoch` argument in `fit` '\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Train on 900188 samples, validate on 100021 samples\n",
      "Epoch 1/30\n",
      " - 387s - loss: 8.2727 - val_loss: 2.2829\n",
      "Epoch 2/30\n",
      " - 316s - loss: 1.4957 - val_loss: 1.1248\n",
      "Epoch 3/30\n",
      " - 319s - loss: 1.0059 - val_loss: 0.9370\n",
      "Epoch 4/30\n",
      " - 317s - loss: 0.8957 - val_loss: 0.8764\n",
      "Epoch 5/30\n",
      " - 316s - loss: 0.8495 - val_loss: 0.8461\n",
      "Epoch 6/30\n",
      " - 355s - loss: 0.8182 - val_loss: 0.8228\n",
      "Epoch 7/30\n",
      " - 395s - loss: 0.7921 - val_loss: 0.8045\n",
      "Epoch 8/30\n",
      " - 395s - loss: 0.7695 - val_loss: 0.7921\n",
      "Epoch 9/30\n",
      " - 394s - loss: 0.7477 - val_loss: 0.7807\n",
      "Epoch 10/30\n",
      " - 406s - loss: 0.7269 - val_loss: 0.7700\n",
      "Epoch 11/30\n",
      " - 371s - loss: 0.7060 - val_loss: 0.7614\n",
      "Epoch 12/30\n",
      " - 332s - loss: 0.6849 - val_loss: 0.7543\n",
      "Epoch 13/30\n",
      " - 319s - loss: 0.6639 - val_loss: 0.7483\n",
      "Epoch 14/30\n",
      " - 340s - loss: 0.6428 - val_loss: 0.7458\n",
      "Epoch 15/30\n",
      " - 358s - loss: 0.6218 - val_loss: 0.7428\n",
      "Epoch 16/30\n",
      " - 315s - loss: 0.6009 - val_loss: 0.7433\n",
      "Epoch 17/30\n",
      " - 314s - loss: 0.5801 - val_loss: 0.7424\n",
      "Epoch 18/30\n",
      " - 314s - loss: 0.5596 - val_loss: 0.7458\n",
      "Epoch 19/30\n",
      " - 313s - loss: 0.5396 - val_loss: 0.7481\n"
     ]
    }
   ],
   "source": [
    "# Callbacks monitor the validation loss\n",
    "# Save the model weights each time the validation loss has improved\n",
    "callbacks = [EarlyStopping('val_loss', patience=2), \n",
    "             ModelCheckpoint('weights.h5', save_best_only=True)]\n",
    "\n",
    "# Use 30 epochs, 90% training data, 10% validation data \n",
    "history = model.fit([Users, Movies], Ratings, nb_epoch=30, validation_split=.1, verbose=2, callbacks=callbacks)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Root Mean Square Error\n",
    "During the training process above, I saved the model weights each time the validation loss has improved. Thus, I can use that value to calculate the best validation Root Mean Square Error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Minimum RMSE at epoch 17 = 0.8616\n"
     ]
    }
   ],
   "source": [
    "# Show the best validation RMSE\n",
    "min_val_loss, idx = min((val, idx) for (idx, val) in enumerate(history.history['val_loss']))\n",
    "print 'Minimum RMSE at epoch', '{:d}'.format(idx+1), '=', '{:.4f}'.format(math.sqrt(min_val_loss))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The best validation loss is *0.7424* at epoch 17. Taking the square root of that number, I got the RMSE value of *0.8616*, which is better than the RMSE from the SVD Model (*0.8736*)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict the Ratings\n",
    "The next step is to actually predict the ratings a random user will give to a random movie. Below I apply the freshly trained deep learning model for all the users and all the movies, using 100 dimensional embeddings for each of them. I also load pre-trained weights from *[weights.h5](https://github.com/khanhnamle1994/movielens/blob/master/weights.h5)* for the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Use the pre-trained model\n",
    "trained_model = CFModel(max_userid, max_movieid, K_FACTORS)\n",
    "# Load weights\n",
    "trained_model.load_weights('weights.h5')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As mentioned above, my random test user is has ID 2000."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>gender</th>\n",
       "      <th>zipcode</th>\n",
       "      <th>age_desc</th>\n",
       "      <th>occ_desc</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1999</th>\n",
       "      <td>2000</td>\n",
       "      <td>M</td>\n",
       "      <td>44685</td>\n",
       "      <td>18-24</td>\n",
       "      <td>college/grad student</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      user_id gender zipcode age_desc              occ_desc\n",
       "1999     2000      M   44685    18-24  college/grad student"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Pick a random test user\n",
    "users[users['user_id'] == TEST_USER]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here I define the function to predict user's rating of unrated items, using the *rate* function inside the CFModel class in *[CFModel.py](https://github.com/khanhnamle1994/movielens/blob/master/CFModel.py)*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Function to predict the ratings given User ID and Movie ID\n",
    "def predict_rating(user_id, movie_id):\n",
    "    return trained_model.rate(user_id - 1, movie_id - 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here I show the top 20 movies that user 2000 has already rated, including the *predictions* column showing the values that used 2000 would have rated based on the newly defined *predict_rating* function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>movie_id</th>\n",
       "      <th>rating</th>\n",
       "      <th>prediction</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2000</td>\n",
       "      <td>1639</td>\n",
       "      <td>5</td>\n",
       "      <td>3.724665</td>\n",
       "      <td>Chasing Amy (1997)</td>\n",
       "      <td>Drama|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2000</td>\n",
       "      <td>2529</td>\n",
       "      <td>5</td>\n",
       "      <td>3.803218</td>\n",
       "      <td>Planet of the Apes (1968)</td>\n",
       "      <td>Action|Sci-Fi</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2000</td>\n",
       "      <td>1136</td>\n",
       "      <td>5</td>\n",
       "      <td>4.495121</td>\n",
       "      <td>Monty Python and the Holy Grail (1974)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2000</td>\n",
       "      <td>2321</td>\n",
       "      <td>5</td>\n",
       "      <td>4.010493</td>\n",
       "      <td>Pleasantville (1998)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2000</td>\n",
       "      <td>2858</td>\n",
       "      <td>5</td>\n",
       "      <td>4.253924</td>\n",
       "      <td>American Beauty (1999)</td>\n",
       "      <td>Comedy|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2000</td>\n",
       "      <td>2501</td>\n",
       "      <td>5</td>\n",
       "      <td>4.206387</td>\n",
       "      <td>October Sky (1999)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>2000</td>\n",
       "      <td>2804</td>\n",
       "      <td>5</td>\n",
       "      <td>4.353670</td>\n",
       "      <td>Christmas Story, A (1983)</td>\n",
       "      <td>Comedy|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>2000</td>\n",
       "      <td>1688</td>\n",
       "      <td>5</td>\n",
       "      <td>3.710508</td>\n",
       "      <td>Anastasia (1997)</td>\n",
       "      <td>Animation|Children's|Musical</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>2000</td>\n",
       "      <td>1653</td>\n",
       "      <td>5</td>\n",
       "      <td>4.089375</td>\n",
       "      <td>Gattaca (1997)</td>\n",
       "      <td>Drama|Sci-Fi|Thriller</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2000</td>\n",
       "      <td>527</td>\n",
       "      <td>5</td>\n",
       "      <td>5.046471</td>\n",
       "      <td>Schindler's List (1993)</td>\n",
       "      <td>Drama|War</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>2000</td>\n",
       "      <td>1619</td>\n",
       "      <td>5</td>\n",
       "      <td>3.688177</td>\n",
       "      <td>Seven Years in Tibet (1997)</td>\n",
       "      <td>Drama|War</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>2000</td>\n",
       "      <td>110</td>\n",
       "      <td>5</td>\n",
       "      <td>3.966857</td>\n",
       "      <td>Braveheart (1995)</td>\n",
       "      <td>Action|Drama|War</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>2000</td>\n",
       "      <td>1193</td>\n",
       "      <td>5</td>\n",
       "      <td>4.718451</td>\n",
       "      <td>One Flew Over the Cuckoo's Nest (1975)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>2000</td>\n",
       "      <td>318</td>\n",
       "      <td>5</td>\n",
       "      <td>4.743262</td>\n",
       "      <td>Shawshank Redemption, The (1994)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2000</td>\n",
       "      <td>1748</td>\n",
       "      <td>5</td>\n",
       "      <td>3.779190</td>\n",
       "      <td>Dark City (1998)</td>\n",
       "      <td>Film-Noir|Sci-Fi|Thriller</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>2000</td>\n",
       "      <td>1923</td>\n",
       "      <td>5</td>\n",
       "      <td>3.264066</td>\n",
       "      <td>There's Something About Mary (1998)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>2000</td>\n",
       "      <td>1259</td>\n",
       "      <td>5</td>\n",
       "      <td>4.543493</td>\n",
       "      <td>Stand by Me (1986)</td>\n",
       "      <td>Adventure|Comedy|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>2000</td>\n",
       "      <td>595</td>\n",
       "      <td>5</td>\n",
       "      <td>4.334908</td>\n",
       "      <td>Beauty and the Beast (1991)</td>\n",
       "      <td>Animation|Children's|Musical</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>2000</td>\n",
       "      <td>2028</td>\n",
       "      <td>5</td>\n",
       "      <td>4.192429</td>\n",
       "      <td>Saving Private Ryan (1998)</td>\n",
       "      <td>Action|Drama|War</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>2000</td>\n",
       "      <td>1907</td>\n",
       "      <td>5</td>\n",
       "      <td>4.027490</td>\n",
       "      <td>Mulan (1998)</td>\n",
       "      <td>Animation|Children's</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    user_id  movie_id  rating  prediction  \\\n",
       "0      2000      1639       5    3.724665   \n",
       "1      2000      2529       5    3.803218   \n",
       "2      2000      1136       5    4.495121   \n",
       "3      2000      2321       5    4.010493   \n",
       "4      2000      2858       5    4.253924   \n",
       "5      2000      2501       5    4.206387   \n",
       "6      2000      2804       5    4.353670   \n",
       "7      2000      1688       5    3.710508   \n",
       "8      2000      1653       5    4.089375   \n",
       "9      2000       527       5    5.046471   \n",
       "10     2000      1619       5    3.688177   \n",
       "11     2000       110       5    3.966857   \n",
       "12     2000      1193       5    4.718451   \n",
       "13     2000       318       5    4.743262   \n",
       "14     2000      1748       5    3.779190   \n",
       "15     2000      1923       5    3.264066   \n",
       "16     2000      1259       5    4.543493   \n",
       "17     2000       595       5    4.334908   \n",
       "18     2000      2028       5    4.192429   \n",
       "19     2000      1907       5    4.027490   \n",
       "\n",
       "                                     title                        genres  \n",
       "0                       Chasing Amy (1997)                 Drama|Romance  \n",
       "1                Planet of the Apes (1968)                 Action|Sci-Fi  \n",
       "2   Monty Python and the Holy Grail (1974)                        Comedy  \n",
       "3                     Pleasantville (1998)                        Comedy  \n",
       "4                   American Beauty (1999)                  Comedy|Drama  \n",
       "5                       October Sky (1999)                         Drama  \n",
       "6                Christmas Story, A (1983)                  Comedy|Drama  \n",
       "7                         Anastasia (1997)  Animation|Children's|Musical  \n",
       "8                           Gattaca (1997)         Drama|Sci-Fi|Thriller  \n",
       "9                  Schindler's List (1993)                     Drama|War  \n",
       "10             Seven Years in Tibet (1997)                     Drama|War  \n",
       "11                       Braveheart (1995)              Action|Drama|War  \n",
       "12  One Flew Over the Cuckoo's Nest (1975)                         Drama  \n",
       "13        Shawshank Redemption, The (1994)                         Drama  \n",
       "14                        Dark City (1998)     Film-Noir|Sci-Fi|Thriller  \n",
       "15     There's Something About Mary (1998)                        Comedy  \n",
       "16                      Stand by Me (1986)        Adventure|Comedy|Drama  \n",
       "17             Beauty and the Beast (1991)  Animation|Children's|Musical  \n",
       "18              Saving Private Ryan (1998)              Action|Drama|War  \n",
       "19                            Mulan (1998)          Animation|Children's  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user_ratings = ratings[ratings['user_id'] == TEST_USER][['user_id', 'movie_id', 'rating']]\n",
    "user_ratings['prediction'] = user_ratings.apply(lambda x: predict_rating(TEST_USER, x['movie_id']), axis=1)\n",
    "user_ratings.sort_values(by='rating', \n",
    "                         ascending=False).merge(movies, \n",
    "                                                on='movie_id', \n",
    "                                                how='inner', \n",
    "                                                suffixes=['_u', '_m']).head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No surpise that these top movies all have 5-start rating. Some of the prediction values seem off (those with value 3.7, 3.8, 3.9 etc.)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Recommend Movies\n",
    "Here I make a recommendation list of unrated 20 movies sorted by prediction value for user 2000. Let's see it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movie_id</th>\n",
       "      <th>prediction</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>953</td>\n",
       "      <td>4.868923</td>\n",
       "      <td>It's a Wonderful Life (1946)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>668</td>\n",
       "      <td>4.866858</td>\n",
       "      <td>Pather Panchali (1955)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1423</td>\n",
       "      <td>4.859523</td>\n",
       "      <td>Hearts and Minds (1996)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3307</td>\n",
       "      <td>4.834415</td>\n",
       "      <td>City Lights (1931)</td>\n",
       "      <td>Comedy|Drama|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>649</td>\n",
       "      <td>4.802675</td>\n",
       "      <td>Cold Fever (Á köldum klaka) (1994)</td>\n",
       "      <td>Comedy|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>669</td>\n",
       "      <td>4.797451</td>\n",
       "      <td>Aparajito (1956)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>326</td>\n",
       "      <td>4.784828</td>\n",
       "      <td>To Live (Huozhe) (1994)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>3092</td>\n",
       "      <td>4.761148</td>\n",
       "      <td>Chushingura (1962)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>3022</td>\n",
       "      <td>4.753003</td>\n",
       "      <td>General, The (1927)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>2351</td>\n",
       "      <td>4.720692</td>\n",
       "      <td>Nights of Cabiria (Le Notti di Cabiria) (1957)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>926</td>\n",
       "      <td>4.719633</td>\n",
       "      <td>All About Eve (1950)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>3306</td>\n",
       "      <td>4.718323</td>\n",
       "      <td>Circus, The (1928)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>3629</td>\n",
       "      <td>4.684521</td>\n",
       "      <td>Gold Rush, The (1925)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>3415</td>\n",
       "      <td>4.683432</td>\n",
       "      <td>Mirror, The (Zerkalo) (1975)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>2609</td>\n",
       "      <td>4.678223</td>\n",
       "      <td>King of Masks, The (Bian Lian) (1996)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>1178</td>\n",
       "      <td>4.674256</td>\n",
       "      <td>Paths of Glory (1957)</td>\n",
       "      <td>Drama|War</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>2203</td>\n",
       "      <td>4.656760</td>\n",
       "      <td>Shadow of a Doubt (1943)</td>\n",
       "      <td>Film-Noir|Thriller</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>954</td>\n",
       "      <td>4.654399</td>\n",
       "      <td>Mr. Smith Goes to Washington (1939)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>3849</td>\n",
       "      <td>4.649190</td>\n",
       "      <td>Spiral Staircase, The (1946)</td>\n",
       "      <td>Thriller</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>602</td>\n",
       "      <td>4.645639</td>\n",
       "      <td>Great Day in Harlem, A (1994)</td>\n",
       "      <td>Documentary</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    movie_id  prediction                                           title  \\\n",
       "0        953    4.868923                    It's a Wonderful Life (1946)   \n",
       "1        668    4.866858                          Pather Panchali (1955)   \n",
       "2       1423    4.859523                         Hearts and Minds (1996)   \n",
       "3       3307    4.834415                              City Lights (1931)   \n",
       "4        649    4.802675              Cold Fever (Á köldum klaka) (1994)   \n",
       "5        669    4.797451                                Aparajito (1956)   \n",
       "6        326    4.784828                         To Live (Huozhe) (1994)   \n",
       "7       3092    4.761148                              Chushingura (1962)   \n",
       "8       3022    4.753003                             General, The (1927)   \n",
       "9       2351    4.720692  Nights of Cabiria (Le Notti di Cabiria) (1957)   \n",
       "10       926    4.719633                            All About Eve (1950)   \n",
       "11      3306    4.718323                              Circus, The (1928)   \n",
       "12      3629    4.684521                           Gold Rush, The (1925)   \n",
       "13      3415    4.683432                    Mirror, The (Zerkalo) (1975)   \n",
       "14      2609    4.678223           King of Masks, The (Bian Lian) (1996)   \n",
       "15      1178    4.674256                           Paths of Glory (1957)   \n",
       "16      2203    4.656760                        Shadow of a Doubt (1943)   \n",
       "17       954    4.654399             Mr. Smith Goes to Washington (1939)   \n",
       "18      3849    4.649190                    Spiral Staircase, The (1946)   \n",
       "19       602    4.645639                   Great Day in Harlem, A (1994)   \n",
       "\n",
       "                  genres  \n",
       "0                  Drama  \n",
       "1                  Drama  \n",
       "2                  Drama  \n",
       "3   Comedy|Drama|Romance  \n",
       "4           Comedy|Drama  \n",
       "5                  Drama  \n",
       "6                  Drama  \n",
       "7                  Drama  \n",
       "8                 Comedy  \n",
       "9                  Drama  \n",
       "10                 Drama  \n",
       "11                Comedy  \n",
       "12                Comedy  \n",
       "13                 Drama  \n",
       "14                 Drama  \n",
       "15             Drama|War  \n",
       "16    Film-Noir|Thriller  \n",
       "17                 Drama  \n",
       "18              Thriller  \n",
       "19           Documentary  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recommendations = ratings[ratings['movie_id'].isin(user_ratings['movie_id']) == False][['movie_id']].drop_duplicates()\n",
    "recommendations['prediction'] = recommendations.apply(lambda x: predict_rating(TEST_USER, x['movie_id']), axis=1)\n",
    "recommendations.sort_values(by='prediction',\n",
    "                          ascending=False).merge(movies,\n",
    "                                                 on='movie_id',\n",
    "                                                 how='inner',\n",
    "                                                 suffixes=['_u', '_m']).head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "In this notebook, I showed how to use a simple deep learning approach to build a recommendation engine for the MovieLens 1M dataset. This model performed better than all the approaches I attempted before (content-based, user-item similarity collaborative filtering, SVD). I can certainly improve this model's performance by making it deeper with more linear and non-linear layers. I leave that task to you then!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}