{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# SVD for Movie Recommendations\n", "In this notebook, I'll detail a basic version of model-based collaborative filtering for recommendations by employing it on the MovieLens 1M dataset. \n", "\n", "[In my previous attempt](https://github.com/khanhnamle1994/movielens/blob/master/Content_Based_and_Collaborative_Filtering_Models.ipynb), I used user-based and item-based collaborative filtering to make movie recommendations from users' ratings data. I can only try them on a very small data sample (20,000 ratings), and ended up getting pretty high Root Mean Squared Error (bad recommendations). Memory-based collaborative filtering approaches that compute distance relationships between items or users have these two major issues:\n", "\n", "1. It doesn't scale particularly well to massive datasets, especially for real-time recommendations based on user behavior similarities - which takes a lot of computations.\n", "2. Ratings matrices may be overfitting to noisy representations of user tastes and preferences. When we use distance based \"neighborhood\" approaches on raw data, we match to sparse low-level details that we assume represent the user's preference vector instead of the vector itself.\n", "\n", "Thus I'd need to apply **Dimensionality Reduction** technique to derive the tastes and preferences from the raw data, otherwise known as doing low-rank matrix factorization. Why reduce dimensions?\n", "\n", "* I can discover hidden correlations / features in the raw data.\n", "* I can remove redundant and noisy features that are not useful.\n", "* I can interpret and visualize the data easier.\n", "* I can also access easier data storage and processing.\n", "\n", "With that goal in mind, I'll introduce Singular Vector Decomposition (SVD) to you, a powerful dimensionality reduction technique that is used heavily in modern model-based CF recommender system.\n", "\n", "![dim-red](images/dimensionality-reduction.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading the Dataset\n", "Let's load the 3 data files just like last time." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import libraries\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# Reading ratings file\n", "ratings = pd.read_csv('ratings.csv', sep='\\t', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating', 'timestamp'])\n", "\n", "# Reading users file\n", "users = pd.read_csv('users.csv', sep='\\t', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])\n", "\n", "# Reading movies file\n", "movies = pd.read_csv('movies.csv', sep='\\t', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the movies and ratings dataframes." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movie_idtitlegenres
01Toy Story (1995)Animation|Children's|Comedy
12Jumanji (1995)Adventure|Children's|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
\n", "
" ], "text/plain": [ " movie_id title genres\n", "0 1 Toy Story (1995) Animation|Children's|Comedy\n", "1 2 Jumanji (1995) Adventure|Children's|Fantasy\n", "2 3 Grumpier Old Men (1995) Comedy|Romance\n", "3 4 Waiting to Exhale (1995) Comedy|Drama\n", "4 5 Father of the Bride Part II (1995) Comedy" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmovie_idratingtimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291
\n", "
" ], "text/plain": [ " user_id movie_id rating timestamp\n", "0 1 1193 5 978300760\n", "1 1 661 3 978302109\n", "2 1 914 3 978301968\n", "3 1 3408 4 978300275\n", "4 1 2355 5 978824291" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ratings.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also let's count the number of unique users and movies." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of users = 6040 | Number of movies = 3706\n" ] } ], "source": [ "n_users = ratings.user_id.unique().shape[0]\n", "n_movies = ratings.movie_id.unique().shape[0]\n", "print 'Number of users = ' + str(n_users) + ' | Number of movies = ' + str(n_movies)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now I want the format of my ratings matrix to be one row per user and one column per movie. To do so, I'll pivot *ratings* to get that and call the new variable *Ratings* (with a capital *R)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movie_id12345678910...3943394439453946394739483949395039513952
user_id
15.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
20.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
30.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
40.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
50.00.00.00.00.02.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", "

5 rows × 3706 columns

\n", "
" ], "text/plain": [ "movie_id 1 2 3 4 5 6 7 8 9 10 ... \\\n", "user_id ... \n", "1 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... \n", "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... \n", "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... \n", "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... \n", "5 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 ... \n", "\n", "movie_id 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 \n", "user_id \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[5 rows x 3706 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Ratings = ratings.pivot(index = 'user_id', columns ='movie_id', values = 'rating').fillna(0)\n", "Ratings.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last but not least, I need to de-normalize the data (normalize by each users mean) and convert it from a dataframe to a numpy array." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "R = Ratings.as_matrix()\n", "user_ratings_mean = np.mean(R, axis = 1)\n", "Ratings_demeaned = R - user_ratings_mean.reshape(-1, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With my ratings matrix properly formatted and normalized, I'm ready to do some dimensionality reduction. But first, let's go over the math." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model-Based Collaborative Filtering\n", "*Model-based Collaborative Filtering* is based on *matrix factorization (MF)* which has received greater exposure, mainly as an unsupervised learning method for latent variable decomposition and dimensionality reduction. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF:\n", "\n", "* The goal of MF is to learn the latent preferences of users and the latent attributes of items from known ratings (learn features that describe the characteristics of ratings) to then predict the unknown ratings through the dot product of the latent features of users and items. \n", "* When you have a very sparse matrix, with a lot of dimensions, by doing matrix factorization, you can restructure the user-item matrix into low-rank structure, and you can represent the matrix by the multiplication of two low-rank matrices, where the rows contain the latent vector. \n", "* You fit this matrix to approximate your original matrix, as closely as possible, by multiplying the low-rank matrices together, which fills in the entries missing in the original matrix.\n", "\n", "For example, let's check the sparsity of the ratings dataset:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The sparsity level of MovieLens1M dataset is 95.5%\n" ] } ], "source": [ "sparsity = round(1.0 - len(ratings) / float(n_users * n_movies), 3)\n", "print 'The sparsity level of MovieLens1M dataset is ' + str(sparsity * 100) + '%'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Support Vector Decomposition (SVD)\n", "A well-known matrix factorization method is *Singular value decomposition (SVD)*. At a high level, SVD is an algorithm that decomposes a matrix $A$ into the best lower rank (i.e. smaller/simpler) approximation of the original matrix $A$. Mathematically, it decomposes A into a two unitary matrices and a diagonal matrix:\n", "\n", "![svd](images/svd.png)\n", "\n", "where $A$ is the input data matrix (users's ratings), $U$ is the left singular vectors (user \"features\" matrix), $\\Sigma$ is the diagonal matrix of singular values (essentially weights/strengths of each concept), and $V^{T}$ is the right singluar vectors (movie \"features\" matrix). $U$ and $V^{T}$ are column orthonomal, and represent different things. $U$ represents how much users \"like\" each feature and $V^{T}$ represents how relevant each feature is to each movie.\n", "\n", "To get the lower rank approximation, I take these matrices and keep only the top $k$ features, which can be thought of as the underlying tastes and preferences vectors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setting Up SVD\n", "Scipy and Numpy both have functions to do the singular value decomposition. I'm going to use the Scipy function *svds* because it let's me choose how many latent factors I want to use to approximate the original ratings matrix (instead of having to truncate it after)." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from scipy.sparse.linalg import svds\n", "U, sigma, Vt = svds(Ratings_demeaned, k = 50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As I'm going to leverage matrix multiplication to get predictions, I'll convert the $\\Sigma$ (now are values) to the diagonal matrix form." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sigma = np.diag(sigma)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making Predictions from the Decomposed Matrices\n", "I now have everything I need to make movie ratings predictions for every user. I can do it all at once by following the math and matrix multiply $U$, $\\Sigma$, and $V^{T}$ back to get the rank $k=50$ approximation of $A$.\n", "\n", "But first, I need to add the user means back to get the actual star ratings prediction." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) + user_ratings_mean.reshape(-1, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the predictions matrix for every user, I can build a function to recommend movies for any user. I return the list of movies the user has already rated, for the sake of comparison." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movie_id12345678910...3943394439453946394739483949395039513952
04.2888610.143055-0.195080-0.0188430.012232-0.176604-0.0741200.141358-0.059553-0.195950...0.0278070.0016400.026395-0.022024-0.0854150.4035290.1055790.0319120.0504500.088910
10.7447160.1696590.3354180.0007580.0224751.3530500.0514260.0712580.1616011.567246...-0.056502-0.013733-0.0105800.062576-0.0162480.155790-0.418737-0.101102-0.054098-0.140188
21.8188240.4561360.090978-0.043037-0.025694-0.158617-0.1317780.0989770.0305510.735470...0.040481-0.0053010.0128320.0293490.0208660.1215320.0762050.0123450.015148-0.109956
30.408057-0.0729600.0396420.0893630.0419500.237753-0.0494260.0094670.045469-0.111370...0.008571-0.005425-0.008500-0.003417-0.0839820.0945120.057557-0.0260500.014841-0.034224
41.5742720.021239-0.0513000.246884-0.0324061.552281-0.199630-0.014920-0.0604980.450512...0.1101510.0460100.006934-0.015940-0.050080-0.0525390.5071890.0338300.1257060.199244
\n", "

5 rows × 3706 columns

\n", "
" ], "text/plain": [ "movie_id 1 2 3 4 5 6 \\\n", "0 4.288861 0.143055 -0.195080 -0.018843 0.012232 -0.176604 \n", "1 0.744716 0.169659 0.335418 0.000758 0.022475 1.353050 \n", "2 1.818824 0.456136 0.090978 -0.043037 -0.025694 -0.158617 \n", "3 0.408057 -0.072960 0.039642 0.089363 0.041950 0.237753 \n", "4 1.574272 0.021239 -0.051300 0.246884 -0.032406 1.552281 \n", "\n", "movie_id 7 8 9 10 ... 3943 \\\n", "0 -0.074120 0.141358 -0.059553 -0.195950 ... 0.027807 \n", "1 0.051426 0.071258 0.161601 1.567246 ... -0.056502 \n", "2 -0.131778 0.098977 0.030551 0.735470 ... 0.040481 \n", "3 -0.049426 0.009467 0.045469 -0.111370 ... 0.008571 \n", "4 -0.199630 -0.014920 -0.060498 0.450512 ... 0.110151 \n", "\n", "movie_id 3944 3945 3946 3947 3948 3949 \\\n", "0 0.001640 0.026395 -0.022024 -0.085415 0.403529 0.105579 \n", "1 -0.013733 -0.010580 0.062576 -0.016248 0.155790 -0.418737 \n", "2 -0.005301 0.012832 0.029349 0.020866 0.121532 0.076205 \n", "3 -0.005425 -0.008500 -0.003417 -0.083982 0.094512 0.057557 \n", "4 0.046010 0.006934 -0.015940 -0.050080 -0.052539 0.507189 \n", "\n", "movie_id 3950 3951 3952 \n", "0 0.031912 0.050450 0.088910 \n", "1 -0.101102 -0.054098 -0.140188 \n", "2 0.012345 0.015148 -0.109956 \n", "3 -0.026050 0.014841 -0.034224 \n", "4 0.033830 0.125706 0.199244 \n", "\n", "[5 rows x 3706 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preds = pd.DataFrame(all_user_predicted_ratings, columns = Ratings.columns)\n", "preds.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now I write a function to return the movies with the highest predicted rating that the specified user hasn't already rated. Though I didn't use any explicit movie content features (such as genre or title), I'll merge in that information to get a more complete picture of the recommendations." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def recommend_movies(predictions, userID, movies, original_ratings, num_recommendations):\n", " \n", " # Get and sort the user's predictions\n", " user_row_number = userID - 1 # User ID starts at 1, not 0\n", " sorted_user_predictions = preds.iloc[user_row_number].sort_values(ascending=False) # User ID starts at 1\n", " \n", " # Get the user's data and merge in the movie information.\n", " user_data = original_ratings[original_ratings.user_id == (userID)]\n", " user_full = (user_data.merge(movies, how = 'left', left_on = 'movie_id', right_on = 'movie_id').\n", " sort_values(['rating'], ascending=False)\n", " )\n", "\n", " print 'User {0} has already rated {1} movies.'.format(userID, user_full.shape[0])\n", " print 'Recommending highest {0} predicted ratings movies not already rated.'.format(num_recommendations)\n", " \n", " # Recommend the highest predicted rating movies that the user hasn't seen yet.\n", " recommendations = (movies[~movies['movie_id'].isin(user_full['movie_id'])].\n", " merge(pd.DataFrame(sorted_user_predictions).reset_index(), how = 'left',\n", " left_on = 'movie_id',\n", " right_on = 'movie_id').\n", " rename(columns = {user_row_number: 'Predictions'}).\n", " sort_values('Predictions', ascending = False).\n", " iloc[:num_recommendations, :-1]\n", " )\n", "\n", " return user_full, recommendations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to recommend 20 movies for user with ID 1310." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "User 1310 has already rated 24 movies.\n", "Recommending highest 20 predicted ratings movies not already rated.\n" ] } ], "source": [ "already_rated, predictions = recommend_movies(preds, 1310, movies, ratings, 20)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmovie_idratingtimestamptitlegenres
5131022485974781573Say Anything... (1989)Comedy|Drama|Romance
6131026205974781573This Is My Father (1998)Drama|Romance
7131036835974781935Blood Simple (1984)Drama|Film-Noir
15131017045974781573Good Will Hunting (1997)Drama
1131012935974781839Gandhi (1982)Drama
12131031014974781573Fatal Attraction (1987)Thriller
11131013434974781534Cape Fear (1991)Thriller
20131020004974781892Lethal Weapon (1987)Action|Comedy|Crime|Drama
18131035264974781892Parenthood (1989)Comedy|Drama
17131033604974781935Hoosiers (1986)Drama
13131031114974782001Places in the Heart (1984)Drama
23131010974974781534E.T. the Extra-Terrestrial (1982)Children's|Drama|Fantasy|Sci-Fi
10131011964974781701Star Wars: Episode V - The Empire Strikes Back...Action|Adventure|Drama|Sci-Fi|War
9131011854974781839My Left Foot (1989)Drama
8131036854974781935Prizzi's Honor (1985)Comedy|Drama|Romance
4131022434974782001Broadcast News (1987)Comedy|Drama|Romance
3131012994974781701Killing Fields, The (1984)Drama|War
1613101443974781573Brothers McMullen, The (1995)Comedy
19131019603974782001Last Emperor, The (1987)Drama|War
0131029883974781935Melvin and Howard (1980)Drama
\n", "
" ], "text/plain": [ " user_id movie_id rating timestamp \\\n", "5 1310 2248 5 974781573 \n", "6 1310 2620 5 974781573 \n", "7 1310 3683 5 974781935 \n", "15 1310 1704 5 974781573 \n", "1 1310 1293 5 974781839 \n", "12 1310 3101 4 974781573 \n", "11 1310 1343 4 974781534 \n", "20 1310 2000 4 974781892 \n", "18 1310 3526 4 974781892 \n", "17 1310 3360 4 974781935 \n", "13 1310 3111 4 974782001 \n", "23 1310 1097 4 974781534 \n", "10 1310 1196 4 974781701 \n", "9 1310 1185 4 974781839 \n", "8 1310 3685 4 974781935 \n", "4 1310 2243 4 974782001 \n", "3 1310 1299 4 974781701 \n", "16 1310 144 3 974781573 \n", "19 1310 1960 3 974782001 \n", "0 1310 2988 3 974781935 \n", "\n", " title \\\n", "5 Say Anything... (1989) \n", "6 This Is My Father (1998) \n", "7 Blood Simple (1984) \n", "15 Good Will Hunting (1997) \n", "1 Gandhi (1982) \n", "12 Fatal Attraction (1987) \n", "11 Cape Fear (1991) \n", "20 Lethal Weapon (1987) \n", "18 Parenthood (1989) \n", "17 Hoosiers (1986) \n", "13 Places in the Heart (1984) \n", "23 E.T. the Extra-Terrestrial (1982) \n", "10 Star Wars: Episode V - The Empire Strikes Back... \n", "9 My Left Foot (1989) \n", "8 Prizzi's Honor (1985) \n", "4 Broadcast News (1987) \n", "3 Killing Fields, The (1984) \n", "16 Brothers McMullen, The (1995) \n", "19 Last Emperor, The (1987) \n", "0 Melvin and Howard (1980) \n", "\n", " genres \n", "5 Comedy|Drama|Romance \n", "6 Drama|Romance \n", "7 Drama|Film-Noir \n", "15 Drama \n", "1 Drama \n", "12 Thriller \n", "11 Thriller \n", "20 Action|Comedy|Crime|Drama \n", "18 Comedy|Drama \n", "17 Drama \n", "13 Drama \n", "23 Children's|Drama|Fantasy|Sci-Fi \n", "10 Action|Adventure|Drama|Sci-Fi|War \n", "9 Drama \n", "8 Comedy|Drama|Romance \n", "4 Comedy|Drama|Romance \n", "3 Drama|War \n", "16 Comedy \n", "19 Drama|War \n", "0 Drama " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Top 20 movies that User 1310 has rated \n", "already_rated.head(20)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movie_idtitlegenres
16181674Witness (1985)Drama|Romance|Thriller
18801961Rain Man (1988)Drama
11871210Star Wars: Episode VI - Return of the Jedi (1983)Action|Adventure|Romance|Sci-Fi|War
12161242Glory (1989)Action|Drama|War
12021225Amadeus (1984)Drama
12731302Field of Dreams (1989)Drama
12201246Dead Poets Society (1989)Drama
18811962Driving Miss Daisy (1989)Drama
18771957Chariots of Fire (1981)Drama
19382020Dangerous Liaisons (1988)Drama|Romance
12331259Stand by Me (1986)Adventure|Comedy|Drama
30113098Natural, The (1984)Drama
21122194Untouchables, The (1987)Action|Crime|Drama
18761956Ordinary People (1980)Drama
12681296Room with a View, A (1986)Drama|Romance
22672352Big Chill, The (1983)Comedy|Drama
12781307When Harry Met Sally... (1989)Comedy|Romance
11651186Sex, Lies, and Videotape (1989)Drama
11991222Full Metal Jacket (1987)Action|Drama|War
28332919Year of Living Dangerously (1982)Drama|Romance
\n", "
" ], "text/plain": [ " movie_id title \\\n", "1618 1674 Witness (1985) \n", "1880 1961 Rain Man (1988) \n", "1187 1210 Star Wars: Episode VI - Return of the Jedi (1983) \n", "1216 1242 Glory (1989) \n", "1202 1225 Amadeus (1984) \n", "1273 1302 Field of Dreams (1989) \n", "1220 1246 Dead Poets Society (1989) \n", "1881 1962 Driving Miss Daisy (1989) \n", "1877 1957 Chariots of Fire (1981) \n", "1938 2020 Dangerous Liaisons (1988) \n", "1233 1259 Stand by Me (1986) \n", "3011 3098 Natural, The (1984) \n", "2112 2194 Untouchables, The (1987) \n", "1876 1956 Ordinary People (1980) \n", "1268 1296 Room with a View, A (1986) \n", "2267 2352 Big Chill, The (1983) \n", "1278 1307 When Harry Met Sally... (1989) \n", "1165 1186 Sex, Lies, and Videotape (1989) \n", "1199 1222 Full Metal Jacket (1987) \n", "2833 2919 Year of Living Dangerously (1982) \n", "\n", " genres \n", "1618 Drama|Romance|Thriller \n", "1880 Drama \n", "1187 Action|Adventure|Romance|Sci-Fi|War \n", "1216 Action|Drama|War \n", "1202 Drama \n", "1273 Drama \n", "1220 Drama \n", "1881 Drama \n", "1877 Drama \n", "1938 Drama|Romance \n", "1233 Adventure|Comedy|Drama \n", "3011 Drama \n", "2112 Action|Crime|Drama \n", "1876 Drama \n", "1268 Drama|Romance \n", "2267 Comedy|Drama \n", "1278 Comedy|Romance \n", "1165 Drama \n", "1199 Action|Drama|War \n", "2833 Drama|Romance " ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Top 20 movies that User 1310 hopefully will enjoy\n", "predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These look like pretty good recommendations. It's good to see that, although I didn't actually use the genre of the movie as a feature, the truncated matrix factorization features \"picked up\" on the underlying tastes and preferences of the user. I've recommended some comedy, drama, and romance movies - all of which were genres of some of this user's top rated movies." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model Evaluation\n", "Can't forget to evaluate our model, can we?\n", "\n", "Instead of doing manually like the last time, I will use the *[Surprise](https://pypi.python.org/pypi/scikit-surprise)* library that provided various ready-to-use powerful prediction algorithms including (SVD) to evaluate its RMSE (Root Mean Squared Error) on the MovieLens dataset. It is a Python scikit building and analyzing recommender systems." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import libraries from Surprise package\n", "from surprise import Reader, Dataset, SVD, evaluate\n", "\n", "# Load Reader library\n", "reader = Reader()\n", "\n", "# Load ratings dataset with Dataset library\n", "data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)\n", "\n", "# Split the dataset for 5-fold evaluation\n", "data.split(n_folds=5)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/khanhnamle/anaconda2/lib/python2.7/site-packages/surprise/evaluate.py:66: UserWarning: The evaluate() method is deprecated. Please use model_selection.cross_validate() instead.\n", " 'model_selection.cross_validate() instead.', UserWarning)\n", "/Users/khanhnamle/anaconda2/lib/python2.7/site-packages/surprise/dataset.py:193: UserWarning: Using data.split() or using load_from_folds() without using a CV iterator is now deprecated. \n", " UserWarning)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Evaluating RMSE of algorithm SVD.\n", "\n", "------------\n", "Fold 1\n", "RMSE: 0.8731\n", "------------\n", "Fold 2\n", "RMSE: 0.8743\n", "------------\n", "Fold 3\n", "RMSE: 0.8742\n", "------------\n", "Fold 4\n", "RMSE: 0.8724\n", "------------\n", "Fold 5\n", "RMSE: 0.8737\n", "------------\n", "------------\n", "Mean RMSE: 0.8736\n", "------------\n", "------------\n" ] }, { "data": { "text/plain": [ "CaseInsensitiveDefaultDict(list,\n", " {'rmse': [0.873147136065789,\n", " 0.8742574033899374,\n", " 0.8742215891785942,\n", " 0.8724048167991113,\n", " 0.8737344531771998]})" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Use the SVD algorithm.\n", "svd = SVD()\n", "\n", "# Compute the RMSE of the SVD algorithm.\n", "evaluate(svd, data, measures=['RMSE'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I get a mean *Root Mean Square Error* of 0.8736 which is pretty good. Let's now train on the dataset and arrive at predictions." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/khanhnamle/anaconda2/lib/python2.7/site-packages/surprise/prediction_algorithms/algo_base.py:51: UserWarning: train() is deprecated. Use fit() instead\n", " warnings.warn('train() is deprecated. Use fit() instead', UserWarning)\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trainset = data.build_full_trainset()\n", "svd.train(trainset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I'll pick again user with ID 1310 and check the ratings he has given." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmovie_idratingtimestamp
215928131029883974781935
215929131012935974781839
215930131012952974782001
215931131012994974781701
215932131022434974782001
215933131022485974781573
215934131026205974781573
215935131036835974781935
215936131036854974781935
215937131011854974781839
215938131011964974781701
215939131013434974781534
215940131031014974781573
215941131031114974782001
215942131023132974781839
215943131017045974781573
21594413101443974781573
215945131033604974781935
215946131035264974781892
215947131019603974782001
215948131020004974781892
215949131012312974781963
215950131010902974781839
215951131010974974781534
\n", "
" ], "text/plain": [ " user_id movie_id rating timestamp\n", "215928 1310 2988 3 974781935\n", "215929 1310 1293 5 974781839\n", "215930 1310 1295 2 974782001\n", "215931 1310 1299 4 974781701\n", "215932 1310 2243 4 974782001\n", "215933 1310 2248 5 974781573\n", "215934 1310 2620 5 974781573\n", "215935 1310 3683 5 974781935\n", "215936 1310 3685 4 974781935\n", "215937 1310 1185 4 974781839\n", "215938 1310 1196 4 974781701\n", "215939 1310 1343 4 974781534\n", "215940 1310 3101 4 974781573\n", "215941 1310 3111 4 974782001\n", "215942 1310 2313 2 974781839\n", "215943 1310 1704 5 974781573\n", "215944 1310 144 3 974781573\n", "215945 1310 3360 4 974781935\n", "215946 1310 3526 4 974781892\n", "215947 1310 1960 3 974782001\n", "215948 1310 2000 4 974781892\n", "215949 1310 1231 2 974781963\n", "215950 1310 1090 2 974781839\n", "215951 1310 1097 4 974781534" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ratings[ratings['user_id'] == 1310]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's use SVD to predict the rating that User with ID 1310 will give to a random movie (let's say with Movie ID 1994)." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Prediction(uid=1310, iid=1994, r_ui=None, est=3.348657218865324, details={u'was_impossible': False})" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "svd.predict(1310, 1994)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For movie with ID 1994, I get an estimated prediction of 3.349. The recommender system works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.\n", "\n", "## Conclusion\n", "In this notebook, I attempted to build a model-based Collaborative Filtering movie recommendation sytem based on latent features from a low rank matrix factorization method called SVD. As it captures the underlying features driving the raw data, it can scale significantly better to massive datasets as well as make better recommendations based on user's tastes.\n", "\n", "However, we still likely lose some meaningful signals by using a low-rank approximation. Specifically, there's an interpretability problem as a singular vector specifies a linear combination of all input columns or rows. There's also a lack of sparsity when the singular vectors are quite dense. Thus, SVD approach is limited to linear projections." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.14" } }, "nbformat": 4, "nbformat_minor": 2 }