{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Collaborative filtering\n", "\n", "* __Collaborative filtering__: Generally, the process of filtering out some data by collaborating data from different data sources/agents. Specifically, this process with regards to building recommendation systems. Making predictions (filtering) by collecting data from lots of different users about their preferences/habits (collaboration).\n", "\n", "\n", "* __Matrix decomposition/factoring__: The process of taking a single matrix, and then expressing it as the product of different matrices. You can think of gradient descent as a way of doing this. For example, if we have a table of data (matrix) that shows {users}x{movies}, and the matrix is filled with the users' scores for those movies, we could have two matrices -- one for movie factors, and one for user factors. We could then set the properties for those two matrices via gradient descent so that their product results (as closely as possible) as the actual scores users gave the movies. So we have taken the score matrix, and expressed it as the product of the movie factor and user factor matrices. (Actually, our operations are not technically matrix decomposition because we will in 0 values.)\n", "\n", "\n", "* __Observed features/Latent features__: (aka \"factors\" or \"variables\") Observed features are the features that are explicitly read into the model. For example, words in a text. Latent features are the \"hidden\" features -- usually \"discovered\" by some aggregate of the observed features. For example, the topic of a text. Thinking of the matrix decomposition above, each column for a movie could represent some value -- special effects, year of release, etc. An each row of the user matrix could represent how much they value that feature. Those features, like special effects, etc., would be the latent features. The features that can't be directly observed vs. those that can." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at collab_filter.xlsx as an example:\n", "\n", "\n", "\n", "This whole process could be thought of as _collaborative filtering using matrix decomposition_. (We're breaking our matrix into 2 different matrices and using it to make some predictions).\n", "\n", "The matrix above the movies and the matrix to the left of the users are embedding matrices for those things.\n", "\n", "So, in summary, the process for this shallow learning is:\n", "\n", "1. Init your user/movie/movie scores matrix using randomly initialized embedding matrices for the movies and users\n", "2. Set up your cost function\n", "3. Minimize your cost function using gradient descent, thus setting more accurate embedding matrix values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Collaborative filtering using fast.ai" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# First do our usual imports\n", "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "\n", "from fastai.learner import *\n", "from fastai.column_data import *" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Then set up the path\n", "path = \"data/ml-latest-small/\"" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userIdmovieIdratingtimestamp
01312.51260759144
1110293.01260759179
2110613.01260759182
3111292.01260759185
4111724.01260759205
\n", "
" ], "text/plain": [ " userId movieId rating timestamp\n", "0 1 31 2.5 1260759144\n", "1 1 1029 3.0 1260759179\n", "2 1 1061 3.0 1260759182\n", "3 1 1129 2.0 1260759185\n", "4 1 1172 4.0 1260759205" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Take a look at the data\n", "# We can see it contains a userId, movieId, and rating. We want to predict the rating.\n", "ratings = pd.read_csv(path+'ratings.csv')\n", "ratings.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieIdtitlegenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
12Jumanji (1995)Adventure|Children|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama|Romance
45Father of the Bride Part II (1995)Comedy
\n", "
" ], "text/plain": [ " movieId title \\\n", "0 1 Toy Story (1995) \n", "1 2 Jumanji (1995) \n", "2 3 Grumpier Old Men (1995) \n", "3 4 Waiting to Exhale (1995) \n", "4 5 Father of the Bride Part II (1995) \n", "\n", " genres \n", "0 Adventure|Animation|Children|Comedy|Fantasy \n", "1 Adventure|Children|Fantasy \n", "2 Comedy|Romance \n", "3 Comedy|Drama|Romance \n", "4 Comedy " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We can also get the movie names too\n", "movie_details = pd.read_csv(path+'movies.csv')\n", "movie_details.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "userId\n", "547 2391\n", "564 1868\n", "624 1735\n", "15 1700\n", "73 1610\n", "452 1340\n", "468 1291\n", "380 1063\n", "311 1019\n", "30 1011\n", "294 947\n", "509 923\n", "580 922\n", "213 910\n", "212 876\n", "Name: rating, dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Though not required for modelling, we create a cross tab of the top users and top movies, like we had in our Excel file\n", "# First get the users who have given the most ratings\n", "group = ratings.groupby('userId')['rating'].count()\n", "topUsers = group.sort_values(ascending=False)[:15]\n", "topUsers" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "movieId\n", "356 341\n", "296 324\n", "318 311\n", "593 304\n", "260 291\n", "480 274\n", "2571 259\n", "1 247\n", "527 244\n", "589 237\n", "1196 234\n", "110 228\n", "1270 226\n", "608 224\n", "1198 220\n", "Name: rating, dtype: int64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now get the movies which are the highest rated\n", "group = ratings.groupby('movieId')['rating'].count()\n", "topMovies = group.sort_values(ascending=False)[:15]\n", "topMovies" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Now join them together\n", "top_ranked = ratings.join(topUsers, rsuffix='_r', how='inner', on='userId')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieId11102602963183564805275895936081196119812702571
userId
152.03.05.05.02.01.03.04.04.05.05.05.04.05.05.0
304.05.04.05.05.05.04.05.04.04.05.04.05.05.03.0
735.04.04.55.05.05.04.05.03.04.54.05.05.05.04.5
2123.05.04.04.04.54.03.05.03.04.0NaNNaN3.03.05.0
2133.02.55.0NaNNaN2.05.0NaN4.02.52.05.03.03.04.0
2944.03.04.0NaN3.04.04.04.03.0NaNNaN4.04.54.04.5
3113.03.04.03.04.55.04.55.04.52.04.03.04.54.54.0
3804.05.04.05.04.05.04.0NaN4.05.04.04.0NaN3.05.0
4523.54.04.05.05.04.05.04.04.05.05.04.04.04.02.0
4684.03.03.53.53.53.02.5NaNNaN3.04.03.03.53.03.0
5093.05.05.05.04.04.03.05.02.04.04.55.05.03.04.5
5473.5NaNNaN5.05.02.03.05.0NaN5.05.02.52.03.53.5
5644.01.02.05.0NaN3.05.04.05.05.05.05.05.03.03.0
5804.04.54.04.54.03.53.04.04.54.04.54.03.53.04.5
6245.0NaN5.05.0NaN3.03.0NaN3.05.04.05.05.05.02.0
\n", "
" ], "text/plain": [ "movieId 1 110 260 296 318 356 480 527 589 593 608 \\\n", "userId \n", "15 2.0 3.0 5.0 5.0 2.0 1.0 3.0 4.0 4.0 5.0 5.0 \n", "30 4.0 5.0 4.0 5.0 5.0 5.0 4.0 5.0 4.0 4.0 5.0 \n", "73 5.0 4.0 4.5 5.0 5.0 5.0 4.0 5.0 3.0 4.5 4.0 \n", "212 3.0 5.0 4.0 4.0 4.5 4.0 3.0 5.0 3.0 4.0 NaN \n", "213 3.0 2.5 5.0 NaN NaN 2.0 5.0 NaN 4.0 2.5 2.0 \n", "294 4.0 3.0 4.0 NaN 3.0 4.0 4.0 4.0 3.0 NaN NaN \n", "311 3.0 3.0 4.0 3.0 4.5 5.0 4.5 5.0 4.5 2.0 4.0 \n", "380 4.0 5.0 4.0 5.0 4.0 5.0 4.0 NaN 4.0 5.0 4.0 \n", "452 3.5 4.0 4.0 5.0 5.0 4.0 5.0 4.0 4.0 5.0 5.0 \n", "468 4.0 3.0 3.5 3.5 3.5 3.0 2.5 NaN NaN 3.0 4.0 \n", "509 3.0 5.0 5.0 5.0 4.0 4.0 3.0 5.0 2.0 4.0 4.5 \n", "547 3.5 NaN NaN 5.0 5.0 2.0 3.0 5.0 NaN 5.0 5.0 \n", "564 4.0 1.0 2.0 5.0 NaN 3.0 5.0 4.0 5.0 5.0 5.0 \n", "580 4.0 4.5 4.0 4.5 4.0 3.5 3.0 4.0 4.5 4.0 4.5 \n", "624 5.0 NaN 5.0 5.0 NaN 3.0 3.0 NaN 3.0 5.0 4.0 \n", "\n", "movieId 1196 1198 1270 2571 \n", "userId \n", "15 5.0 4.0 5.0 5.0 \n", "30 4.0 5.0 5.0 3.0 \n", "73 5.0 5.0 5.0 4.5 \n", "212 NaN 3.0 3.0 5.0 \n", "213 5.0 3.0 3.0 4.0 \n", "294 4.0 4.5 4.0 4.5 \n", "311 3.0 4.5 4.5 4.0 \n", "380 4.0 NaN 3.0 5.0 \n", "452 4.0 4.0 4.0 2.0 \n", "468 3.0 3.5 3.0 3.0 \n", "509 5.0 5.0 3.0 4.5 \n", "547 2.5 2.0 3.5 3.5 \n", "564 5.0 5.0 3.0 3.0 \n", "580 4.0 3.5 3.0 4.5 \n", "624 5.0 5.0 5.0 2.0 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top_ranked = top_ranked.join(topMovies, rsuffix='_r', how='inner', on='movieId')\n", "\n", "pd.crosstab(top_ranked.userId, top_ranked.movieId, top_ranked.rating, aggfunc=np.sum)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Collaborative filtering\n", "\n", "Now we will do the actual collaborative filtering. This is pretty similar to our previous processes." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# First, get the cross validation indexes -- a random 20% of rows we can use for validaton\n", "val_idxs = get_cv_idxs(len(ratings))\n", "\n", "# Weight decay. This will be covered later. This means 2^-4 (0.0625)\n", "wd = 2e-4 \n", "\n", "# This is the depth of the embedding matrix. Can be thought of as the number of latent features. (see note above)\n", "n_factors = 50 " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Now declare our data and learner\n", "# We pass in the two columns and the thing we want to predict -- like we had in our Excel example earlier\n", "collaborative_filter_data = CollabFilterDataset.from_csv(path, 'ratings.csv', 'userId', 'movieId', 'rating')\n", "learn = collaborative_filter_data.get_learner(n_factors, val_idxs, 64, opt_fn=optim.Adam)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "9e86229ec34141b59fd7fd885fa04a6a", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type HBox.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "epoch trn_loss val_loss \n", " 0 0.831135 0.810703 \n", " 1 0.791689 0.780824 \n", " 2 0.617506 0.765011 \n", "\n" ] }, { "data": { "text/plain": [ "[0.76501125]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Do the learning\n", "# These params were figured out using trials, like usual\n", "learn.fit(1e-2, 2, wds=wd, cycle_len=1, cycle_mult=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The evaluation method here is MSE -- mean squared error (sum of actual value-predicted value)^2/num of samples).\n", "So we'll take the square root to get our RMSE." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8746427842267951" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "math.sqrt(0.765)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Movie bias" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our bias affects the movie rating, so we can also think of it as a measure of how good/bad movies are." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# First, convert the IDs to contiguous values, like we did for our model.\n", "movie_names = movie_details.set_index('movieId')['title'].to_dict()\n", "group = ratings.groupby('movieId')['rating'].count()\n", "top_movies = group.sort_values(ascending=False).index.values[:3000]\n", "top_movie_idx = np.array([collaborative_filter_data.item2idx[o] for o in top_movies])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to view the layers in our PyTorch model, we can just call it.\n", "\n", "So below we have a model wth two embedding layers, and then two bias layers -- one of user biases, and one for item biases (in this case, items = movies).\n", "\n", "You can see the 0th element is the number of items, and the 1st element is the number of features. For example, in our user embedding layer, we have 671 users and 50 features, in our item bias layer we have 9066 movies and 1 bias for each movie, etc." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "EmbeddingDotBias(\n", " (u): Embedding(671, 50)\n", " (i): Embedding(9066, 50)\n", " (ub): Embedding(671, 1)\n", " (ib): Embedding(9066, 1)\n", ")" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = learn.model\n", "model.cuda()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we take our top movie IDs and pass them into the item bias layer to get the biases for the movie.\n", "\n", "Note: PyTorch lets you do this -- pass in indices to a layer to get the corresponding values.\n", "The indicies must be converted to PyTorch `Variables` first. Recall that a variable is basically like a tensor that supports automatic differentiation.\n", "\n", "We then convert the resulting data to a NumPy array so that work can be done on the CPU." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Take a look at the movie bias\n", "# Input is a movie id, and output is the movie bias (a float)\n", "movie_bias = to_np(model.ib(V(top_movie_idx)))" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0.85251],\n", " [ 0.89408],\n", " [ 1.31877],\n", " ...,\n", " [ 0.22685],\n", " [-0.03515],\n", " [ 0.24388]], dtype=float32)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_bias" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3000, 1)" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_bias.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Zip up the movie names with their respective biases\n", "movie_ratings = [(b[0], movie_names[i]) for i,b in zip(top_movies, movie_bias)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can look at top and bottom rated movies, corrected for reviewer sentiment, and the different types of movies viewers watch." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(-0.9562768, 'Battlefield Earth (2000)'),\n", " (-0.7368616, 'Inspector Gadget (1999)'),\n", " (-0.73659664, 'Anaconda (1997)'),\n", " (-0.7353736, 'Speed 2: Cruise Control (1997)'),\n", " (-0.7109455, 'Wild Wild West (1999)'),\n", " (-0.6921251, 'Mighty Morphin Power Rangers: The Movie (1995)'),\n", " (-0.6649571, 'Super Mario Bros. (1993)'),\n", " (-0.655268, 'Batman & Robin (1997)'),\n", " (-0.63718784, 'Haunting, The (1999)'),\n", " (-0.59907967, 'Flintstones, The (1994)'),\n", " (-0.59654623, 'Superman III (1983)'),\n", " (-0.58483046, 'Congo (1995)'),\n", " (-0.5782997, 'Showgirls (1995)'),\n", " (-0.57199323, 'Little Nicky (2000)'),\n", " (-0.5705105, 'Message in a Bottle (1999)')]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sort by the 0th element in the tuple (the bias)\n", "sorted(movie_ratings, key=lambda o: o[0])[:15]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(-0.9562768, 'Battlefield Earth (2000)'),\n", " (-0.7368616, 'Inspector Gadget (1999)'),\n", " (-0.73659664, 'Anaconda (1997)'),\n", " (-0.7353736, 'Speed 2: Cruise Control (1997)'),\n", " (-0.7109455, 'Wild Wild West (1999)'),\n", " (-0.6921251, 'Mighty Morphin Power Rangers: The Movie (1995)'),\n", " (-0.6649571, 'Super Mario Bros. (1993)'),\n", " (-0.655268, 'Batman & Robin (1997)'),\n", " (-0.63718784, 'Haunting, The (1999)'),\n", " (-0.59907967, 'Flintstones, The (1994)'),\n", " (-0.59654623, 'Superman III (1983)'),\n", " (-0.58483046, 'Congo (1995)'),\n", " (-0.5782997, 'Showgirls (1995)'),\n", " (-0.57199323, 'Little Nicky (2000)'),\n", " (-0.5705105, 'Message in a Bottle (1999)')]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# (Same as above)\n", "sorted(movie_ratings, key=itemgetter(0))[:15]" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(1.3187655, 'Shawshank Redemption, The (1994)'),\n", " (1.0735388, 'Godfather, The (1972)'),\n", " (1.0717344, 'Usual Suspects, The (1995)'),\n", " (0.9121452, \"Schindler's List (1993)\"),\n", " (0.903625, 'To Kill a Mockingbird (1962)'),\n", " (0.8940818, 'Pulp Fiction (1994)'),\n", " (0.89336175, 'Fargo (1996)'),\n", " (0.887614, 'Matrix, The (1999)'),\n", " (0.8801452, 'Silence of the Lambs, The (1991)'),\n", " (0.8669827, 'Godfather: Part II, The (1974)'),\n", " (0.8619761, 'Star Wars: Episode IV - A New Hope (1977)'),\n", " (0.852508, 'Forrest Gump (1994)'),\n", " (0.84972376, 'Dark Knight, The (2008)'),\n", " (0.84826905, '12 Angry Men (1957)'),\n", " (0.8375876, 'Rear Window (1954)')]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted(movie_ratings, key=lambda o: o[0], reverse=True)[:15]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpreting embedding matrices" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3000, 50)" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_embeddings = to_np(model.i(V(top_movie_idx)))\n", "movie_embeddings.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's hard to interpret 50 different factors. We use [Principle Component Analysis (PCA)](https://plot.ly/ipython-notebooks/principal-component-analysis/) to simplify them down to 3 vectors.\n", "\n", "PCA essentially says, reduce our dimensionality down to $n$. It finds 3 linear combinations of our 50 embedding dimensions whic capture as much variation as possible, while also making those 3 linear combinations as different to each other as possible." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.decomposition import PCA\n", "pca = PCA(n_components=3)\n", "movie_pca = pca.fit(movie_embeddings.T).components_" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3, 3000)" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie_pca.shape" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [ "factor0 = movie_pca[0]\n", "movie_component = [(factor, movie_names[i]) for factor,i in zip(factor0, top_movies)]" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.08261366, 'Independence Day (a.k.a. ID4) (1996)'),\n", " (0.060704462, 'Armageddon (1998)'),\n", " (0.057128586, 'Lost World: Jurassic Park, The (1997)'),\n", " (0.05646295, \"Charlie's Angels (2000)\"),\n", " (0.05587433, 'X-Men (2000)'),\n", " (0.054502834, 'Grumpier Old Men (1995)'),\n", " (0.053873993, 'Pearl Harbor (2001)'),\n", " (0.05382658, 'Police Academy 4: Citizens on Patrol (1987)'),\n", " (0.052748825, 'Miss Congeniality (2000)'),\n", " (0.049758486, 'Waterworld (1995)')]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Looking at the first component, it looks like it's something like classier movies vs. more lighthearted\n", "sorted(movie_component, key=itemgetter(0), reverse=True)[:10]" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(-0.072085, 'Taxi Driver (1976)'),\n", " (-0.070109025, 'Fargo (1996)'),\n", " (-0.06869264, 'Chinatown (1974)'),\n", " (-0.06718892, 'Godfather, The (1972)'),\n", " (-0.06630573, 'Apocalypse Now (1979)'),\n", " (-0.06497336, 'Pulp Fiction (1994)'),\n", " (-0.0637139, 'Casablanca (1942)'),\n", " (-0.061072655, 'Goodfellas (1990)'),\n", " (-0.059540763, 'Shining, The (1980)'),\n", " (-0.058864854, 'Maltese Falcon, The (1941)')]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted(movie_component, key=itemgetter(0))[:10]" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": true }, "outputs": [], "source": [ "factor1 = movie_pca[1]\n", "movie_component = [(factor, movie_names[i]) for factor,i in zip(factor1, top_movies)]" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(0.065984644, 'Mission to Mars (2000)'),\n", " (0.060737364, 'Island of Dr. Moreau, The (1996)'),\n", " (0.0543456, 'Tank Girl (1995)'),\n", " (0.05323038, 'Batman & Robin (1997)'),\n", " (0.050556783, \"Joe's Apartment (1996)\"),\n", " (0.04883899, 'Showgirls (1995)'),\n", " (0.048659008, 'Coneheads (1993)'),\n", " (0.04720561, 'Catwoman (2004)'),\n", " (0.046349775, 'Piano, The (1993)'),\n", " (0.04605449, 'Bringing Up Baby (1938)')]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Looking at the second component, it looks more like CGI vs dialogue-driven\n", "sorted(movie_component, key=itemgetter(0), reverse=True)[:10]" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(-0.13760568, 'Lord of the Rings: The Return of the King, The (2003)'),\n", " (-0.13532887, 'Lord of the Rings: The Fellowship of the Ring, The (2001)'),\n", " (-0.12333063, 'Lord of the Rings: The Two Towers, The (2002)'),\n", " (-0.103825174, 'Star Wars: Episode VI - Return of the Jedi (1983)'),\n", " (-0.09216414, 'Lethal Weapon (1987)'),\n", " (-0.09151409, 'Jurassic Park (1993)'),\n", " (-0.090675056, 'Spider-Man (2002)'),\n", " (-0.08928624,\n", " 'Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)'),\n", " (-0.08467902, 'Die Hard (1988)'),\n", " (-0.083290756, 'X2: X-Men United (2003)')]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted(movie_component, key=itemgetter(0))[:10]" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# We can map these two components against each other\n", "idxs = np.random.choice(len(top_movies), 50, replace=False)\n", "X = factor0[idxs]\n", "Y = factor1[idxs]\n", "plt.figure(figsize=(15,15))\n", "plt.scatter(X, Y)\n", "for i, x, y in zip(top_movies[idxs], X, Y):\n", " plt.text(x,y,movie_names[i], color=np.random.rand(3)*0.7, fontsize=11)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Collbarative Filtering from scratch\n", "\n", "In this section, we'll look at implementing collaborative filtering from scratch." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Do our imports again in case we want to run from here\n", "%reload_ext autoreload\n", "%autoreload 2\n", "%matplotlib inline\n", "\n", "from fastai.learner import *\n", "from fastai.column_data import *" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Then set up the path\n", "path = \"data/ml-latest-small/\"" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userIdmovieIdratingtimestamp
01312.51260759144
1110293.01260759179
2110613.01260759182
3111292.01260759185
4111724.01260759205
\n", "
" ], "text/plain": [ " userId movieId rating timestamp\n", "0 1 31 2.5 1260759144\n", "1 1 1029 3.0 1260759179\n", "2 1 1061 3.0 1260759182\n", "3 1 1129 2.0 1260759185\n", "4 1 1172 4.0 1260759205" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ratings = pd.read_csv(path+'ratings.csv')\n", "ratings.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PyTorch Arithmetic" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(\n", " 1 2\n", " 3 4\n", " [torch.cuda.FloatTensor of size 2x2 (GPU 0)], \n", " 2 2\n", " 10 10\n", " [torch.cuda.FloatTensor of size 2x2 (GPU 0)])" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Declare tensors (n-dimensonal matrices)\n", "a = T([[1.,2],\n", " [3,4]])\n", "\n", "b = T([[2.,2],\n", " [10,10]])\n", "\n", "a,b" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", " 2 4\n", " 30 40\n", "[torch.cuda.FloatTensor of size 2x2 (GPU 0)]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Element-wise multiplication\n", "a*b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### CUDA\n", "To run on the graphics card, add `.cuda()` to the end of PyTorch calls. Otherwise they will run on the CPU." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", " 2 4\n", " 30 40\n", "[torch.cuda.FloatTensor of size 2x2 (GPU 0)]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This is running on the GPU\n", "a*b.cuda()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", " 6\n", " 70\n", "[torch.cuda.FloatTensor of size 2 (GPU 0)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Element-wise multiplication and sum across the columns\n", "# This is the tensor dot product.\n", "# I.e., the dot product of [1,2] and [2,2] = 6, and [3,4]*[10,10] = 70\n", "(a*b).sum(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## PyTorch Modules\n", "We can build our own neural network layer to process inputs and compute activations.\n", "\n", "In PyTorch, we call this a __module__. I.e., we are going to build a PyTorch _module_. Modules can be passed in to neural nets.\n", "\n", "PyTorch modules are derived from `nn.Module` (neural network module).\n", "\n", "Modules must contain a function called `forward` that will compute the forward activations -- do the forward pass.\n", "\n", "This `forward` function is called automatically when the module is called with its constructor, i.e., `module(a,b)` will call `forward(a,b)`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# We can create a module that does dot products between tensors\n", "class DotProduct(nn.Module):\n", " def forward(self, users, movies):\n", " return (users*movies).sum(1)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model = DotProduct()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "\n", " 6\n", " 70\n", "[torch.cuda.FloatTensor of size 2 (GPU 0)]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This will call the forward function.\n", "model(a, b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A more complex module/fixing up index values\n", "\n", "Now, let's create a more complex module to do the work we were doing in our spreadsheet.\n", "\n", "But first, we have a slightly problem: user and movie IDs are not contiguous. For example, our user ID might jump from 1000 to 1400. This means that if we want to do direct indexing via the ID, we would need to have those extra 400 rows in our tensor. So we'll do some data fixing to map a series of sequential, contiguous IDs." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Get the unique user IDs\n", "unique_users = ratings.userId.unique()\n", "\n", "# Get a list of sequential IDs using enumerate\n", "user_to_index = {o:i for i,o in enumerate(unique_users)}\n", "\n", "# Map the userIds in ratings using user_to_index\n", "ratings.userId = ratings.userId.apply(lambda x: user_to_index[x])" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Do the same for movie IDs\n", "unique_movies = ratings.movieId.unique()\n", "movie_to_index = {o:i for i,o in enumerate(unique_movies)}\n", "ratings.movieId = ratings.movieId.apply(lambda x: movie_to_index[x])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(671, 9066)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "number_of_users = int(ratings.userId.nunique())\n", "number_of_movies = int(ratings.movieId.nunique())\n", "\n", "number_of_users, number_of_movies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating the module\n", "\n", "Now let's create our module. This will be a module that holds an embedding matrix for our users and movies.\n", "The `forward` pass will do a dot product on them.\n", "\n", "The module will use `nn.Embedding` to create the embedding matrices. These are PyTorch __`variables`__. Variables support all the operations that tensors do, except they also support automatic differentiation.\n", "\n", "When we want to access the tensor part of the variable, we call `.weight.data` on the variable.\n", "\n", "If we put `_` at the end of a PyTorch tensor function, it performs the operation in place.\n", "\n", "To initialize our embedding matrices to random numbers using values calculated using [He initialization](https://machinelearning.wtf/terms/he-initialization/). (See PyTorch's `kaiming_uniform` which can do He initialization too [link](http://pytorch.org/docs/master/_modules/torch/nn/init.html).)\n", "\n", "The flow of the module will be like this:\n", "\n", "1. Look up the factors for the users from the embedding matrix\n", "2. Look up the factors for the movies from the embedding matrix\n", "3. Take the dot product" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": true }, "outputs": [], "source": [ "number_of_factors = 50" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class EmbeddingNet(nn.Module):\n", " def __init__(self, number_of_users, number_of_movies):\n", " super().__init__()\n", " \n", " # Create embedding matrices for users and movies\n", " self.user_embedding_matrix = nn.Embedding(number_of_users, number_of_factors)\n", " self.movie_embedding_matrix = nn.Embedding(number_of_movies, number_of_factors)\n", " \n", " # Initialize the embedding matrices\n", " # .weight.data gets the tensor part of the variable\n", " # Using _ performs the operation in place\n", " self.user_embedding_matrix.weight.data.uniform_(0,0.05)\n", " self.movie_embedding_matrix.weight.data.uniform_(0,0.05)\n", " \n", " \n", " # Foward pass\n", " # As with our structured data example, we can take in categorical and continuous variables\n", " # (But both our users and movies are categorical)\n", " def forward(self, categorical, continuous):\n", " # Get the users and movies params\n", " users,movies = categorical[:,0],categorical[:,1]\n", " \n", " # Get the factors from our embedding matrices\n", " user_factors,movie_factors = self.user_embedding_matrix(users), self.movie_embedding_matrix(movies)\n", " \n", " # Take the dot product\n", " return (user_factors*movie_factors).sum(1)\n", " " ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Now we want to set up our x and y for our crosstab\n", "# X = everything except rating and timestamp (row/column for our cross tab)\n", "# Y = ratings (result in our cross tab)\n", "x = ratings.drop(['rating', 'timestamp'],axis=1)\n", "y = ratings['rating'].astype('float32')" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userIdmovieId
000
101
202
303
404
\n", "
" ], "text/plain": [ " userId movieId\n", "0 0 0\n", "1 0 1\n", "2 0 2\n", "3 0 3\n", "4 0 4" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x.head()" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 2.5\n", "1 3.0\n", "2 3.0\n", "3 2.0\n", "4 4.0\n", "Name: rating, dtype: float32" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y.head()" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": true }, "outputs": [], "source": [ "val_idxs = get_cv_idxs(len(ratings))" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Just use fast.ai to set up the dataloader\n", "data = ColumnarModelData.from_data_frame(path, val_idxs, x, y, ['userId', 'movieId'], 64)" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": true }, "outputs": [], "source": [ "weight_decay=1e-5\n", "model = EmbeddingNet(number_of_users, number_of_movies).cuda()\n", "\n", "# optim creates the optimization function\n", "# model.parameters() fetches the weights from the nn.Module superclass (anything of type nn.[weight type] e.g. Embedding)\n", "opt = optim.SGD(model.parameters(), 1e-1, weight_decay=weight_decay, momentum=0.9)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "5cac37d423254c7ab4ede4cdcd7ded46", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type HBox.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "epoch trn_loss val_loss \n", " 0 1.649158 1.637204 \n", " 1 1.117915 1.309114 \n", " 2 0.903568 1.219225 \n", "\n" ] }, { "data": { "text/plain": [ "[1.2192254]" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Call the PyTorch training loop (we'll write our own later on)\n", "fit(model, data, 3, opt, F.mse_loss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that our loss is still quite high.\n", "\n", "We can manually do some learning rate annealing and call `fit` again." ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": true }, "outputs": [], "source": [ "set_lrs(opt, 0.01)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "00e265a4ae2243668da0e2af2dfb0c7e", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type HBox.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "epoch trn_loss val_loss \n", " 0 0.685637 1.1429 \n", " 1 0.694845 1.133847 \n", " 2 0.700296 1.129204 \n", "\n" ] }, { "data": { "text/plain": [ "[1.1292036]" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fit(model, data, 3, opt, F.mse_loss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bias\n", "Our loss still doesn't compete with the fast.ai library. One reason for this is lack of __bias__.\n", "\n", "Consider, one movie tends to have particularly high ratings, or a certain user tends to give low scores to movies. We want to account for these case-by-case variances. So we give each movie and user a bias and add them on to our dot product. In practice, this will be like a an extra row stuck on to our movie and user tensors.\n", "\n", "\n", "\n", "So now we will create a new model that takes bias into account.\n", "\n", "This will have a few other differences:\n", "\n", "1. It uses a convenience method to create embeddings\n", "2. It normalizes scores returns from the forward pass to 1-5\n", "\n", "This second step is not strictly necessary, but it will make it easier to fit parameters.\n", "\n", "The sigmoid function is called from `F`, which is PyTorch's functional library." ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0.5, 5.0)" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# For step 2, score normalizing\n", "min_rating, max_rating = ratings.rating.min(), ratings.rating.max()\n", "min_rating, max_rating" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# number_of_inputs = rows in the embedding matrix\n", "# number_of_factors = columns in the embedding matrix\n", "def get_embedding(number_of_inputs, number_of_factors):\n", " embedding = nn.Embedding(number_of_inputs, number_of_factors)\n", " embedding.weight.data.uniform_(-0.01, 0.01)\n", " return embedding\n", "\n", "class EmbeddingDotBias(nn.Module):\n", " def __init__(self, number_of_users, number_of_movies):\n", " super().__init__()\n", " \n", " # Initialize embedding matrices and bias vectors\n", " (self.user_embedding_matrix, self.movie_embedding_matrix, self.user_biases, self.movie_biases) = [get_embedding(*o) for o in [\n", " (number_of_users, number_of_factors), (number_of_movies, number_of_factors), (number_of_users, 1), (number_of_movies, 1)\n", " ]]\n", " \n", " def forward(self, categorical, continuous):\n", " users, movies = categorical[:,0], categorical[:,1]\n", " \n", " # Do our dot product\n", " user_dot_movies = (self.user_embedding_matrix(users)*self.movie_embedding_matrix(movies)).sum(1)\n", " \n", " # Add on our bias vectors\n", " results = user_dot_movies + self.user_biases(users).squeeze() + self.movie_biases(movies).squeeze()\n", " \n", " # Normalize results\n", " results = F.sigmoid(results) * (max_rating-min_rating)+min_rating\n", " \n", " return results" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cf = CollabFilterDataset.from_csv(path, 'ratings.csv', 'userId', 'movieId', 'rating')\n", "\n", "weight_decay=2e-4\n", "model = EmbeddingDotBias(cf.n_users, cf.n_items).cuda()\n", "opt = optim.SGD(model.parameters(), 1e-1, weight_decay=weight_decay, momentum=0.9)" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "adb77c43b2ce4eff83f472de25022b6c", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type HBox.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "epoch trn_loss val_loss \n", " 0 0.832861 0.836411 \n", " 1 0.805658 0.817018 \n", " 2 0.789209 0.810872 \n", "\n" ] }, { "data": { "text/plain": [ "[0.8108725]" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fit(model, data, 3, opt, F.mse_loss)" ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "collapsed": true }, "outputs": [], "source": [ "set_lrs(opt, 1e-2)" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "64dc3f84d60d415abfaf2308e7fb2acf", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type HBox.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "epoch trn_loss val_loss \n", " 0 0.733431 0.802443 \n", " 1 0.726335 0.800945 \n", " 2 0.756487 0.800443 \n", "\n" ] }, { "data": { "text/plain": [ "[0.800443]" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fit(model, data, 3, opt, F.mse_loss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mini neural net\n", "\n", "Now, we could take our user and movie embedding values, stick them together, and feed them into a linear layer, effectively creating a neural network.\n", "\n", "\n", "\n", "To create linear layers, we will use the PyTorch `nn.Linear` class. Note, this class already has biases built into it, so there is no need for separate bias vectors." ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class EmbeddingNet(nn.Module):\n", " def __init__(self, number_of_users, number_of_movies, number_hidden_activations=10, p1=0.05, p2=0.5):\n", " super().__init__()\n", " \n", " # Set up our embedding layers\n", " (self.user_embedding_matrix, self.movie_embedding_matrix) = [get_embedding(*o) for o in [\n", " (number_of_users, number_of_factors), (number_of_movies, number_of_factors)\n", " ]]\n", " \n", " # Set up the first linear layer. Since we are sticking together our users and movies, *2\n", " self.linear_layer_1 = nn.Linear(number_of_factors*2, number_hidden_activations)\n", " \n", " # Set up second linear layer, which will give the output\n", " self.linear_layer_2 = nn.Linear(number_hidden_activations, 1)\n", " \n", " self.dropout1 = nn.Dropout(p1)\n", " self.dropout2 = nn.Dropout(p2)\n", " \n", " def forward(self, categorical, continuous):\n", " users, movies = categorical[:,0], categorical[:,1]\n", " \n", " # Now, first we get the values from our embedding matrix, and concatenate the columns (dim=1)\n", " # and then run dropout on them\n", " x = self.dropout1(torch.cat([self.user_embedding_matrix(users),self.movie_embedding_matrix(movies)], dim=1))\n", " \n", " # Next, feed this into our first linear layer, run it through ReLU, and perform dropout\n", " x = self.dropout2(F.relu(self.linear_layer_1(x)))\n", " \n", " # Lastly, we feed it into our second linear layer, run it through sigmoid and normalize\n", " # Linear output function\n", " return F.sigmoid(self.linear_layer_2(x)) * (max_rating-min_rating+1) + min_rating-0.5\n", " " ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": true }, "outputs": [], "source": [ "weight_decay=1e-5\n", "model = EmbeddingNet(number_of_users, number_of_movies).cuda()\n", "opt = optim.Adam(model.parameters(), 1e-3, weight_decay=weight_decay)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Note on fit.\n", "When calling `fit`, we pass it a loss/cost function that it can use to measure the success of the function with.\n", "\n", "E.g., `F.mse_loss`." ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "1b5d9007c4784b26bc555910aa75096c", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type HBox.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "epoch trn_loss val_loss \n", " 0 0.88798 0.817012 \n", " 1 0.79681 0.796811 \n", " 2 0.802571 0.79135 \n", "\n" ] }, { "data": { "text/plain": [ "[0.79135036]" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fit(model, data, 3, opt, F.mse_loss)" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "collapsed": true }, "outputs": [], "source": [ "set_lrs(opt, 1e-3)" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b34e934abe7645dda72a331db356b5d1", "version_major": 2, "version_minor": 0 }, "text/html": [ "

Failed to display Jupyter Widget of type HBox.

\n", "

\n", " If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n", " that the widgets JavaScript is still loading. If this message persists, it\n", " likely means that the widgets JavaScript library is either not installed or\n", " not enabled. See the Jupyter\n", " Widgets Documentation for setup instructions.\n", "

\n", "

\n", " If you're reading this message in another frontend (for example, a static\n", " rendering on GitHub or NBViewer),\n", " it may mean that your frontend doesn't currently support widgets.\n", "

\n" ], "text/plain": [ "HBox(children=(IntProgress(value=0, description='Epoch', max=3), HTML(value='')))" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "epoch trn_loss val_loss \n", " 0 0.778022 0.789235 \n", " 1 0.761803 0.789287 \n", " 2 0.765764 0.794108 \n", "\n" ] }, { "data": { "text/plain": [ "[0.7941082]" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fit(model, data, 3, opt, F.mse_loss)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }