{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# code for loading the format for the notebook\n", "import os\n", "\n", "# path : store the current path to convert back to it later\n", "path = os.getcwd()\n", "os.chdir(os.path.join('..', 'notebook_format'))\n", "from formats import load_style\n", "load_style(css_style = 'custom2.css', plot_style = False)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Author: Ethen\n", "\n", "Last updated: 2022-05-11 23:19:12\n", "\n", "Python implementation: CPython\n", "Python version : 3.7.11\n", "IPython version : 7.27.0\n", "\n", "numpy : 1.21.6\n", "pandas : 1.3.5\n", "scipy : 1.7.1\n", "sklearn: 1.0.2\n", "tqdm : 4.62.3\n", "\n" ] } ], "source": [ "os.chdir(path)\n", "\n", "# 1. magic to print version\n", "# 2. magic so that the notebook will reload external python modules\n", "%load_ext watermark\n", "%load_ext autoreload \n", "%autoreload 2\n", "\n", "import sys\n", "import numpy as np\n", "import pandas as pd\n", "from math import ceil\n", "from tqdm import trange\n", "from subprocess import call\n", "from itertools import islice\n", "from sklearn.metrics import roc_auc_score\n", "from sklearn.preprocessing import normalize\n", "from sklearn.neighbors import NearestNeighbors\n", "from scipy.sparse import csr_matrix, dok_matrix\n", "\n", "%watermark -a 'Ethen' -d -u -t -v -p numpy,pandas,scipy,sklearn,tqdm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bayesian Personalized Ranking" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you are new to the field recommendation system, please make sure you understand the basics of matrix factorization from this [other documentation](http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/recsys/1_ALSWR.ipynb).\n", "\n", "Recall that when doing matrix factorization for implicit feedback data (users' clicks, view times), we start with a user-item matrix, $R$ where nonzero elements of the matrix are the users' interaction with the items. And what matrix factorization does is it decomposes a large matrix into products of matrices, namely, $R=U×V$.\n", "\n", "![](img/matrix_factorization.png)\n", "\n", "Matrix factorization assumes that:\n", "\n", "- Each user can be described by $d$ features. For example, feature 1 might be a referring to how much each user likes disney movies.\n", "- Each item, movie in this case, can be described by an analagous set of $d$ features. To correspond to the above example, feature 1 for the movie might be a number that says how close the movie is to a disney movie.\n", "\n", "With that notion in mind, we can denote our $d$ feature user into math by letting a user $u$ take the form of a $1 \\times d$-dimensional vector $\\textbf{x}_{u}$. Similarly, an item *i* can be represented by a $1 \\times d$-dimensional vector $\\textbf{y}_{i}$. And we would predict the interactions that the user $u$ will have for item $i$ is by doing a dot product of the two vectors\n", "\n", "$$\n", "\\begin{align}\n", "\\hat r_{ui} &= \\textbf{x}_{u} \\textbf{y}_{i}^{T} = \\sum\\limits_{d} x_{ud}y_{di}\n", "\\end{align}\n", "$$\n", "\n", "Where $\\hat r_{ui}$ represents our prediction for the true interactions $r_{ui}$. Next, we will choose a objective function to minimize the square of the difference between all interactions in our dataset ($S$) and our predictions. This produces a objective function of the form:\n", "\n", "$$\n", "\\begin{align}\n", "L &= \\sum\\limits_{u,i \\in S}( r_{ui} - \\textbf{x}_{u} \\textbf{y}_{i}^{T} )^{2}\n", "\\end{align}\n", "$$\n", "\n", "This is all well and good, but a lot of times, what we wish to optimize for is not the difference between the true interaction and the predicted interaction, but instead is the ranking of the items. Meaning given a user, what is the top-N most likely item that the user prefers. And this is what **Bayesian Personalized Ranking (BPR)** tries to accomplish. The idea is centered around sampling positive (items user has interacted with) and negative (items user hasn't interacted with) items and running pairwise comparisons." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Formulation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose $U$ is the set of all users and $I$ is the set of all items, our goal is to provide user $u$ with a personalized ranking, denoted by $>_u$. As mentioned in the last section, the usual approach for item recommendation algorithm is to predict a personalized score $\\hat r_{ui}$ for an item that reflects the user's preference for that item. Then the items are ranked by sorting them according to that score and the top-N is recommended to the user. \n", "\n", "Here we'll use a different approach by using item pairs as training data and optimize for correctly ranking item pairs. From $S$, the whole dataset we try to reconstruct for each user parts of $>_u$. If the user has interacted with item $i$, i.e. $(u,i) \\in S$, then we assume that the user prefers this item over all other non-observed items. E.g. in the figure below user $u_1$ has interacted with item $i_2$ but not item $i_1$, so we assume that this user prefers item $i_2$ over $i_1$: $i_2 >_u i_1$. We will denote this generally as $i >_u j$, where the $i$ stands for the positive item and $j$ is for the negative item. For items that the user have both interacted with, we cannot infer any preference. The same is true for two items that a user has not interacted yet (e.g. item $i_1$ and $i_4$ for user $u_1$).\n", "\n", "![](img/interactions.png)\n", "\n", "Given these information, we can now get to the Bayesian part of this method. Let $\\Theta$ be the parameter of the model that determines the personalized ranking. BPR's goal is to maximize the posterior probability:\n", "\n", "$$\n", "\\begin{align}\n", "p(\\Theta | i >_u j ) \\propto p( i >_u j |\\Theta) p(\\Theta)\n", "\\end{align}\n", "$$\n", "\n", "$p( i >_u j |\\Theta)$ is the likelihood function, it captures the individual probability that a user really prefers item $i$ over item $j$. We compute this probability with the form:\n", "\n", "$$\n", "\\begin{align}\n", "p( i >_u j |\\Theta) = \\sigma \\big(\\hat r_{uij}(\\Theta) \\big)\n", "\\end{align}\n", "$$\n", "\n", "Where: $\\sigma$ is the good old logistic sigmoid:\n", "\n", "$$\n", "\\begin{align}\n", "\\sigma(x) = \\frac{1}{1 + e^{-x}}\n", "\\end{align}\n", "$$\n", "\n", "And $r_{uij}(\\Theta)$ captures the relationship between user $u$, item $i$ and item $j$, which can be further decomposed into:\n", "\n", "$$\n", "\\begin{align}\n", "\\hat r_{uij} = \\hat r_{ui} - \\hat r_{uj}\n", "\\end{align}\n", "$$\n", "\n", "For convenience we skipped the argument $\\Theta$ from $\\hat r_{uij}$. The formula above is basically saying what is the difference between the predicted interaction between the positive item $i$ and the negative item $j$. Now because of this generic framework, we can apply any standard collaborative filtering (such as the matrix factorization) techniques that can predict the interaction between user and item. Keep in mind that although it may seem like we're using the same models as in other work, here we're optimizing against another criterion as we do not try to predict a single predictor $\\hat r_{ui}$ but instead tries to classify the difference of two predictions $\\hat r_{ui} - \\hat r_{uj}$. For those that interested in diving deeper, there's a section in the original paper that showed that the BPR optimization criteria is actually optimizing AUC (Area Under the ROC curve)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far, we have only discussed the likelihood function. In order to complete the Bayesian modeling approach of the personalized ranking task, we introduce a general prior density $p(\\Theta)$ which is a normal distribution with zero mean and variance-covariance matrix $\\Sigma (\\Theta)$, to reduce the number of unknown hyperparameters, we set: $\\Sigma (\\Theta) = \\lambda_{\\Theta} I$\n", "\n", "To sum it all up, the full form of the maximum posterior probability optimization (called BPR-Opt in the paper) can be specified as:\n", "\n", "$$\n", "\\begin{align}\n", "BPR-Opt &= \\prod_{u, i, j} p( i >_u j |\\Theta) p(\\Theta) \\\\\n", "&= ln \\big( \\prod_{u, i, j} p( i >_u j |\\Theta) p(\\Theta) \\big) \\\\\n", "&= \\sum_{u, i, j} ln \\sigma \\big(\\hat r_{ui} - \\hat r_{uj} \\big) + ln p(\\Theta) \\\\\n", "&= \\sum_{u, i, j} ln \\sigma \\big(\\hat r_{ui} - \\hat r_{uj} \\big) - \\lambda_{\\Theta} \\left\\Vert \\Theta \\right\\Vert ^{2} \\\\\n", "&= \\sum_{u, i, j} ln \\sigma \\big(\\hat r_{ui} - \\hat r_{uj} \\big) - \\frac{\\lambda_{\\Theta}}{2} \\left\\Vert x_u \\right\\Vert ^{2} - \\frac{\\lambda_{\\Theta}}{2} \\left\\Vert y_i \\right\\Vert ^{2} - \\frac{\\lambda_{\\Theta}}{2} \\left\\Vert y_j \\right\\Vert ^{2} \\\\\n", "&= \\sum_{u, i, j} ln \\sigma \\big( x_u y_i^T - x_u y_j^T \\big) - \\frac{\\lambda_{\\Theta}}{2} \\left\\Vert x_u \\right\\Vert ^{2} - \\frac{\\lambda_{\\Theta}}{2} \\left\\Vert y_i \\right\\Vert ^{2} - \\frac{\\lambda_{\\Theta}}{2} \\left\\Vert y_j \\right\\Vert ^{2} \\\\\n", "&= \\sum_{u, i, j} ln \\frac{1}{1 + e^{-(x_u y_i^T - x_u y_j^T) }} - \\frac{\\lambda_{\\Theta}}{2} \\left\\Vert x_u \\right\\Vert ^{2} - \\frac{\\lambda_{\\Theta}}{2} \\left\\Vert y_i \\right\\Vert ^{2} - \\frac{\\lambda_{\\Theta}}{2} \\left\\Vert y_j \\right\\Vert ^{2}\n", "\\end{align}\n", "$$\n", "\n", "Where:\n", "\n", "- We first take the natural log (it's a monotonic transformation, meaning taking the log does not affect the optimization process) to make the product a sum to make life easier on us\n", "- As for the $p(\\Theta)$ part, recall that because for each parameter we assume that it's a normal distribution with mean zero ($\\mu = 0$), and unit variance ($\\Sigma = I$, we ignore the $\\lambda_{\\Theta}$ for now), the formula for it is:\n", "\n", "$$\n", "\\begin{align}\n", "N(x \\mid \\mu, \\Sigma) \n", "&= \\frac{1}{(2\\pi)^{d/2}\\sqrt{|\\Sigma|}} exp(-\\frac{1}{2}(x-\\mu)^{T}\\Sigma^{-1}(x-\\mu)) \\\\\n", "&= \\frac{1}{(2\\pi)^{d/2}} exp(-\\frac{1}{2}\\Theta^{T}\\Theta)\n", "\\end{align}\n", "$$\n", "\n", "In the formula above, the only thing that depends on $\\Theta$ is the $exp(-\\frac{1}{2}\\Theta^{T}\\Theta)$ part on the right, the rest is just a multiplicative constant that we don't need to worry about, thus if we take the natural log of that formula, then the exponential goes away and our $p(\\Theta)$ can be written as $- \\frac{1}{2} \\left\\Vert \\Theta \\right\\Vert ^{2}$, and we simply multiply the $\\lambda_{\\Theta}$ back, which can be seen as the model specific regularization parameter.\n", "\n", "Last but not least, in machine learning it's probably more common to try and minimize things, thus we simply flip all the signs of the maximization formula above, leaving us with:\n", "\n", "$$\n", "\\begin{align}\n", "argmin_{x_u, y_i, y_j} \\sum_{u, i, j} -ln \\frac{1}{1 + e^{-(x_u y_i^T - x_u y_j^T) }} + \\frac{\\lambda_{\\Theta}}{2} \\left\\Vert x_u \\right\\Vert ^{2} + \\frac{\\lambda_{\\Theta}}{2} \\left\\Vert y_i \\right\\Vert ^{2} + \\frac{\\lambda_{\\Theta}}{2} \\left\\Vert y_j \\right\\Vert ^{2}\n", "\\end{align}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Optimization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the last section, we have derived an optimization criterion for personalized ranking. As the criterion is differentiable, gradient descent based algorithms are an obvious choice for maximization. But standard gradient descent is probably not the right choice for our problem given the size of all the possible combinations of the triplet $(u, i, j)$. To solve for this issue we use a stochastic gradient descent algorithm that chooses the triplets randomly (uniformly distributed).\n", "\n", "To solve for the function using gradient descent, we derive the gradient for the three parameters $x_u$, $y_i$, $y_j$ separately. Just a minor hint when deriving the gradient, remember that the first part of the formula requires the chain rule:\n", "\n", "$$\n", "\\begin{align}\n", "\\dfrac{\\partial}{\\partial x} ln \\sigma(x) \n", "&= \\dfrac{1}{\\sigma(x)} \\dfrac{\\partial}{\\partial x} \\sigma(x) \\\\\n", "&= \\left( 1 + e^{-x} \\right) \\dfrac{\\partial}{\\partial x} \\sigma(x) \\\\\n", "&= \\left( 1 + e^{-x} \\right) \\dfrac{\\partial}{\\partial x} \\left[ \\dfrac{1}{1 + e^{-x}} \\right] \\\\\n", "&= \\left( 1 + e^{-x} \\right) \\dfrac{\\partial}{\\partial x} \\left( 1 + \\mathrm{e}^{-x} \\right)^{-1} \\\\\n", "&= \\left( 1 + e^{-x} \\right) \\cdot -(1 + e^{-x})^{-2}(-e^{-x}) \\\\\n", "&= \\left( 1 + e^{-x} \\right) \\dfrac{e^{-x}}{\\left(1 + e^{-x}\\right)^2} \\\\\n", "&= \\dfrac{e^{-x}}{1 + e^{-x}} \\\\\n", "&= \\dfrac{1}{1 + e^x}\n", "\\end{align}\n", "$$\n", "\n", "---\n", "\n", "$$\n", "\\begin{align}\n", "\\dfrac{\\partial}{\\partial x_u}\n", "&= \\dfrac{1}{1 + e^{(x_u y_i^T - x_u y_j^T)}} \\cdot (y_j - y_i) + \\lambda x_u\n", "\\end{align}\n", "$$\n", "\n", "$$\n", "\\begin{align}\n", "\\dfrac{\\partial}{\\partial y_i}\n", "&= \\dfrac{1}{1 + e^{(x_u y_i^T - x_u y_j^T)}} \\cdot -x_u + \\lambda y_i\n", "\\end{align}\n", "$$\n", "\n", "$$\n", "\\begin{align}\n", "\\dfrac{\\partial}{\\partial y_j}\n", "&= \\dfrac{1}{1 + e^{(x_u y_i^T - x_u y_j^T)}} \\cdot x_u + \\lambda y_j\n", "\\end{align}\n", "$$\n", "\n", "After deriving the gradient the update for each parameter using vanilla gradient descent is:\n", "\n", "$$\n", "\\begin{align}\n", "\\Theta & \\Leftarrow \\Theta - \\alpha \\dfrac{\\partial}{\\partial \\Theta}\n", "\\end{align}\n", "$$\n", "\n", "Where $\\alpha$ is the learning rate." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Implementation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will again use the movielens data as an example." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data dimension: \n", " (100000, 4)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_iditem_idratingtimestamp
01962423881250949
11863023891717742
2223771878887116
3244512880606923
41663461886397596
\n", "
" ], "text/plain": [ " user_id item_id rating timestamp\n", "0 196 242 3 881250949\n", "1 186 302 3 891717742\n", "2 22 377 1 878887116\n", "3 244 51 2 880606923\n", "4 166 346 1 886397596" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "file_dir = 'ml-100k'\n", "file_path = os.path.join(file_dir, 'u.data')\n", "if not os.path.isdir(file_dir):\n", " call(['curl', '-O', 'http://files.grouplens.org/datasets/movielens/' + file_dir + '.zip'])\n", " call(['unzip', file_dir + '.zip'])\n", "\n", "# we will not be using the timestamp column\n", "names = ['user_id', 'item_id', 'rating', 'timestamp']\n", "df = pd.read_csv(file_path, sep = '\\t', names = names)\n", "print('data dimension: \\n', df.shape)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because BPR assumes binary implicit feedback (meaing there's only positive and negative items), here we'll assume that an item is positive only if he/she gave the item a ratings above 3 (feel free to experiment and change the threshold). The next few code chunks, creates the sparse interaction matrix and split into train and test set." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def create_matrix(data, users_col, items_col, ratings_col, threshold = None):\n", " \"\"\"\n", " creates the sparse user-item interaction matrix,\n", " if the data is not in the format where the interaction only\n", " contains the positive items (indicated by 1), then use the \n", " threshold parameter to determine which items are considered positive\n", " \n", " Parameters\n", " ----------\n", " data : DataFrame\n", " implicit rating data\n", "\n", " users_col : str\n", " user column name\n", "\n", " items_col : str\n", " item column name\n", " \n", " ratings_col : str\n", " implicit rating column name\n", "\n", " threshold : int, default None\n", " threshold to determine whether the user-item pair is \n", " a positive feedback\n", "\n", " Returns\n", " -------\n", " ratings : scipy sparse csr_matrix, shape [n_users, n_items]\n", " user/item ratings matrix\n", "\n", " data : DataFrame\n", " implict rating data that retains only the positive feedback\n", " (if specified to do so)\n", " \"\"\"\n", " if threshold is not None:\n", " data = data[data[ratings_col] >= threshold]\n", " data[ratings_col] = 1\n", "\n", " # this ensures each user has at least 2 records to construct a valid\n", " # train and test split in downstream process, note we might purge\n", " # some users completely during this process\n", " data_user_num_items = (data\n", " .groupby('user_id')\n", " .agg(**{'num_items': ('item_id', 'count')})\n", " .reset_index())\n", " data = data.merge(data_user_num_items, on='user_id', how='inner')\n", " data = data[data['num_items'] > 1]\n", " \n", " for col in (items_col, users_col, ratings_col):\n", " data[col] = data[col].astype('category')\n", "\n", " ratings = csr_matrix((data[ratings_col],\n", " (data[users_col].cat.codes, data[items_col].cat.codes)))\n", " ratings.eliminate_zeros()\n", " return ratings, data" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<943x1574 sparse matrix of type ''\n", "\twith 82520 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "items_col = 'item_id'\n", "users_col = 'user_id'\n", "ratings_col = 'rating'\n", "threshold = 3\n", "X, df = create_matrix(df, users_col, items_col, ratings_col, threshold)\n", "X" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def create_train_test(ratings, test_size = 0.2, seed = 1234):\n", " \"\"\"\n", " split the user-item interactions matrix into train and test set\n", " by removing some of the interactions from every user and pretend\n", " that we never seen them\n", " \n", " Parameters\n", " ----------\n", " ratings : scipy sparse csr_matrix, shape [n_users, n_items]\n", " The user-item interactions matrix\n", " \n", " test_size : float between 0.0 and 1.0, default 0.2\n", " Proportion of the user-item interactions for each user\n", " in the dataset to move to the test set; e.g. if set to 0.2\n", " and a user has 10 interactions, then 2 will be moved to the\n", " test set\n", " \n", " seed : int, default 1234\n", " Seed for reproducible random splitting the \n", " data into train/test set\n", " \n", " Returns\n", " ------- \n", " train : scipy sparse csr_matrix, shape [n_users, n_items]\n", " Training set\n", " \n", " test : scipy sparse csr_matrix, shape [n_users, n_items]\n", " Test set\n", " \"\"\"\n", " assert test_size < 1.0 and test_size > 0.0\n", "\n", " # Dictionary Of Keys based sparse matrix is more efficient\n", " # for constructing sparse matrices incrementally compared with csr_matrix\n", " train = ratings.copy().todok()\n", " test = dok_matrix(train.shape)\n", " \n", " # for all the users assign randomly chosen interactions\n", " # to the test and assign those interactions to zero in the training;\n", " # when computing the interactions to go into the test set, \n", " # remember to round up the numbers (e.g. a user has 4 ratings, if the\n", " # test_size is 0.2, then 0.8 ratings will go to test, thus we need to\n", " # round up to ensure the test set gets at least 1 rating)\n", " rstate = np.random.RandomState(seed)\n", " for u in range(ratings.shape[0]):\n", " split_index = ratings[u].indices\n", " n_splits = ceil(test_size * split_index.shape[0])\n", " test_index = rstate.choice(split_index, size = n_splits, replace = False)\n", " test[u, test_index] = ratings[u, test_index]\n", " train[u, test_index] = 0\n", " \n", " train, test = train.tocsr(), test.tocsr()\n", " return train, test" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<943x1574 sparse matrix of type ''\n", "\twith 65641 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train, X_test = create_train_test(X, test_size = 0.2, seed = 1234)\n", "X_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following section provides a implementation of the algorithm from scratch." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "class BPR:\n", " \"\"\"\n", " Bayesian Personalized Ranking (BPR) for implicit feedback data\n", "\n", " Parameters\n", " ----------\n", " learning_rate : float, default 0.01\n", " learning rate for gradient descent\n", "\n", " n_factors : int, default 20\n", " Number/dimension of user and item latent factors\n", "\n", " n_iters : int, default 15\n", " Number of iterations to train the algorithm\n", " \n", " batch_size : int, default 1000\n", " batch size for batch gradient descent, the original paper\n", " uses stochastic gradient descent (i.e., batch size of 1),\n", " but this can make the training unstable (very sensitive to\n", " learning rate)\n", "\n", " reg : int, default 0.01\n", " Regularization term for the user and item latent factors\n", "\n", " seed : int, default 1234\n", " Seed for the randomly initialized user, item latent factors\n", "\n", " verbose : bool, default True\n", " Whether to print progress bar while training\n", "\n", " Attributes\n", " ----------\n", " user_factors : 2d ndarray, shape [n_users, n_factors]\n", " User latent factors learnt\n", "\n", " item_factors : 2d ndarray, shape [n_items, n_factors]\n", " Item latent factors learnt\n", "\n", " References\n", " ----------\n", " S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-Thieme \n", " Bayesian Personalized Ranking from Implicit Feedback\n", " - https://arxiv.org/abs/1205.2618\n", " \"\"\"\n", " def __init__(self, learning_rate = 0.01, n_factors = 15, n_iters = 10, \n", " batch_size = 1000, reg = 0.01, seed = 1234, verbose = True):\n", " self.reg = reg\n", " self.seed = seed\n", " self.verbose = verbose\n", " self.n_iters = n_iters\n", " self.n_factors = n_factors\n", " self.batch_size = batch_size\n", " self.learning_rate = learning_rate\n", " \n", " # to avoid re-computation at predict\n", " self._prediction = None\n", " \n", " def fit(self, ratings):\n", " \"\"\"\n", " Parameters\n", " ----------\n", " ratings : scipy sparse csr_matrix, shape [n_users, n_items]\n", " sparse matrix of user-item interactions\n", " \"\"\"\n", " indptr = ratings.indptr\n", " indices = ratings.indices\n", " n_users, n_items = ratings.shape\n", " \n", " # ensure batch size makes sense, since the algorithm involves\n", " # for each step randomly sample a user, thus the batch size\n", " # should be smaller than the total number of users or else\n", " # we would be sampling the user with replacement\n", " batch_size = self.batch_size\n", " if n_users < batch_size:\n", " batch_size = n_users\n", " sys.stderr.write('WARNING: Batch size is greater than number of users,'\n", " 'switching to a batch size of {}\\n'.format(n_users))\n", "\n", " batch_iters = n_users // batch_size\n", " \n", " # initialize random weights\n", " rstate = np.random.RandomState(self.seed)\n", " self.user_factors = rstate.normal(size = (n_users, self.n_factors))\n", " self.item_factors = rstate.normal(size = (n_items, self.n_factors))\n", " \n", " # progress bar for training iteration if verbose is turned on\n", " loop = range(self.n_iters)\n", " if self.verbose:\n", " loop = trange(self.n_iters, desc = self.__class__.__name__)\n", " \n", " for _ in loop:\n", " for _ in range(batch_iters):\n", " sampled = self._sample(n_users, n_items, indices, indptr)\n", " sampled_users, sampled_pos_items, sampled_neg_items = sampled\n", " self._update(sampled_users, sampled_pos_items, sampled_neg_items)\n", "\n", " return self\n", " \n", " def _sample(self, n_users, n_items, indices, indptr):\n", " \"\"\"sample batches of random triplets u, i, j\"\"\"\n", " sampled_pos_items = np.zeros(self.batch_size, dtype = np.int)\n", " sampled_neg_items = np.zeros(self.batch_size, dtype = np.int)\n", " sampled_users = np.random.choice(\n", " n_users, size = self.batch_size, replace = False)\n", "\n", " for idx, user in enumerate(sampled_users):\n", " pos_items = indices[indptr[user]:indptr[user + 1]]\n", " pos_item = np.random.choice(pos_items)\n", " neg_item = np.random.choice(n_items)\n", " while neg_item in pos_items:\n", " neg_item = np.random.choice(n_items)\n", "\n", " sampled_pos_items[idx] = pos_item\n", " sampled_neg_items[idx] = neg_item\n", "\n", " return sampled_users, sampled_pos_items, sampled_neg_items\n", " \n", " def _update(self, u, i, j):\n", " \"\"\"\n", " update according to the bootstrapped user u, \n", " positive item i and negative item j\n", " \"\"\"\n", " user_u = self.user_factors[u]\n", " item_i = self.item_factors[i]\n", " item_j = self.item_factors[j]\n", " \n", " # decompose the estimator, compute the difference between\n", " # the score of the positive items and negative items; a\n", " # naive implementation might look like the following:\n", " # r_ui = np.diag(user_u.dot(item_i.T))\n", " # r_uj = np.diag(user_u.dot(item_j.T))\n", " # r_uij = r_ui - r_uj\n", " \n", " # however, we can do better, so\n", " # for batch dot product, instead of doing the dot product\n", " # then only extract the diagonal element (which is the value\n", " # of that current batch), we perform a hadamard product, \n", " # i.e. matrix element-wise product then do a sum along the column will\n", " # be more efficient since it's less operations\n", " # http://people.revoledu.com/kardi/tutorial/LinearAlgebra/HadamardProduct.html\n", " # r_ui = np.sum(user_u * item_i, axis = 1)\n", " #\n", " # then we can achieve another speedup by doing the difference\n", " # on the positive and negative item up front instead of computing\n", " # r_ui and r_uj separately, these two idea will speed up the operations\n", " # from 1:14 down to 0.36\n", " r_uij = np.sum(user_u * (item_i - item_j), axis = 1)\n", " sigmoid = np.exp(-r_uij) / (1.0 + np.exp(-r_uij))\n", " \n", " # repeat the 1 dimension sigmoid n_factors times so\n", " # the dimension will match when doing the update\n", " sigmoid_tiled = np.tile(sigmoid, (self.n_factors, 1)).T\n", "\n", " # update using gradient descent\n", " grad_u = sigmoid_tiled * (item_j - item_i) + self.reg * user_u\n", " grad_i = sigmoid_tiled * -user_u + self.reg * item_i\n", " grad_j = sigmoid_tiled * user_u + self.reg * item_j\n", " self.user_factors[u] -= self.learning_rate * grad_u\n", " self.item_factors[i] -= self.learning_rate * grad_i\n", " self.item_factors[j] -= self.learning_rate * grad_j\n", " return self\n", "\n", " def predict(self):\n", " \"\"\"\n", " Obtain the predicted ratings for every users and items\n", " by doing a dot product of the learnt user and item vectors.\n", " The result will be cached to avoid re-computing it every time\n", " we call predict, thus there will only be an overhead the first\n", " time we call it. Note, ideally you probably don't need to compute\n", " this as it returns a dense matrix and may take up huge amounts of\n", " memory for large datasets\n", " \"\"\"\n", " if self._prediction is None:\n", " self._prediction = self.user_factors.dot(self.item_factors.T)\n", "\n", " return self._prediction\n", "\n", " def _predict_user(self, user):\n", " \"\"\"\n", " returns the predicted ratings for the specified user,\n", " this is mainly used in computing evaluation metric\n", " \"\"\"\n", " user_pred = self.user_factors[user].dot(self.item_factors.T)\n", " return user_pred\n", "\n", " def recommend(self, ratings, N = 5):\n", " \"\"\"\n", " Returns the top N ranked items for given user id,\n", " excluding the ones that the user already liked\n", " \n", " Parameters\n", " ----------\n", " ratings : scipy sparse csr_matrix, shape [n_users, n_items]\n", " sparse matrix of user-item interactions \n", " \n", " N : int, default 5\n", " top-N similar items' N\n", " \n", " Returns\n", " -------\n", " recommendation : 2d ndarray, shape [number of users, N]\n", " each row is the top-N ranked item for each query user\n", " \"\"\"\n", " n_users = ratings.shape[0]\n", " recommendation = np.zeros((n_users, N), dtype = np.uint32)\n", " for user in range(n_users):\n", " top_n = self._recommend_user(ratings, user, N)\n", " recommendation[user] = top_n\n", "\n", " return recommendation\n", "\n", " def _recommend_user(self, ratings, user, N):\n", " \"\"\"the top-N ranked items for a given user\"\"\"\n", " scores = self._predict_user(user)\n", "\n", " # compute the top N items, removing the items that the user already liked\n", " # from the result and ensure that we don't get out of bounds error when \n", " # we ask for more recommendations than that are available\n", " liked = set(ratings[user].indices)\n", " count = N + len(liked)\n", " if count < scores.shape[0]:\n", "\n", " # when trying to obtain the top-N indices from the score,\n", " # using argpartition to retrieve the top-N indices in \n", " # unsorted order and then sort them will be faster than doing\n", " # straight up argort on the entire score\n", " # http://stackoverflow.com/questions/42184499/cannot-understand-numpy-argpartition-output\n", " ids = np.argpartition(scores, -count)[-count:]\n", " best_ids = np.argsort(scores[ids])[::-1]\n", " best = ids[best_ids]\n", " else:\n", " best = np.argsort(scores)[::-1]\n", "\n", " top_n = list(islice((rec for rec in best if rec not in liked), N))\n", " return top_n\n", " \n", " def get_similar_items(self, N = 5, item_ids = None):\n", " \"\"\"\n", " return the top N similar items for itemid, where\n", " cosine distance is used as the distance metric\n", " \n", " Parameters\n", " ----------\n", " N : int, default 5\n", " top-N similar items' N\n", " \n", " item_ids : 1d iterator, e.g. list or numpy array, default None\n", " the item ids that we wish to find the similar items\n", " of, the default None will compute the similar items\n", " for all the items\n", " \n", " Returns\n", " -------\n", " similar_items : 2d ndarray, shape [number of query item_ids, N]\n", " each row is the top-N most similar item id for each\n", " query item id\n", " \"\"\"\n", " # cosine distance is proportional to normalized euclidean distance,\n", " # thus we normalize the item vectors and use euclidean metric so\n", " # we can use the more efficient kd-tree for nearest neighbor search;\n", " # also the item will always to nearest to itself, so we add 1 to \n", " # get an additional nearest item and remove itself at the end\n", " normed_factors = normalize(self.item_factors)\n", " knn = NearestNeighbors(n_neighbors = N + 1, metric = 'euclidean')\n", " knn.fit(normed_factors)\n", "\n", " # returns a distance, index tuple,\n", " # we don't actually need the distance\n", " if item_ids is not None:\n", " normed_factors = normed_factors[item_ids]\n", "\n", " _, items = knn.kneighbors(normed_factors)\n", " similar_items = items[:, 1:].astype(np.uint32)\n", " return similar_items" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "BPR: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:05<00:00, 27.79it/s]\n" ] }, { "data": { "text/plain": [ "<__main__.BPR at 0x7f85683be0d0>" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# parameters were randomly chosen\n", "bpr_params = {\n", " 'reg': 0.01,\n", " 'learning_rate': 0.1,\n", " 'n_iters': 160,\n", " 'n_factors': 15,\n", " 'batch_size': 100\n", "}\n", "bpr = BPR(**bpr_params)\n", "bpr.fit(X_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In recommender systems, we are often interested in how well the method can rank a given set of items. And to do that we'll use AUC (Area Under ROC Curve as our evaluation metric. The best possible value that the AUC evaluation metric can take is 1, and any non-random ranking that makes sense would have an AUC > 0.5. An intuitive explanation of AUC is it specifies the probability that when we draw two examples at random, their predicted pairwise ranking is correct. The following [documentation](http://nbviewer.jupyter.org/github/ethen8181/machine-learning/blob/master/model_selection/auc/auc.ipynb) has a more detailed discussion on AUC in case you're not familiar with it." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def auc_score(model, ratings):\n", " \"\"\"\n", " computes area under the ROC curve (AUC).\n", " The full name should probably be mean\n", " auc score as it is computing the auc\n", " for every user's prediction and actual\n", " interaction and taking the average for\n", " all users\n", " \n", " Parameters\n", " ----------\n", " model : BPR instance\n", " Trained BPR model\n", " \n", " ratings : scipy sparse csr_matrix, shape [n_users, n_items]\n", " sparse matrix of user-item interactions\n", " \n", " Returns\n", " -------\n", " auc : float 0.0 ~ 1.0\n", " \"\"\"\n", " auc = 0.0\n", " n_users, n_items = ratings.shape\n", " for user, row in enumerate(ratings):\n", " y_pred = model._predict_user(user)\n", " y_true = np.zeros(n_items)\n", " y_true[row.indices] = 1\n", " auc += roc_auc_score(y_true, y_pred)\n", "\n", " auc /= n_users\n", " return auc" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.897227024734603\n", "0.8263226783060188\n" ] } ], "source": [ "print(auc_score(bpr, X_train))\n", "print(auc_score(bpr, X_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Item Recommendations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given the trained model, we can get the most similar items by using the `get_similar_items` method, we can specify the number of most similar items by specifying the `N` argument. And this can be seen as the people who like/buy this also like/buy this functionality, since it's recommending similar items for a given item." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 8, 236, 124, 221, 110],\n", " [ 582, 176, 98, 425, 421],\n", " [1034, 173, 1026, 176, 87],\n", " ...,\n", " [ 422, 350, 1186, 1501, 1267],\n", " [ 779, 114, 1141, 536, 949],\n", " [ 750, 809, 1394, 551, 1465]], dtype=uint32)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bpr.get_similar_items(N = 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the other hand, we can also generate a top-N recommended item for each given user, by passing the sparse rating data and `N` to the `recommend` method." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[477, 425, 273, 274, 251],\n", " [293, 123, 319, 317, 924],\n", " [301, 285, 312, 322, 867],\n", " ...,\n", " [285, 274, 99, 621, 300],\n", " [541, 0, 287, 285, 744],\n", " [ 88, 490, 237, 476, 185]], dtype=uint32)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bpr.recommend(X_train, N = 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For these two methods, we can go one step further and look-up the actual item for these indices to see if they make intuitive sense. If we wish to do this, the movielens dataset has a `u.item` file that contains metadata about the movie." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Wiki: Area Under the ROC Curve](http://www.recsyswiki.com/wiki/Area_Under_the_ROC_Curve)\n", "- [StackExchange: Derivative of sigmoid function](http://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x)\n", "- [Blog: What you wanted to know about AUC](http://fastml.com/what-you-wanted-to-know-about-auc/)\n", "- [Blog: Learning to Rank Sketchfab Models with LightFM](http://blog.ethanrosenthal.com/2016/11/07/implicit-mf-part-2/)\n", "- [Blog (Chinese Mandarin): BPR [Bayesian Personalized Ranking]](http://liuzhiqiangruc.iteye.com/blog/2073526)\n", "- [Github: An implementation of Bayesian Personalised Ranking in Theano](https://github.com/bbc/theano-bpr)\n", "- [Paper: S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-Thieme - Bayesian Personalized Ranking from Implicit Feedback](https://arxiv.org/abs/1205.2618)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.11" }, "toc": { "base_numbering": 1, "nav_menu": { "height": "12px", "width": "252px" }, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "235px" }, "toc_section_display": "block", "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }