{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Recommending products with RetailRocket event logs\n", "\n", "This IPython notebook illustrates the usage of the [ctpfrec](https://github.com/david-cortes/ctpfrec/) Python package for _Collaborative Topic Poisson Factorization_ in recommender systems based on sparse count data using the [RetailRocket](https://www.kaggle.com/retailrocket/ecommerce-dataset) dataset, consisting of event logs (view, add to cart, purchase) from an online catalog of products plus anonymized text descriptions of items.\n", "\n", "Collaborative Topic Poisson Factorization is a probabilistic model that tries to jointly factorize the user-item interaction matrix along with item-word text descriptions (as bag-of-words) of the items by the product of lower dimensional matrices. The package can also extend this model to add user attributes in the same format as the items’.\n", "\n", "Compared to competing methods such as BPR (Bayesian Personalized Ranking) or weighted-implicit NMF (non-negative matrix factorization of the non-probabilistic type that uses squared loss), it only requires iterating over the data for which an interaction was observed and not over data for which no interaction was observed (i.e. it doesn’t iterate over items not clicked by a user), thus being more scalable, and at the same time producing better results when fit to sparse count data (in general). Same for the word counts of items.\n", "\n", "The implementation here is based on the paper _Content-based recommendations with poisson factorization (Gopalan, P.K., Charlin, L. and Blei, D., 2014)_.\n", "\n", "For a similar package for explicit feedback data see also [cmfrec](https://github.com/david-cortes/cmfrec/). For Poisson factorization without side information see [hpfrec](https://github.com/david-cortes/hpfrec/).\n", "\n", "**Small note: if the TOC here is not clickable or the math symbols don't show properly, try visualizing this same notebook from nbviewer following [this link](http://nbviewer.jupyter.org/github/david-cortes/ctpfrec/blob/master/example/ctpfrec_retailrocket.ipynb).**\n", "\n", "** *\n", "## Sections\n", "* [1. Model description](#p1)\n", "* [2. Loading and processing the dataset](#p2)\n", "* [3. Fitting the model](#p3)\n", "* [4. Common sense checks](#p4)\n", "* [5. Comparison to model without item information](#p5)\n", "* [6. Making recommendations](#p6)\n", "* [7. References](#p7)\n", "** *\n", "\n", "## 1. Model description\n", "\n", "The model consists in producing a low-rank non-negative matrix factorization of the item-word matrix (a.k.a. bag-of-words, a matrix where each row represents an item and each column a word, with entries containing the number of times each word appeared in an item’s text, ideally with some pre-processing on the words such as stemming or lemmatization) by the product of two lower-rank matrices\n", "\n", "$$ W_{iw} \\approx \\Theta_{ik} \\beta_{wk}^T $$\n", "\n", "along with another low-rank matrix factorization of the user-item activity matrix (a matrix where each entry corresponds to how many times each user interacted with each item) that shares the same item-factor matrix above plus an offset based on user activity and not based on items’ words\n", "\n", "$$ Y_{ui} \\approx \\eta_{uk} (\\Theta_{ik} + \\epsilon_{ik})^T $$\n", "\n", "These matrices are assumed to come from a generative process as follows:\n", "\n", "* Items:\n", "\n", "$$ \\beta_{wk} \\sim Gamma(a,b) $$\n", "$$ \\Theta_{ik} \\sim Gamma(c,d)$$\n", "$$ W_{iw} \\sim Poisson(\\Theta_{ik} \\beta_{wk}^T) $$\n", "_(Where $W$ is the item-word count matrix, $k$ is the number of latent factors, $i$ is the number of items, $w$ is the number of words)_\n", "\n", "* User-Item interactions\n", "$$ \\eta_{uk} \\sim Gamma(e,f) $$\n", "$$ \\epsilon_{ik} \\sim Gamma(g,h) $$\n", "$$ Y_{ui} \\sim Poisson(\\eta_{uk} (\\Theta_{ik} + \\epsilon_{ik})^T) $$\n", "_(Where $u$ is the number of users, $Y$ is the user-item interaction matrix)_\n", "\n", "The model is fit using mean-field variational inference with coordinate ascent. For more details see the paper in the references.\n", "** *\n", "\n", "## 2. Loading and processing the data\n", "\n", "Reading and concatenating the data. First the event logs:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestampvisitorideventitemidtransactionid
01433221332117257597view355908NaN
11433224214164992329view248676NaN
21433221999827111016view318965NaN
31433221955914483717view253185NaN
41433221337106951259view367447NaN
\n", "
" ], "text/plain": [ " timestamp visitorid event itemid transactionid\n", "0 1433221332117 257597 view 355908 NaN\n", "1 1433224214164 992329 view 248676 NaN\n", "2 1433221999827 111016 view 318965 NaN\n", "3 1433221955914 483717 view 253185 NaN\n", "4 1433221337106 951259 view 367447 NaN" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np, pandas as pd\n", "\n", "events = pd.read_csv(\"events.csv\")\n", "events.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "view 2664312\n", "addtocart 69332\n", "transaction 22457\n", "Name: event, dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "events.event.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to put all user-item interactions in one scale, I will arbitrarily assign values as follows:\n", "* View: +1\n", "* Add to basket: +3\n", "* Purchase: +3\n", "\n", "Thus, if a user clicks an item, that `(user, item)` pair will have `value=1`, if she later adds it to cart and purchases it, will have `value=7` (plus any other views of the same item), and so on.\n", "\n", "The reasoning behind this scale is because the distributions of counts and sums of counts seem to still follow a nice exponential distribution with these values, but different values might give better results in terms of models fit to them." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX4AAAD8CAYAAABw1c+bAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAFCxJREFUeJzt3X+MpdV93/H3pyzGlLX4Eehos6AuVmkiUhR+jAiWq2oWahtQVYjkWiBkqE20aYsjp6U/IPkjcVMkpzKmBbeON4EYJ8RrinEWEVyXYKaR/wDCOpjlhwlrsy67WtgA9tpjuzS43/5xz+Lr7ez8uDN3dmfP+yVdzfOc55znnHOf2c88c+69s6kqJEn9+BuHegCSpJVl8EtSZwx+SeqMwS9JnTH4JakzBr8kdcbgl6TOGPyS1BmDX5I6s+ZQDwDg5JNPrg0bNozU9vvf/z7HHXfc8g7oMOec++Cc+7CUOW/btu2Vqjplse0Oi+DfsGEDjz/++Ehtp6enmZqaWt4BHeaccx+ccx+WMuck3xqlnUs9ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1xuCXpM4Y/JLUmVUf/Nt372PDDX9yqIchSavGqg9+SdLiGPyS1BmDX5I6Y/BLUmcMfknqjMEvSZ2ZN/iTvDXJY0m+luTpJB9p5Z9O8kKSJ9rj7FaeJLcm2ZHkySTnjnsSkqSFW8j/wPU6cGFVzSQ5GvhKki+2Y/+mqu45oP4lwBnt8QvAJ9tXSdJhYN47/hqYabtHt0fN0eQy4DOt3SPACUnWLX2okqTlsKA1/iRHJXkC2As8WFWPtkM3teWcW5Ic08rWAy8ONd/VyiRJh4FUzXXzfkDl5ATgC8CvAK8CLwFvATYD36iqf5/kfuCjVfWV1uYh4N9V1eMHnGsTsAlgYmLivC1btow0gb2v7ePlH8JZ648fqf1qNDMzw9q1aw/1MFaUc+6Dc16cjRs3bquqycW2W8ga/5uq6jtJHgYurqqPteLXk/w+8K/b/m7gtKFmp7ayA8+1mcEPDCYnJ2vU/2X+tru2cvP2Ney8arT2q9H09DSjPl+rlXPug3NeGQt5V88p7U6fJMcC7wK+vn/dPkmAy4GnWpP7gKvbu3suAPZV1Z6xjF6StGgLueNfB9yZ5CgGPyjurqr7k3w5ySlAgCeAf9bqPwBcCuwAfgB8YPmHLUka1bzBX1VPAufMUn7hQeoXcN3ShyZJGgc/uStJnTH4JakzBr8kdcbgl6TOGPyS1BmDX5I6Y/BLUmcMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1Zt7gT/LWJI8l+VqSp5N8pJWfnuTRJDuSfC7JW1r5MW1/Rzu+YbxTkCQtxkLu+F8HLqyqnwfOBi5OcgHw28AtVfV3gG8D17b61wLfbuW3tHqSpMPEvMFfAzNt9+j2KOBC4J5Wfidwedu+rO3Tjl+UJMs2YknSkixojT/JUUmeAPYCDwLfAL5TVW+0KruA9W17PfAiQDu+D/ip5Ry0JGl0axZSqap+BJyd5ATgC8DPLrXjJJuATQATExNMT0+PdJ6JY+H6s94Yuf1qNDMz09V8wTn3wjmvjAUF/35V9Z0kDwPvAE5Isqbd1Z8K7G7VdgOnAbuSrAGOB16d5Vybgc0Ak5OTNTU1NdIEbrtrKzdvX8POq0ZrvxpNT08z6vO1WjnnPjjnlbGQd/Wc0u70SXIs8C7gWeBh4L2t2jXA1rZ9X9unHf9yVdVyDlqSNLqF3PGvA+5MchSDHxR3V9X9SZ4BtiT5D8BfALe3+rcDf5BkB/AacMUYxi1JGtG8wV9VTwLnzFL+TeD8Wcr/N/BPlmV0kqRl5yd3JakzBr8kdcbgl6TOGPyS1BmDX5I6Y/BLUmcMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1xuCXpM7MG/xJTkvycJJnkjyd5MOt/DeT7E7yRHtcOtTmxiQ7kjyX5D3jnIAkaXHm/c/WgTeA66vqq0neBmxL8mA7dktVfWy4cpIzgSuAnwN+GvjTJH+3qn60nAOXJI1m3jv+qtpTVV9t298DngXWz9HkMmBLVb1eVS8AO4Dzl2OwkqSlW9Qaf5INwDnAo63oQ0meTHJHkhNb2XrgxaFmu5j7B4UkaQWlqhZWMVkL/E/gpqq6N8kE8ApQwG8B66rqg0k+ATxSVX/Y2t0OfLGq7jngfJuATQATExPnbdmyZaQJ7H1tHy//EM5af/xI7VejmZkZ1q5de6iHsaKccx+c8+Js3LhxW1VNLrbdQtb4SXI08Hngrqq6F6CqXh46/rvA/W13N3DaUPNTW9lPqKrNwGaAycnJmpqaWuzYAbjtrq3cvH0NO68arf1qND09zajP12rlnPvgnFfGQt7VE+B24Nmq+vhQ+bqhar8IPNW27wOuSHJMktOBM4DHlm/IkqSlWMgd/zuB9wPbkzzRyn4NuDLJ2QyWenYCvwxQVU8nuRt4hsE7gq7zHT2SdPiYN/ir6itAZjn0wBxtbgJuWsK4JElj4id3JakzBr8kdcbgl6TOGPyS1BmDX5I6Y/BLUmcMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1xuCXpM7MG/xJTkvycJJnkjyd5MOt/KQkDyZ5vn09sZUnya1JdiR5Msm5456EJGnhFnLH/wZwfVWdCVwAXJfkTOAG4KGqOgN4qO0DXAKc0R6bgE8u+6glSSObN/irak9VfbVtfw94FlgPXAbc2ardCVzeti8DPlMDjwAnJFm37COXJI1kUWv8STYA5wCPAhNVtacdegmYaNvrgReHmu1qZZKkw8CahVZMshb4PPCrVfXdJG8eq6pKUovpOMkmBktBTExMMD09vZjmb5o4Fq4/642R269GMzMzXc0XnHMvnPPKWFDwJzmaQejfVVX3tuKXk6yrqj1tKWdvK98NnDbU/NRW9hOqajOwGWBycrKmpqZGmsBtd23l5u1r2HnVaO1Xo+npaUZ9vlYr59wH57wyFvKungC3A89W1ceHDt0HXNO2rwG2DpVf3d7dcwGwb2hJSJJ0iC3kjv+dwPuB7UmeaGW/BnwUuDvJtcC3gPe1Yw8AlwI7gB8AH1jWEUuSlmTe4K+qrwA5yOGLZqlfwHVLHJckaUz85K4kdcbgl6TOGPyS1BmDX5I6Y/BLUmcMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1xuCXpM4Y/JLUmXmDP8kdSfYmeWqo7DeT7E7yRHtcOnTsxiQ7kjyX5D3jGrgkaTQLueP/NHDxLOW3VNXZ7fEAQJIzgSuAn2tt/muSo5ZrsJKkpZs3+Kvqz4DXFni+y4AtVfV6Vb0A7ADOX8L4JEnLbClr/B9K8mRbCjqxla0HXhyqs6uVSZIOE6mq+SslG4D7q+rvtf0J4BWggN8C1lXVB5N8Anikqv6w1bsd+GJV3TPLOTcBmwAmJibO27Jly0gT2PvaPl7+IZy1/viR2q9GMzMzrF279lAPY0U55z4458XZuHHjtqqaXGy7NaN0VlUv799O8rvA/W13N3DaUNVTW9ls59gMbAaYnJysqampUYbCbXdt5ebta9h51WjtV6Pp6WlGfb5WK+fcB+e8MkZa6kmybmj3F4H97/i5D7giyTFJTgfOAB5b2hAlSctp3jv+JJ8FpoCTk+wCfgOYSnI2g6WencAvA1TV00nuBp4B3gCuq6ofjWfokqRRzBv8VXXlLMW3z1H/JuCmpQxKkjQ+fnJXkjpj8EtSZwx+SeqMwS9JnTH4JakzBr8kdcbgl6TOGPyS1BmDX5I6Y/BLUmcMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6sy8wZ/kjiR7kzw1VHZSkgeTPN++ntjKk+TWJDuSPJnk3HEOXpK0eAu54/80cPEBZTcAD1XVGcBDbR/gEuCM9tgEfHJ5hilJWi7zBn9V/Rnw2gHFlwF3tu07gcuHyj9TA48AJyRZt1yDlSQt3ahr/BNVtadtvwRMtO31wItD9Xa1MknSYWLNUk9QVZWkFtsuySYGy0FMTEwwPT09Uv8Tx8L1Z70xcvvVaGZmpqv5gnPuhXNeGaMG/8tJ1lXVnraUs7eV7wZOG6p3aiv7/1TVZmAzwOTkZE1NTY00kNvu2srN29ew86rR2q9G09PTjPp8rVbOuQ/OeWWMutRzH3BN274G2DpUfnV7d88FwL6hJSFJ0mFg3jv+JJ8FpoCTk+wCfgP4KHB3kmuBbwHva9UfAC4FdgA/AD4whjFLkpZg3uCvqisPcuiiWeoWcN1SByVJGh8/uStJnTH4JakzBr8kdcbgl6TOGPyS1BmDX5I6Y/BLUmcMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR1Zt7/c3cuSXYC3wN+BLxRVZNJTgI+B2wAdgLvq6pvL22YkqTlshx3/Bur6uyqmmz7NwAPVdUZwENtX5J0mBjHUs9lwJ1t+07g8jH0IUka0VKDv4D/kWRbkk2tbKKq9rTtl4CJJfYhSVpGqarRGyfrq2p3kr8FPAj8CnBfVZ0wVOfbVXXiLG03AZsAJiYmztuyZctIY9j72j5e/iGctf74kdqvRjMzM6xdu/ZQD2NFOec+OOfF2bhx47ahZfYFW9KLu1W1u33dm+QLwPnAy0nWVdWeJOuAvQdpuxnYDDA5OVlTU1MjjeG2u7Zy8/Y17LxqtPar0fT0NKM+X6uVc+6Dc14ZIy/1JDkuydv2bwPvBp4C7gOuadWuAbYudZCSpOWzlDv+CeALSfaf54+q6r8n+XPg7iTXAt8C3rf0YUqSlsvIwV9V3wR+fpbyV4GLljIoSdL4+MldSeqMwS9JnTH4JakzBr8kdcbgl6TOGPyS1BmDX5I6Y/BLUmcMfknqjMEvSZ0x+CWpMwa/JHXG4Jekzhj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTNjC/4kFyd5LsmOJDeMqx9J0uKMJfiTHAX8F+AS4EzgyiRnjqMvSdLijOuO/3xgR1V9s6r+D7AFuGxMfUmSFmFcwb8eeHFof1crG5sNN/zJm4/hsoPVXcw5DyeH23gkjeZQ5kuqavlPmrwXuLiqfqntvx/4har60FCdTcCmtvszwHMjdncy8MoShrsaOec+OOc+LGXOf7uqTllsozUjdjaf3cBpQ/untrI3VdVmYPNSO0ryeFVNLvU8q4lz7oNz7sOhmPO4lnr+HDgjyelJ3gJcAdw3pr4kSYswljv+qnojyYeALwFHAXdU1dPj6EuStDjjWuqhqh4AHhjX+YcsebloFXLOfXDOfVjxOY/lxV1J0uHLP9kgSZ1Z1cG/2v4sRJLTkjyc5JkkTyf5cCs/KcmDSZ5vX09s5Ulya5vfk0nOHTrXNa3+80muGSo/L8n21ubWJJmrjxWc+1FJ/iLJ/W3/9CSPtnF+rr0JgCTHtP0d7fiGoXPc2MqfS/KeofJZvw8O1scKzfeEJPck+XqSZ5O840i/zkn+Zfu+firJZ5O89Ui7zknuSLI3yVNDZYfsus7Vx5yqalU+GLxo/A3g7cBbgK8BZx7qcc0z5nXAuW37bcBfMviTFv8RuKGV3wD8dtu+FPgiEOAC4NFWfhLwzfb1xLZ9Yjv2WKub1vaSVj5rHys4938F/BFwf9u/G7iibf8O8M/b9r8AfqdtXwF8rm2f2a7xMcDp7dofNdf3wcH6WKH53gn8Utt+C3DCkXydGXxA8wXg2KHn/p8eadcZ+AfAucBTQ2WH7LoerI9557FS/xDGcAHeAXxpaP9G4MZDPa5FzmEr8C4GH15b18rWAc+17U8BVw7Vf64dvxL41FD5p1rZOuDrQ+Vv1jtYHys0z1OBh4ALgfvbN+krwJoDryWDd4K9o22vafVy4PXdX+9g3wdz9bEC8z2eQQjmgPIj9jrz40/rn9Su2/3Ae47E6wxs4CeD/5Bd14P1Md8cVvNSz4r/WYjl1H61PQd4FJioqj3t0EvARNs+2BznKt81Szlz9LES/hPwb4H/2/Z/CvhOVb3R9ofH+ebc2vF9rf5in4u5+hi304G/An4/g+Wt30tyHEfwda6q3cDHgP8F7GFw3bZxZF/n/Q7ldR0pB1dz8K9aSdYCnwd+taq+O3ysBj+2x/pWq5XoY78k/wjYW1XbVqK/w8QaBssBn6yqc4DvM/j1/E1H4HU+kcEfYjwd+GngOODilej7cLJarutqDv55/yzE4SjJ0QxC/66qurcVv5xkXTu+Dtjbyg82x7nKT52lfK4+xu2dwD9OspPBX2m9EPjPwAlJ9n+OZHicb86tHT8eeJXFPxevztHHuO0CdlXVo23/HgY/CI7k6/wPgReq6q+q6q+Bexlc+yP5Ou93KK/rSDm4moN/1f1ZiPYK/e3As1X18aFD9wH7X9m/hsHa//7yq9sr9xcA+9qve18C3p3kxHan9W4G65p7gO8muaD1dfUB55qtj7Gqqhur6tSq2sDgGn25qq4CHgbeO8t4hsf53la/WvkV7d0gpwNnMHghbNbvg9bmYH2MVVW9BLyY5Gda0UXAMxzB15nBEs8FSf5mG9P+OR+x13nIobyuB+tjbuN8EWTcDwavaP8lg1f7f/1Qj2cB4/37DH5FexJ4oj0uZbBO+RDwPPCnwEmtfhj8hzbfALYDk0Pn+iCwoz0+MFQ+CTzV2nyCH39Ib9Y+Vnj+U/z4XT1vZ/APegfw34BjWvlb2/6OdvztQ+1/vc3rOdq7Heb6PjhYHys017OBx9u1/mMG7944oq8z8BHg621cf8DgnTlH1HUGPsvgNYy/ZvCb3bWH8rrO1cdcDz+5K0mdWc1LPZKkERj8ktQZg1+SOmPwS1JnDH5J6ozBL0mdMfglqTMGvyR15v8BJn7Kx+tzHbgAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "\n", "equiv = {\n", " 'view':1,\n", " 'addtocart':3,\n", " 'transaction':3\n", "}\n", "events['count']=events.event.map(equiv)\n", "events.groupby('visitorid')['count'].sum().value_counts().hist(bins=200)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UserIdItemIdCount
00670451
102859301
203575641
31720281
422163052
\n", "
" ], "text/plain": [ " UserId ItemId Count\n", "0 0 67045 1\n", "1 0 285930 1\n", "2 0 357564 1\n", "3 1 72028 1\n", "4 2 216305 2" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "events = events.groupby(['visitorid','itemid'])['count'].sum().to_frame().reset_index()\n", "events.rename(columns={'visitorid':'UserId', 'itemid':'ItemId', 'count':'Count'}, inplace=True)\n", "events.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now creating a train and test split. For simplicity purposes and in order to be able to make a fair comparison with a model that doesn't use item descriptions, I will try to only take users that had >= 3 items in the training data, and items that had >= 3 users.\n", "\n", "Given the lack of user attributes and the fact that it will be compared later to a model without side information, the test set will only have users from the training data, but it's also possible to use user attributes if they follow the same format as the items', in which case the model can also recommend items to new users.\n", "\n", "In order to compare it later to a model without items' text, I will also filter out the test set to have only items that were in the training set. **This is however not a model limitation, as it can also recommend items that have descriptions but no user interactions**." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(381963, 3)\n", "(68490, 3)\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "events_train, events_test = train_test_split(events, test_size=.2, random_state=1)\n", "del events\n", "\n", "## In order to find users and items with at least 3 interactions each,\n", "## it's easier and faster to use a simple heuristic that first filters according to one criteria,\n", "## then, according to the other, and repeats.\n", "## Finding a real subset of the data in which each item has strictly >= 3 users,\n", "## and each user has strictly >= 3 items, is a harder graph partitioning or optimization\n", "## problem. For a similar example of finding such subsets see also:\n", "## http://nbviewer.ipython.org/github/david-cortes/datascienceprojects/blob/master/optimization/dataset_splitting.ipynb\n", "users_filter_out = events_train.groupby('UserId')['ItemId'].agg(lambda x: len(tuple(x)))\n", "users_filter_out = np.array(users_filter_out.index[users_filter_out < 3])\n", "\n", "items_filter_out = events_train.loc[~np.in1d(events_train.UserId, users_filter_out)].groupby('ItemId')['UserId'].agg(lambda x: len(tuple(x)))\n", "items_filter_out = np.array(items_filter_out.index[items_filter_out < 3])\n", "\n", "users_filter_out = events_train.loc[~np.in1d(events_train.ItemId, items_filter_out)].groupby('UserId')['ItemId'].agg(lambda x: len(tuple(x)))\n", "users_filter_out = np.array(users_filter_out.index[users_filter_out < 3])\n", "\n", "events_train = events_train.loc[~np.in1d(events_train.UserId.values, users_filter_out)]\n", "events_train = events_train.loc[~np.in1d(events_train.ItemId.values, items_filter_out)]\n", "events_test = events_test.loc[np.in1d(events_test.UserId.values, events_train.UserId.values)]\n", "events_test = events_test.loc[np.in1d(events_test.ItemId.values, events_train.ItemId.values)]\n", "\n", "print(events_train.shape)\n", "print(events_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now processing the text descriptions of the items:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestampitemidpropertyvalue
01435460400000460429categoryid1338
114415084000002067838881116713 960601 n277.200
21439089200000395014400n552.000 639502 n720.000 424566
3143122680000059481790n15360.000
41431831600000156781917828513
\n", "
" ], "text/plain": [ " timestamp itemid property value\n", "0 1435460400000 460429 categoryid 1338\n", "1 1441508400000 206783 888 1116713 960601 n277.200\n", "2 1439089200000 395014 400 n552.000 639502 n720.000 424566\n", "3 1431226800000 59481 790 n15360.000\n", "4 1431831600000 156781 917 828513" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iteminfo = pd.read_csv(\"item_properties_part1.csv\")\n", "iteminfo2 = pd.read_csv(\"item_properties_part2.csv\")\n", "iteminfo = iteminfo.append(iteminfo2, ignore_index=True)\n", "iteminfo.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The item's description contain many fields and have a mixture of words and numbers. The numeric variables, as per the documentation, are prefixed with an \"n\" and have three digits decimal precision - I will exclude them here since this model is insensitive to numeric attributes such as price. The words are already lemmazed, and since we only have their IDs, it's not possible to do any other pre-processing on them.\n", "\n", "Although the descriptions don't say anything about it, looking at the contents and the lengths of the different fields, here I will assume that the field $283$ is the product title and the field $888$ is the product description. I will just concatenate them to obtain an overall item text, but there might be better ways of doing this (such as having different IDs for the same word when it appears in the title or the body, or multiplying those in the title by some number, etc.)\n", "\n", "As the descriptions vary over time, I will only take the most recent version for each item:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestampitemidpropertyvalue
01431226800000028366094 372274 478989
114330412000000888478989
214354604000001283513325 1020281 1204938 172646 72261 30603 8980...
314421132000001888172646 1154859
414312268000002283822092 325894 504272 147366 343631 648485 n600...
\n", "
" ], "text/plain": [ " timestamp itemid property \\\n", "0 1431226800000 0 283 \n", "1 1433041200000 0 888 \n", "2 1435460400000 1 283 \n", "3 1442113200000 1 888 \n", "4 1431226800000 2 283 \n", "\n", " value \n", "0 66094 372274 478989 \n", "1 478989 \n", "2 513325 1020281 1204938 172646 72261 30603 8980... \n", "3 172646 1154859 \n", "4 822092 325894 504272 147366 343631 648485 n600... " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "iteminfo = iteminfo.loc[iteminfo.property.isin(('888','283'))]\n", "iteminfo = iteminfo.loc[iteminfo.groupby(['itemid','property'])['timestamp'].idxmax()]\n", "iteminfo.reset_index(drop=True, inplace=True)\n", "iteminfo.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note that for simplicity I am completely ignoring the categories (these are easily incorporated e.g. by adding a count of +1 for each category to which an item belongs) and important factors such as the price. I am also completely ignoring all the other fields.**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ItemIdWordIdCount
001944962
101640521
202455981
31441881
412866711
\n", "
" ], "text/plain": [ " ItemId WordId Count\n", "0 0 194496 2\n", "1 0 164052 1\n", "2 0 245598 1\n", "3 1 44188 1\n", "4 1 286671 1" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "from scipy.sparse import coo_matrix\n", "import re\n", "\n", "def concat_fields(x):\n", " x = list(x)\n", " out = x[0]\n", " for i in x[1:]:\n", " out += \" \" + i\n", " return out\n", "\n", "class NonNumberTokenizer(object):\n", " def __init__(self):\n", " pass\n", " def __call__(self, txt):\n", " return [i for i in txt.split(\" \") if bool(re.search(\"^\\d\", i))]\n", "\n", "iteminfo = iteminfo.groupby('itemid')['value'].agg(lambda x: concat_fields(x))\n", "\n", "t = CountVectorizer(tokenizer=NonNumberTokenizer(), stop_words=None,\n", " dtype=np.int32, strip_accents=None, lowercase=False)\n", "bag_of_words = t.fit_transform(iteminfo)\n", "\n", "bag_of_words = coo_matrix(bag_of_words)\n", "bag_of_words = pd.DataFrame({\n", " 'ItemId' : iteminfo.index[bag_of_words.row],\n", " 'WordId' : bag_of_words.col,\n", " 'Count' : bag_of_words.data\n", "})\n", "del iteminfo\n", "bag_of_words.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, I will not filter it out by only items that were in the training set, as other items can still be used to get better latent factors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** *\n", "\n", "## 3. Fitting the model\n", "\n", "Fitting the model - note that I'm using some enhancements (passed as arguments to the class constructor) over the original version in the paper:\n", "* Standardizing item counts so as not to favor items with longer descriptions.\n", "* Initializing $\\Theta$ and $\\beta$ through hierarchical Poisson factorization instead of latent Dirichlet allocation.\n", "* Using a small step size for the updates for the parameters obtained from hierarchical Poisson factorization at the beginning, which then grows to one with increasing iteration numbers (informally, this achieves to somehwat \"preserve\" these fits while the user parameters are adjusted to these already-fit item parameters - then as the user parameters are already defined towards them, the item and word parameters start changing too).\n", "\n", "I'll be also fitting two slightly different models: one that takes (and can make recommendations for) all the items for which there are either descriptions or user clicks, and another that uses all the items for which there are descriptions to initialize the item-related parameters but discards the ones without clicks (can only make recommendations for items that users have clicked).\n", "\n", "For more information about the parameters and what they do, see the online documentation:\n", "\n", "[http://ctpfrec.readthedocs.io](http://ctpfrec.readthedocs.io)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(381963, 3)\n", "(68490, 3)\n", "(7676561, 3)\n" ] } ], "source": [ "print(events_train.shape)\n", "print(events_test.shape)\n", "print(bag_of_words.shape)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "*****************************************\n", "Collaborative Topic Poisson Factorization\n", "*****************************************\n", "\n", "Number of users: 65913\n", "Number of items: 418301\n", "Number of words: 342260\n", "Latent factors to use: 70\n", "\n", "Initializing parameters...\n", "Initializing Theta and Beta through HPF...\n", "\n", "**********************************\n", "Hierarchical Poisson Factorization\n", "**********************************\n", "\n", "Number of users: 417053\n", "Number of items: 342260\n", "Latent factors to use: 70\n", "\n", "Initializing parameters...\n", "Allocating Phi matrix...\n", "Initializing optimization procedure...\n", "Iteration 10 | Norm(Theta_{10} - Theta_{0}): 3373.40234\n", "Iteration 20 | Norm(Theta_{20} - Theta_{10}): 13.27755\n", "Iteration 30 | Norm(Theta_{30} - Theta_{20}): 11.13662\n", "Iteration 40 | Norm(Theta_{40} - Theta_{30}): 5.30947\n", "Iteration 50 | Norm(Theta_{50} - Theta_{40}): 3.23760\n", "Iteration 60 | Norm(Theta_{60} - Theta_{50}): 2.57951\n", "Iteration 70 | Norm(Theta_{70} - Theta_{60}): 1.99546\n", "Iteration 80 | Norm(Theta_{80} - Theta_{70}): 1.91506\n", "Iteration 90 | Norm(Theta_{90} - Theta_{80}): 1.49374\n", "Iteration 100 | Norm(Theta_{100} - Theta_{90}): 1.17536\n", "\n", "\n", "Optimization finished\n", "Final log-likelihood: -54256333\n", "Final RMSE: 2.4187\n", "Minutes taken (optimization part): 23.7\n", "\n", "**********************************\n", "\n", "Allocating intermediate matrices...\n", "Initializing optimization procedure...\n", "Iteration 10 | train llk: -6305341 | train rmse: 2.8694\n", "Iteration 20 | train llk: -6248204 | train rmse: 2.8681\n", "Iteration 30 | train llk: -6228858 | train rmse: 2.8675\n", "Iteration 40 | train llk: -6220805 | train rmse: 2.8672\n", "Iteration 50 | train llk: -6212324 | train rmse: 2.8670\n", "Iteration 60 | train llk: -6212101 | train rmse: 2.8670\n", "\n", "\n", "Optimization finished\n", "Final log-likelihood: -6212101\n", "Final RMSE: 2.8670\n", "Minutes taken (optimization part): 15.4\n", "\n", "Producing Python dictionaries...\n", "CPU times: user 5h 10min 38s, sys: 1min 39s, total: 5h 12min 18s\n", "Wall time: 39min 46s\n" ] } ], "source": [ "%%time\n", "from ctpfrec import CTPF\n", "\n", "recommender_all_items = CTPF(k=70, step_size=lambda x: 1-1/np.sqrt(x+1),\n", " standardize_items=True, initialize_hpf=True, reindex=True,\n", " missing_items='include', allow_inconsistent_math=True, random_seed=1)\n", "recommender_all_items.fit(counts_df=events_train.copy(), words_df=bag_of_words.copy())" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "*****************************************\n", "Collaborative Topic Poisson Factorization\n", "*****************************************\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/david_cortes_rivera/ctpfrec/ctpfrec.py:463: UserWarning: Some words are associated only with items that are in 'words_df' but not in 'counts_df'. These will be used to initialize Beta but will be excluded from the final model. If you still wish to include them in the model, use 'missing_items='include''. For information about which words are used by the model, see the attribute 'word_mapping_'.\n", " warnings.warn(msg)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Number of users: 65913\n", "Number of items: 39578\n", "Number of words: 67980\n", "Latent factors to use: 70\n", "\n", "Initializing parameters...\n", "Initializing Theta and Beta through HPF...\n", "\n", "**********************************\n", "Hierarchical Poisson Factorization\n", "**********************************\n", "\n", "Number of users: 417053\n", "Number of items: 342260\n", "Latent factors to use: 70\n", "\n", "Initializing parameters...\n", "Allocating Phi matrix...\n", "Initializing optimization procedure...\n", "Iteration 10 | Norm(Theta_{10} - Theta_{0}): 3373.40234\n", "Iteration 20 | Norm(Theta_{20} - Theta_{10}): 13.27888\n", "Iteration 30 | Norm(Theta_{30} - Theta_{20}): 11.13438\n", "Iteration 40 | Norm(Theta_{40} - Theta_{30}): 5.31399\n", "Iteration 50 | Norm(Theta_{50} - Theta_{40}): 3.23850\n", "Iteration 60 | Norm(Theta_{60} - Theta_{50}): 2.54416\n", "Iteration 70 | Norm(Theta_{70} - Theta_{60}): 1.98683\n", "Iteration 80 | Norm(Theta_{80} - Theta_{70}): 1.91646\n", "Iteration 90 | Norm(Theta_{90} - Theta_{80}): 1.50169\n", "Iteration 100 | Norm(Theta_{100} - Theta_{90}): 1.18380\n", "\n", "\n", "Optimization finished\n", "Final log-likelihood: -54259927\n", "Final RMSE: 2.4187\n", "Minutes taken (optimization part): 23.7\n", "\n", "**********************************\n", "\n", "Allocating intermediate matrices...\n", "Initializing optimization procedure...\n", "Iteration 10 | train llk: -5006436 | train rmse: 2.8536\n", "Iteration 20 | train llk: -4944714 | train rmse: 2.8482\n", "Iteration 30 | train llk: -4924733 | train rmse: 2.8460\n", "Iteration 40 | train llk: -4926419 | train rmse: 2.8454\n", "\n", "\n", "Optimization finished\n", "Final log-likelihood: -4926419\n", "Final RMSE: 2.8454\n", "Minutes taken (optimization part): 1.9\n", "\n", "Producing Python dictionaries...\n", "CPU times: user 3h 24min 4s, sys: 38.5 s, total: 3h 24min 42s\n", "Wall time: 25min 59s\n" ] } ], "source": [ "%%time\n", "recommender_clicked_items_only = CTPF(k=70, step_size=lambda x: 1-1/np.sqrt(x+1),\n", " standardize_items=True, initialize_hpf=True, reindex=True,\n", " missing_items='exclude', allow_inconsistent_math=True, random_seed=1)\n", "recommender_clicked_items_only.fit(counts_df=events_train.copy(), words_df=bag_of_words.copy())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the time here was spent in fitting the model to items that no user in the training set had clicked. If using instead a random initialization, it would have taken a lot less time to fit this model (there would be only a fraction of the items - see above time spent in each procedure), but the results are slightly worse.\n", "\n", "_Disclaimer: this notebook was run on a Google cloud server with Skylake CPU using 8 cores, and memory usage tops at around 6GB of RAM for the first model (including all the objects loaded before). In a desktop computer, it would take a bit longer to fit._\n", "** *\n", "\n", "## 4. Common sense checks\n", "\n", "There are many different metrics to evaluate recommendation quality in implicit datasets, but all of them have their drawbacks. The idea of this notebook is to illustrate the package usage and not to introduce and compare evaluation metrics, so I will only perform some common sense checks on the test data.\n", "\n", "For implementations of evaluation metrics for implicit recommendations see other packages such as [lightFM](https://github.com/lyst/lightfm).\n", "\n", "As some common sense checks, the predictions should:\n", "* Be higher for this non-zero hold-out sample than for random items.\n", "* Produce a good discrimination between random items and those in the hold-out sample (very related to the first point).\n", "* Be correlated with the numer of events per user-item pair in the hold-out sample.\n", "* Follow an exponential distribution rather than a normal or some other symmetric distribution.\n", "\n", "Here I'll check these four conditions:\n", "\n", "#### Model with all items" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average prediction for combinations in test set: 0.017780766\n", "Average prediction for random combinations: 0.0047758827\n" ] } ], "source": [ "events_test['Predicted'] = recommender_all_items.predict(user=events_test.UserId, item=events_test.ItemId)\n", "events_test['RandomItem'] = np.random.choice(events_train.ItemId.unique(), size=events_test.shape[0])\n", "events_test['PredictedRandom'] = recommender_all_items.predict(user=events_test.UserId,\n", " item=events_test.RandomItem)\n", "print(\"Average prediction for combinations in test set: \", events_test.Predicted.mean())\n", "print(\"Average prediction for random combinations: \", events_test.PredictedRandom.mean())" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7079527323667897" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import roc_auc_score\n", "\n", "was_clicked = np.r_[np.ones(events_test.shape[0]), np.zeros(events_test.shape[0])]\n", "score_model = np.r_[events_test.Predicted.values, events_test.PredictedRandom.values]\n", "roc_auc_score(was_clicked[~np.isnan(score_model)], score_model[~np.isnan(score_model)])" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.12031331307638801" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.corrcoef(events_test.Count[~events_test.Predicted.isnull()], events_test.Predicted[~events_test.Predicted.isnull()])[0,1]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAD8CAYAAACcjGjIAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAEhNJREFUeJzt3H+s3XV9x/Hna60gsEFhuKZpm5RkjQuSqHgDXVyWTbJSwKz8sRjMJo1p7B/i4rIlW90/ZLo/8J/pSBxJI53t5saIztAo2jVIspis0FtBEKrjjkG4DdjNQplroqF774/7qT3yue09/fm91/t8JCfn+31/P9/veZ+Tm/s63+/5nJOqQpKkUb8wdAOSpPnHcJAkdQwHSVLHcJAkdQwHSVLHcJAkdQwHSVLHcJAkdQwHSVJn6dANnKmrr7661qxZM3QbkrRg7N+//7+r6m3jjF2w4bBmzRomJyeHbkOSFowkL4471stKkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqTOgg2Hpw8eYc3Wr7Fm69eGbkWSfu4s2HCQJJ0/hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6Y4VDkmVJvpTke0kOJPn1JFcl2ZPkuXZ/ZRubJPcmmUryVJLrR46zqY1/Lsmmkfp7kjzd9rk3Sc79U5UkjWvcM4e/Br5RVb8GvBM4AGwFHqmqtcAjbR3gFmBtu20B7gNIchVwN3AjcANw9/FAaWM+MrLfhrN7WpKkszFnOCS5AvhN4H6AqvpJVb0GbAR2tGE7gNvb8kZgZ83YCyxLsgK4GdhTVYer6lVgD7Chbbu8qvZWVQE7R44lSRrAOGcO1wD/BfxtkieSfD7JZcDyqnq5jXkFWN6WVwIvjew/3Wqnqk/PUpckDWSccFgKXA/cV1XvBv6XE5eQAGjv+Ovct/ezkmxJMplk8tjRI+f74SRp0RonHKaB6ap6rK1/iZmw+EG7JES7P9S2HwRWj+y/qtVOVV81S71TVduqaqKqJpZcesUYrUuSzsSc4VBVrwAvJXl7K90EPAvsAo7PONoEPNSWdwF3tllL64Aj7fLTbmB9kivbB9Hrgd1t2+tJ1rVZSneOHEuSNIClY477Q+CLSS4Cngc+zEywPJhkM/Ai8IE29mHgVmAKONrGUlWHk3wK2NfGfbKqDrfljwJfAC4Bvt5ukqSBjBUOVfUkMDHLpptmGVvAXSc5znZg+yz1SeC6cXqRJJ1/fkNaktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQZKxySvJDk6SRPJplstauS7EnyXLu/stWT5N4kU0meSnL9yHE2tfHPJdk0Un9PO/5U2zfn+olKksZ3OmcOv11V76qqiba+FXikqtYCj7R1gFuAte22BbgPZsIEuBu4EbgBuPt4oLQxHxnZb8MZPyNJ0lk7m8tKG4EdbXkHcPtIfWfN2AssS7ICuBnYU1WHq+pVYA+woW27vKr2VlUBO0eOJUkawLjhUMC/JNmfZEurLa+ql9vyK8DytrwSeGlk3+lWO1V9epa6JGkgS8cc9xtVdTDJrwB7knxvdGNVVZI69+39rBZMWwCWXP628/1wkrRojXXmUFUH2/0h4CvMfGbwg3ZJiHZ/qA0/CKwe2X1Vq52qvmqW+mx9bKuqiaqaWHLpFeO0Lkk6A3OGQ5LLkvzS8WVgPfBdYBdwfMbRJuChtrwLuLPNWloHHGmXn3YD65Nc2T6IXg/sbtteT7KuzVK6c+RYkqQBjHNZaTnwlTa7dCnwD1X1jST7gAeTbAZeBD7Qxj8M3ApMAUeBDwNU1eEknwL2tXGfrKrDbfmjwBeAS4Cvt5skaSBzhkNVPQ+8c5b6D4GbZqkXcNdJjrUd2D5LfRK4box+JUkXgN+QliR1DAdJUsdwkCR1DAdJUsdwkCR1DAdJUsdwkCR1DAdJUsdwkCR1DAdJUsdwkCR1DAdJUsdwkCR1DAdJUsdwkCR1DAdJUsdwkCR1DAdJUsdwkCR1DAdJUsdwkCR1DAdJUsdwkCR1xg6HJEuSPJHkq239miSPJZlK8k9JLmr1i9v6VNu+ZuQYn2j17ye5eaS+odWmkmw9d09PknQmTufM4ePAgZH1TwOfqapfBV4FNrf6ZuDVVv9MG0eSa4E7gHcAG4C/aYGzBPgccAtwLfDBNlaSNJCxwiHJKuA24PNtPcD7gC+1ITuA29vyxrZO235TG78ReKCqflxV/wlMATe021RVPV9VPwEeaGMlSQMZ98zhs8CfAv/X1n8ZeK2q3mjr08DKtrwSeAmgbT/Sxv+0/qZ9TlaXJA1kznBI8n7gUFXtvwD9zNXLliSTSSaPHT0ydDuS9HNr6Rhj3gv8bpJbgbcClwN/DSxLsrSdHawCDrbxB4HVwHSSpcAVwA9H6seN7nOy+s+oqm3ANoCLV6ytMXqXJJ2BOc8cquoTVbWqqtYw84HyN6vq94FHgd9rwzYBD7XlXW2dtv2bVVWtfkebzXQNsBZ4HNgHrG2zny5qj7HrnDw7SdIZGefM4WT+DHggyV8CTwD3t/r9wN8lmQIOM/PPnqp6JsmDwLPAG8BdVXUMIMnHgN3AEmB7VT1zFn1Jks5SZt7ULzwXr1hbKzZ9FoAX7rlt4G4kaf5Lsr+qJsYZ6zekJUkdw0GS1DEcJEkdw0GS1DEcJEkdw0GS1DEcJEkdw0GS1DEcJEkdw0GS1DEcJEkdw0GS1DEcJEkdw0GS1DEcJEkdw0GS1DEcJEkdw0GS1DEcJEkdw0GS1DEcJEkdw0GS1DEcJEmdOcMhyVuTPJ7kO0meSfIXrX5NkseSTCX5pyQXtfrFbX2qbV8zcqxPtPr3k9w8Ut/QalNJtp77pylJOh3jnDn8GHhfVb0TeBewIck64NPAZ6rqV4FXgc1t/Gbg1Vb/TBtHkmuBO4B3ABuAv0myJMkS4HPALcC1wAfbWEnSQOYMh5rxo7b6lnYr4H3Al1p9B3B7W97Y1mnbb0qSVn+gqn5cVf8JTAE3tNtUVT1fVT8BHmhjJUkDGeszh/YO/0ngELAH+A/gtap6ow2ZBla25ZXASwBt+xHgl0frb9rnZHVJ0kDGCoeqOlZV7wJWMfNO/9fOa1cnkWRLkskkk8eOHhmiBUlaFE5rtlJVvQY8Cvw6sCzJ0rZpFXCwLR8EVgO07VcAPxytv2mfk9Vne/xtVTVRVRNLLr3idFqXJJ2GcWYrvS3JsrZ8CfA7wAFmQuL32rBNwENteVdbp23/ZlVVq9/RZjNdA6wFHgf2AWvb7KeLmPnQete5eHKSpDOzdO4hrAB2tFlFvwA8WFVfTfIs8ECSvwSeAO5v4+8H/i7JFHCYmX/2VNUzSR4EngXeAO6qqmMAST4G7AaWANur6plz9gwlSactM2/qF56LV6ytFZs+C8AL99w2cDeSNP8l2V9VE+OM9RvSkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6swZDklWJ3k0ybNJnkny8Va/KsmeJM+1+ytbPUnuTTKV5Kkk148ca1Mb/1ySTSP19yR5uu1zb5KcjycrSRrPOGcObwB/UlXXAuuAu5JcC2wFHqmqtcAjbR3gFmBtu20B7oOZMAHuBm4EbgDuPh4obcxHRvbbcPZPTZJ0puYMh6p6uaq+3Zb/BzgArAQ2AjvasB3A7W15I7CzZuwFliVZAdwM7Kmqw1X1KrAH2NC2XV5Ve6uqgJ0jx5IkDeC0PnNIsgZ4N/AYsLyqXm6bXgGWt+WVwEsju0232qnq07PUZ3v8LUkmk0weO3rkdFqXJJ2GscMhyS8CXwb+qKpeH93W3vHXOe6tU1XbqmqiqiaWXHrF+X44SVq0xgqHJG9hJhi+WFX/3Mo/aJeEaPeHWv0gsHpk91Wtdqr6qlnqkqSBjDNbKcD9wIGq+quRTbuA4zOONgEPjdTvbLOW1gFH2uWn3cD6JFe2D6LXA7vbtteTrGuPdefIsSRJA1g6xpj3Ah8Cnk7yZKv9OXAP8GCSzcCLwAfatoeBW4Ep4CjwYYCqOpzkU8C+Nu6TVXW4LX8U+AJwCfD1dpMkDWTOcKiqbwEn+97BTbOML+CukxxrO7B9lvokcN1cvUiSLgy/IS1J6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6swZDkm2JzmU5LsjtauS7EnyXLu/stWT5N4kU0meSnL9yD6b2vjnkmwaqb8nydNtn3uT5Fw/SUnS6RnnzOELwIY31bYCj1TVWuCRtg5wC7C23bYA98FMmAB3AzcCNwB3Hw+UNuYjI/u9+bEkSRfYnOFQVf8KHH5TeSOwoy3vAG4fqe+sGXuBZUlWADcDe6rqcFW9CuwBNrRtl1fV3qoqYOfIsSRJAznTzxyWV9XLbfkVYHlbXgm8NDJuutVOVZ+epS5JGtBZfyDd3vHXOehlTkm2JJlMMnns6JEL8ZCStCidaTj8oF0Sot0favWDwOqRcata7VT1VbPUZ1VV26pqoqomllx6xRm2Lkmay5mGwy7g+IyjTcBDI/U726yldcCRdvlpN7A+yZXtg+j1wO627fUk69ospTtHjiVJGsjSuQYk+Ufgt4Crk0wzM+voHuDBJJuBF4EPtOEPA7cCU8BR4MMAVXU4yaeAfW3cJ6vq+IfcH2VmRtQlwNfbTZI0oDnDoao+eJJNN80ytoC7TnKc7cD2WeqTwHVz9SFJunD8hrQkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqbN06AbOhTVbv/bT5RfuuW3ATiTp54NnDpKkzrwJhyQbknw/yVSSrUP3I0mL2by4rJRkCfA54HeAaWBfkl1V9ezpHstLTJJ09uZFOAA3AFNV9TxAkgeAjcBph8Mog0KSzsx8CYeVwEsj69PAjefyAUaD4kwZMJIWi/kSDmNJsgXY0lZ//OKn3//dC/r4n76Qj3Zargb+e+gm5gFfhxN8LU7wtTjh7eMOnC/hcBBYPbK+qtV+RlVtA7YBJJmsqokL09785msxw9fhBF+LE3wtTkgyOe7Y+TJbaR+wNsk1SS4C7gB2DdyTJC1a8+LMoareSPIxYDewBNheVc8M3JYkLVrzIhwAquph4OHT2GXb+eplAfK1mOHrcIKvxQm+FieM/Vqkqs5nI5KkBWi+fOYgSZpHFlw4+DMbM5JsT3IoyQWdzjsfJVmd5NEkzyZ5JsnHh+5pKEnemuTxJN9pr8VfDN3T0JIsSfJEkq8O3cuQkryQ5OkkT44za2lBXVZqP7Px74z8zAbwwTP5mY2FLslvAj8CdlbVdUP3M6QkK4AVVfXtJL8E7AduX6R/FwEuq6ofJXkL8C3g41W1d+DWBpPkj4EJ4PKqev/Q/QwlyQvARFWN9Z2PhXbm8NOf2aiqnwDHf2Zj0amqfwUOD93HfFBVL1fVt9vy/wAHmPnW/aJTM37UVt/SbgvnHeA5lmQVcBvw+aF7WWgWWjjM9jMbi/KfgGaXZA3wbuCxYTsZTruM8iRwCNhTVYv2tQA+C/wp8H9DNzIPFPAvSfa3X5s4pYUWDtJJJflF4MvAH1XV60P3M5SqOlZV72LmlwZuSLIoLzsmeT9wqKr2D93LPPEbVXU9cAtwV7s0fVILLRzG+pkNLT7t+vqXgS9W1T8P3c98UFWvAY8CG4buZSDvBX63XWt/AHhfkr8ftqXhVNXBdn8I+Aozl+lPaqGFgz+zoU77EPZ+4EBV/dXQ/QwpyduSLGvLlzAzeeN7w3Y1jKr6RFWtqqo1zPyv+GZV/cHAbQ0iyWVtsgZJLgPWA6ec6bigwqGq3gCO/8zGAeDBxfozG0n+Efg34O1JppNsHrqnAb0X+BAz7wyfbLdbh25qICuAR5M8xcybqT1VtaincAqA5cC3knwHeBz4WlV941Q7LKiprJKkC2NBnTlIki4Mw0GS1DEcJEkdw0GS1DEcJEkdw0GS1DEcJEkdw0GS1Pl/SX852vJtevUAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "_ = plt.hist(events_test.Predicted, bins=200)\n", "plt.xlim(0,5)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Model with clicked items only" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average prediction for combinations in test set: 0.025673132\n", "Average prediction for random combinations: 0.008127485\n" ] } ], "source": [ "events_test['Predicted'] = recommender_clicked_items_only.predict(user=events_test.UserId, item=events_test.ItemId)\n", "events_test['PredictedRandom'] = recommender_clicked_items_only.predict(user=events_test.UserId,\n", " item=events_test.RandomItem)\n", "print(\"Average prediction for combinations in test set: \", events_test.Predicted.mean())\n", "print(\"Average prediction for random combinations: \", events_test.PredictedRandom.mean())" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6907211476157746" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "was_clicked = np.r_[np.ones(events_test.shape[0]), np.zeros(events_test.shape[0])]\n", "score_model = np.r_[events_test.Predicted.values, events_test.PredictedRandom.values]\n", "roc_auc_score(was_clicked, score_model)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.06974015808183695" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.corrcoef(events_test.Count, events_test.Predicted)[0,1]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYcAAAD9CAYAAABX0LttAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAE5VJREFUeJzt3G+MXfV95/H3p3ZIgBZs2qxl2ZZAqhVEkUJgBK5SVd2gGANRzIMKgXbrEbJwJcgq0a7UNfvEKvQBedKkSCmSFbyxu9lQlDTCSiDuyGFVRVqDx4FAgKSeUhBjAd7GYJq1lAj63Qfz8/qG39hz/ffOxO+XdHXP+Z7vOfO7V5Y/95zzuzdVhSRJg35j1AOQJM0/hoMkqWM4SJI6hoMkqWM4SJI6hoMkqTNnOCT5WJLnBh7vJvlCksuSTCTZ356Xtv4keSjJVJLnk1w7cKzx1r8/yfhA/bokL7R9HkqSs/NyJUnDmDMcquqnVXVNVV0DXAccAb4NbAZ2V9VqYHdbB7gZWN0em4CHAZJcBmwBbgCuB7YcDZTWc/fAfuvOyKuTJJ2Sk72sdCPwT1X1GrAe2N7q24Hb2vJ6YEfN2AMsSbIcuAmYqKpDVfU2MAGsa9suqao9NfONvB0Dx5IkjcDJhsMdwDfa8rKqeqMtvwksa8srgNcH9plutRPVp2epS5JGZPGwjUkuAD4L3PfBbVVVSc7673Ak2cTMpSouvvji66688sqz/Scl6dfGvn37/qWqPjpM79DhwMy9hB9W1Vtt/a0ky6vqjXZp6GCrHwBWDey3stUOAH/0gfr/avWVs/R3qmorsBVgbGysJicnT2L4knR+S/LasL0nc1npTo5dUgLYCRydcTQOPD5Q39BmLa0BDrfLT7uAtUmWthvRa4Fdbdu7Sda0WUobBo4lSRqBoc4cklwMfBr404Hyg8BjSTYCrwG3t/oTwC3AFDMzm+4CqKpDSR4A9ra++6vqUFu+B/gacCHwZHtIkkYkC/Unu72sJEknJ8m+qhobptdvSEuSOoaDJKljOEiSOoaDJKljOEiSOoaDJKlzMt+QnldeOHCYyzd/94Q9rz546zkajST9evHMQZLUMRwkSR3DQZLUMRwkSR3DQZLUMRwkSR3DQZLUMRwkSR3DQZLUMRwkSR3DQZLUMRwkSR3DQZLUMRwkSR3DQZLUGSockixJ8s0kP0nycpLfT3JZkokk+9vz0tabJA8lmUryfJJrB44z3vr3JxkfqF+X5IW2z0NJcuZfqiRpWMOeOfwV8L2quhL4OPAysBnYXVWrgd1tHeBmYHV7bAIeBkhyGbAFuAG4HthyNFBaz90D+607vZclSTodc4ZDkkuBPwQeAaiqX1bVO8B6YHtr2w7c1pbXAztqxh5gSZLlwE3ARFUdqqq3gQlgXdt2SVXtqaoCdgwcS5I0AsOcOVwB/B/gvyd5NslXk1wMLKuqN1rPm8CytrwCeH1g/+lWO1F9epa6JGlEhgmHxcC1wMNV9Qng/3LsEhIA7RN/nfnh/aokm5JMJpl8/8jhs/3nJOm8NUw4TAPTVfV0W/8mM2HxVrskRHs+2LYfAFYN7L+y1U5UXzlLvVNVW6tqrKrGFl106RBDlySdijnDoareBF5P8rFWuhF4CdgJHJ1xNA483pZ3AhvarKU1wOF2+WkXsDbJ0nYjei2wq217N8maNktpw8CxJEkjsHjIvv8EfD3JBcArwF3MBMtjSTYCrwG3t94ngFuAKeBI66WqDiV5ANjb+u6vqkNt+R7ga8CFwJPtIUkakaHCoaqeA8Zm2XTjLL0F3Huc42wDts1SnwSuHmYskqSzz29IS5I6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqWM4SJI6hoMkqTNUOCR5NckLSZ5LMtlqlyWZSLK/PS9t9SR5KMlUkueTXDtwnPHWvz/J+ED9unb8qbZvzvQLlSQN72TOHP59VV1TVWNtfTOwu6pWA7vbOsDNwOr22AQ8DDNhAmwBbgCuB7YcDZTWc/fAfutO+RVJkk7b6VxWWg9sb8vbgdsG6jtqxh5gSZLlwE3ARFUdqqq3gQlgXdt2SVXtqaoCdgwcS5I0AsOGQwF/n2Rfkk2ttqyq3mjLbwLL2vIK4PWBfadb7UT16VnqkqQRWTxk3x9U1YEk/w6YSPKTwY1VVUnqzA/vV7Vg2gSw6JKPnu0/J0nnraHOHKrqQHs+CHybmXsGb7VLQrTng639ALBqYPeVrXai+spZ6rONY2tVjVXV2KKLLh1m6JKkUzBnOCS5OMlvHV0G1gI/BnYCR2ccjQOPt+WdwIY2a2kNcLhdftoFrE2ytN2IXgvsatveTbKmzVLaMHAsSdIIDHNZaRnw7Ta7dDHwP6vqe0n2Ao8l2Qi8Btze+p8AbgGmgCPAXQBVdSjJA8De1nd/VR1qy/cAXwMuBJ5sD0nSiMwZDlX1CvDxWeo/A26cpV7Avcc51jZg2yz1SeDqIcYrSToH/Ia0JKljOEiSOoaDJKljOEiSOoaDJKljOEiSOoaDJKljOEiSOoaDJKljOEiSOoaDJKljOEiSOoaDJKljOEiSOoaDJKljOEiSOoaDJKljOEiSOoaDJKljOEiSOoaDJKljOEiSOoaDJKkzdDgkWZTk2STfaetXJHk6yVSSv01yQat/uK1Pte2XDxzjvlb/aZKbBurrWm0qyeYz9/IkSafiZM4cPg+8PLD+ReBLVfW7wNvAxlbfCLzd6l9qfSS5CrgD+D1gHfDXLXAWAV8BbgauAu5svZKkERkqHJKsBG4FvtrWA3wK+GZr2Q7c1pbXt3Xa9htb/3rg0ar6RVX9MzAFXN8eU1X1SlX9Eni09UqSRmTYM4cvA38G/Ftb/23gnap6r61PAyva8grgdYC2/XDr///1D+xzvHonyaYkk0km3z9yeMihS5JO1pzhkOQzwMGq2ncOxnNCVbW1qsaqamzRRZeOejiS9Gtr8RA9nwQ+m+QW4CPAJcBfAUuSLG5nByuBA63/ALAKmE6yGLgU+NlA/ajBfY5XlySNwJxnDlV1X1WtrKrLmbmh/P2q+g/AU8Aft7Zx4PG2vLOt07Z/v6qq1e9os5muAFYDzwB7gdVt9tMF7W/sPCOvTpJ0SoY5czie/wo8muQvgGeBR1r9EeBvkkwBh5j5z56qejHJY8BLwHvAvVX1PkCSzwG7gEXAtqp68TTGJUk6TZn5UL/wfHj56lo+/uUT9rz64K3naDSSNP8l2VdVY8P0+g1pSVLHcJAkdQwHSVLHcJAkdQwHSVLHcJAkdQwHSVLHcJAkdQwHSVLHcJAkdQwHSVLHcJAkdQwHSVLHcJAkdQwHSVLHcJAkdQwHSVLHcJAkdQwHSVLHcJAkdQwHSVLHcJAkdeYMhyQfSfJMkh8leTHJn7f6FUmeTjKV5G+TXNDqH27rU2375QPHuq/Vf5rkpoH6ulabSrL5zL9MSdLJGObM4RfAp6rq48A1wLoka4AvAl+qqt8F3gY2tv6NwNut/qXWR5KrgDuA3wPWAX+dZFGSRcBXgJuBq4A7W68kaUTmDIea8fO2+qH2KOBTwDdbfTtwW1te39Zp229MklZ/tKp+UVX/DEwB17fHVFW9UlW/BB5tvZKkERnqnkP7hP8ccBCYAP4JeKeq3mst08CKtrwCeB2gbT8M/PZg/QP7HK8+2zg2JZlMMvn+kcPDDF2SdAqGCoeqer+qrgFWMvNJ/8qzOqrjj2NrVY1V1diiiy4dxRAk6bxwUrOVquod4Cng94ElSRa3TSuBA235ALAKoG2/FPjZYP0D+xyvLkkakWFmK300yZK2fCHwaeBlZkLij1vbOPB4W97Z1mnbv19V1ep3tNlMVwCrgWeAvcDqNvvpAmZuWu88Ey9OknRqFs/dwnJge5tV9BvAY1X1nSQvAY8m+QvgWeCR1v8I8DdJpoBDzPxnT1W9mOQx4CXgPeDeqnofIMnngF3AImBbVb14xl6hJOmkZeZD/cLz4eWra/n4l0/Y8+qDt56j0UjS/JdkX1WNDdPrN6QlSR3DQZLUMRwkSR3DQZLUMRwkSR3DQZLUMRwkSR3DQZLUMRwkSR3DQZLUMRwkSR3DQZLUMRwkSR3DQZLUMRwkSR3DQZLUMRwkSR3DQZLUMRwkSR3DQZLUMRwkSR3DQZLUmTMckqxK8lSSl5K8mOTzrX5Zkokk+9vz0lZPkoeSTCV5Psm1A8cab/37k4wP1K9L8kLb56EkORsvVpI0nGHOHN4D/ktVXQWsAe5NchWwGdhdVauB3W0d4GZgdXtsAh6GmTABtgA3ANcDW44GSuu5e2C/daf/0iRJp2rOcKiqN6rqh235X4GXgRXAemB7a9sO3NaW1wM7asYeYEmS5cBNwERVHaqqt4EJYF3bdklV7amqAnYMHEuSNAIndc8hyeXAJ4CngWVV9Ubb9CawrC2vAF4f2G261U5Un56lLkkakaHDIclvAt8CvlBV7w5ua5/46wyPbbYxbEoymWTy/SOHz/afk6Tz1lDhkORDzATD16vq71r5rXZJiPZ8sNUPAKsGdl/Zaieqr5yl3qmqrVU1VlVjiy66dJihS5JOwTCzlQI8ArxcVX85sGkncHTG0Tjw+EB9Q5u1tAY43C4/7QLWJlnabkSvBXa1be8mWdP+1oaBY0mSRmDxED2fBP4EeCHJc63234AHgceSbAReA25v254AbgGmgCPAXQBVdSjJA8De1nd/VR1qy/cAXwMuBJ5sD0nSiMwZDlX1A+B43zu4cZb+Au49zrG2AdtmqU8CV881FknSueE3pCVJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktSZMxySbEtyMMmPB2qXJZlIsr89L231JHkoyVSS55NcO7DPeOvfn2R8oH5dkhfaPg8lyZl+kZKkkzPMmcPXgHUfqG0GdlfVamB3Wwe4GVjdHpuAh2EmTIAtwA3A9cCWo4HSeu4e2O+Df0uSdI7NGQ5V9Q/AoQ+U1wPb2/J24LaB+o6asQdYkmQ5cBMwUVWHquptYAJY17ZdUlV7qqqAHQPHkiSNyKnec1hWVW+05TeBZW15BfD6QN90q52oPj1LXZI0Qqd9Q7p94q8zMJY5JdmUZDLJ5PtHDp+LPylJ56VTDYe32iUh2vPBVj8ArBroW9lqJ6qvnKU+q6raWlVjVTW26KJLT3HokqS5nGo47ASOzjgaBx4fqG9os5bWAIfb5addwNokS9uN6LXArrbt3SRr2iylDQPHkiSNyOK5GpJ8A/gj4HeSTDMz6+hB4LEkG4HXgNtb+xPALcAUcAS4C6CqDiV5ANjb+u6vqqM3ue9hZkbUhcCT7SFJGqE5w6Gq7jzOphtn6S3g3uMcZxuwbZb6JHD1XOOQJJ07fkNaktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJHcNBktQxHCRJnXkTDknWJflpkqkkm0c9Hkk6ny0e9QAAkiwCvgJ8GpgG9ibZWVUvnc5xL9/83aH6Xn3w1tP5M5L0a2e+nDlcD0xV1StV9UvgUWD9iMckSeeteXHmAKwAXh9YnwZuOFd/fNgzjDPNMxZJ89V8CYehJNkEbGqrv3jti5/58SjHc7ryxTN2qN8B/uWMHW3h8n04xvfiGN+LYz42bON8CYcDwKqB9ZWt9iuqaiuwFSDJZFWNnZvhzW++FzN8H47xvTjG9+KYJJPD9s6Xew57gdVJrkhyAXAHsHPEY5Kk89a8OHOoqveSfA7YBSwCtlXViyMeliSdt+ZFOABU1RPAEyexy9azNZYFyPdihu/DMb4Xx/heHDP0e5GqOpsDkSQtQPPlnoMkaR5ZcOHgz2zMSLItycEkC3o675mQZFWSp5K8lOTFJJ8f9ZhGJclHkjyT5EftvfjzUY9p1JIsSvJsku+MeiyjlOTVJC8keW6YWUsL6rJS+5mNf2TgZzaAO0/3ZzYWoiR/CPwc2FFVV496PKOUZDmwvKp+mOS3gH3Abefpv4sAF1fVz5N8CPgB8Pmq2jPioY1Mkv8MjAGXVNVnRj2eUUnyKjBWVUN952OhnTn4MxtNVf0DcGjU45gPquqNqvphW/5X4GVmvnV/3qkZP2+rH2qPhfMJ8AxLshK4FfjqqMey0Cy0cJjtZzbOy/8ENLsklwOfAJ4e7UhGp11GeQ44CExU1Xn7XgBfBv4M+LdRD2QeKODvk+xrvzZxQgstHKTjSvKbwLeAL1TVu6Mez6hU1ftVdQ0zvzRwfZLz8rJjks8AB6tq36jHMk/8QVVdC9wM3NsuTR/XQguHoX5mQ+efdn39W8DXq+rvRj2e+aCq3gGeAtaNeiwj8kngs+1a+6PAp5L8j9EOaXSq6kB7Pgh8m5nL9Me10MLBn9lQp92EfQR4uar+ctTjGaUkH02ypC1fyMzkjZ+MdlSjUVX3VdXKqrqcmf8rvl9V/3HEwxqJJBe3yRokuRhYC5xwpuOCCoeqeg84+jMbLwOPna8/s5HkG8D/Bj6WZDrJxlGPaYQ+CfwJM58Mn2uPW0Y9qBFZDjyV5HlmPkxNVNV5PYVTACwDfpDkR8AzwHer6nsn2mFBTWWVJJ0bC+rMQZJ0bhgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqSO4SBJ6hgOkqTO/wN44dpe2QtQQgAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "_ = plt.hist(events_test.Predicted, bins=200)\n", "plt.xlim(0,5)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** *\n", "\n", "## 5. Comparison to model without item information\n", "\n", "A natural benchmark to compare this model is to is a Poisson factorization model without any item side information - here I'll do the comparison with a _Hierarchical Poisson factorization_ model with the same metrics as above:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "**********************************\n", "Hierarchical Poisson Factorization\n", "**********************************\n", "\n", "Number of users: 65913\n", "Number of items: 39578\n", "Latent factors to use: 70\n", "\n", "Initializing parameters...\n", "Allocating Phi matrix...\n", "Initializing optimization procedure...\n", "Iteration 10 | train llk: -4635584 | train rmse: 2.8502\n", "Iteration 20 | train llk: -4548912 | train rmse: 2.8397\n", "Iteration 30 | train llk: -4512693 | train rmse: 2.8336\n", "Iteration 40 | train llk: -4492286 | train rmse: 2.8297\n", "Iteration 50 | train llk: -4476969 | train rmse: 2.8287\n", "Iteration 60 | train llk: -4464443 | train rmse: 2.8282\n", "Iteration 70 | train llk: -4454397 | train rmse: 2.8282\n", "Iteration 80 | train llk: -4448200 | train rmse: 2.8280\n", "Iteration 90 | train llk: -4442528 | train rmse: 2.8275\n", "Iteration 100 | train llk: -4437068 | train rmse: 2.8272\n", "\n", "\n", "Optimization finished\n", "Final log-likelihood: -4437068\n", "Final RMSE: 2.8272\n", "Minutes taken (optimization part): 1.5\n", "\n", "CPU times: user 12min 22s, sys: 2.19 s, total: 12min 24s\n", "Wall time: 1min 34s\n" ] } ], "source": [ "%%time\n", "from hpfrec import HPF\n", "\n", "recommender_no_sideinfo = HPF(k=70)\n", "recommender_no_sideinfo.fit(events_train.copy())" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average prediction for combinations in test set: 0.023392139\n", "Average prediction for random combinations: 0.0063642794\n" ] } ], "source": [ "events_test_comp = events_test.copy()\n", "events_test_comp['Predicted'] = recommender_no_sideinfo.predict(user=events_test_comp.UserId, item=events_test_comp.ItemId)\n", "events_test_comp['PredictedRandom'] = recommender_no_sideinfo.predict(user=events_test_comp.UserId,\n", " item=events_test_comp.RandomItem)\n", "print(\"Average prediction for combinations in test set: \", events_test_comp.Predicted.mean())\n", "print(\"Average prediction for random combinations: \", events_test_comp.PredictedRandom.mean())" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6910112931686316" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "was_clicked = np.r_[np.ones(events_test_comp.shape[0]), np.zeros(events_test_comp.shape[0])]\n", "score_model = np.r_[events_test_comp.Predicted.values, events_test_comp.PredictedRandom.values]\n", "roc_auc_score(was_clicked, score_model)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.1007423756772694" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.corrcoef(events_test_comp.Count, events_test_comp.Predicted)[0,1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can be seen, adding the side information and widening the catalog to include more items using only their text descriptions (no clicks) results in an improvemnet over all 3 metrics, especially correlation with number of clicks.\n", "\n", "More important than that however, is its ability to make recommendations from a far wider catalog of items, which in practice can make a much larger difference in recommendation quality than improvement in typicall offline metrics.\n", "** *\n", "\n", "## 6. Making recommendations\n", "\n", "The package provides a simple API for making predictions and Top-N recommended lists. These Top-N lists can be made among all items, or across some user-provided subset only, and you can choose to discard items with which the user had already interacted in the training set.\n", "\n", "Here I will:\n", "* Pick a random user with a reasonably long event history.\n", "* See which items would the model recommend to them among those which he has not yet clicked.\n", "* Compare it with the recommended list from the model without item side information.\n", "\n", "Unfortunately, since all the data is anonymized, it's not possible to make a qualitative evaluation of the results by looking at the recommended lists as it is in other datasets." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1362222" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users_many_events = events_train.groupby('UserId')['ItemId'].agg(lambda x: len(tuple(x)))\n", "users_many_events = np.array(users_many_events.index[users_many_events > 20])\n", "\n", "np.random.seed(1)\n", "chosen_user = np.random.choice(users_many_events)\n", "chosen_user" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 44 ms, sys: 0 ns, total: 44 ms\n", "Wall time: 52 ms\n" ] }, { "data": { "text/plain": [ "array([ 9877, 119736, 312728, 241555, 257040, 325310, 320130, 445351,\n", " 409804, 384302, 219512, 38965, 234255, 303828, 37029, 309778,\n", " 248455, 190000, 290999, 213834])" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "recommender_all_items.topN(chosen_user, n=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*(These numbers represent the IDs of the items being recommended as they appeared in the `events_train` data frame)*" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 8 ms, sys: 0 ns, total: 8 ms\n", "Wall time: 1.65 ms\n" ] }, { "data": { "text/plain": [ "array([119736, 441852, 372188, 344723, 116624, 439963, 345279, 4001,\n", " 183511, 33912, 354585, 456056, 29940, 272324, 89323, 186702,\n", " 190000, 227790, 92361, 78729])" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "recommender_clicked_items_only.topN(chosen_user, n=20)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4 ms, sys: 0 ns, total: 4 ms\n", "Wall time: 1.48 ms\n" ] }, { "data": { "text/plain": [ "array([ 9877, 241555, 325310, 38965, 283115, 272455, 37115, 412622,\n", " 252319, 314789, 108486, 265571, 20740, 212917, 210087, 198784,\n", " 381941, 82377, 178274, 122219])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%time\n", "recommender_no_sideinfo.topN(chosen_user, n=20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** *\n", "\n", "## 7. References\n", "* Gopalan, Prem K., Laurent Charlin, and David Blei. \"Content-based recommendations with poisson factorization.\" Advances in Neural Information Processing Systems. 2014." ] } ], "metadata": { "kernelspec": { "display_name": "Python3 (mkl)", "language": "python", "name": "myenv" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }