{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Splitting biased test data for recommender system\n", "** *\n", "*Note: if you are visualizing this notebook directly from GitHub, some mathematical symbols might display incorrectly or not display at all. This same notebook can be rendered from nbviewer by following [this link.](http://nbviewer.jupyter.org/github/david-cortes/datascienceprojects/blob/master/optimization/dataset_splitting.ipynb)*\n", "\n", "This project consists of splitting a dataset of feedback of users from products in order to later develop and evaluate recommendation algorithms. The final task (not developed in this example) consists of recommending an option within a product to users, and there is a dataset of feedback from different users about different products, like this (the option with the highest score within a product is recommended to the user):" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UserIdProductIdProductCategoryProductOptionUserFeedbackScore
0154100.54
1154210.78
2154300.21
\n", "
" ], "text/plain": [ " UserId ProductId ProductCategory ProductOption UserFeedback Score\n", "0 1 5 4 1 0 0.54\n", "1 1 5 4 2 1 0.78\n", "2 1 5 4 3 0 0.21" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "pd.DataFrame([(1,5,4,1,0,.54),(1,5,4,2,1,.78),(1,5,4,3,0,.21)],\n", " columns=['UserId','ProductId','ProductCategory','ProductOption','UserFeedback','Score'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to evaluate how a recommendation algorithm would work for new users and new products with this data, it’s necessary to split it into a train set, with which the parameters of the formulas are to be calculated, and a test set, with which such formula is to be tested.\n", "\n", "However, splitting it is no easy task, because there is a huge sampling bias and the data consists of pretty much always the same subset of users giving feedback about the same subset of products, thus a random split would not suffice. Additionally, there are product categories, and the sample is also biased towards some of those, but both sets need to contain products from all categories.\n", "\n", "Thus, it’s necessary to find some way of splitting the dataset in such a way that the train and test sets have totally different users and products, with each containing products of all categories, while minimizing the observations that are discarded in order to come up with non-intersecting sets – this while still having reasonably-sized splits for each, with the training set being larger. Such a split can be obtained through mixed-integer programming.\n", "\n", "Mixed-integer programs can be modeled with a variety of frameworks (e.g. or-tools, PuLP, pyomo, cvxpy, etc.) and solved with a variety of solvers. Here I'll model the problem using Google's or-tools (available through pip for python 3 as 'py3-ortools') and solve it interfacing it to coin-or's CBC (available as part of [coin-or's optimization suite](https://projects.coin-or.org/CoinBinary)) - both of which are free - aided with igraph (unofficially available as an easy-to-install [python wheel](http://www.lfd.uci.edu/~gohlke/pythonlibs/#python-igraph)). Commercial solvers such as Gurobi, however, usually present better performance and are able to solve larger problems. This example is purposefully small in order not to make the solution too difficult or too slow for CBC.\n", "\n", "The data that created this situation is not public, but it’s possible to simulate similar data with random statistical distributions - real data would most likely result in larger (more observations, user and products) but easier splitting problems though (different connectivity between users-products-categories).\n", "** *\n", "## Sections\n", "[1. Simulating the data](#p1)\n", "\n", "[2. Modeling the problem - Non-intersecting sets](#p2)\n", "* [2.1 Mathematical formulation](#p21)\n", "* [2.2 Modeling the problem in or-tools](#p22)\n", "* [2.3 Examining the results](#p23)\n", "\n", "[3. Alternative model - Soft-assignment, cliques and larger dataset](#p3)\n", "* [3.1 Mathematical formulation](#p31)\n", "* [3.2 Simulating a larger problem](#p32)\n", "* [3.3 Obtaining the maximal cliques](#p33)\n", "* [3.4 Modeling the problem in or-tools](#p34)\n", "* [3.5 Examining the results](#p35)\n", "\n", "** *\n", "\n", "## 1. Simulating the data\n", "\n", "Here I’ll produce a small sample of data that resembles this situation, in the following way:\n", "\n", "* There is a number of users to sample, and a number of products to sample, which are twice the number of users.\n", "* Each product belongs to one of 3 categories, with unequal probabilities for each.\n", "* There is a sampling bias towards each user and product, with some being more or less likely to be observed in the feedback data. These biases are simulated by assigning a random number to users and products from a beta distribution (with the products having a less spread distribution).\n", "* A number of observations are collected of users giving feedback for products. Each one consists of selecting a user and a product independently, with probability proportional to their sampling bias." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np, pandas as pd, igraph, matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "nusers=50\n", "nprods=100\n", "ncategs=3\n", "users=[u for u in range(nusers)]\n", "prods=[p+nusers for p in range(nprods)] #assigning them an ID por plotting\n", "np.random.seed(1)\n", "prod_cat=np.random.choice([0,1,2],p=[.5,.3,.2],size=nprods)\n", "user_prob=np.random.beta(a=3,b=3,size=nusers)\n", "item_prob=np.random.beta(a=4,b=3,size=nprods)\n", "\n", "user_prob=user_prob/np.sum(user_prob)\n", "item_prob=item_prob/np.sum(item_prob)\n", "data=list()\n", "for i in range(200):\n", " data.append((np.random.choice(users,p=user_prob),np.random.choice(prods,p=item_prob)))\n", " \n", "data=pd.DataFrame(data,columns=['User','Product']).drop_duplicates()\n", "temp_left=data.copy().rename(columns={'User':'User1'})\n", "temp_right=data.copy().rename(columns={'User':'User2'})\n", "uel=pd.merge(temp_left,temp_right,on='Product')\n", "uel=uel.loc[uel.User1\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g=igraph.Graph.Bipartite([False]*nusers+[True]*nprods,[(i.User,i.Product) for i in data.itertuples()])\n", "g.vs['color']=[3]*nusers+list(prod_cat)\n", "g.delete_vertices([i for i in g.vs if g.degree(i)==0])\n", "igraph.plot(g, vertex_color=[['green','green3','darkgreen','orange'][v['color']] for v in g.vs], bbox=(0,0,500,300),vertex_size=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now plotting graphs of observed users with products in common – splitting them into balanced sets is not as easy as finding connected components or clustering the graph:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g=igraph.Graph(directed=False)\n", "g.add_vertices(users)\n", "g.add_edges([(i.User1,i.User2) for i in uel.itertuples()])\n", "igraph.plot(g,bbox=(0, 0, 200, 200),vertex_size=8)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cases in this random sample didn’t encompass all users or all products, and since there is no data for them, I’ll remove them and reindex their IDs:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# reassigning IDs of users and products\n", "user_old_to_new=dict()\n", "user_new_to_old=dict()\n", "prod_old_to_new=dict()\n", "prod_new_to_old=dict()\n", "\n", "users_sampled=list(set(list(data.User)))\n", "products_sampled=list(set(list(data.Product)))\n", "users_sampled.sort()\n", "products_sampled.sort()\n", "\n", "cnt=0\n", "for u in users_sampled:\n", " user_old_to_new[u]=cnt\n", " user_new_to_old[cnt]=u\n", " cnt+=1\n", "cnt=0\n", "for p in products_sampled:\n", " prod_old_to_new[p]=cnt\n", " prod_new_to_old[cnt]=p\n", " cnt+=1\n", " \n", "# remapping original test data\n", "data['User']=data.User.map(lambda u: int(user_old_to_new[u]))\n", "data['Product']=data.Product.map(lambda p: int(prod_old_to_new[p]))\n", "prod_cat=[prod_cat[p-nusers] for p in products_sampled]\n", "\n", "nusers=len(users_sampled)\n", "nprods=len(products_sampled)\n", "users=[user_old_to_new[u] for u in users_sampled]\n", "prods=[prod_old_to_new[p] for p in products_sampled]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculating some useful numbers and sets for later:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# tests per product\n", "cases_per_prod=data.groupby('Product').size().to_dict()\n", "obs_per_prod=[cases_per_prod[p] for p in range(nprods)]\n", "\n", "# tests per user\n", "cases_per_user=data.groupby('User').size().to_dict()\n", "obs_per_user=[cases_per_user[u] for u in range(nusers)]\n", "\n", "# dct which users tested which products\n", "prods_per_user=dict()\n", "for user,df in data.groupby('User'):\n", " prods_per_user[user]=set(list(df.Product))\n", "\n", "# dct which products were tested by each user\n", "users_per_prod=dict()\n", "for product,df in data.groupby('Product'):\n", " users_per_prod[product]=set(list(df.User))\n", " \n", "# split sets: train=0,test=1\n", "sets=[0,1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Modeling the problem - Non-intersecting sets\n", "\n", "\n", "### 2.1 Mathematical Formulation\n", "\n", "\n", "The splitting problem can be expressed into finding binary (i.e. 0/1 values) variables that satisfy the following mathematical problem:\n", "
('u' denotes users, 'p' denotes products and 's' denotes the set (train-test))
\n", "\n", "$$ max \\: \\sum_{u,p,s} Observation_{u,p,s} $$\n", "
(maximize the number of observations assigned to either set)
\n", "$$ $$\n", "$$ s.t. $$\n", "$$ $$\n", "$$ \\left\\vert Observations \\,from\\, user\\, u \\right\\vert \\times User_{u,s} \\ge \\sum_p Observation_{u,p,s} \\:\\forall\\, u,s $$
(Users belong to a set if there is at least one observation from them on it)
\n", "$$ $$\n", "$$ \\left\\vert Observations\\, from \\,product\\, p \\right\\vert \\times Product_{p,s} \\ge \\sum_u Observation_{u,p,s} \\:\\forall\\, p,s $$
(Products belong to a set if there is at least one observation from them on it)
\n", "$$ $$\n", "$$ \\sum_s User_{u,s} \\le 1 \\:\\forall\\,u $$
(each user belongs to at most 1 set)
\n", "$$ $$\n", "$$ \\sum_s Product_{p,s} \\le 1 \\:\\forall\\,p $$
(each product belongs to at most 1 set)
\n", "$$ $$\n", "$$ \\sum_{p \\, \\in \\, category} \\sum_{u} Observation_{u,p,s} \\ge \\left\\vert Minimum\\, Category\\, Sample \\right\\vert \\:\\forall\\, \n", "s,categories$$
(There is a minimum number of products from each category in each set)
\n", "$$ $$\n", "$$ \\sum_{u,s} User_{u,s} \\ge \\left\\vert Minimum \\, User \\, Sample \\, \\right\\vert \\:\\forall \\, u,s $$
(There is a minimum number of users in each set)
\n", "$$ $$\n", "$$ \\sum_{u,p} Observation_{u,p,s} \\ge \\left\\vert\\ Minimum \\, Set \\, Size \\right\\vert \\:\\forall\\, s $$
(There is a minimum number of observations in each set)
\n", "$$ $$\n", "$$ \\sum_{s=train} Observation_{u,p,s} - 1.5 \\sum_{s=test} Observation_{u,p,s} \\ge 0 $$
(The training set contains at least one and a half times more observations than the test set)
\n", "$$ $$\n", "$$ Observation_{u,p,s}, User_{u,s}, Product_{p,s} \\in \\{ 0,1 \\} $$
(All decision variables are binary)
\n", "\n", "\n", "### 2.2 Modeling the problem in or-tools\n", "\n", "Creating the necessary decision variables" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from ortools.linear_solver import pywraplp\n", "\n", "# starting solver\n", "solver = pywraplp.Solver('split',pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)\n", "\n", "# variables indicating to which set does each user belong\n", "user=[[solver.BoolVar(\"User\"+str(u)+\"Split\"+str(s)) for s in sets] for u in range(nusers)]\n", "\n", "# variables indicating to which set does each product belong\n", "product=[[solver.BoolVar(\"Product\"+str(p)+\"Split\"+str(s)) for s in sets] for p in range(nprods)]\n", "\n", "# variables indicating to which split does each observation belong\n", "obs=[[[None for s in sets] for p in prods] for u in users]\n", "for i in data.itertuples():\n", " for s in sets:\n", " varname=\"Group\"+str(s)+\"User\"+str(int(i.User))+\"Product\"+str(int(i.Product))\n", " obs[int(i.User)][int(i.Product)][s]=solver.BoolVar(varname)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the optimization objective and constraints:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ " >" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# maximize assigned cases\n", "solver.Maximize(solver.Sum(obs[i.User][i.Product][s] for i in data.itertuples() for s in sets))\n", "\n", "# users assgined to a set\n", "for u in range(nusers):\n", " for s in sets:\n", " solver.Add(obs_per_user[u]*user[u][s] >= solver.Sum(obs[u][p][s] for p in prods_per_user[u]))\n", " \n", "# products assigned to a set\n", "for p in range(nprods):\n", " for s in sets:\n", " solver.Add(obs_per_prod[p]*product[p][s] >= solver.Sum(obs[u][p][s] for u in users_per_prod[p]))\n", "\n", "# no same user in different sets\n", "for u in range(nusers):\n", " solver.Add(user[u][0] + user[u][1] <= 1)\n", "\n", "# no same product in different sets\n", "for p in range(nprods):\n", " solver.Add(product[p][0] + product[p][1] <= 1)\n", "\n", "# minimum size for each set\n", "u_train=10\n", "u_test=10\n", "c_train=50\n", "c_test=50\n", "p_cat=5\n", "solver.Add(solver.Sum(user[u][0] for u in range(nusers))>=u_train)\n", "solver.Add(solver.Sum(user[u][1] for u in range(nusers))>=u_test)\n", "solver.Add(solver.Sum(obs[i.User][i.Product][0] for i in data.itertuples())>=c_train)\n", "solver.Add(solver.Sum(obs[i.User][i.Product][1] for i in data.itertuples())>=c_test)\n", "\n", "for c in range(ncategs):\n", " solver.Add(solver.Sum(obs[i.User][i.Product][0] for i in data.itertuples() if (prod_cat[i.Product]==c))>=p_cat)\n", " solver.Add(solver.Sum(obs[i.User][i.Product][1] for i in data.itertuples() if (prod_cat[i.Product]==c))>=p_cat)\n", "\n", "# balanced split\n", "solver.Add(solver.Sum(obs[i.User][i.Product][0]-1.5*obs[i.User][i.Product][1] for i in data.itertuples())>=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now letting CBC solve it - a result code of zero means an optimal solution was found, whereas 2 means it failed" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# solve\n", "solver.Solve()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 2.3 Examining the results\n", "\n", "Extracting the results (not the most efficient way of doing it, but the small number of variables allows for it):" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import re\n", "\n", "user_arr=dict()\n", "prod_arr=dict()\n", "r1=\"^User([0-9]+)Split([0-9])\"\n", "r2=\"^Product([0-9]+)Split([0-9])\"\n", "\n", "\n", "for u in user:\n", " s1=u[0].solution_value()>.5\n", " s2=u[1].solution_value()>.5\n", " u=int(re.match(r1,u[0].name()).groups()[0])\n", " if s1 and s2:\n", " user_arr[u]=-1\n", " elif s1:\n", " user_arr[u]=0\n", " elif s2:\n", " user_arr[u]=1\n", " \n", "for p in product:\n", " s1=p[0].solution_value()>.5\n", " s2=p[1].solution_value()>.5\n", " u=int(re.match(r2,p[0].name()).groups()[0])\n", " if s1 and s2:\n", " prod_arr[u]=-1\n", " elif s1:\n", " prod_arr[u]=0\n", " elif s2:\n", " prod_arr[u]=1\n", " \n", "user_arr=pd.DataFrame([(k,v) for k,v in user_arr.items()],columns=['User','Group'])\n", "prod_arr=pd.DataFrame([(k,v) for k,v in prod_arr.items()],columns=['Product','Group'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now examining the split that CBC found:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total observations in train set: 129\n", "Total observations in test set: 50\n", "Proportion of cases assigned to either set: 89.95%\n", "\n", "Users in train set: 29\n", "Users in test set: 15\n", "Products in train set: 58\n", "Products in test set: 29\n" ] } ], "source": [ "users_train=set(list(user_arr.User[user_arr.Group==0]))\n", "users_test=set(list(user_arr.User[user_arr.Group==1]))\n", "prods_train=set(list(prod_arr.Product[prod_arr.Group==0]))\n", "prods_test=set(list(prod_arr.Product[prod_arr.Group==1]))\n", "\n", "cases_train=len([i for i in data.itertuples() if (i.User in users_train) and (i.Product in prods_train)])\n", "cases_test=len([i for i in data.itertuples() if (i.User in users_test) and (i.Product in prods_test)])\n", "print('Total observations in train set:',cases_train)\n", "print('Total observations in test set:',cases_test)\n", "print('Proportion of cases assigned to either set:','{:.2%}'.format((cases_train+cases_test)/data.shape[0]))\n", "print()\n", "print('Users in train set:',len(users_train))\n", "print('Users in test set:',len(users_test))\n", "print('Products in train set:',len(prods_train))\n", "print('Products in test set:',len(prods_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Alternative model - Soft-assignment, cliques and larger dataset\n", "\n", "The model used before does the job, but it scales poorly with problem size and might not be the most efficient way of obtaining such a split.\n", "\n", "If we lower the standards, the problem can also be modeled with a soft assignment of users and products to sets – that is, allowing them to be in either set, but minimizing the number of repeated cases (users and products that belong to both sets). As such, it's also possible to (efficently) define the maximum and minimum proportion of the data that one of the sets should be assigned (e.g. test set being between 20% and 40% of the data).\n", "\n", "In addition, it’s also possible to use information from a (mathematical) graph of users connected by products (each user is a node and there is an edge between two users whenever there is feedback for a product in common), and of products connected by users, by finding cliques of such graphs (subsets of nodes that are all connected to each other).\n", "\n", "If we are to include all observations in each set, then whenever two users have a product in common, those users should belong to the same set, and whenever two products have a user in common, those products should belong to the same set. The same logic can be extended to larger numbers of users/products connected by products/users – these are obtained by the cliques of such graphs.\n", "\n", "If there existed such a split that would include all observations while assigning totally different users and products to each set (such a split didn’t exist in the simulation before), then the problem would be simple: if a user/product belongs in a clique, then all users/products in the clique belong to the same set, and no user/product in the clique belongs to the other set. Such a model would work by adding an extra variable indicating to which set does the clique belong, and inequalities like this: \n", "\n", "$$ \\sum_{u \\in clique} User_{u,s=train} = \\left\\vert Clique \\, Cardinality \\right\\vert \\times W_{clique} \\: \\forall\\, user\\, clique $$\n", "$$ \\sum_{u \\in clique} User_{u,s=test} = \\left\\vert Clique \\, Cardinality \\right\\vert \\times (1-W_{clique}) \\: \\forall\\, user\\, clique $$\n", "$$ W_{clique} \\in \\{0,1\\} \\:\\forall\\, user\\,clique $$\n", "$$ $$\n", "
And similarly for product cliques:
\n", "$$ $$\n", "$$ \\sum_{p \\in clique} Product_{p,s=train} = \\left\\vert Clique \\, Cardinality \\right\\vert \\times W_{clique} \\: \\forall\\, product\\, clique $$\n", "$$ \\sum_{p \\in clique} Product_{p,s=test} = \\left\\vert Clique \\, Cardinality \\right\\vert \\times (1-W_{clique}) \\: \\forall\\, product\\, clique $$\n", "$$ W_{clique} \\in \\{0,1\\} \\:\\forall\\, product\\,clique $$\n", "\n", "In this case, however, since users and products can be in both sets, these two inequalities will likely not hold. If we know beforehand to which set does a clique belong, then it would be possible to change one of those equalities to an inequality (if w=1 means the clique is in the training set, then $\\le$ for the first one and $\\ge$ for the second one). \n", "\n", "Since finding all cliques of a graph is a very algorithmically complex operation, and adding too many constraints can slowdown the linear solver, it’s possible to work with this by finding only the maximal cliques of a graph (for which there are more efficient algorithms), and as the training data is supposed to be larger, we can assume (albeit perhaps incorrectly) that the largest cliques would end up in the training data. Note that the solution might not be exactly the same as without the cliques constraints – it might even make the solver fail as it would discard some otherwise valid solutions – but it can speed up things enormously." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 3.1 Mathematical formulation\n", "\n", "Alltogether, such a problem formulation would now look like this:\n", "\n", "$$ min \\: \\sum_s(\\sum_u User_{u,s} + \\sum_p Product_{p,s}) $$\n", "$$ $$\n", "$$ s.t. $$\n", "$$ $$\n", "
Users and products belong to at least one set
\n", "$$ \\sum_s User_{u,s} \\ge 1 \\:\\forall\\, u $$\n", "$$ \\sum_s Product_{p,s} \\ge 1 \\:\\forall\\, p $$\n", "$$ $$\n", "
If an observation from them is in a set, they have to be in that set
\n", "$$ \\left\\vert Observations\\,from\\,user\\,u \\right\\vert \\times User_{u,s=train} \\ge \\sum_p Observation_{u,p} \\:\\forall\\, u $$\n", "$$ \\left\\vert Observations\\,from\\,user\\,u \\right\\vert \\times User_{u,s=test} \\ge \\left\\vert Observations\\,from\\,user\\,u \\right\\vert - \\sum_p Observation_{u,p} \\:\\forall\\, u $$\n", "$$ $$\n", "$$ \\left\\vert Observations\\,from\\,product\\,p \\right\\vert \\times Product_{p,s=train} \\ge \\sum_u Observation_{u,p} \\:\\forall\\, p $$\n", "$$ \\left\\vert Observations\\,from\\,product\\,p \\right\\vert \\times Product_{p,s=test} \\ge \\left\\vert Observations\\,from\\,product\\,p \\right\\vert - \\sum_u Observation_{u,p} \\:\\forall\\, p $$\n", "\n", "$$ $$\n", "
Sets have to include a minimum of samples of users and product categories, and be balanced
\n", "$$ \\sum_{p \\in category,s} Product_{p,s} \\ge \\left\\vert Minimum\\, Category\\, Sample \\right\\vert \\:\\forall\\, \n", "s,categories $$\n", "$$ \\sum_u User_{u,s} \\ge \\left\\vert Minimum\\, User\\, Sample \\right\\vert \\:\\forall\\, \n", "s $$\n", "$$ \\sum_{u,p} Observation_{u,p} - MinPercTest\\times TotalObs \\ge 0 $$\n", "$$ \\sum_{u,p} Observation_{u,p} - (1-MaxTestPerc)\\times TotalObs \\le 0$$\n", "$$ $$\n", "$$ $$\n", "
Adding the clique constraints
\n", "$$ \\sum_{u \\in clique} User_{u,s=train} = \\left\\vert Clique \\, Cardinality \\right\\vert \\times W_{clique} \\: \\forall\\, maximal \\,user\\, clique $$\n", "$$ \\sum_{u \\in clique} User_{u,s=test} \\ge \\left\\vert Clique \\, Cardinality \\right\\vert \\times (1-W_{clique}) \\: \\forall\\,maximal\\, user\\, clique $$\n", "\n", "$$ $$\n", "$$ \\sum_{p \\in clique} Product_{p,s=train} = \\left\\vert Clique \\, Cardinality \\right\\vert \\times W_{clique} \\: \\forall\\, maximal \\, product\\, clique $$\n", "$$ \\sum_{p \\in clique} Product_{p,s=test} \\ge \\left\\vert Clique \\, Cardinality \\right\\vert \\times (1-W_{clique}) \\: \\forall\\, maximal \\, product\\, clique $$\n", "\n", "$$ $$\n", "
All decision variables are binary
\n", "$$ Observation_{u,p}, User_{u,s}, Product_{p,s}, W_{clique} \\in \\{0,1\\} \\:\\forall\\, u,p,s,maximal\\,user\\,and\\,product\\,clique $$\n", "\n", "_Note that now there is only one binary variable per observation, rather than one per observation belonging to each set, as this time they all belong to either set. If an observation variable equals 1, then that observation belongs to the test set, if it euqals zero, it belongs to the training set._" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 3.2 Simulating a larger dataset\n", "\n", "Generating a bigger dataset - same way as before, but this time with:\n", "* 100 users\n", "* 200 products\n", "* 5 product categories\n", "* 400 samples" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "nusers=100\n", "nprods=200\n", "ncategs=5\n", "users=[u for u in range(nusers)]\n", "prods=[p for p in range(nprods)]\n", "np.random.seed(1)\n", "prod_cat=np.random.choice([0,1,2,3,4],p=[.4,.2,.2,.1,.1],size=nprods)\n", "user_prob=np.random.beta(a=3,b=3,size=nusers)\n", "item_prob=np.random.beta(a=4,b=3,size=nprods)\n", "\n", "user_prob=user_prob/np.sum(user_prob)\n", "item_prob=item_prob/np.sum(item_prob)\n", "data=list()\n", "for i in range(400):\n", " data.append((np.random.choice(users,p=user_prob),np.random.choice(prods,p=item_prob)))\n", " \n", "data=pd.DataFrame(data,columns=['User','Product']).drop_duplicates()\n", "# reassigning IDs of users and products\n", "user_old_to_new=dict()\n", "user_new_to_old=dict()\n", "prod_old_to_new=dict()\n", "prod_new_to_old=dict()\n", "\n", "users_sampled=list(set(list(data.User)))\n", "products_sampled=list(set(list(data.Product)))\n", "users_sampled.sort()\n", "products_sampled.sort()\n", "\n", "cnt=0\n", "for u in users_sampled:\n", " user_old_to_new[u]=cnt\n", " user_new_to_old[cnt]=u\n", " cnt+=1\n", "cnt=0\n", "for p in products_sampled:\n", " prod_old_to_new[p]=cnt\n", " prod_new_to_old[cnt]=p\n", " cnt+=1\n", " \n", "# remapping original test data\n", "data['User']=data.User.map(lambda u: int(user_old_to_new[u]))\n", "data['Product']=data.Product.map(lambda p: int(prod_old_to_new[p]))\n", "\n", "nusers=len(users_sampled)\n", "nprods=len(products_sampled)\n", "prod_cat=[prod_cat[p] for p in products_sampled]\n", "users=[user_old_to_new[u] for u in users_sampled]\n", "prods=[prod_old_to_new[p] for p in products_sampled]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# tests per product\n", "cases_per_prod=data.groupby('Product').size().to_dict()\n", "obs_per_prod=[cases_per_prod[p] for p in range(nprods)]\n", "\n", "# tests per user\n", "cases_per_user=data.groupby('User').size().to_dict()\n", "obs_per_user=[cases_per_user[u] for u in range(nusers)]\n", "\n", "# dct which users tested which products\n", "prods_per_user=dict()\n", "for user,df in data.groupby('User'):\n", " prods_per_user[user]=set(list(df.Product))\n", "\n", "# dct which products were tested by each user\n", "users_per_prod=dict()\n", "for product,df in data.groupby('Product'):\n", " users_per_prod[product]=set(list(df.User))\n", " \n", "# split sets: train=0,test=1\n", "sets=[0,1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 3.3 Obtaining the maximal cliques\n", "\n", "Maximal cliques can be obtained efficiently with either igraph or or-tools – here I’ll use igraph:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# constructing graph of users who tested the same products\n", "g_users=igraph.Graph()\n", "g_users.add_vertices(nusers)\n", "\n", "left_side=data.copy()\n", "right_side=data.copy()\n", "left_side.rename(columns={\"User\":\"User1\"},inplace=True)\n", "right_side.rename(columns={\"User\":\"User2\"},inplace=True)\n", "edges=pd.merge(left_side,right_side,on=\"Product\")\n", "edges=edges.loc[edges.User1\n", "### 3.4 Modeling the problem in or-tools\n", "\n", "Creating the decision variables" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# starting solver\n", "solver = pywraplp.Solver('split',pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)\n", "\n", "# variables indicating to which set does each user belong\n", "user=[[solver.BoolVar(\"User\"+str(u)+\"Split\"+str(s)) for s in sets] for u in range(nusers)]\n", "\n", "# variables indicating to which set does each product belong\n", "product=[[solver.BoolVar(\"Product\"+str(p)+\"Split\"+str(s)) for s in sets] for p in range(nprods)]\n", "\n", "# variables indicating to which split does each test case belong\n", "obs=[[None for p in prods] for u in users]\n", "for i in data.itertuples():\n", " varname=\"User\"+str(int(i.User))+\"Product\"+str(int(i.Product))\n", " obs[int(i.User)][int(i.Product)]=solver.BoolVar(varname)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now modeling the problem:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import math\n", "\n", "# minimize repeated users and products\n", "solver.Minimize(solver.Sum(user[u][s] for u in range(nusers) for s in sets)+solver.Sum(product[p][s] for p in range(nprods) for s in sets))\n", "\n", "\n", "# each user and product belongs to at least one set\n", "solver.Add(solver.Sum(user[u][0]+user[u][1] for u in range(nusers))>=1)\n", "solver.Add(solver.Sum(product[p][0]+product[p][1] for p in range(nprods))>=1)\n", "\n", "# users assgined to a set\n", "for u in range(nusers):\n", " solver.Add(obs_per_user[u]*user[u][0] >= solver.Sum(obs[u][p] for p in prods_per_user[u]))\n", " solver.Add(obs_per_user[u]*user[u][1] >= obs_per_user[u]-solver.Sum(obs[u][p] for p in prods_per_user[u]))\n", " \n", "# products assigned to a set\n", "for p in range(nprods):\n", " solver.Add(obs_per_prod[p]*product[p][0] >= solver.Sum(obs[u][p] for u in users_per_prod[p]))\n", " solver.Add(obs_per_prod[p]*product[p][1] >= obs_per_prod[p]-solver.Sum(obs[u][p] for u in users_per_prod[p]))\n", " \n", "# balanced split\n", "min_perc_test=0.2\n", "max_perc_test=0.4\n", "solver.Add(solver.Sum(obs[i.User][i.Product] for i in data.itertuples())-math.floor((1-max_perc_test)*data.shape[0])<=0)\n", "solver.Add(solver.Sum(obs[i.User][i.Product] for i in data.itertuples())-math.ceil(min_perc_test*data.shape[0])>=0)\n", "\n", "# minimum size for each set\n", "min_users_train=25\n", "min_users_test=25\n", "solver.Add(solver.Sum(user[u][0] for u in range(len(users)))>=min_users_train)\n", "solver.Add(solver.Sum(user[u][1] for u in range(len(users)))>=min_users_test)\n", "\n", "min_prods_per_cat=5\n", "for c in range(ncategs):\n", " solver.Add(solver.Sum(product[p][0] for p in range(nprods) if prod_cat[p]==c)>=min_prods_per_cat)\n", " solver.Add(solver.Sum(product[p][1] for p in range(nprods) if prod_cat[p]==c)>=min_prods_per_cat)\n", "\n", "# adding cliques of users\n", "w_cliques_users=[solver.BoolVar(\"W_UserClique\"+str(cl)) for cl in range(len(c_users))]\n", "for cl in range(len(c_users)):\n", " C=len(c_users[cl])\n", " solver.Add(solver.Sum(user[u][0] for u in c_users[cl])==w_cliques_users[cl]*C)\n", " solver.Add(solver.Sum(user[u][1] for u in c_users[cl])>=(1-w_cliques_users[cl])*C)\n", " \n", "#adding cliques for products\n", "w_cliques_products=[solver.BoolVar(\"W_ProductClique\"+str(cl)) for cl in range(len(c_products))]\n", "for cl in range(len(c_products)):\n", " C=len(c_products[cl])\n", " solver.Add(solver.Sum(product[p][0] for p in c_products[cl])==w_cliques_products[cl]*C)\n", " solver.Add(solver.Sum(product[p][1] for p in c_products[cl])>=(1-w_cliques_products[cl])*C)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solving:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "solver.Solve()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### 3.5 Examining the results" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Proportion of observations in training data: 79.69%\n", "Proportion of observations in test data: 20.31%\n", "\n", "Proportion of users belonging exclusively to train data 65.93%\n", "Proportion of users belonging exclusively to test data 14.29%\n", "Proportion of products belonging exclusively to train data 75.61%\n", "Proportion of products belonging exclusively to test data 23.78%\n" ] } ], "source": [ "# extract results\n", "import re\n", "res_arr=dict()\n", "r0=\"^User([0-9]+)Product([0-9]+)$\"\n", "\n", "for u in obs:\n", " for p in u:\n", " if not (p is None):\n", " us,pr=re.match(r0,p.name()).groups()\n", " us=int(us)\n", " pr=int(pr)\n", " res_arr[(us,pr)]=p.solution_value()\n", " \n", "res_arr=pd.DataFrame([(k[0],k[1],v) for k,v, in res_arr.items()],columns=['User','Product','Group'])\n", "\n", "users_train=set(list(res_arr.loc[res_arr.Group==0]['User']))\n", "users_test=set(list(res_arr.loc[res_arr.Group==1]['User']))\n", "prods_train=set(list(res_arr.loc[res_arr.Group==0]['Product']))\n", "prods_test=set(list(res_arr.loc[res_arr.Group==1]['Product']))\n", "\n", "users_unique_train=users_train.difference(users_test)\n", "users_unique_test=users_test.difference(users_train)\n", "prods_unique_train=prods_train.difference(prods_test)\n", "prods_unique_test=prods_test.difference(prods_train)\n", "cases_unique_train=data.loc[(data.User.map(lambda x: x in users_unique_train))&(data.Product.map(lambda x: x in prods_unique_train))]\n", "cases_unique_test=data.loc[(data.User.map(lambda x: x in users_unique_test))&(data.Product.map(lambda x: x in prods_unique_test))]\n", "\n", "\n", "print('Proportion of observations in training data:','{:.2%}'.format(res_arr.loc[res_arr.Group==0].shape[0]/data.shape[0]))\n", "print('Proportion of observations in test data:','{:.2%}'.format(res_arr.loc[res_arr.Group==1].shape[0]/data.shape[0]))\n", "print()\n", "print('Proportion of users belonging exclusively to train data','{:.2%}'.format(len(users_unique_train)/nusers))\n", "print('Proportion of users belonging exclusively to test data','{:.2%}'.format(len(users_unique_test)/nusers))\n", "print('Proportion of products belonging exclusively to train data','{:.2%}'.format(len(prods_unique_train)/nprods))\n", "print('Proportion of products belonging exclusively to test data','{:.2%}'.format(len(prods_unique_test)/nprods))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The split seems good - now the data is ready for the next phase: building and evaluating recommendation algorithms!" ] } ], "metadata": { "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }