{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Developing Recommendation Systems for Users on Yelp-like Networks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Collaborative Filtering systems\n", "\n", "In this project, we create a recommendation system for **restaurants** using [collaborative filtering](http://en.wikipedia.org/wiki/Collaborative_filtering) (CF). The general structure of a recommendation system is that there are users and there are items. Users express explicit or implicit preferences towards certain items. CF thus relies on users' past behavior.\n", "\n", "There are two primary approaches to CF: neighboorhood and latent factor model. The former is concerned with computing the relationships between items or between users. In the latter approach you have a model of hidden factors through which users and items are transformed to the same space. For example, if you are rating movies we may transform items into genre factors, and users into their preference for a particular genre.\n", "\n", "Factor models generally lead to more accurate recommenders. One of the reasons for this is the sparsity of the item-user matrix. Most users tend to rate barely one or two items. Latent factor models are more expressive, and fit fewer parameters. However, neighborhood models are more prevalent, as they have an intuitive aspect that appeals to users(if you liked this you will like that) and online(a new preference can be incorporated very quickly).\n", "\n", "Most recommenders today combine neighboorhood CF with model based CF, and SVD based matrix factorization approaches.\n", "\n", "To see the example of a simple beer recommender, go [here](http://nbviewer.ipython.org/20a18d52c539b87de2af). This homework is inspired by the one there but we go after food instead, and go deeper into the problem of making recommendations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### User and Item based approaches\n", "\n", "Original approaches to neighborhood based CF used user-user models. By this we mean that rating estimates are made from recorded ratings of like minded users. However, since most users tend to rate very few items, this is usually a losing proposition for explicit-rating based recommenders. Thus, most neighborhood based systems such as Amazon these days rely on item-item approaches. In these methods, a rating is estimated by other ratings made by the user on \"similar\" or \"nearby\" items: we have a K-Nearest-Neighbors algorithm, in effect." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Outline of this Project\n", "\n", "The outline of this project is as follows:\n", "\n", "1. Create a database of item-item similarities. Use this to implement a neighborhood-based CF recommender that can answer simple questions like \"give me more restaurants like this one\". This part of the homework assumes that the similaties calculated make good \"global recommendations\".\n", "\n", "2. In the second part, we go one step further and attempt to predict the rating that a user will give an item they have not seen before. This requires that we find the restaurants that *this* user would rate as similar (not just those which are globally similar). \n", "\n", "3. In the third part, we implement a factor-based CF recommender using a Bayesian model. While quite a bit more complex, this allows us to pool information both about similar users and about similar restaurants.\n", "\n", "5. We will scale up our system by creating a recommender on the lines of Q1 and Q2 that works on the entire data set. We will use the map-reduce paradigm to split the computation over multiple machines." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "from collections import defaultdict\n", "import json\n", "\n", "import numpy as np\n", "import scipy as sp\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "\n", "from matplotlib import rcParams\n", "import matplotlib.cm as cm\n", "import matplotlib as mpl\n", "\n", "#colorbrewer2 Dark2 qualitative color table\n", "dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),\n", " (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),\n", " (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),\n", " (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),\n", " (0.4, 0.6509803921568628, 0.11764705882352941),\n", " (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),\n", " (0.6509803921568628, 0.4627450980392157, 0.11372549019607843)]\n", "\n", "rcParams['figure.figsize'] = (10, 6)\n", "rcParams['figure.dpi'] = 150\n", "rcParams['axes.color_cycle'] = dark2_colors\n", "rcParams['lines.linewidth'] = 2\n", "rcParams['axes.facecolor'] = 'white'\n", "rcParams['font.size'] = 14\n", "rcParams['patch.edgecolor'] = 'white'\n", "rcParams['patch.facecolor'] = dark2_colors[0]\n", "rcParams['font.family'] = 'StixGeneral'\n", "\n", "\n", "def remove_border(axes=None, top=False, right=False, left=True, bottom=True):\n", " \"\"\"\n", " Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks\n", " \n", " The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn\n", " \"\"\"\n", " ax = axes or plt.gca()\n", " ax.spines['top'].set_visible(top)\n", " ax.spines['right'].set_visible(right)\n", " ax.spines['left'].set_visible(left)\n", " ax.spines['bottom'].set_visible(bottom)\n", " \n", " #turn off all ticks\n", " ax.yaxis.set_ticks_position('none')\n", " ax.xaxis.set_ticks_position('none')\n", " \n", " #now re-enable visibles\n", " if top:\n", " ax.xaxis.tick_top()\n", " if bottom:\n", " ax.xaxis.tick_bottom()\n", " if left:\n", " ax.yaxis.tick_left()\n", " if right:\n", " ax.yaxis.tick_right()\n", " \n", "pd.set_option('display.width', 500)\n", "pd.set_option('display.max_columns', 100)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 111 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Description of the data set\n", "\n", "The data set has been extracted from the Yelp Phoenix restaurants dataset. It is available [here](https://dl.dropboxusercontent.com/u/75194/bigdf.csv)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "fulldf=pd.read_csv(\"bigdf.csv\")\n", "fulldf.head(2)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | user_id | \n", "business_id | \n", "date | \n", "review_id | \n", "stars | \n", "usefulvotes_review | \n", "user_name | \n", "categories | \n", "biz_name | \n", "latitude | \n", "longitude | \n", "business_avg | \n", "business_review_count | \n", "user_avg | \n", "user_review_count | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "rLtl8ZkDX5vH5nAx9C3q5Q | \n", "9yKzy9PApeiPPOUJEtnvkg | \n", "2011-01-26 00:00:00 | \n", "fWKvX83p0-ka4JS3dc6E5A | \n", "5 | \n", "5 | \n", "Jason | \n", "[Breakfast & Brunch, Restaurants] | \n", "Morning Glory Cafe | \n", "33.390792 | \n", "-112.012504 | \n", "3.87156 | \n", "109 | \n", "3.796954 | \n", "197 | \n", "
1 | \n", "SBbftLzfYYKItOMFwOTIJg | \n", "9yKzy9PApeiPPOUJEtnvkg | \n", "2008-05-04 00:00:00 | \n", "DASdFe-g0BgfN9J2tanStg | \n", "5 | \n", "1 | \n", "Jennifer | \n", "[Breakfast & Brunch, Restaurants] | \n", "Morning Glory Cafe | \n", "33.390792 | \n", "-112.012504 | \n", "3.87156 | \n", "109 | \n", "3.473684 | \n", "57 | \n", "
from mrjob.job import MRJob\n",
"from itertools import combinations, permutations\n",
"from operator import itemgetter\n",
"from scipy.stats.stats import pearsonr\n",
"from numpy import isnan\n",
"\n",
"class RestaurantSimilarities(MRJob):\n",
"\n",
"\tdef steps(self):\n",
"\t\t"the steps in the map-reduce process"\n",
"\t\tthesteps = [\n",
"\t\t\tself.mr(mapper=self.line_mapper, reducer=self.users_items_collector),\n",
"\t\t\tself.mr(mapper=self.pair_items_mapper, reducer=self.calc_sim_collector)\n",
"\t\t]\n",
"\t\treturn thesteps\n",
" \n",
"\tdef line_mapper(self,_,line):\n",
" \n",
"\t\tuser_id,business_id,stars,business_avg,user_avg=line.split(',')\n",
"\t\tyield user_id, (business_id,stars,business_avg,user_avg)\n",
"\n",
"\tdef users_items_collector(self, user_id, values):\n",
"\n",
"\t\t# feed values generator into a list comprehension\n",
"\t\tvalues = [val for val in values]\n",
"\t\t# yield new values list along with user_id\n",
"\t\tyield user_id, values\n",
"\n",
"\tdef pair_items_mapper(self, user_id, values):\n",
"\t\n",
"\t\t# initialize keys list\n",
"\t\tkeys = []\n",
"\t\t# feed in vals from generator\n",
"\t\tvals = [v for v in values]\n",
"\t\t# sort vals list\n",
"\t\tvals = sorted(vals, key=itemgetter(0))\n",
"\t\t\n",
"\t\t# pull out keys\n",
"\t\t# NB: we don't really need this step, but it makes for more readable code\n",
"\t\tfor v in vals:\n",
"\t\t\tkeys.append(v[0])\n",
"\t\t\n",
"\t\t# loop through combination pairs, return keys and values\n",
"\t\t# NB: I'm actually not sure what's more readable here - we could just do combinations() on vals itself,\n",
"\t\t# and then chop it up by sub-indexing in the yield statement. I discussed this with Brian Feeny (who helped a lot\n",
"\t\t# on this assignment!) and I ended up implementing in the way he described, mostly for readability's sake.\n",
"\t\tfor k, v in zip(combinations(keys,2), combinations(vals,2)):\n",
"\t\t\tyield k, v\n",
"\n",
"\n",
"\tdef calc_sim_collector(self, key, values):\n",
"\t\n",
"\t\t(rest1, rest2), common_ratings = key, values\n",
"\t\t# this is our n_common counter\n",
"\t\tn_common = 0\n",
"\t\t# lists to catch ratings difference scores for both restaurants\n",
"\t\tr1 = []\n",
"\t\tr2 = []\n",
"\t\t\n",
"\t\t# loop through values, pull out ratings, compute differences, append to lists\n",
"\t\tfor valset in values:\n",
"\t\t\n",
"\t\t\tn_common += 1\n",
"\t\t\n",
"\t\t\trest1_stars = float(valset[0][1])\n",
"\t\t\trest1_user_avg = float(valset[0][3])\n",
"\t\t\trest2_stars = float(valset[1][1])\n",
"\t\t\trest2_user_avg = float(valset[1][3])\n",
"\n",
"\t\t\tdiff1=rest1_stars-rest1_user_avg\n",
"\t\t\tdiff2=rest2_stars-rest2_user_avg\n",
"\n",
"\t\t\tr1.append(diff1)\n",
"\t\t\tr2.append(diff2)\n",
" \n",
" # compute pearson's\n",
"\t\trho = pearsonr(r1, r2)[0]\n",
"\t\t# check for NaN, if so return 0\n",
"\t\trho = rho if (not isnan(rho)) else 0\n",
"\n",
"\t\t# yield rest_id tuple as key, rho/ncom tuple as val\n",
"\t\tyield (rest1, rest2), (rho, n_common)\n",
"\t\t\n",
"#Below MUST be there for things to work\n",
"if __name__ == '__main__':\n",
"\tRestaurantSimilarities.run()\n",
"