{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "HW4: Do we really need Chocolate Recommendations?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Before You Start\n", "\n", "This is a **long** homework. Please start early. It uses a lot of different (and sometimes complex) concepts, so you might find yourself reading a lot. So, please, give yourself a lot of time.\n", "\n", "Also, please see this [link](http://nbviewer.ipython.org/urls/raw.github.com/cs109/content/master/InstructionsForAmazonEMR.ipynb) on getting an Amazon Web Services account soon, so that you dont delay its creation. This class gives you $100 in credits which you will use for this homework, possibly your project, and any other projects you might like.\n", "\n", "Finally, please go to the labs. The one on 18th October (Today) will cover Gibbs Sampling and Bayesian Normal distributions. The one on the 25th will cover Map-Reduce. Both will help on the homework." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Collaborative Filtering systems\n", "\n", "In this homework, you will create a recommendation system for **restaurants** using [collaborative filtering](http://en.wikipedia.org/wiki/Collaborative_filtering) (CF). The general structure of a recommendation system is that there are users and there are items. Users express explicit or implicit preferences towards certain items. CF thus relies on users' past behavior.\n", "\n", "There are two primary approaches to CF: neighboorhood and latent factor model. The former is concerned with computing the relationships between items or between users. In the latter approach you have a model of hidden factors through which users and items are transformed to the same space. For example, if you are rating movies we may transform items into genre factors, and users into their preference for a particular genre.\n", "\n", "Factor models generally lead to more accurate recommenders. One of the reasons for this is the sparsity of the item-user matrix. Most users tend to rate barely one or two items. Latent factor models are more expressive, and fit fewer parameters. However, neighborhood models are more prevalent, as they have an intuitive aspect that appeals to users(if you liked this you will like that) and online(a new preference can be incorporated very quickly).\n", "\n", "Most recommenders today combine neighboorhood CF with model based CF, and SVD based matrix factorization approaches.\n", "\n", "To see the example of a simple beer recommender, go [here](http://nbviewer.ipython.org/20a18d52c539b87de2af). This homework is inspired by the one there but we go after food instead, and go deeper into the problem of making recommendations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### User and Item based approaches\n", "\n", "Original approaches to neighborhood based CF used user-user models. By this we mean that rating estimates are made from recorded ratings of like minded users. However, since most users tend to rate very few items, this is usually a losing proposition for explicit-rating based recommenders. Thus, most neighborhood based systems such as Amazon these days rely on item-item approaches. In these methods, a rating is estimated by other ratings made by the user on \"similar\" or \"nearby\" items: we have a K-Nearest-Neighbors algorithm, in effect." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Outline of this Homework\n", "\n", "The outline of this homework is as follows:\n", "\n", "1. Create a database of item-item similarities. Use this to implement a neighborhood-based CF recommender that can answer simple questions like \"give me more restaurants like this one\". This part of the homework assumes that the similaties calculated make good \"global recommendations\".\n", "\n", "2. In the second part, we go one step further and attempt to predict the rating that a user will give an item they have not seen before. This requires that we find the restaurants that *this* user would rate as similar (not just those which are globally similar). \n", "\n", "3. In the third part, we implement a factor-based CF recommender using a Bayesian model. While quite a bit more complex, this allows us to pool information both about similar users and about similar restaurants.\n", "\n", "5. We will scale up our system by creating a recommender on the lines of Q1 and Q2 that works on the entire data set. We will use the map-reduce paradigm to split the computation over multiple machines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will start simply, by working on a subset of the restaurant data before generalizing to the entire data set in Problem 4. The complete data set has 150,000 reviews, but we shall start with just about 7000. You will create this smaller set by taking all the users who had rated more than 60 restaurants, and all the businesses which had greater than 150 reviews from the larger data set. This is not a random set: indeed we use it as it a computationally tractable set that is a bit less sparse than the entire data set." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "from collections import defaultdict\n", "import json\n", "\n", "import numpy as np\n", "import scipy as sp\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "\n", "from matplotlib import rcParams\n", "import matplotlib.cm as cm\n", "import matplotlib as mpl\n", "\n", "#colorbrewer2 Dark2 qualitative color table\n", "dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),\n", " (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),\n", " (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),\n", " (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),\n", " (0.4, 0.6509803921568628, 0.11764705882352941),\n", " (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),\n", " (0.6509803921568628, 0.4627450980392157, 0.11372549019607843)]\n", "\n", "rcParams['figure.figsize'] = (10, 6)\n", "rcParams['figure.dpi'] = 150\n", "rcParams['axes.color_cycle'] = dark2_colors\n", "rcParams['lines.linewidth'] = 2\n", "rcParams['axes.facecolor'] = 'white'\n", "rcParams['font.size'] = 14\n", "rcParams['patch.edgecolor'] = 'white'\n", "rcParams['patch.facecolor'] = dark2_colors[0]\n", "rcParams['font.family'] = 'StixGeneral'\n", "\n", "\n", "def remove_border(axes=None, top=False, right=False, left=True, bottom=True):\n", " \"\"\"\n", " Minimize chartjunk by stripping out unnecesasry plot borders and axis ticks\n", " \n", " The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn\n", " \"\"\"\n", " ax = axes or plt.gca()\n", " ax.spines['top'].set_visible(top)\n", " ax.spines['right'].set_visible(right)\n", " ax.spines['left'].set_visible(left)\n", " ax.spines['bottom'].set_visible(bottom)\n", " \n", " #turn off all ticks\n", " ax.yaxis.set_ticks_position('none')\n", " ax.xaxis.set_ticks_position('none')\n", " \n", " #now re-enable visibles\n", " if top:\n", " ax.xaxis.tick_top()\n", " if bottom:\n", " ax.xaxis.tick_bottom()\n", " if left:\n", " ax.yaxis.tick_left()\n", " if right:\n", " ax.yaxis.tick_right()\n", " \n", "pd.set_option('display.width', 500)\n", "pd.set_option('display.max_columns', 100)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Description of the data set\n", "\n", "The data set has been extracted from the Yelp Phoenix restaurants dataset. It is available [here](https://dl.dropboxusercontent.com/u/75194/bigdf.csv)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "fulldf=pd.read_csv(\"bigdf.csv\")\n", "fulldf.head(2)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | user_id | \n", "business_id | \n", "date | \n", "review_id | \n", "stars | \n", "usefulvotes_review | \n", "user_name | \n", "categories | \n", "biz_name | \n", "latitude | \n", "longitude | \n", "business_avg | \n", "business_review_count | \n", "user_avg | \n", "user_review_count | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "rLtl8ZkDX5vH5nAx9C3q5Q | \n", "9yKzy9PApeiPPOUJEtnvkg | \n", "2011-01-26 00:00:00 | \n", "fWKvX83p0-ka4JS3dc6E5A | \n", "5 | \n", "5 | \n", "Jason | \n", "[Breakfast & Brunch, Restaurants] | \n", "Morning Glory Cafe | \n", "33.390792 | \n", "-112.012504 | \n", "3.87156 | \n", "109 | \n", "3.796954 | \n", "197 | \n", "
1 | \n", "SBbftLzfYYKItOMFwOTIJg | \n", "9yKzy9PApeiPPOUJEtnvkg | \n", "2008-05-04 00:00:00 | \n", "DASdFe-g0BgfN9J2tanStg | \n", "5 | \n", "1 | \n", "Jennifer | \n", "[Breakfast & Brunch, Restaurants] | \n", "Morning Glory Cafe | \n", "33.390792 | \n", "-112.012504 | \n", "3.87156 | \n", "109 | \n", "3.473684 | \n", "57 | \n", "
import numpy as np\n",
"\n",
"from mrjob.job import MRJob\n",
"from itertools import combinations, permutations\n",
"\n",
"from scipy.stats.stats import pearsonr\n",
"\n",
"\n",
"class RestaurantSimilarities(MRJob):\n",
"\n",
" def steps(self):\n",
" "the steps in the map-reduce process"\n",
" thesteps = [\n",
" self.mr(mapper=self.line_mapper, reducer=self.users_items_collector),\n",
" self.mr(mapper=self.pair_items_mapper, reducer=self.calc_sim_collector)\n",
" ]\n",
" return thesteps\n",
"\n",
" def line_mapper(self,_,line):\n",
" "this is the complete implementation"\n",
" user_id,business_id,stars,business_avg,user_avg=line.split(',')\n",
" yield user_id, (business_id,stars,business_avg,user_avg)\n",
"\n",
"\n",
" def users_items_collector(self, user_id, values):\n",
" """\n",
" #iterate over the list of tuples yielded in the previous mapper\n",
" #and append them to an array of rating information\n",
" """\n",
" pass\n",
"\n",
"\n",
" def pair_items_mapper(self, user_id, values):\n",
" """\n",
" ignoring the user_id key, take all combinations of business pairs\n",
" and yield as key the pair id, and as value the pair rating information\n",
" """\n",
"\t pass #your code here\n",
"\n",
" def calc_sim_collector(self, key, values):\n",
" """\n",
" Pick up the information from the previous yield as shown. Compute\n",
" the pearson correlation and yield the final information as in the\n",
" last line here.\n",
" """\n",
" (rest1, rest2), common_ratings = key, values\n",
"\t #your code here\n",
" yield (rest1, rest2), (rho, n_common)\n",
"\n",
"\n",
"#Below MUST be there for things to work\n",
"if __name__ == '__main__':\n",
" RestaurantSimilarities.run()\n",
"
import numpy as np\n",
"\n",
"from mrjob.job import MRJob\n",
"from itertools import combinations, permutations\n",
"from math import sqrt\n",
"\n",
"from scipy.stats.stats import pearsonr\n",
"\n",
"class RestaurantSimilarities(MRJob):\n",
"\n",
" def steps(self):\n",
" thesteps = [\n",
" self.mr(mapper=self.line_mapper, reducer=self.users_items_collector),\n",
" self.mr(mapper=self.pair_items_mapper, reducer=self.calc_sim_collector)\n",
" ]\n",
" return thesteps\n",
"\n",
" def line_mapper(self,_,line):\n",
" user_id,business_id,stars,business_avg,user_avg=line.split(',')\n",
" yield user_id, (business_id,stars,business_avg,user_avg)\n",
"\n",
" def users_items_collector(self, user_id, values):\n",
" ratings=[]\n",
" for business_id,stars,business_avg,user_avg in values:\n",
" ratings.append((business_id,(stars, user_avg)))\n",
" yield user_id, ratings\n",
"\n",
" def pair_items_mapper(self, user_id, values):\n",
" ratings = values\n",
" for biz1tuple, biz2tuple in combinations(ratings, 2):\n",
" biz1, biz1r=biz1tuple\n",
" biz2, biz2r=biz2tuple\n",
" if biz1 <= biz2 :\n",
" yield (biz1, biz2), (biz1r, biz2r)\n",
" else:\n",
" yield (biz2, biz1), (biz2r, biz1r)\n",
"\n",
" def calc_sim_collector(self, key, values):\n",
" (rest1, rest2), common_ratings = key, values\n",
" diff1=[]\n",
" diff2=[]\n",
" n_common=0\n",
"\n",
"\n",
" for rt1, rt2 in common_ratings:\n",
" diff1.append(float(rt1[0])-float(rt1[1]))\n",
" diff2.append(float(rt2[0])-float(rt2[1]))\n",
" n_common=n_common+1\n",
" if n_common==0:\n",
" rho=0.\n",
" else:\n",
" rho=pearsonr(diff1, diff2)[0]\n",
" if np.isnan(rho):\n",
" rho=0.\n",
" yield (rest1, rest2), (rho, n_common)\n",
"\n",
"\n",
"#Below MUST be there for things to work!\n",
"if __name__ == '__main__':\n",
" RestaurantSimilarities.run()\n",
"
import numpy as np\n",
"\n",
"from mrjob.job import MRJob\n",
"from itertools import combinations, permutations\n",
"from math import sqrt\n",
"import mrjob\n",
"\n",
"from scipy.stats.stats import pearsonr\n",
"\n",
"class RestaurantSimilarities(MRJob):\n",
"\n",
" def steps(self):\n",
" thesteps = [\n",
" self.mr(mapper=self.line_mapper, reducer=self.users_items_collector),\n",
" self.mr(mapper=self.pair_items_mapper, reducer=self.calc_sim_collector),\n",
" self.mr(mapper=self.ranking_mapper, reducer=self.top_similar_collector)\n",
" ]\n",
" return thesteps\n",
"\n",
" def line_mapper(self,_,line):\n",
" user_id,business_id,stars,business_avg,user_avg=line.split(',')\n",
" yield user_id, (business_id,stars,business_avg,user_avg)\n",
"\n",
" def users_items_collector(self, user_id, values):\n",
" ratings=[]\n",
" for business_id,stars,business_avg,user_avg in values:\n",
" ratings.append((business_id,(stars, user_avg)))\n",
" yield user_id, ratings\n",
"\n",
" def pair_items_mapper(self, user_id, values):\n",
" ratings = values\n",
" for biz1tuple, biz2tuple in combinations(ratings, 2):\n",
" biz1, biz1r=biz1tuple\n",
" biz2, biz2r=biz2tuple\n",
" if biz1 <= biz2 :\n",
" yield (biz1, biz2), (biz1r, biz2r)\n",
" else:\n",
" yield (biz2, biz1), (biz2r, biz1r)\n",
"\n",
" def calc_sim_collector(self, key, values):\n",
" (rest1, rest2), common_ratings = key, values\n",
" diff1=[]\n",
" diff2=[]\n",
" n_common=0\n",
"\n",
"\n",
" for rt1, rt2 in common_ratings:\n",
" diff1.append(float(rt1[0])-float(rt1[1]))\n",
" diff2.append(float(rt2[0])-float(rt2[1]))\n",
" n_common=n_common+1\n",
" if n_common==0:\n",
" rho=0.\n",
" else:\n",
" rho=pearsonr(diff1, diff2)[0]\n",
" if np.isnan(rho):\n",
" rho=0.\n",
" yield (rest1, rest2), (rho, n_common)\n",
"\n",
" def ranking_mapper(self, restaurants, values):\n",
" sim, n_common = values\n",
" rest1, rest2 = restaurants\n",
" if int(n_common) > 0:\n",
" yield (rest1), (sim, rest2, n_common)\n",
"\n",
" def top_similar_collector(self, key, values):\n",
" rest1 = key\n",
" for sim, rest2, n_common in sorted(values, reverse=True):\n",
" yield None, (rest1, rest2, sim, n_common)\n",
"\n",
"#Below MUST be there for things to work!\n",
"if __name__ == '__main__':\n",
" RestaurantSimilarities.run()\n",
"