{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# K Nearest Neighbor Worksheet" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to look at the following ratings:\n", "\n", "|Customer | Taylor Swift | Miranda Lambert | Carrie Underwood | Jhené Aiko | Tinashe | Kelela | Randy Wood |\n", "|:-----------|:------:|:------:|:---------:|:------:|:--------:|:------:|:--------:|\n", "| Sarah | 5 | | 5 | 2 | | 2 | |\n", "| Miquel | 2 | | | 4 | 5 | 5 | 1 |\n", "| Tyler | 4 | 4 | 5 | 2 | 1 | 2 | 1 |\n", "| Ann | 2 | 3 | | 5 | | 5 | 1 |\n", "| Jessica | 2 | 1 | | 5 | | | 1 |\n", "| Franklin | 5 | 5 | 5 | | 2 | 3 | 2 |\n", "| Jenny | 4 | 4 | 4 | 1 | 1 | 1 | 5 |\n", "\n", "Spend a minute and look this table over. What do you think will be our highest recommendation for Sarah?\n", "\n", "We will implement this data in Python as follows:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ratings = {'Sarah': {'Taylor Swift': 5, 'Carrie Underwood': 5, 'Jhené Aiko': 2, 'Kelela': 2},\n", " 'Miguel': {'Taylor Swift': 2, 'Jhené Aiko': 4, 'Kelela': 5, 'Tinashe': 5, 'Randy Wood': 1},\n", " 'Tyler': {'Taylor Swift': 4, 'Miranda Lambert': 4, 'Carrie Underwood': 5, 'Jhené Aiko': 2, 'Kelela': 2, 'Tinashe': 1, 'Randy Wood': 1},\n", " 'Ann': {'Taylor Swift': 2, 'Miranda Lambert': 3, 'Jhené Aiko': 5, 'Kelela': 5, 'Randy Wood': 1},\n", " 'Jessica': {'Taylor Swift': 2, 'Miranda Lambert': 1, 'Jhené Aiko': 5, 'Randy Wood': 1},\n", " 'Franklin': {'Taylor Swift': 4, 'Miranda Lambert': 5, 'Carrie Underwood': 5, 'Jhené Aiko': 2, 'Kelela': 3, 'Tinashe': 2, 'Randy Wood': 2},\n", " 'Jenny': {'Taylor Swift': 4, 'Miranda Lambert': 4, 'Carrie Underwood': 4, 'Jhené Aiko': 1, 'Kelela': 1, 'Tinashe': 1, 'Randy Wood': 5}\n", " }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the code we used in lab 1:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def manhattan(rating1, rating2):\n", " \"\"\"Computes the Manhattan distance. Both rating1 and rating2 are dictionaries\n", " of the form {'The Strokes': 3.0, 'Slightly Stoopid': 2.5}\"\"\"\n", " distance = 0\n", " commonRatings = False \n", " for key in rating1:\n", " if key in rating2:\n", " distance += abs(rating1[key] - rating2[key])\n", " commonRatings = True\n", " if commonRatings:\n", " return distance\n", " else:\n", " return -1 #Indicates no ratings in common\n", " \n", "def computeNearestNeighbor(username, users):\n", " \"\"\"creates a sorted list of users based on their distance to username\"\"\"\n", " distances = []\n", " for user in users:\n", " if user != username:\n", " distance = manhattan(users[user], users[username])\n", " distances.append((distance, user))\n", " # sort based on distance -- closest first\n", " distances.sort()\n", " return distances\n", "\n", "def recommend(username, users):\n", " \"\"\"Give list of recommendations\"\"\"\n", " # first find nearest neighbor\n", " nearest = computeNearestNeighbor(username, users)[0][1]\n", "\n", " recommendations = []\n", " # now find bands neighbor rated that user didn't\n", " neighborRatings = users[nearest]\n", " userRatings = users[username]\n", " for artist in neighborRatings:\n", " if not artist in userRatings:\n", " recommendations.append((artist, neighborRatings[artist]))\n", " # using the fn sorted for variety - sort is more efficient\n", " return sorted(recommendations, key=lambda artistTuple: artistTuple[1], reverse = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### checkpoint 1\n", "What does this system recommend for Sarah?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the highest rating for Sarah is the artist you discovered above. Let's examine a different method.\n", "\n", "## What is the difference?\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a long code block defining the recommender class:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import codecs\n", "from math import sqrt\n", "\n", "class recommender:\n", "\n", " def __init__(self, data, k=1, metric='pearson', n=5):\n", " \"\"\" initialize recommender\n", " currently, if data is dictionary the recommender is initialized\n", " to it.\n", " For all other data types of data, no initialization occurs\n", " k is the k value for k nearest neighbor\n", " metric is which distance formula to use\n", " n is the maximum number of recommendations to make\"\"\"\n", " self.k = k\n", " self.n = n\n", " self.username2id = {}\n", " self.userid2name = {}\n", " self.productid2name = {}\n", " # for some reason I want to save the name of the metric\n", " self.metric = metric\n", " if self.metric == 'pearson':\n", " self.fn = self.pearson\n", " #\n", " # if data is dictionary set recommender data to it\n", " #\n", " if type(data).__name__ == 'dict':\n", " self.data = data\n", "\n", " def convertProductID2name(self, id):\n", " \"\"\"Given product id number return product name\"\"\"\n", " if id in self.productid2name:\n", " return self.productid2name[id]\n", " else:\n", " return id\n", "\n", "\n", " def userRatings(self, id, n):\n", " \"\"\"Return n top ratings for user with id\"\"\"\n", " print (\"Ratings for \" + self.userid2name[id])\n", " ratings = self.data[id]\n", " print(len(ratings))\n", " ratings = list(ratings.items())\n", " ratings = [(self.convertProductID2name(k), v)\n", " for (k, v) in ratings]\n", " # finally sort and return\n", " ratings.sort(key=lambda artistTuple: artistTuple[1],\n", " reverse = True)\n", " ratings = ratings[:n]\n", " for rating in ratings:\n", " print(\"%s\\t%i\" % (rating[0], rating[1]))\n", " \n", "\n", " \n", "\n", " def loadBookDB(self, path=''):\n", " \"\"\"loads the BX book dataset. Path is where the BX files are\n", " located\"\"\"\n", " self.data = {}\n", " i = 0\n", " #\n", " # First load book ratings into self.data\n", " #\n", " f = codecs.open(path + \"BX-Book-Ratings.csv\", 'r', 'utf8')\n", " for line in f:\n", " i += 1\n", " #separate line into fields\n", " fields = line.split(';')\n", " user = fields[0].strip('\"')\n", " book = fields[1].strip('\"')\n", " rating = int(fields[2].strip().strip('\"'))\n", " if user in self.data:\n", " currentRatings = self.data[user]\n", " else:\n", " currentRatings = {}\n", " currentRatings[book] = rating\n", " self.data[user] = currentRatings\n", " f.close()\n", " #\n", " # Now load books into self.productid2name\n", " # Books contains isbn, title, and author among other fields\n", " #\n", " f = codecs.open(path + \"BX-Books.csv\", 'r', 'utf8')\n", " for line in f:\n", " i += 1\n", " #separate line into fields\n", " fields = line.split(';')\n", " isbn = fields[0].strip('\"')\n", " title = fields[1].strip('\"')\n", " author = fields[2].strip().strip('\"')\n", " title = title + ' by ' + author\n", " self.productid2name[isbn] = title\n", " f.close()\n", " #\n", " # Now load user info into both self.userid2name and\n", " # self.username2id\n", " #\n", " f = codecs.open(path + \"BX-Users.csv\", 'r', 'utf8')\n", " for line in f:\n", " i += 1\n", " #print(line)\n", " #separate line into fields\n", " fields = line.split(';')\n", " userid = fields[0].strip('\"')\n", " location = fields[1].strip('\"')\n", " if len(fields) > 3:\n", " age = fields[2].strip().strip('\"')\n", " else:\n", " age = 'NULL'\n", " if age != 'NULL':\n", " value = location + ' (age: ' + age + ')'\n", " else:\n", " value = location\n", " self.userid2name[userid] = value\n", " self.username2id[location] = userid\n", " f.close()\n", " print(i)\n", " \n", " \n", " def pearson(self, rating1, rating2):\n", " sum_xy = 0\n", " sum_x = 0\n", " sum_y = 0\n", " sum_x2 = 0\n", " sum_y2 = 0\n", " n = 0\n", " for key in rating1:\n", " if key in rating2:\n", " n += 1\n", " x = rating1[key]\n", " y = rating2[key]\n", " sum_xy += x * y\n", " sum_x += x\n", " sum_y += y\n", " sum_x2 += pow(x, 2)\n", " sum_y2 += pow(y, 2)\n", " if n == 0:\n", " return 0\n", " # now compute denominator\n", " denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)\n", " * sqrt(sum_y2 - pow(sum_y, 2) / n))\n", " if denominator == 0:\n", " return 0\n", " else:\n", " return (sum_xy - (sum_x * sum_y) / n) / denominator\n", "\n", "\n", " def computeNearestNeighbor(self, username):\n", " \"\"\"creates a sorted list of users based on their distance to\n", " username\"\"\"\n", " distances = []\n", " for instance in self.data:\n", " if instance != username:\n", " distance = self.fn(self.data[username],\n", " self.data[instance])\n", " distances.append((instance, distance))\n", " # sort based on distance -- closest first\n", " distances.sort(key=lambda artistTuple: artistTuple[1],\n", " reverse=True)\n", " return distances\n", "\n", " def recommend(self, user):\n", " \"\"\"Give list of recommendations\"\"\"\n", " recommendations = {}\n", " # first get list of users ordered by nearness\n", " nearest = self.computeNearestNeighbor(user)\n", " #\n", " # now get the ratings for the user\n", " #\n", " userRatings = self.data[user]\n", " #\n", " # determine the total distance\n", " totalDistance = 0.0\n", " for i in range(self.k):\n", " totalDistance += nearest[i][1]\n", " # now iterate through the k nearest neighbors\n", " # accumulating their ratings\n", " for i in range(self.k):\n", " # compute slice of pie \n", " weight = nearest[i][1] / totalDistance\n", " # get the name of the person\n", " name = nearest[i][0]\n", " # get the ratings for this person\n", " neighborRatings = self.data[name]\n", " # get the name of the person\n", " # now find bands neighbor rated that user didn't\n", " for artist in neighborRatings:\n", " if not artist in userRatings:\n", " if artist not in recommendations:\n", " recommendations[artist] = (neighborRatings[artist]\n", " * weight)\n", " else:\n", " recommendations[artist] = (recommendations[artist]\n", " + neighborRatings[artist]\n", " * weight)\n", " # now make list from dictionary\n", " recommendations = list(recommendations.items())\n", " recommendations = [(self.convertProductID2name(k), v)\n", " for (k, v) in recommendations]\n", " # finally sort and return\n", " recommendations.sort(key=lambda artistTuple: artistTuple[1],\n", " reverse = True)\n", " # Return the first n items\n", " return recommendations[:self.n]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will look at that in more detail in a bit. First, let's make a simple recommendation. First I am going to make an instance of this recommender:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "r = recommender(ratings)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's see what we should recommend to Ann:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('Tinashe', 2.7284445256913887), ('Carrie Underwood', -0.1773234608330154)]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r.recommend('Ann')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, it looks like Ann will really hate Carrie Underwood. \n", "\n", "#### checkpoint 2\n", "\n", "Who is Sarah's nearest neighbor?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### checkpoint 3\n", "\n", "What artist does this system recommend for Sarah?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### puzzler 1\n", "Is this the same or different than the recommendation made at Checkpoint 1 above?\n", "\n", "If it is different, why is it different? (Look at the code for the recommender class). Need a hint? See the hint section below.\n", "\n", "#### puzzler 2\n", "Do you see a problem with only using the nearest neighbor? What is that problem?\n", "\n", "### the k-nearest neighbor approach\n", "Instead of looking at a single nearest neighbor, we are going to look at the nearest *k* neighbors. For example when *k* equals 3, we will look at the three nearest neighbors.\n", "\n", "The above code already allows for this. Can you create an instance of a recommender that uses the three nearest neighbors?" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### checkpoint 4\n", "\n", "Great. Now who are Sarah's three nearest neighbors\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### checkpoint 5\n", "\n", "What artists do we recommend for Sarah?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### xp \n", "\n", "##### 40xp for showing me all five checkpoints above.\n", "##### extra XP for demoing (should be done on your own or with a partner - no larger groups)\n", "\n", "1. implement Manhattan and Euclidean distance, compare results. 30xp\n", "2. Using the book crossing dataset, add your own ratings and see what the system recommends to you. 30-40xp\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }