{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### 计算传播应用\n", "***\n", "***\n", "# 推荐系统简介\n", "***\n", "***\n", "\n", "王成军\n", "\n", "wangchengjun@nju.edu.cn\n", "\n", "计算传播网 http://computational-communication.com" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 集体智慧编程\n", "\n", "> 集体智慧是指为了创造新想法,将一群人的行为、偏好或思想组合在一起。一般基于聪明的算法(Netflix, Google)或者提供内容的用户(Wikipedia)。\n", "\n", "集体智慧编程所强调的是前者,即通过编写计算机程序、构造具有智能的算法收集并分析用户的数据,发现新的信息甚至是知识。\n", "\n", "- Netflix\n", "- Google\n", "- Wikipedia\n", "\n", "Toby Segaran, 2007, Programming Collective Intelligence. O'Reilly. \n", "\n", "https://github.com/computational-class/programming-collective-intelligence-code/blob/master/chapter2/recommendations.py" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 推荐系统\n", "\n", "- 目前互联网世界最常见的智能产品形式。\n", "- 从信息时代过渡到注意力时代:\n", " - 信息过载(information overload)\n", " - 注意力稀缺\n", "- 推荐系统的基本任务是联系用户和物品,帮助用户快速发现有用信息,解决信息过载的问题。\n", " - 针对长尾分布问题,找到个性化需求,优化资源配置\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 推荐系统的类型\n", "- 社会化推荐(Social Recommendation)\n", " - 让朋友帮助推荐物品\n", "- 基于内容的推荐 (Content-based filtering)\n", " - 基于用户已经消费的物品内容,推荐新的物品。例如根据看过的电影的导演和演员,推荐新影片。\n", "- 基于协同过滤的推荐(collaborative filtering)\n", " - 找到和某用户的历史兴趣一致的用户,根据这些用户之间的相似性或者他们所消费物品的相似性,为该用户推荐物品" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 协同过滤算法\n", "\n", "- 基于邻域的方法(neighborhood-based method)\n", " - 基于用户的协同过滤(user-based filtering)\n", " - 基于物品的协同过滤 (item-based filtering)\n", "- 隐语义模型(latent factor model)\n", "- 基于图的随机游走算法(random walk on graphs)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# UserCF和ItemCF的比较\n", "\n", "- UserCF较为古老, 1992年应用于电子邮件个性化推荐系统Tapestry, 1994年应用于Grouplens新闻个性化推荐, 后来被Digg采用\n", " - 推荐那些与个体有共同兴趣爱好的用户所喜欢的物品(群体热点,社会化)\n", " - 反映用户所在小型群体中物品的热门程度\n", "- ItemCF相对较新,应用于电子商务网站Amazon和DVD租赁网站Netflix\n", " - 推荐那些和用户之前喜欢的物品相似的物品 (历史兴趣,个性化)\n", " - 反映了用户自己的兴趣传承\n", "- 新闻更新快,物品数量庞大,相似度变化很快,不利于维护一张物品相似度的表格,电影、音乐、图书则可以。\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 推荐系统评测\n", "- 用户满意度\n", "- 预测准确度\n", "\n", " $r_{ui}$用户实际打分, $\\hat{r_{ui}}$推荐算法预测打分, T为测量次数\n", "\n", " - 均方根误差RMSE\n", " \n", " $RMSE = \\sqrt{\\frac{\\sum_{u, i \\in T} (r_{ui} - \\hat{r_{ui}})}{ T }^2} $\n", " \n", " - 平均绝对误差MAE\n", " \n", " $ MAE = \\frac{\\sum_{u, i \\in T} \\left | r_{ui} - \\hat{r_{ui}} \\right|}{ T}$" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:37:50.871325Z", "start_time": "2019-06-15T03:37:50.858051Z" }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# A dictionary of movie critics and their ratings of a small\n", "# set of movies\n", "critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,\n", " 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,\n", " 'The Night Listener': 3.0},\n", " 'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,\n", " 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,\n", " 'You, Me and Dupree': 3.5},\n", " 'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,\n", " 'Superman Returns': 3.5, 'The Night Listener': 4.0},\n", " 'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,\n", " 'The Night Listener': 4.5, 'Superman Returns': 4.0,\n", " 'You, Me and Dupree': 2.5},\n", " 'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,\n", " 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,\n", " 'You, Me and Dupree': 2.0},\n", " 'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,\n", " 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},\n", " 'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:37:53.188222Z", "start_time": "2019-06-15T03:37:53.176776Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "2.5" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "critics['Lisa Rose']['Lady in the Water']" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:38:02.652358Z", "start_time": "2019-06-15T03:38:02.648001Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "4.5" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "critics['Toby']['Snakes on a Plane']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:38:06.772052Z", "start_time": "2019-06-15T03:38:06.767392Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{'Snakes on a Plane': 4.5, 'Superman Returns': 4.0, 'You, Me and Dupree': 1.0}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "critics['Toby']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "# 1. User-based filtering\n", "***\n", "## 1.0 Finding similar users" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:38:20.782918Z", "start_time": "2019-06-15T03:38:20.679782Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "3.1622776601683795" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 欧几里得距离\n", "import numpy as np\n", "np.sqrt(np.power(5-4, 2) + np.power(4-1, 2))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- This formula calculates the distance, which will be smaller for people who are more similar. \n", "- However, you need a function that gives higher values for people who are similar. \n", "- This can be done by adding 1 to the function (so you don’t get a division-by-zero error) and inverting it:" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0.2402530733520421" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "1.0 /(1 + np.sqrt(np.power(5-4, 2) + np.power(4-1, 2)) )" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:39:43.671528Z", "start_time": "2019-06-15T03:39:43.659485Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Returns a distance-based similarity score for person1 and person2\n", "def sim_distance(prefs,person1,person2):\n", " # Get the list of shared_items\n", " si={}\n", " for item in prefs[person1]:\n", " if item in prefs[person2]:\n", " si[item]=1\n", " # if they have no ratings in common, return 0\n", " if len(si)==0: return 0\n", " # Add up the squares of all the differences\n", " sum_of_squares=np.sum([np.power(prefs[person1][item]-prefs[person2][item],2)\n", " for item in prefs[person1] if item in prefs[person2]])\n", " #for item in si.keys()])# \n", " return 1/(1+np.sqrt(sum_of_squares) )" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:39:50.817110Z", "start_time": "2019-06-15T03:39:50.812396Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0.3483314773547883" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sim_distance(critics, 'Lisa Rose','Toby')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Pearson correlation coefficient" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:39:57.893913Z", "start_time": "2019-06-15T03:39:57.860142Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Returns the Pearson correlation coefficient for p1 and p2\n", "def sim_pearson(prefs,p1,p2):\n", " # Get the list of mutually rated items\n", " si={}\n", " for item in prefs[p1]:\n", " if item in prefs[p2]: si[item]=1\n", " # Find the number of elements\n", " n=len(si)\n", " # if they are no ratings in common, return 0\n", " if n==0: return 0\n", " # Add up all the preferences\n", " sum1=np.sum([prefs[p1][it] for it in si])\n", " sum2=np.sum([prefs[p2][it] for it in si])\n", " # Sum up the squares\n", " sum1Sq=np.sum([np.power(prefs[p1][it],2) for it in si])\n", " sum2Sq=np.sum([np.power(prefs[p2][it],2) for it in si])\n", " # Sum up the products\n", " pSum=np.sum([prefs[p1][it]*prefs[p2][it] for it in si])\n", " # Calculate Pearson score\n", " num=pSum-(sum1*sum2/n)\n", " den=np.sqrt((sum1Sq-np.power(sum1,2)/n)*(sum2Sq-np.power(sum2,2)/n))\n", " if den==0: return 0\n", " return num/den" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:40:05.801716Z", "start_time": "2019-06-15T03:40:05.797214Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0.9912407071619299" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sim_pearson(critics, 'Lisa Rose','Toby')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:40:31.416479Z", "start_time": "2019-06-15T03:40:31.409446Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Returns the best matches for person from the prefs dictionary.\n", "# Number of results and similarity function are optional params.\n", "def topMatches(prefs,person,n=5,similarity=sim_pearson):\n", " scores=[(similarity(prefs,person,other),other)\n", " for other in prefs if other!=person]\n", " # Sort the list so the highest scores appear at the top \n", " scores.sort( )\n", " scores.reverse( )\n", " return scores[0:n]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:40:37.864357Z", "start_time": "2019-06-15T03:40:37.859476Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[(0.9912407071619299, 'Lisa Rose'),\n", " (0.9244734516419049, 'Mick LaSalle'),\n", " (0.8934051474415647, 'Claudia Puig')]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topMatches(critics,'Toby',n=3) # topN" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## 1.1 Recommending Items" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "\n", "- Toby相似的五个用户(Rose, Reymour, Puig, LaSalle, Matthews)及相似度(依次为0.99, 0.38, 0.89, 0.92, 0.66)\n", "- 这五个用户看过的三个电影(Night,Lady, Luck)及其评分\n", " - 例如,Rose对Night评分是3.0\n", "- S.xNight是用户相似度与电影评分的乘积\n", " - 例如,Toby与Rose相似度(0.99) * Rose对Night评分是3.0 = 2.97\n", "- 可以得到每部电影的得分\n", " - 例如,Night的得分是12.89 = 2.97+1.14+4.02+2.77+1.99\n", "- 电影得分需要使用用户相似度之和进行加权\n", " - 例如,Night电影的预测得分是3.35 = 12.89/3.84\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:49:21.867787Z", "start_time": "2019-06-15T03:49:21.838827Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# Gets recommendations for a person by using a weighted average\n", "# of every other user's rankings\n", "def getRecommendations(prefs,person,similarity=sim_pearson):\n", " totals={}\n", " simSums={}\n", " for other in prefs:\n", " # don't compare me to myself\n", " if other==person: continue\n", " sim=similarity(prefs,person,other)\n", " # ignore scores of zero or lower\n", " if sim<=0: continue\n", " for item in prefs[other]: \n", " # only score movies I haven't seen yet\n", " if item not in prefs[person]:# or prefs[person][item]==0:\n", " # Similarity * Score\n", " totals.setdefault(item,0)\n", " totals[item]+=prefs[other][item]*sim\n", " # Sum of similarities\n", " simSums.setdefault(item,0)\n", " simSums[item]+=sim\n", " # Create the normalized list\n", " rankings=[(total/simSums[item],item) for item,total in totals.items()]\n", " # Return the sorted list\n", " rankings.sort()\n", " rankings.reverse()\n", " return rankings" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:49:26.596347Z", "start_time": "2019-06-15T03:49:26.590788Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[(3.3477895267131013, 'The Night Listener'),\n", " (2.8325499182641614, 'Lady in the Water'),\n", " (2.530980703765565, 'Just My Luck')]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now you can find out what movies I should watch next:\n", "getRecommendations(critics,'Toby')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:50:05.387449Z", "start_time": "2019-06-15T03:50:05.382039Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[(3.457128694491423, 'The Night Listener'),\n", " (2.778584003814923, 'Lady in the Water'),\n", " (2.422482042361917, 'Just My Luck')]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# You’ll find that the results are only affected very slightly by the choice of similarity metric.\n", "getRecommendations(critics,'Toby',similarity=sim_distance)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "# 2. Item-based filtering\n", "***\n", "\n", "Now you know how to find similar people and recommend products for a given person\n", "\n", "### But what if you want to see which products are similar to each other? \n", "This is actually the same method we used earlier to determine similarity between people—" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### 将item-user字典的键值翻转" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:50:31.798138Z", "start_time": "2019-06-15T03:50:31.789728Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# you just need to swap the people and the items. \n", "def transformPrefs(prefs):\n", " result={}\n", " for person in prefs:\n", " for item in prefs[person]:\n", " result.setdefault(item,{})\n", " # Flip item and person\n", " result[item][person]=prefs[person][item]\n", " return result\n", "\n", "movies = transformPrefs(critics)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### 计算item的相似性" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:50:47.138142Z", "start_time": "2019-06-15T03:50:47.132763Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[(0.6579516949597695, 'You, Me and Dupree'),\n", " (0.4879500364742689, 'Lady in the Water'),\n", " (0.11180339887498941, 'Snakes on a Plane'),\n", " (-0.1798471947990544, 'The Night Listener'),\n", " (-0.42289003161103106, 'Just My Luck')]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topMatches(movies,'Superman Returns')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### 给item推荐user" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:51:37.311313Z", "start_time": "2019-06-15T03:51:37.291502Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[(0.3090169943749474, 'Snakes on a Plane'),\n", " (0.252650308587072, 'The Night Listener'),\n", " (0.2402530733520421, 'Lady in the Water'),\n", " (0.20799159651347807, 'Just My Luck'),\n", " (0.1918253663634734, 'You, Me and Dupree')]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def calculateSimilarItems(prefs,n=10):\n", " # Create a dictionary of items showing which other items they\n", " # are most similar to.\n", " result={}\n", " # Invert the preference matrix to be item-centric\n", " itemPrefs=transformPrefs(prefs)\n", " c=0\n", " for item in itemPrefs:\n", " # Status updates for large datasets\n", " c+=1\n", " if c%100==0: \n", " print(\"%d / %d\" % (c,len(itemPrefs)))\n", " # Find the most similar items to this one\n", " scores=topMatches(itemPrefs,item,n=n,similarity=sim_distance)\n", " result[item]=scores\n", " return result\n", "\n", "itemsim=calculateSimilarItems(critics) \n", "itemsim['Superman Returns']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "\n", "- Toby看过三个电影(snakes、Superman、dupree)和评分(依次是4.5、4.0、1.0)\n", "- 表格2-3给出这三部电影与另外三部电影的相似度\n", " - 例如superman与night的相似度是0.103\n", "- R.xNight表示Toby对自己看过的三部定影的评分与Night这部电影相似度的乘积\n", " - 例如,0.818 = 4.5*0.182\n", " \n", " \n", "- 那么Toby对于Night的评分可以表达为0.818+0.412+0.148 = 1.378\n", " - 已经知道Night相似度之和是0.182+0.103+0.148 = 0.433\n", " - 那么Toby对Night的最终评分可以表达为:1.378/0.433 = 3.183\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:59:45.629649Z", "start_time": "2019-06-15T03:59:45.598310Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[(3.1667425234070894, 'The Night Listener'),\n", " (2.9366294028444346, 'Just My Luck'),\n", " (2.868767392626467, 'Lady in the Water')]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def getRecommendedItems(prefs,itemMatch,user):\n", " userRatings=prefs[user]\n", " scores={}\n", " totalSim={}\n", " # Loop over items rated by this user\n", " for (item,rating) in userRatings.items( ):\n", " # Loop over items similar to this one\n", " for (similarity,item2) in itemMatch[item]:\n", " # Ignore if this user has already rated this item\n", " if item2 in userRatings: continue\n", " # Weighted sum of rating times similarity\n", " scores.setdefault(item2,0)\n", " scores[item2]+=similarity*rating\n", " # Sum of all the similarities\n", " totalSim.setdefault(item2,0)\n", " totalSim[item2]+=similarity\n", " # Divide each total score by total weighting to get an average\n", " rankings=[(score/totalSim[item],item) for item,score in scores.items( )]\n", " # Return the rankings from highest to lowest\n", " rankings.sort( )\n", " rankings.reverse( )\n", " return rankings\n", "\n", "getRecommendedItems(critics,itemsim,'Toby')" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T03:59:58.539274Z", "start_time": "2019-06-15T03:59:58.534208Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[(4.0, 'Michael Phillips'), (3.0, 'Jack Matthews')]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "getRecommendations(movies,'Just My Luck')" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[(3.1637361366111816, 'Michael Phillips')]" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "getRecommendations(movies, 'You, Me and Dupree')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\n", "\n", "# 基于物品的协同过滤算法的网络表示方法" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# 基于图的模型\n", "\n", "使用二分图表示用户行为,因此基于图的算法可以应用到推荐系统当中。\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:26:46.988001Z", "start_time": "2019-06-15T06:26:46.974911Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# https://github.com/ParticleWave/RecommendationSystemStudy/blob/d1960056b96cfaad62afbfe39225ff680240d37e/PersonalRank.py\n", "import os\n", "import random\n", "\n", "class Graph:\n", " def __init__(self):\n", " self.G = dict()\n", " \n", " def addEdge(self, p, q):\n", " if p not in self.G: self.G[p] = dict()\n", " if q not in self.G: self.G[q] = dict()\n", " self.G[p][q] = 1\n", " self.G[q][p] = 1\n", "\n", " def getGraphMatrix(self):\n", " return self.G" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:26:57.557206Z", "start_time": "2019-06-15T06:26:57.546827Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dict_keys(['A', 'd', 'b', 'C', 'a', 'c', 'B'])\n" ] } ], "source": [ "graph = Graph()\n", "graph.addEdge('A', 'a')\n", "graph.addEdge('A', 'c')\n", "graph.addEdge('B', 'a')\n", "graph.addEdge('B', 'b')\n", "graph.addEdge('B', 'c')\n", "graph.addEdge('B', 'd')\n", "graph.addEdge('C', 'c')\n", "graph.addEdge('C', 'd')\n", "G = graph.getGraphMatrix()\n", "print(G.keys())" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:27:06.285484Z", "start_time": "2019-06-15T06:27:06.280821Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "{'A': {'a': 1, 'c': 1},\n", " 'B': {'a': 1, 'b': 1, 'c': 1, 'd': 1},\n", " 'C': {'c': 1, 'd': 1},\n", " 'a': {'A': 1, 'B': 1},\n", " 'b': {'B': 1},\n", " 'c': {'A': 1, 'B': 1, 'C': 1},\n", " 'd': {'B': 1, 'C': 1}}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "G" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:32:02.195675Z", "start_time": "2019-06-15T06:32:02.189744Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A a 1\n", "A c 1\n", "d C 1\n", "d B 1\n", "b B 1\n", "C d 1\n", "C c 1\n", "a B 1\n", "a A 1\n", "c C 1\n", "c B 1\n", "c A 1\n", "B a 1\n", "B d 1\n", "B c 1\n", "B b 1\n" ] } ], "source": [ "for i, ri in G.items():\n", " for j, wij in ri.items():\n", " print(i, j, wij)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:34:47.321257Z", "start_time": "2019-06-15T06:34:47.301108Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def PersonalRank(G, alpha, root, max_step):\n", " # G is the biparitite graph of users' ratings on items\n", " # alpha is the probability of random walk forward\n", " # root is the studied User\n", " # max_step if the steps of iterations.\n", " rank = dict()\n", " rank = {x:0.0 for x in G.keys()}\n", " rank[root] = 1.0\n", " for k in range(max_step):\n", " tmp = {x:0.0 for x in G.keys()}\n", " for i,ri in G.items():\n", " for j,wij in ri.items():\n", " if j not in tmp: tmp[j] = 0.0 #\n", " tmp[j] += alpha * rank[i] / (len(ri)*1.0)\n", " if j == root: tmp[j] += 1.0 - alpha\n", " rank = tmp\n", " print(k, rank)\n", " return rank" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:35:06.747759Z", "start_time": "2019-06-15T06:35:06.740862Z" }, "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 {'c': 0.4, 'd': 0.0, 'b': 0.0, 'C': 0.0, 'a': 0.4, 'A': 0.3999999999999999, 'B': 0.0}\n", "1 {'c': 0.15999999999999998, 'd': 0.0, 'b': 0.0, 'C': 0.10666666666666669, 'a': 0.15999999999999998, 'A': 0.6666666666666666, 'B': 0.2666666666666667}\n", "2 {'c': 0.3626666666666667, 'd': 0.09600000000000003, 'b': 0.053333333333333344, 'C': 0.04266666666666666, 'a': 0.32, 'A': 0.5066666666666666, 'B': 0.10666666666666665}\n", "3 {'c': 0.24106666666666665, 'd': 0.03839999999999999, 'b': 0.02133333333333333, 'C': 0.13511111111111113, 'a': 0.22399999999999998, 'A': 0.624711111111111, 'B': 0.3057777777777778}\n", "4 {'c': 0.36508444444444443, 'd': 0.11520000000000002, 'b': 0.06115555555555557, 'C': 0.07964444444444443, 'a': 0.31104, 'A': 0.5538844444444444, 'B': 0.1863111111111111}\n", "5 {'c': 0.29067377777777775, 'd': 0.06911999999999999, 'b': 0.03726222222222222, 'C': 0.14343585185185187, 'a': 0.258816, 'A': 0.6217718518518518, 'B': 0.31677629629629633}\n", "6 {'c': 0.3694383407407408, 'd': 0.12072960000000002, 'b': 0.06335525925925926, 'C': 0.1051610074074074, 'a': 0.312064, 'A': 0.5810394074074073, 'B': 0.2384971851851852}\n", "7 {'c': 0.322179602962963, 'd': 0.08976384000000001, 'b': 0.047699437037037044, 'C': 0.14680873086419757, 'a': 0.2801152, 'A': 0.6233424908641975, 'B': 0.322318538271605}\n", "8 {'c': 0.37252419634567907, 'd': 0.12318720000000004, 'b': 0.06446370765432101, 'C': 0.12182009679012348, 'a': 0.313800704, 'A': 0.5979606407901235, 'B': 0.27202572641975314}\n", "9 {'c': 0.34231744031604944, 'd': 0.10313318400000002, 'b': 0.05440514528395063, 'C': 0.14861466569218112, 'a': 0.2935894016, 'A': 0.624860067292181, 'B': 0.3257059134156379}\n", "10 {'c': 0.37453107587687245, 'd': 0.12458704896000004, 'b': 0.06514118268312759, 'C': 0.13253792435094652, 'a': 0.3150852096, 'A': 0.6087204113909463, 'B': 0.29349780121810704}\n", "11 {'c': 0.35520289454037857, 'd': 0.11171472998400003, 'b': 0.05869956024362141, 'C': 0.14970977315116601, 'a': 0.3021877248, 'A': 0.6259090374071659, 'B': 0.3278568031376681}\n", "12 {'c': 0.3758188848508664, 'd': 0.12545526988800004, 'b': 0.06557136062753362, 'C': 0.1394066638710343, 'a': 0.31593497559039996, 'A': 0.6155958617974342, 'B': 0.3072414019859314}\n", "13 {'c': 0.3634492906645737, 'd': 0.1172109459456, 'b': 0.061448280397186285, 'C': 0.1504004772487644, 'a': 0.30768662511615996, 'A': 0.6265923595297243, 'B': 0.3292315559869513}\n", "14 {'c': 0.37664344590878573, 'd': 0.126006502096896, 'b': 0.06584631119739026, 'C': 0.14380418922212634, 'a': 0.31648325500928, 'A': 0.6199944608903503, 'B': 0.3160374635863394}\n", "15 {'c': 0.36872695276225853, 'd': 0.12072916840611841, 'b': 0.06320749271726787, 'C': 0.1508408530811013, 'a': 0.311205277073408, 'A': 0.6270315542460547, 'B': 0.3301112040427255}\n", "16 {'c': 0.3771712037394075, 'd': 0.12635858204098563, 'b': 0.0660222408085451, 'C': 0.14661885476571632, 'a': 0.316834862506967, 'A': 0.6228092982326321, 'B': 0.3216669597688938}\n", "17 {'c': 0.37210465315311814, 'd': 0.1229809338600653, 'b': 0.06433339195377877, 'C': 0.15112242048023627, 'a': 0.3134571112468316, 'A': 0.6273129326666287, 'B': 0.3306741581298592}\n", "18 {'c': 0.3775089728847178, 'd': 0.12658379981806633, 'b': 0.06613483162597183, 'C': 0.1484202810515243, 'a': 0.3170600046926233, 'A': 0.6246107520062307, 'B': 0.32526983911327995}\n", "19 {'c': 0.37426638104575805, 'd': 0.1244220802432657, 'b': 0.06505396782265599, 'C': 0.15130257936315128, 'a': 0.3148982686251483, 'A': 0.627493061312974, 'B': 0.3310344465409781}\n" ] }, { "data": { "text/plain": [ "{'A': 0.627493061312974,\n", " 'B': 0.3310344465409781,\n", " 'C': 0.15130257936315128,\n", " 'a': 0.3148982686251483,\n", " 'b': 0.06505396782265599,\n", " 'c': 0.37426638104575805,\n", " 'd': 0.1244220802432657}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PersonalRank(G, 0.8, 'A', 20)\n", "# print(PersonalRank(G, 0.8, 'B', 20))\n", "# print(PersonalRank(G, 0.8, 'C', 20))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "***\n", "# 3. MovieLens Recommender\n", "***\n", "MovieLens是一个电影评价的真实数据,由明尼苏达州立大学的GroupLens项目组开发。\n", "\n", "### 数据下载\n", "http://grouplens.org/datasets/movielens/1m/\n", "\n", "> These files contain 1,000,209 anonymous ratings of approximately 3,900 movies \n", "made by 6,040 MovieLens users who joined MovieLens in 2000.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# 数据格式\n", "All ratings are contained in the file \"ratings.dat\" and are in the following format:\n", "\n", "UserID::MovieID::Rating::Timestamp\n", "\n", "1::1193::5::978300760\n", "\n", "1::661::3::978302109\n", "\n", "1::914::3::978301968\n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:38:51.401130Z", "start_time": "2019-06-15T06:38:51.387300Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def loadMovieLens(path='/Users/datalab/bigdata/cjc/ml-1m/'):\n", " # Get movie titles\n", " movies={}\n", " for line in open(path+'movies.dat', encoding = 'iso-8859-15'):\n", " (id,title)=line.split('::')[0:2]\n", " movies[id]=title\n", " \n", " # Load data\n", " prefs={}\n", " for line in open(path+'/ratings.dat'):\n", " (user,movieid,rating,ts)=line.split('::')\n", " prefs.setdefault(user,{})\n", " prefs[user][movies[movieid]]=float(rating)\n", " return prefs" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:38:57.759710Z", "start_time": "2019-06-15T06:38:56.267827Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "{'Alice in Wonderland (1951)': 1.0,\n", " 'Army of Darkness (1993)': 3.0,\n", " 'Bad Boys (1995)': 5.0,\n", " 'Benji (1974)': 1.0,\n", " 'Brady Bunch Movie, The (1995)': 1.0,\n", " 'Braveheart (1995)': 5.0,\n", " 'Buffalo 66 (1998)': 1.0,\n", " 'Chambermaid on the Titanic, The (1998)': 1.0,\n", " 'Cowboy Way, The (1994)': 1.0,\n", " 'Cyrano de Bergerac (1990)': 4.0,\n", " 'Dear Diary (Caro Diario) (1994)': 1.0,\n", " 'Die Hard (1988)': 3.0,\n", " 'Diebinnen (1995)': 1.0,\n", " 'Dr. No (1962)': 1.0,\n", " 'Escape from the Planet of the Apes (1971)': 1.0,\n", " 'Fast, Cheap & Out of Control (1997)': 1.0,\n", " 'Faster Pussycat! Kill! Kill! (1965)': 1.0,\n", " 'From Russia with Love (1963)': 1.0,\n", " 'Fugitive, The (1993)': 5.0,\n", " 'Get Shorty (1995)': 1.0,\n", " 'Gladiator (2000)': 5.0,\n", " 'Goldfinger (1964)': 5.0,\n", " 'Good, The Bad and The Ugly, The (1966)': 4.0,\n", " 'Hunt for Red October, The (1990)': 5.0,\n", " 'Hurricane, The (1999)': 5.0,\n", " 'Indiana Jones and the Last Crusade (1989)': 4.0,\n", " 'Jaws (1975)': 5.0,\n", " 'Jurassic Park (1993)': 5.0,\n", " 'King Kong (1933)': 1.0,\n", " 'King of New York (1990)': 1.0,\n", " 'Last of the Mohicans, The (1992)': 1.0,\n", " 'Lethal Weapon (1987)': 5.0,\n", " 'Longest Day, The (1962)': 1.0,\n", " 'Man with the Golden Gun, The (1974)': 5.0,\n", " 'Mask of Zorro, The (1998)': 5.0,\n", " 'Matrix, The (1999)': 5.0,\n", " \"On Her Majesty's Secret Service (1969)\": 1.0,\n", " 'Out of Sight (1998)': 1.0,\n", " 'Palookaville (1996)': 1.0,\n", " 'Planet of the Apes (1968)': 1.0,\n", " 'Pope of Greenwich Village, The (1984)': 1.0,\n", " 'Princess Bride, The (1987)': 3.0,\n", " 'Raiders of the Lost Ark (1981)': 4.0,\n", " 'Rock, The (1996)': 5.0,\n", " 'Rocky (1976)': 5.0,\n", " 'Saving Private Ryan (1998)': 4.0,\n", " 'Shanghai Noon (2000)': 1.0,\n", " 'Speed (1994)': 1.0,\n", " 'Star Wars: Episode IV - A New Hope (1977)': 5.0,\n", " 'Star Wars: Episode V - The Empire Strikes Back (1980)': 5.0,\n", " 'Taking of Pelham One Two Three, The (1974)': 1.0,\n", " 'Terminator 2: Judgment Day (1991)': 5.0,\n", " 'Terminator, The (1984)': 4.0,\n", " 'Thelma & Louise (1991)': 1.0,\n", " 'True Romance (1993)': 1.0,\n", " 'U-571 (2000)': 5.0,\n", " 'Untouchables, The (1987)': 5.0,\n", " 'Westworld (1973)': 1.0,\n", " 'X-Men (2000)': 4.0}" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prefs=loadMovieLens()\n", "prefs['87']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### user-based filtering" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "ExecuteTime": { "end_time": "2018-05-05T03:13:11.081668Z", "start_time": "2018-05-05T03:13:09.528867Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "[(5.0, 'Time of the Gypsies (Dom za vesanje) (1989)'),\n", " (5.0, 'Tigrero: A Film That Was Never Made (1994)'),\n", " (5.0, 'Schlafes Bruder (Brother of Sleep) (1995)'),\n", " (5.0, 'Return with Honor (1998)'),\n", " (5.0, 'Lured (1947)'),\n", " (5.0, 'Identification of a Woman (Identificazione di una donna) (1982)'),\n", " (5.0, 'I Am Cuba (Soy Cuba/Ya Kuba) (1964)'),\n", " (5.0, 'Hour of the Pig, The (1993)'),\n", " (5.0, 'Gay Deceivers, The (1969)'),\n", " (5.0, 'Gate of Heavenly Peace, The (1995)'),\n", " (5.0, 'Foreign Student (1994)'),\n", " (5.0, 'Dingo (1992)'),\n", " (5.0, 'Dangerous Game (1993)'),\n", " (5.0, 'Callejón de los milagros, El (1995)'),\n", " (5.0, 'Bittersweet Motel (2000)'),\n", " (4.820460101722989, 'Apple, The (Sib) (1998)'),\n", " (4.738956184936386, 'Lamerica (1994)'),\n", " (4.681816541467396, 'Bells, The (1926)'),\n", " (4.664958072522234, 'Hurricane Streets (1998)'),\n", " (4.650741840804562, 'Sanjuro (1962)'),\n", " (4.649974172600346, 'On the Ropes (1999)'),\n", " (4.636825408739504, 'Shawshank Redemption, The (1994)'),\n", " (4.627888709544556, 'For All Mankind (1989)'),\n", " (4.582048349280509, 'Midaq Alley (Callejón de los milagros, El) (1995)'),\n", " (4.579778646871153, \"Schindler's List (1993)\"),\n", " (4.57519994103739,\n", " 'Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)'),\n", " (4.574904988403455, 'Godfather, The (1972)'),\n", " (4.5746840191882345, \"Ed's Next Move (1996)\"),\n", " (4.558519037147828, 'Hanging Garden, The (1997)'),\n", " (4.527760042775589, 'Close Shave, A (1995)')]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "getRecommendations(prefs,'87')[0:30]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Item-based filtering" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "start_time": "2018-05-05T03:14:13.060Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100 / 3706\n", "200 / 3706\n", "300 / 3706\n", "400 / 3706\n", "500 / 3706\n", "600 / 3706\n", "700 / 3706\n", "800 / 3706\n", "900 / 3706\n", "1000 / 3706\n" ] } ], "source": [ "itemsim=calculateSimilarItems(prefs,n=50)" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[(5.0, 'Uninvited Guest, An (2000)'),\n", " (5.0, 'Two Much (1996)'),\n", " (5.0, 'Two Family House (2000)'),\n", " (5.0, 'Trial by Jury (1994)'),\n", " (5.0, 'Tom & Viv (1994)'),\n", " (5.0, 'This Is My Father (1998)'),\n", " (5.0, 'Something to Sing About (1937)'),\n", " (5.0, 'Slappy and the Stinkers (1998)'),\n", " (5.0, 'Running Free (2000)'),\n", " (5.0, 'Roula (1995)'),\n", " (5.0, 'Prom Night IV: Deliver Us From Evil (1992)'),\n", " (5.0, 'Project Moon Base (1953)'),\n", " (5.0, 'Price Above Rubies, A (1998)'),\n", " (5.0, 'Open Season (1996)'),\n", " (5.0, 'Only Angels Have Wings (1939)'),\n", " (5.0, 'Onegin (1999)'),\n", " (5.0, 'Once Upon a Time... When We Were Colored (1995)'),\n", " (5.0, 'Office Killer (1997)'),\n", " (5.0, 'N\\xe9nette et Boni (1996)'),\n", " (5.0, 'No Looking Back (1998)'),\n", " (5.0, 'Never Met Picasso (1996)'),\n", " (5.0, 'Music From Another Room (1998)'),\n", " (5.0, \"Mummy's Tomb, The (1942)\"),\n", " (5.0, 'Modern Affair, A (1995)'),\n", " (5.0, 'Machine, The (1994)'),\n", " (5.0, 'Lured (1947)'),\n", " (5.0, 'Low Life, The (1994)'),\n", " (5.0, 'Lodger, The (1926)'),\n", " (5.0, 'Loaded (1994)'),\n", " (5.0, 'Line King: Al Hirschfeld, The (1996)')]" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "getRecommendedItems(prefs,itemsim,'87')[0:30]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Buiding Recommendation System with Turicreate" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In this notebook we will import Turicreate and use it to\n", "\n", "- train two models that can be used for recommending new songs to users \n", "- compare the performance of the two models\n" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:40:29.380065Z", "start_time": "2019-06-15T06:40:27.941331Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "%matplotlib inline\n", "import turicreate as tc\n", "import matplotlib.pyplot as plt\n" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:40:42.258487Z", "start_time": "2019-06-15T06:40:42.238423Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
item_idratinguser_id
a10
b30
c20
a51
b41
b12
c42
d32
\n", "[8 rows x 3 columns]
\n", "
" ], "text/plain": [ "Columns:\n", "\titem_id\tstr\n", "\trating\tint\n", "\tuser_id\tstr\n", "\n", "Rows: 8\n", "\n", "Data:\n", "+---------+--------+---------+\n", "| item_id | rating | user_id |\n", "+---------+--------+---------+\n", "| a | 1 | 0 |\n", "| b | 3 | 0 |\n", "| c | 2 | 0 |\n", "| a | 5 | 1 |\n", "| b | 4 | 1 |\n", "| b | 1 | 2 |\n", "| c | 4 | 2 |\n", "| d | 3 | 2 |\n", "+---------+--------+---------+\n", "[8 rows x 3 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sf = tc.SFrame({'user_id': [\"0\", \"0\", \"0\", \"1\", \"1\", \"2\", \"2\", \"2\"],\n", " 'item_id': [\"a\", \"b\", \"c\", \"a\", \"b\", \"b\", \"c\", \"d\"],\n", " 'rating': [1, 3, 2, 5, 4, 1, 4, 3]})\n", "sf" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:40:49.395062Z", "start_time": "2019-06-15T06:40:48.313559Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
Preparing data set.
" ], "text/plain": [ "Preparing data set." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
    Data has 8 observations with 3 users and 4 items.
" ], "text/plain": [ " Data has 8 observations with 3 users and 4 items." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
    Data prepared in: 0.003196s
" ], "text/plain": [ " Data prepared in: 0.003196s" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Training ranking_factorization_recommender for recommendations.
" ], "text/plain": [ "Training ranking_factorization_recommender for recommendations." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------------------------+--------------------------------------------------+----------+
" ], "text/plain": [ "+--------------------------------+--------------------------------------------------+----------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Parameter                      | Description                                      | Value    |
" ], "text/plain": [ "| Parameter | Description | Value |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------------------------+--------------------------------------------------+----------+
" ], "text/plain": [ "+--------------------------------+--------------------------------------------------+----------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| num_factors                    | Factor Dimension                                 | 32       |
" ], "text/plain": [ "| num_factors | Factor Dimension | 32 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| regularization                 | L2 Regularization on Factors                     | 1e-09    |
" ], "text/plain": [ "| regularization | L2 Regularization on Factors | 1e-09 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| solver                         | Solver used for training                         | sgd      |
" ], "text/plain": [ "| solver | Solver used for training | sgd |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-09    |
" ], "text/plain": [ "| linear_regularization | L2 Regularization on Linear Coefficients | 1e-09 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| ranking_regularization         | Rank-based Regularization Weight                 | 0.25     |
" ], "text/plain": [ "| ranking_regularization | Rank-based Regularization Weight | 0.25 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| max_iterations                 | Maximum Number of Iterations                     | 25       |
" ], "text/plain": [ "| max_iterations | Maximum Number of Iterations | 25 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------------------------+--------------------------------------------------+----------+
" ], "text/plain": [ "+--------------------------------+--------------------------------------------------+----------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
  Optimizing model using SGD; tuning step size.
" ], "text/plain": [ " Optimizing model using SGD; tuning step size." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
  Using 8 / 8 points for tuning the step size.
" ], "text/plain": [ " Using 8 / 8 points for tuning the step size." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+-------------------+------------------------------------------+
" ], "text/plain": [ "+---------+-------------------+------------------------------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Attempt | Initial Step Size | Estimated Objective Value                |
" ], "text/plain": [ "| Attempt | Initial Step Size | Estimated Objective Value |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+-------------------+------------------------------------------+
" ], "text/plain": [ "+---------+-------------------+------------------------------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0       | 25                | Not Viable                               |
" ], "text/plain": [ "| 0 | 25 | Not Viable |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1       | 6.25              | Not Viable                               |
" ], "text/plain": [ "| 1 | 6.25 | Not Viable |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2       | 1.5625            | Not Viable                               |
" ], "text/plain": [ "| 2 | 1.5625 | Not Viable |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3       | 0.390625          | 2.79772                                  |
" ], "text/plain": [ "| 3 | 0.390625 | 2.79772 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4       | 0.195312          | 2.74043                                  |
" ], "text/plain": [ "| 4 | 0.195312 | 2.74043 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 5       | 0.0976562         | 2.79932                                  |
" ], "text/plain": [ "| 5 | 0.0976562 | 2.79932 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 6       | 0.0488281         | 3.00656                                  |
" ], "text/plain": [ "| 6 | 0.0488281 | 3.00656 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 7       | 0.0244141         | 3.28278                                  |
" ], "text/plain": [ "| 7 | 0.0244141 | 3.28278 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+-------------------+------------------------------------------+
" ], "text/plain": [ "+---------+-------------------+------------------------------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Final   | 0.195312          | 2.74043                                  |
" ], "text/plain": [ "| Final | 0.195312 | 2.74043 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+-------------------+------------------------------------------+
" ], "text/plain": [ "+---------+-------------------+------------------------------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Starting Optimization.
" ], "text/plain": [ "Starting Optimization." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+--------------+-------------------+-----------------------+-------------+
" ], "text/plain": [ "+---------+--------------+-------------------+-----------------------+-------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
" ], "text/plain": [ "| Iter. | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+--------------+-------------------+-----------------------+-------------+
" ], "text/plain": [ "+---------+--------------+-------------------+-----------------------+-------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Initial | 141us        | 3.89999           | 1.3637                |             |
" ], "text/plain": [ "| Initial | 141us | 3.89999 | 1.3637 | |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+--------------+-------------------+-----------------------+-------------+
" ], "text/plain": [ "+---------+--------------+-------------------+-----------------------+-------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1       | 762us        | 4.11274           | 1.60028               | 0.195312    |
" ], "text/plain": [ "| 1 | 762us | 4.11274 | 1.60028 | 0.195312 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2       | 1.298ms      | 3.00481           | 1.39367               | 0.116134    |
" ], "text/plain": [ "| 2 | 1.298ms | 3.00481 | 1.39367 | 0.116134 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3       | 1.73ms       | 2.6827            | 1.23558               | 0.0856819   |
" ], "text/plain": [ "| 3 | 1.73ms | 2.6827 | 1.23558 | 0.0856819 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4       | 2.257ms      | 2.49202           | 1.17326               | 0.0580668   |
" ], "text/plain": [ "| 4 | 2.257ms | 2.49202 | 1.17326 | 0.0580668 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 5       | 2.757ms      | 2.4385            | 1.16425               | 0.0491185   |
" ], "text/plain": [ "| 5 | 2.757ms | 2.4385 | 1.16425 | 0.0491185 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 10      | 5.116ms      | 2.28964           | 1.08345               | 0.029206    |
" ], "text/plain": [ "| 10 | 5.116ms | 2.28964 | 1.08345 | 0.029206 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 25      | 12.067ms     | 2.20318           | 1.06674               | 0.0146899   |
" ], "text/plain": [ "| 25 | 12.067ms | 2.20318 | 1.06674 | 0.0146899 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_iditem_idscorerank
0d1.27186661958694461
1c4.0159375667572021
1d3.13183635473251342
2a2.4787323176860811
\n", "[4 rows x 4 columns]
\n", "
" ], "text/plain": [ "Columns:\n", "\tuser_id\tstr\n", "\titem_id\tstr\n", "\tscore\tfloat\n", "\trank\tint\n", "\n", "Rows: 4\n", "\n", "Data:\n", "+---------+---------+--------------------+------+\n", "| user_id | item_id | score | rank |\n", "+---------+---------+--------------------+------+\n", "| 0 | d | 1.2718666195869446 | 1 |\n", "| 1 | c | 4.015937566757202 | 1 |\n", "| 1 | d | 3.1318363547325134 | 2 |\n", "| 2 | a | 2.478732317686081 | 1 |\n", "+---------+---------+--------------------+------+\n", "[4 rows x 4 columns]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" }, { "data": { "text/html": [ "
+---------+--------------+-------------------+-----------------------+-------------+
" ], "text/plain": [ "+---------+--------------+-------------------+-----------------------+-------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Optimization Complete: Maximum number of passes through the data reached.
" ], "text/plain": [ "Optimization Complete: Maximum number of passes through the data reached." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Computing final objective value and training RMSE.
" ], "text/plain": [ "Computing final objective value and training RMSE." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
       Final objective value: 2.79567
" ], "text/plain": [ " Final objective value: 2.79567" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
       Final training RMSE: 1.03992
" ], "text/plain": [ " Final training RMSE: 1.03992" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "m = tc.recommender.create(sf, target='rating')\n", "recs = m.recommend()\n", "recs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## The CourseTalk dataset: loading and first look\n", "\n", "Loading of the CourseTalk database." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:43:24.922224Z", "start_time": "2019-06-15T06:43:24.866432Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
Materializing SFrame
" ], "text/plain": [ "Materializing SFrame" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#train_file = 'http://s3.amazonaws.com/dato-datasets/millionsong/10000.txt'\n", "train_file = '../data/ratings.dat'\n", "sf = tc.SFrame.read_csv(train_file, header=False, \n", " delimiter='|', verbose=False)\n", "sf = sf.rename({'X1':'user_id', 'X2':'course_id', 'X3':'rating'})\n", "sf.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In order to evaluate the performance of our model, we randomly split the observations in our data set into two partitions: we will use `train_set` when creating our model and `test_set` for evaluating its performance." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:44:10.686617Z", "start_time": "2019-06-15T06:44:10.673786Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idcourse_idrating
115.0
215.0
315.0
415.0
515.0
615.0
715.0
815.0
915.0
1015.0
\n", "[2773 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.\n", "
" ], "text/plain": [ "Columns:\n", "\tuser_id\tint\n", "\tcourse_id\tint\n", "\trating\tfloat\n", "\n", "Rows: 2773\n", "\n", "Data:\n", "+---------+-----------+--------+\n", "| user_id | course_id | rating |\n", "+---------+-----------+--------+\n", "| 1 | 1 | 5.0 |\n", "| 2 | 1 | 5.0 |\n", "| 3 | 1 | 5.0 |\n", "| 4 | 1 | 5.0 |\n", "| 5 | 1 | 5.0 |\n", "| 6 | 1 | 5.0 |\n", "| 7 | 1 | 5.0 |\n", "| 8 | 1 | 5.0 |\n", "| 9 | 1 | 5.0 |\n", "| 10 | 1 | 5.0 |\n", "+---------+-----------+--------+\n", "[2773 rows x 3 columns]\n", "Note: Only the head of the SFrame is printed.\n", "You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns." ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sf" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:44:23.226685Z", "start_time": "2019-06-15T06:44:23.223365Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "train_set, test_set = sf.random_split(0.8, seed=1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Popularity model" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Create a model that makes recommendations using item popularity. When no target column is provided, the popularity is determined by the number of observations involving each item. When a target is provided, popularity is computed using the item’s mean target value. When the target column contains ratings, for example, the model computes the mean rating for each item and uses this to rank items for recommendations.\n", "\n", "One typically wants to initially create a simple recommendation system that can be used as a baseline and to verify that the rest of the pipeline works as expected. The `recommender` package has several models available for this purpose. For example, we can create a model that predicts songs based on their overall popularity across all users.\n" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:44:57.355433Z", "start_time": "2019-06-15T06:44:57.334939Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
Preparing data set.
" ], "text/plain": [ "Preparing data set." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
    Data has 2202 observations with 1651 users and 201 items.
" ], "text/plain": [ " Data has 2202 observations with 1651 users and 201 items." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
    Data prepared in: 0.006547s
" ], "text/plain": [ " Data prepared in: 0.006547s" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
2202 observations to process; with 201 unique items.
" ], "text/plain": [ "2202 observations to process; with 201 unique items." ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "popularity_model = tc.popularity_recommender.create(train_set, 'user_id', 'course_id', target = 'rating')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Item similarity Model" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "* [Collaborative filtering](http://en.wikipedia.org/wiki/Collaborative_filtering) methods make predictions for a given user based on the patterns of other users' activities. One common technique is to compare items based on their [Jaccard](http://en.wikipedia.org/wiki/Jaccard_index) similarity.This measurement is a ratio: the number of items they have in common, over the total number of distinct items in both sets.\n", "* We could also have used another slightly more complicated similarity measurement, called [Cosine Similarity](http://en.wikipedia.org/wiki/Cosine_similarity). \n", "\n", "If your data is implicit, i.e., you only observe interactions between users and items, without a rating, then use ItemSimilarityModel with Jaccard similarity. \n", "\n", "If your data is explicit, i.e., the observations include an actual rating given by the user, then you have a wide array of options. ItemSimilarityModel with cosine or Pearson similarity can incorporate ratings. In addition, MatrixFactorizationModel, FactorizationModel, as well as LinearRegressionModel all support rating prediction. \n", "\n", "#### Now data contains three columns: ‘user_id’, ‘item_id’, and ‘rating’.\n", "\n", "itemsim_cosine_model = graphlab.recommender.create(data, \n", " target=’rating’, \n", " method=’item_similarity’, \n", " similarity_type=’cosine’)\n", " \n", "factorization_machine_model = graphlab.recommender.create(data, \n", " target=’rating’, \n", " method=’factorization_model’)\n", "\n", "\n", "In the following code block, we compute all the item-item similarities and create an object that can be used for recommendations." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:45:29.852827Z", "start_time": "2019-06-15T06:45:29.832716Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
Preparing data set.
" ], "text/plain": [ "Preparing data set." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
    Data has 2202 observations with 1651 users and 201 items.
" ], "text/plain": [ " Data has 2202 observations with 1651 users and 201 items." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
    Data prepared in: 0.006935s
" ], "text/plain": [ " Data prepared in: 0.006935s" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Training model from provided data.
" ], "text/plain": [ "Training model from provided data." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Gathering per-item and per-user statistics.
" ], "text/plain": [ "Gathering per-item and per-user statistics." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------------------------+------------+
" ], "text/plain": [ "+--------------------------------+------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Elapsed Time (Item Statistics) | % Complete |
" ], "text/plain": [ "| Elapsed Time (Item Statistics) | % Complete |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------------------------+------------+
" ], "text/plain": [ "+--------------------------------+------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2.388ms                        | 60.5       |
" ], "text/plain": [ "| 2.388ms | 60.5 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2.607ms                        | 100        |
" ], "text/plain": [ "| 2.607ms | 100 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------------------------+------------+
" ], "text/plain": [ "+--------------------------------+------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Setting up lookup tables.
" ], "text/plain": [ "Setting up lookup tables." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Processing data in one pass using dense lookup tables.
" ], "text/plain": [ "Processing data in one pass using dense lookup tables." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-------------------------------------+------------------+-----------------+
" ], "text/plain": [ "+-------------------------------------+------------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
" ], "text/plain": [ "| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-------------------------------------+------------------+-----------------+
" ], "text/plain": [ "+-------------------------------------+------------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3.247ms                             | 0                | 0               |
" ], "text/plain": [ "| 3.247ms | 0 | 0 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 6.225ms                             | 100              | 201             |
" ], "text/plain": [ "| 6.225ms | 100 | 201 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+-------------------------------------+------------------+-----------------+
" ], "text/plain": [ "+-------------------------------------+------------------+-----------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Finalizing lookup tables.
" ], "text/plain": [ "Finalizing lookup tables." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Generating candidate set for working with new users.
" ], "text/plain": [ "Generating candidate set for working with new users." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Finished training in 0.008353s
" ], "text/plain": [ "Finished training in 0.008353s" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "item_sim_model = tc.item_similarity_recommender.create(\n", " train_set, 'user_id', 'course_id', target = 'rating', \n", " similarity_type='cosine')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Factorization Recommender Model\n", "Create a FactorizationRecommender that learns latent factors for each user and item and uses them to make rating predictions. This includes both standard matrix factorization as well as factorization machines models (in the situation where side data is available for users and/or items). [link](https://dato.com/products/create/docs/generated/graphlab.recommender.factorization_recommender.create.html#graphlab.recommender.factorization_recommender.create)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:45:56.902159Z", "start_time": "2019-06-15T06:45:54.601338Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
Preparing data set.
" ], "text/plain": [ "Preparing data set." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
    Data has 2202 observations with 1651 users and 201 items.
" ], "text/plain": [ " Data has 2202 observations with 1651 users and 201 items." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
    Data prepared in: 0.006938s
" ], "text/plain": [ " Data prepared in: 0.006938s" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Training factorization_recommender for recommendations.
" ], "text/plain": [ "Training factorization_recommender for recommendations." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------------------------+--------------------------------------------------+----------+
" ], "text/plain": [ "+--------------------------------+--------------------------------------------------+----------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Parameter                      | Description                                      | Value    |
" ], "text/plain": [ "| Parameter | Description | Value |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------------------------+--------------------------------------------------+----------+
" ], "text/plain": [ "+--------------------------------+--------------------------------------------------+----------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| num_factors                    | Factor Dimension                                 | 8        |
" ], "text/plain": [ "| num_factors | Factor Dimension | 8 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| regularization                 | L2 Regularization on Factors                     | 1e-08    |
" ], "text/plain": [ "| regularization | L2 Regularization on Factors | 1e-08 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| solver                         | Solver used for training                         | sgd      |
" ], "text/plain": [ "| solver | Solver used for training | sgd |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| linear_regularization          | L2 Regularization on Linear Coefficients         | 1e-10    |
" ], "text/plain": [ "| linear_regularization | L2 Regularization on Linear Coefficients | 1e-10 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| max_iterations                 | Maximum Number of Iterations                     | 50       |
" ], "text/plain": [ "| max_iterations | Maximum Number of Iterations | 50 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+--------------------------------+--------------------------------------------------+----------+
" ], "text/plain": [ "+--------------------------------+--------------------------------------------------+----------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
  Optimizing model using SGD; tuning step size.
" ], "text/plain": [ " Optimizing model using SGD; tuning step size." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
  Using 2202 / 2202 points for tuning the step size.
" ], "text/plain": [ " Using 2202 / 2202 points for tuning the step size." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+-------------------+------------------------------------------+
" ], "text/plain": [ "+---------+-------------------+------------------------------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Attempt | Initial Step Size | Estimated Objective Value                |
" ], "text/plain": [ "| Attempt | Initial Step Size | Estimated Objective Value |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+-------------------+------------------------------------------+
" ], "text/plain": [ "+---------+-------------------+------------------------------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 0       | 25                | Not Viable                               |
" ], "text/plain": [ "| 0 | 25 | Not Viable |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1       | 6.25              | Not Viable                               |
" ], "text/plain": [ "| 1 | 6.25 | Not Viable |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2       | 1.5625            | Not Viable                               |
" ], "text/plain": [ "| 2 | 1.5625 | Not Viable |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3       | 0.390625          | 0.131214                                 |
" ], "text/plain": [ "| 3 | 0.390625 | 0.131214 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4       | 0.195312          | 0.168634                                 |
" ], "text/plain": [ "| 4 | 0.195312 | 0.168634 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 5       | 0.0976562         | 0.237488                                 |
" ], "text/plain": [ "| 5 | 0.0976562 | 0.237488 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 6       | 0.0488281         | 0.338842                                 |
" ], "text/plain": [ "| 6 | 0.0488281 | 0.338842 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+-------------------+------------------------------------------+
" ], "text/plain": [ "+---------+-------------------+------------------------------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Final   | 0.390625          | 0.131214                                 |
" ], "text/plain": [ "| Final | 0.390625 | 0.131214 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+-------------------+------------------------------------------+
" ], "text/plain": [ "+---------+-------------------+------------------------------------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Starting Optimization.
" ], "text/plain": [ "Starting Optimization." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+--------------+-------------------+-----------------------+-------------+
" ], "text/plain": [ "+---------+--------------+-------------------+-----------------------+-------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Iter.   | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size   |
" ], "text/plain": [ "| Iter. | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+--------------+-------------------+-----------------------+-------------+
" ], "text/plain": [ "+---------+--------------+-------------------+-----------------------+-------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| Initial | 61us         | 0.891401          | 0.94414               |             |
" ], "text/plain": [ "| Initial | 61us | 0.891401 | 0.94414 | |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+--------------+-------------------+-----------------------+-------------+
" ], "text/plain": [ "+---------+--------------+-------------------+-----------------------+-------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 1       | 31.222ms     | 0.904366          | 0.950979              | 0.390625    |
" ], "text/plain": [ "| 1 | 31.222ms | 0.904366 | 0.950979 | 0.390625 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 2       | 63.351ms     | 0.493281          | 0.702338              | 0.232267    |
" ], "text/plain": [ "| 2 | 63.351ms | 0.493281 | 0.702338 | 0.232267 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 3       | 100.4ms      | 0.273931          | 0.523383              | 0.171364    |
" ], "text/plain": [ "| 3 | 100.4ms | 0.273931 | 0.523383 | 0.171364 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 4       | 143.322ms    | 0.20113           | 0.448475              | 0.116134    |
" ], "text/plain": [ "| 4 | 143.322ms | 0.20113 | 0.448475 | 0.116134 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 5       | 174.795ms    | 0.150903          | 0.388462              | 0.098237    |
" ], "text/plain": [ "| 5 | 174.795ms | 0.150903 | 0.388462 | 0.098237 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 10      | 318.538ms    | 0.0530298         | 0.230279              | 0.0584121   |
" ], "text/plain": [ "| 10 | 318.538ms | 0.0530298 | 0.230279 | 0.0584121 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
| 50      | 1.50s        | 0.00183493        | 0.04281               | 0.0174693   |
" ], "text/plain": [ "| 50 | 1.50s | 0.00183493 | 0.04281 | 0.0174693 |" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
+---------+--------------+-------------------+-----------------------+-------------+
" ], "text/plain": [ "+---------+--------------+-------------------+-----------------------+-------------+" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Optimization Complete: Maximum number of passes through the data reached.
" ], "text/plain": [ "Optimization Complete: Maximum number of passes through the data reached." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
Computing final objective value and training RMSE.
" ], "text/plain": [ "Computing final objective value and training RMSE." ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
       Final objective value: 0.00162513
" ], "text/plain": [ " Final objective value: 0.00162513" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
       Final training RMSE: 0.0402852
" ], "text/plain": [ " Final training RMSE: 0.0402852" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "factorization_machine_model = tc.recommender.factorization_recommender.create(\n", " train_set, 'user_id', 'course_id', \n", " target='rating')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Model Evaluation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "It's straightforward to use GraphLab to compare models on a small subset of users in the `test_set`. The [precision-recall](http://en.wikipedia.org/wiki/Precision_and_recall) plot that is computed shows the benefits of using the similarity-based model instead of the baseline `popularity_model`: better curves tend toward the upper-right hand corner of the plot. \n", "\n", "The following command finds the top-ranked items for all users in the first 500 rows of `test_set`. The observations in `train_set` are not included in the predicted items." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:46:49.448455Z", "start_time": "2019-06-15T06:46:45.968582Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "compare_models: using 246 users to estimate model performance\n", "PROGRESS: Evaluate model M0\n", "\n", "Precision and recall summary statistics by cutoff\n", "+--------+-----------------------+-----------------------+\n", "| cutoff | mean_precision | mean_recall |\n", "+--------+-----------------------+-----------------------+\n", "| 1 | 0.0 | 0.0 |\n", "| 2 | 0.0 | 0.0 |\n", "| 3 | 0.0 | 0.0 |\n", "| 4 | 0.0 | 0.0 |\n", "| 5 | 0.0 | 0.0 |\n", "| 6 | 0.0006775067750677507 | 0.0020325203252032527 |\n", "| 7 | 0.0005807200929152149 | 0.0020325203252032527 |\n", "| 8 | 0.0010162601626016261 | 0.006097560975609755 |\n", "| 9 | 0.0009033423667570006 | 0.006097560975609755 |\n", "| 10 | 0.0008130081300813011 | 0.006097560975609755 |\n", "+--------+-----------------------+-----------------------+\n", "[10 rows x 3 columns]\n", "\n", "\n", "Overall RMSE: 0.7309854474588884\n", "\n", "Per User RMSE (best)\n", "+---------+------+-------+\n", "| user_id | rmse | count |\n", "+---------+------+-------+\n", "| 1621 | 0.0 | 1 |\n", "+---------+------+-------+\n", "[1 rows x 3 columns]\n", "\n", "\n", "Per User RMSE (worst)\n", "+---------+-------------------+-------+\n", "| user_id | rmse | count |\n", "+---------+-------------------+-------+\n", "| 2009 | 3.006811989100817 | 1 |\n", "+---------+-------------------+-------+\n", "[1 rows x 3 columns]\n", "\n", "\n", "Per Item RMSE (best)\n", "+-----------+------+-------+\n", "| course_id | rmse | count |\n", "+-----------+------+-------+\n", "| 8 | 0.0 | 1 |\n", "+-----------+------+-------+\n", "[1 rows x 3 columns]\n", "\n", "\n", "Per Item RMSE (worst)\n", "+-----------+--------------------+-------+\n", "| course_id | rmse | count |\n", "+-----------+--------------------+-------+\n", "| 144 | 3.0954212400029215 | 3 |\n", "+-----------+--------------------+-------+\n", "[1 rows x 3 columns]\n", "\n", "PROGRESS: Evaluate model M1\n", "\n", "Precision and recall summary statistics by cutoff\n", "+--------+----------------------+----------------------+\n", "| cutoff | mean_precision | mean_recall |\n", "+--------+----------------------+----------------------+\n", "| 1 | 0.004065040650406504 | 0.004065040650406504 |\n", "| 2 | 0.006097560975609755 | 0.010162601626016263 |\n", "| 3 | 0.005420054200542005 | 0.011178861788617888 |\n", "| 4 | 0.007113821138211382 | 0.018292682926829285 |\n", "| 5 | 0.008130081300813007 | 0.026422764227642268 |\n", "| 6 | 0.007452574525745257 | 0.03048780487804879 |\n", "| 7 | 0.00696864111498258 | 0.03455284552845529 |\n", "| 8 | 0.007113821138211383 | 0.0426829268292683 |\n", "| 9 | 0.006775067750677503 | 0.04471544715447155 |\n", "| 10 | 0.006097560975609756 | 0.04471544715447156 |\n", "+--------+----------------------+----------------------+\n", "[10 rows x 3 columns]\n", "\n", "\n", "Overall RMSE: 4.586134473539705\n", "\n", "Per User RMSE (best)\n", "+---------+------+-------+\n", "| user_id | rmse | count |\n", "+---------+------+-------+\n", "| 1300 | 0.5 | 1 |\n", "+---------+------+-------+\n", "[1 rows x 3 columns]\n", "\n", "\n", "Per User RMSE (worst)\n", "+---------+------+-------+\n", "| user_id | rmse | count |\n", "+---------+------+-------+\n", "| 1704 | 5.0 | 1 |\n", "+---------+------+-------+\n", "[1 rows x 3 columns]\n", "\n", "\n", "Per Item RMSE (best)\n", "+-----------+------+-------+\n", "| course_id | rmse | count |\n", "+-----------+------+-------+\n", "| 200 | 0.5 | 1 |\n", "+-----------+------+-------+\n", "[1 rows x 3 columns]\n", "\n", "\n", "Per Item RMSE (worst)\n", "+-----------+------+-------+\n", "| course_id | rmse | count |\n", "+-----------+------+-------+\n", "| 43 | 5.0 | 1 |\n", "+-----------+------+-------+\n", "[1 rows x 3 columns]\n", "\n", "PROGRESS: Evaluate model M2\n", "\n", "Precision and recall summary statistics by cutoff\n", "+--------+-----------------------+-----------------------+\n", "| cutoff | mean_precision | mean_recall |\n", "+--------+-----------------------+-----------------------+\n", "| 1 | 0.0 | 0.0 |\n", "| 2 | 0.0020325203252032522 | 0.0010162601626016261 |\n", "| 3 | 0.002710027100271004 | 0.005081300813008129 |\n", "| 4 | 0.003048780487804878 | 0.009146341463414632 |\n", "| 5 | 0.0024390243902439033 | 0.009146341463414632 |\n", "| 6 | 0.0020325203252032527 | 0.009146341463414632 |\n", "| 7 | 0.001742160278745646 | 0.009146341463414632 |\n", "| 8 | 0.001524390243902439 | 0.009146341463414632 |\n", "| 9 | 0.0013550135501355005 | 0.009146341463414632 |\n", "| 10 | 0.0016260162601626018 | 0.013211382113821135 |\n", "+--------+-----------------------+-----------------------+\n", "[10 rows x 3 columns]\n", "\n", "\n", "Overall RMSE: 0.8346975060578289\n", "\n", "Per User RMSE (best)\n", "+---------+-----------------------+-------+\n", "| user_id | rmse | count |\n", "+---------+-----------------------+-------+\n", "| 1156 | 0.0025112800279867287 | 1 |\n", "+---------+-----------------------+-------+\n", "[1 rows x 3 columns]\n", "\n", "\n", "Per User RMSE (worst)\n", "+---------+--------------------+-------+\n", "| user_id | rmse | count |\n", "+---------+--------------------+-------+\n", "| 1646 | 3.3541036468019474 | 1 |\n", "+---------+--------------------+-------+\n", "[1 rows x 3 columns]\n", "\n", "\n", "Per Item RMSE (best)\n", "+-----------+---------------------+-------+\n", "| course_id | rmse | count |\n", "+-----------+---------------------+-------+\n", "| 129 | 0.00681198910081271 | 1 |\n", "+-----------+---------------------+-------+\n", "[1 rows x 3 columns]\n", "\n", "\n", "Per Item RMSE (worst)\n", "+-----------+--------------------+-------+\n", "| course_id | rmse | count |\n", "+-----------+--------------------+-------+\n", "| 144 | 3.5404041983447554 | 3 |\n", "+-----------+--------------------+-------+\n", "[1 rows x 3 columns]\n", "\n" ] } ], "source": [ "result = tc.recommender.util.compare_models(\n", " test_set, [popularity_model, item_sim_model, factorization_machine_model],\n", " user_sample=.5, skip_set=train_set)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Now let's ask the item similarity model for song recommendations on several users. We first create a list of users and create a subset of observations, `users_ratings`, that pertain to these users." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:49:39.479200Z", "start_time": "2019-06-15T06:49:39.464634Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "dtype: int\n", "Rows: 100\n", "[232, 363, 431, 738, 1860, 732, 187, 1368, 1753, 764, 926, 1180, 1323, 1742, 1685, 1876, 1232, 614, 1573, 786, 1158, 1072, 863, 695, 454, 1211, 1404, 1242, 696, 444, 349, 1883, 499, 354, 573, 1531, 71, 1889, 312, 578, 1556, 1572, 79, 416, 848, 1805, 550, 1644, 1017, 521, 566, 703, 196, 1401, 533, 1054, 398, 1077, 341, 982, 516, 1473, 1982, 1670, 529, 1544, 1920, 1414, 223, 465, 376, 1231, 193, 202, 1128, 1853, 1907, 1331, 966, 810, 1895, 1704, 267, 1615, 350, 979, 69, 138, 1645, 837, 1841, 1801, 853, 803, 405, 1172, 112, 1653, 937, 1429]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "K = 10\n", "users = tc.SArray(sf['user_id'].unique().head(100))\n", "users" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Next we use the `recommend()` function to query the model we created for recommendations. The returned object has four columns: `user_id`, `song_id`, the `score` that the algorithm gave this user for this song, and the song's rank (an integer from 0 to K-1). To see this we can grab the top few rows of `recs`:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:46:10.176376Z", "start_time": "2019-06-14T16:46:10.152361Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idcourse_idscorerank
232930.219596922397613531
2321800.21274745464324952
2321880.201870501041412353
2321080.178093016147613534
232550.17535716295242315
2321680.167150199413299566
2321330.166733860969543467
2321860.161530077457427988
2321640.16010880470275889
2321870.1578584313392639210
\n", "[10 rows x 4 columns]
\n", "
" ], "text/plain": [ "Columns:\n", "\tuser_id\tint\n", "\tcourse_id\tint\n", "\tscore\tfloat\n", "\trank\tint\n", "\n", "Rows: 10\n", "\n", "Data:\n", "+---------+-----------+---------------------+------+\n", "| user_id | course_id | score | rank |\n", "+---------+-----------+---------------------+------+\n", "| 232 | 93 | 0.21959692239761353 | 1 |\n", "| 232 | 180 | 0.2127474546432495 | 2 |\n", "| 232 | 188 | 0.20187050104141235 | 3 |\n", "| 232 | 108 | 0.17809301614761353 | 4 |\n", "| 232 | 55 | 0.1753571629524231 | 5 |\n", "| 232 | 168 | 0.16715019941329956 | 6 |\n", "| 232 | 133 | 0.16673386096954346 | 7 |\n", "| 232 | 186 | 0.16153007745742798 | 8 |\n", "| 232 | 164 | 0.1601088047027588 | 9 |\n", "| 232 | 187 | 0.15785843133926392 | 10 |\n", "+---------+-----------+---------------------+------+\n", "[10 rows x 4 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "recs = item_sim_model.recommend(users=users, k=K)\n", "recs.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "To learn what songs these ids pertain to, we can merge in metadata about each song." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "ExecuteTime": { "end_time": "2019-06-15T06:51:05.644919Z", "start_time": "2019-06-15T06:51:05.291900Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
Materializing SFrame
" ], "text/plain": [ "Materializing SFrame" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "ename": "RuntimeError", "evalue": "Column name course_id does not exist.", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mRuntimeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m~/Applications/anaconda/lib/python3.5/site-packages/turicreate/data_structures/sframe.py\u001b[0m in \u001b[0;36mjoin\u001b[0;34m(self, right, on, how)\u001b[0m\n\u001b[1;32m 4351\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mcython_context\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 4352\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mSFrame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_proxy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__proxy__\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mright\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__proxy__\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhow\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mjoin_keys\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4353\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32mturicreate/_cython/cy_sframe.pyx\u001b[0m in \u001b[0;36mturicreate._cython.cy_sframe.UnitySFrameProxy.join\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mturicreate/_cython/cy_sframe.pyx\u001b[0m in \u001b[0;36mturicreate._cython.cy_sframe.UnitySFrameProxy.join\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mRuntimeError\u001b[0m: Column name course_id does not exist.", "\nDuring handling of the above exception, another exception occurred:\n", "\u001b[0;31mRuntimeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0mcourses\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcourses\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'course_id'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'title'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'provider'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 8\u001b[0;31m \u001b[0mresults\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mrecs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcourses\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mon\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'course_id'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhow\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'inner'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 9\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 10\u001b[0m \u001b[0;31m#Populate observed user-course data with course info\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/Applications/anaconda/lib/python3.5/site-packages/turicreate/data_structures/sframe.py\u001b[0m in \u001b[0;36mjoin\u001b[0;34m(self, right, on, how)\u001b[0m\n\u001b[1;32m 4350\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4351\u001b[0m \u001b[0;32mwith\u001b[0m \u001b[0mcython_context\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 4352\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mSFrame\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0m_proxy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__proxy__\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mright\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__proxy__\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mhow\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mjoin_keys\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4353\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4354\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mfilter_by\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolumn_name\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexclude\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/Applications/anaconda/lib/python3.5/site-packages/turicreate/_cython/context.py\u001b[0m in \u001b[0;36m__exit__\u001b[0;34m(self, exc_type, exc_value, traceback)\u001b[0m\n\u001b[1;32m 47\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshow_cython_trace\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 48\u001b[0m \u001b[0;31m# To hide cython trace, we re-raise from here\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 49\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mexc_type\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mexc_value\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 50\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 51\u001b[0m \u001b[0;31m# To show the full trace, we do nothing and let exception propagate\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mRuntimeError\u001b[0m: Column name course_id does not exist." ] } ], "source": [ "# Get the meta data of the courses\n", "courses = tc.SFrame.read_csv('../data/cursos.dat', header=False, delimiter='|', verbose=False)\n", "courses =courses.rename({'X1':'course_id', 'X2':'title', 'X3':'avg_rating', \n", " 'X4':'workload', 'X5':'university', 'X6':'difficulty', 'X7':'provider'})\n", "courses.show()\n", "\n", "courses = courses[['course_id', 'title', 'provider']]\n", "results = recs.join(courses, on='course_id', how='inner')\n", "\n", "#Populate observed user-course data with course info\n", "userset = frozenset(users)\n", "ix = sf['user_id'].apply(lambda x: x in userset, int) \n", "user_data = sf[ix]\n", "user_data = user_data.join(courses, on='course_id')[['user_id', 'title', 'provider']]" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "ExecuteTime": { "end_time": "2019-06-14T16:47:43.829359Z", "start_time": "2019-06-14T16:47:43.778230Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "User: 1\n", "We were told that the user liked these courses: \n", "+-------------------------------+----------+\n", "| title | provider |\n", "+-------------------------------+----------+\n", "| An Introduction to Interac... | coursera |\n", "+-------------------------------+----------+\n", "[1 rows x 2 columns]\n", "\n", "We recommend these other courses:\n", "+-------+----------+\n", "| title | provider |\n", "+-------+----------+\n", "+-------+----------+\n", "[0 rows x 2 columns]\n", "\n", "\n", "User: 2\n", "We were told that the user liked these courses: \n", "+-------------------------------+----------+\n", "| title | provider |\n", "+-------------------------------+----------+\n", "| An Introduction to Interac... | coursera |\n", "+-------------------------------+----------+\n", "[1 rows x 2 columns]\n", "\n", "We recommend these other courses:\n", "+-------+----------+\n", "| title | provider |\n", "+-------+----------+\n", "+-------+----------+\n", "[0 rows x 2 columns]\n", "\n", "\n", "User: 3\n", "We were told that the user liked these courses: \n", "+-------------------------------+----------+\n", "| title | provider |\n", "+-------------------------------+----------+\n", "| An Introduction to Interac... | coursera |\n", "+-------------------------------+----------+\n", "[1 rows x 2 columns]\n", "\n", "We recommend these other courses:\n", "+-------+----------+\n", "| title | provider |\n", "+-------+----------+\n", "+-------+----------+\n", "[0 rows x 2 columns]\n", "\n", "\n", "User: 4\n", "We were told that the user liked these courses: \n", "+-------------------------------+----------+\n", "| title | provider |\n", "+-------------------------------+----------+\n", "| A Beginner's Guide to ... | coursera |\n", "| Gamification | coursera |\n", "+-------------------------------+----------+\n", "[2 rows x 2 columns]\n", "\n", "We recommend these other courses:\n", "+-------+----------+\n", "| title | provider |\n", "+-------+----------+\n", "+-------+----------+\n", "[0 rows x 2 columns]\n", "\n", "\n", "User: 5\n", "We were told that the user liked these courses: \n", "+-------------------------------+----------+\n", "| title | provider |\n", "+-------------------------------+----------+\n", "| Web Intelligence and Big Data | coursera |\n", "+-------------------------------+----------+\n", "[1 rows x 2 columns]\n", "\n", "We recommend these other courses:\n", "+-------+----------+\n", "| title | provider |\n", "+-------+----------+\n", "+-------+----------+\n", "[0 rows x 2 columns]\n", "\n", "\n" ] } ], "source": [ "# Print out some recommendations \n", "for i in range(5):\n", " user = list(users)[i]\n", " print(\"User: \" + str(i + 1))\n", " user_obs = user_data[user_data['user_id'] == user].head(K)\n", " del user_obs['user_id']\n", " user_recs = results[results['user_id'] == str(user)][['title', 'provider']]\n", "\n", " print(\"We were told that the user liked these courses: \")\n", " print (user_obs.head(K))\n", "\n", " print (\"We recommend these other courses:\")\n", " print (user_recs.head(K))\n", "\n", " print (\"\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Readings\n", "- (Looking for more details about the modules and functions? Check out the API docs.)\n", "- Toby Segaran, 2007, Programming Collective Intelligence. O'Reilly. Chapter 2 Making Recommendations\n", " - programming-collective-intelligence-code/blob/master/chapter2/recommendations.py\n", "- 项亮 2012 推荐系统实践 人民邮电出版社" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python [conda env:anaconda]", "language": "python", "name": "conda-env-anaconda-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "48px", "left": "1382.98px", "top": "61.5313px", "width": "164px" }, "toc_section_display": false, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 1 }