{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Populating the interactive namespace from numpy and matplotlib\n"
]
}
],
"source": [
"%pylab inline\n",
"import pandas as pd\n",
"\n",
"import numpy as np\n",
"from __future__ import division\n",
"import itertools\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"import logging\n",
"logger = logging.getLogger()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"9 Recommendation Systems\n",
"=============\n",
"\n",
"two broad groups:\n",
"\n",
"1. Content-based systems \n",
" focus on the properities of items.\n",
" \n",
"2. Collaborative filtering systems \n",
" focus on the relationship between users and items."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 9.1 A Model for Recommendation Systems\n",
"\n",
"\n",
"#### The Utility Matrix\n",
"record the preference given by users for certain items."
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" HP1 | \n",
" HP2 | \n",
" HP3 | \n",
" TW | \n",
" SW1 | \n",
" SW2 | \n",
" SW3 | \n",
"
\n",
" \n",
" \n",
" \n",
" A | \n",
" 4 | \n",
" NaN | \n",
" NaN | \n",
" 5 | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" B | \n",
" 5 | \n",
" 5 | \n",
" 4 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" C | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2 | \n",
" 4 | \n",
" 5 | \n",
" NaN | \n",
"
\n",
" \n",
" D | \n",
" NaN | \n",
" 3 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" HP1 HP2 HP3 TW SW1 SW2 SW3\n",
"A 4 NaN NaN 5 1 NaN NaN\n",
"B 5 5 4 NaN NaN NaN NaN\n",
"C NaN NaN NaN 2 4 5 NaN\n",
"D NaN 3 NaN NaN NaN NaN 3"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Example 9.1\n",
"M = pd.DataFrame(index=['A', 'B', 'C', 'D'], columns=['HP1', 'HP2', 'HP3', 'TW', 'SW1', 'SW2', 'SW3'])\n",
"\n",
"M.loc['A', ['HP1', 'TW', 'SW1']] = [4, 5, 1]\n",
"M.iloc[1, 0:3] = [5, 5, 4]\n",
"M.iloc[2, 3:-1] = [2, 4, 5]\n",
"M.iloc[3, [1, -1]] = [3, 3]\n",
"\n",
"M_9_1 = M\n",
"M_9_1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In practice, the matrix would be even **sparser**, with the typical user rating only a tiny fraction of all avalibale items.\n",
"\n",
"the **goal** of a recommendation system is: to **predict the blanks** in the utility matrix. \n",
"+ slightly difference in many application: \n",
" - predict every blank entry $<$ discover some potential entries in each row. \n",
" - find all items with the highest expected ratings $<$ find a large subset of those.\n",
" \n",
"\n",
"#### The Long Tail\n",
"physical institutions | online institutions\n",
"---- | -----\n",
"provide only the most popular items | provide the entire range of items\n",
"\n",
"\n",
"the long tail force online institutions to recommend items to individual users:\n",
"\n",
"1. It's no possible to present all avaliable items to the user.\n",
"\n",
"2. Neither can we expect users to have heared of each of the items they might like.\n",
"\n",
"\n",
"#### Applications of Recommendation Systems\n",
"1. Product Recommendations\n",
"\n",
"2. Movie Recommendations\n",
"\n",
"3. News Articles\n",
"\n",
"\n",
"#### Populating the Utility Matrix\n",
"how to discovery the value users place on items:\n",
"\n",
"1. We can ask users to rate items. \n",
" cons: users are unwilling to do, and so samples are biased by very little fraction of peoples.\n",
" \n",
"2. We can make inferences from users' behavior. \n",
" eg: items purchased/viewed/rated."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 9.2 Content-Based Recommendations\n",
"\n",
"\n",
"#### 9.2.1 Item Profiles\n",
"a record representing important characteristics of items.\n",
"\n",
"\n",
"##### Discovering Features\n",
"1. for Documents \n",
" idea: find the identification of words that characterize the topic of a document. \n",
" namely, we expect a sets of words to express the subjects or main ideas of the document. \n",
" 1. eliminate stop words. \n",
" 2. compute the TF.DIF score for each reamining word in the document. \n",
" 3. take as the features of a document the $n$ words with the highest TF.DIF scores. \n",
" \n",
" to measure the similarity of two documents, the distance measures we could use are: \n",
" 1. Jaccard distance \n",
" 2. cosine distance \n",
" cosine distance of vectors is not affected by components in which both vectors have 0.\n",
" \n",
"2. for Images \n",
" invite users to tag the items. \n",
" cons: users are unwilling to do $\\implies$ there are not enough tags (bias).\n",
" \n",
"\n",
"##### generalize feature vector\n",
"1. feature is discrete. $\\to$ boolean value.\n",
"\n",
"2. feature is numerical. $\\to$ normalization.\n",
"\n",
"\n",
"#### 9.2.5 User Profiles\n",
"create vectors with the same components of item profiles to describe the user's preferences. \n",
"\n",
"It could be derived from utility matrix and item profiles.\n",
"\n",
"1. normalizate untility matrix. ($[-1,1]$ for cosine distance).\n",
"\n",
"2. value in user profiles = utility value * corresponding item vectors."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" F0 | \n",
" F1 | \n",
" F2 | \n",
" F3 | \n",
"
\n",
" \n",
" \n",
" \n",
" U | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 0 | \n",
"
\n",
" \n",
" V | \n",
" 6 | \n",
" 2 | \n",
" 3 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" F0 F1 F2 F3\n",
"U 3 4 5 0\n",
"V 6 2 3 5"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# example 9.4\n",
"\n",
"users_name = ['U', 'V']\n",
"items_name = ['F{}'.format(x) for x in range(4)]\n",
"features_name = ['Julia Roberts', 'others']\n",
"\n",
"# utility matrix\n",
"M_uti = pd.DataFrame([\n",
" [3, 4, 5, 0],\n",
" [6, 2, 3, 5] \n",
" ],\n",
" index=users_name,\n",
" columns=items_name\n",
" )\n",
"\n",
"M_uti"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Julia Roberts | \n",
" others | \n",
"
\n",
" \n",
" \n",
" \n",
" F0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" F1 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" F2 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" F3 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Julia Roberts others\n",
"F0 1 0\n",
"F1 1 0\n",
"F2 1 0\n",
"F3 1 0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# item profile\n",
"M_item = pd.DataFrame(index=items_name, columns=features_name)\n",
"\n",
"M_item.loc[:, features_name[0]] = 1\n",
"M_item = M_item.fillna(value=0)\n",
"\n",
"M_item"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" F0 | \n",
" F1 | \n",
" F2 | \n",
" F3 | \n",
"
\n",
" \n",
" \n",
" \n",
" U | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" -3 | \n",
"
\n",
" \n",
" V | \n",
" 2 | \n",
" -2 | \n",
" -1 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" F0 F1 F2 F3\n",
"U 0 1 2 -3\n",
"V 2 -2 -1 1"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"M_uti.apply(lambda x: x - np.mean(x), axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Julia Roberts | \n",
" others | \n",
"
\n",
" \n",
" \n",
" \n",
" U | \n",
" 3 | \n",
" 0 | \n",
"
\n",
" \n",
" V | \n",
" 4 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Julia Roberts others\n",
"U 3 0\n",
"V 4 0"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"M_user = M_uti.fillna(value=0).dot(M_item) / 4 #average = sum/len\n",
"M_user"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 9.2.6 Recommending Items to Users Based on Content\n",
"\n",
"1. to estimate: \n",
" $$M_{utility}[user, item] = cosineDistant(M_{user}, M_{item})$$ \n",
" \n",
" the more similar, the higher probility to recommend.\n",
"\n",
"2. classification algorithms: \n",
" Recommend or Not (machine learning): \n",
" one decision per user $\\to$ take too long time to construct. \n",
" be used only for relatively small problem size."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" Processor Speed | \n",
" 3.06 | \n",
" 2.68 | \n",
" 2.92 | \n",
"
\n",
" \n",
" Disk Size | \n",
" 500.00 | \n",
" 320.00 | \n",
" 640.00 | \n",
"
\n",
" \n",
" Main-Memory Size | \n",
" 6.00 | \n",
" 4.00 | \n",
" 6.00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"Processor Speed 3.06 2.68 2.92\n",
"Disk Size 500.00 320.00 640.00\n",
"Main-Memory Size 6.00 4.00 6.00"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# exercise 9.2.1\n",
"\n",
"raw_data = [\n",
" [3.06, 2.68, 2.92],\n",
" [500, 320, 640],\n",
" [6, 4, 6]\n",
" ]\n",
"\n",
"M_item = pd.DataFrame(raw_data, index=['Processor Speed', 'Disk Size', 'Main-Memory Size'], columns=['A', 'B', 'C'])\n",
"\n",
"# items: A, B, C; features: Processor Speed, Disk Size, ...\n",
"M_item"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" Processor Speed | \n",
" 1.060046 | \n",
" 0.928406 | \n",
" 1.011547 | \n",
"
\n",
" \n",
" Disk Size | \n",
" 1.027397 | \n",
" 0.657534 | \n",
" 1.315068 | \n",
"
\n",
" \n",
" Main-Memory Size | \n",
" 1.125000 | \n",
" 0.750000 | \n",
" 1.125000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"Processor Speed 1.060046 0.928406 1.011547\n",
"Disk Size 1.027397 0.657534 1.315068\n",
"Main-Memory Size 1.125000 0.750000 1.125000"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# exercise 9.2.1 \n",
"\n",
"# (d)\n",
"M_item.apply(lambda x: x / np.mean(x), axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" Processor Speed | \n",
" 0.173333 | \n",
" -0.206667 | \n",
" 0.033333 | \n",
"
\n",
" \n",
" Disk Size | \n",
" 13.333333 | \n",
" -166.666667 | \n",
" 153.333333 | \n",
"
\n",
" \n",
" Main-Memory Size | \n",
" 0.666667 | \n",
" -1.333333 | \n",
" 0.666667 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"Processor Speed 0.173333 -0.206667 0.033333\n",
"Disk Size 13.333333 -166.666667 153.333333\n",
"Main-Memory Size 0.666667 -1.333333 0.666667"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# exercise 9.2.2 \n",
"\n",
"# (a)\n",
"M_item.apply(lambda x: x - np.mean(x), axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" user | \n",
" 4 | \n",
" 2 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"user 4 2 5"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# exercise 9.2.3\n",
"\n",
"M_uti = pd.DataFrame([[4, 2, 5]], index=['user'], columns=['A', 'B', 'C'])\n",
"M_uti"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" user | \n",
" 0.333333 | \n",
" -1.666667 | \n",
" 1.333333 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"user 0.333333 -1.666667 1.333333"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# (a)\n",
"M_uti_nor = M_uti.apply(lambda x: x - np.mean(x), axis=1)\n",
"M_uti_nor"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user | \n",
"
\n",
" \n",
" \n",
" \n",
" Processor Speed | \n",
" 0.148889 | \n",
"
\n",
" \n",
" Disk Size | \n",
" 162.222222 | \n",
"
\n",
" \n",
" Main-Memory Size | \n",
" 1.111111 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user\n",
"Processor Speed 0.148889\n",
"Disk Size 162.222222\n",
"Main-Memory Size 1.111111"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# (b)\n",
"M_user = M_item.dot(M_uti_nor.T) / 3\n",
"M_user"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"utility matrix: \n",
" A B C\n",
"userA 0.333333 -1.666667 1.333333\n",
"userB -1.000000 NaN 1.000000\n",
"\n",
"item profile: \n",
" A B C\n",
"Processor Speed 3.06 2.68 2.92\n",
"Disk Size 500.00 320.00 640.00\n",
"Main-Memory Size 6.00 4.00 6.00\n",
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" userA | \n",
" userB | \n",
"
\n",
" \n",
" \n",
" \n",
" Processor Speed | \n",
" 0.148889 | \n",
" -0.07 | \n",
"
\n",
" \n",
" Disk Size | \n",
" 162.222222 | \n",
" 70.00 | \n",
"
\n",
" \n",
" Main-Memory Size | \n",
" 1.111111 | \n",
" 0.00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" userA userB\n",
"Processor Speed 0.148889 -0.07\n",
"Disk Size 162.222222 70.00\n",
"Main-Memory Size 1.111111 0.00"
]
},
"execution_count": 79,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"logger.setLevel('WARN')\n",
"\n",
"def create_user_profile(utility_matrix, item_profile):\n",
" \"\"\"Create user profile by combining utility matrix with item profile in 9.2.5 .\"\"\"\n",
" \n",
" assert np.array_equal(utility_matrix.columns, item_profile.columns), \\\n",
" \"utility matrix should keep same columns name with item profile.\"\n",
" \n",
" logger.info('utility_matrix: \\n{}\\n'.format(utility_matrix)) \n",
" M_uti_notnull = np.ones(utility_matrix.shape)\n",
" M_uti_notnull[utility_matrix.isnull().values] = 0\n",
" logger.info('utility_matrix_isnull: \\n{}\\n'.format(M_uti_notnull))\n",
" \n",
" \n",
" logger.info('utility_matrix: \\n{}\\n'.format(item_profile)) \n",
" M_item_notnull = np.ones(item_profile.shape)\n",
" M_item_notnull[item_profile.isnull().values] = 0\n",
" logger.info('utility_matrix_isnull: \\n{}\\n'.format(M_item_notnull))\n",
" \n",
" utility_matrix = utility_matrix.fillna(value=0)\n",
" item_profile = item_profile.fillna(value=0)\n",
" M_user = item_profile.dot(utility_matrix.T).values / np.dot(M_item_notnull, M_uti_notnull.T)\n",
" M_user[np.isinf(M_user)] = np.nan # solve: divide zero\n",
" logger.info('M_user: \\n{}\\n'.format(M_user))\n",
" \n",
" return pd.DataFrame(M_user, index=item_profile.index, columns=utility_matrix.index)\n",
"\n",
"\n",
"M_uti = pd.DataFrame([[4, 2, 5], [1, np.nan, 3]], index=['userA', 'userB'], columns=['A', 'B', 'C'])\n",
"M_uti_nor = M_uti.apply(lambda x: x - np.mean(x), axis=1)\n",
"print('utility matrix: \\n{}\\n'.format(M_uti_nor))\n",
"print('item profile: \\n{}\\n'.format(M_item))\n",
"\n",
"create_user_profile(M_uti_nor, M_item)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 9.3 Collaborative Filtering\n",
"identifying similar users and recommending what similar users like.\n",
"\n",
"\n",
"#### refine data \n",
"1. rounding the data \n",
" eg: rates of 3, 4 and 5 are \"1\", otherwise \"0\".\n",
" \n",
"2. normalizing rates \n",
" subtracting from each rating the average rating of that user."
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" HP1 | \n",
" HP2 | \n",
" HP3 | \n",
" TW | \n",
" SW1 | \n",
" SW2 | \n",
" SW3 | \n",
"
\n",
" \n",
" \n",
" \n",
" A | \n",
" 4 | \n",
" NaN | \n",
" NaN | \n",
" 5 | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" B | \n",
" 5 | \n",
" 5 | \n",
" 4 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" C | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 2 | \n",
" 4 | \n",
" 5 | \n",
" NaN | \n",
"
\n",
" \n",
" D | \n",
" NaN | \n",
" 3 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" HP1 HP2 HP3 TW SW1 SW2 SW3\n",
"A 4 NaN NaN 5 1 NaN NaN\n",
"B 5 5 4 NaN NaN NaN NaN\n",
"C NaN NaN NaN 2 4 5 NaN\n",
"D NaN 3 NaN NaN NaN NaN 3"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Fig 9.4\n",
"M_9_1"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" HP1 | \n",
" HP2 | \n",
" HP3 | \n",
" TW | \n",
" SW1 | \n",
" SW2 | \n",
" SW3 | \n",
"
\n",
" \n",
" \n",
" \n",
" A | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" B | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" C | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 1 | \n",
" 1 | \n",
" NaN | \n",
"
\n",
" \n",
" D | \n",
" NaN | \n",
" 1 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" HP1 HP2 HP3 TW SW1 SW2 SW3\n",
"A 1 NaN NaN 1 NaN NaN NaN\n",
"B 1 1 1 NaN NaN NaN NaN\n",
"C NaN NaN NaN NaN 1 1 NaN\n",
"D NaN 1 NaN NaN NaN NaN 1"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# rounding the data\n",
"M_round = M_9_1.copy()\n",
"M_round[M_9_1 <= 2] = np.nan\n",
"M_round[M_9_1 > 2] = 1\n",
"\n",
"M_round"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" HP1 | \n",
" HP2 | \n",
" HP3 | \n",
" TW | \n",
" SW1 | \n",
" SW2 | \n",
" SW3 | \n",
"
\n",
" \n",
" \n",
" \n",
" A | \n",
" 0.666667 | \n",
" NaN | \n",
" NaN | \n",
" 1.666667 | \n",
" -2.333333 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" B | \n",
" 0.333333 | \n",
" 0.333333 | \n",
" -0.666667 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" C | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" -1.666667 | \n",
" 0.333333 | \n",
" 1.333333 | \n",
" NaN | \n",
"
\n",
" \n",
" D | \n",
" NaN | \n",
" 0.000000 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" HP1 HP2 HP3 TW SW1 SW2 SW3\n",
"A 0.666667 NaN NaN 1.666667 -2.333333 NaN NaN\n",
"B 0.333333 0.333333 -0.666667 NaN NaN NaN NaN\n",
"C NaN NaN NaN -1.666667 0.333333 1.333333 NaN\n",
"D NaN 0.000000 NaN NaN NaN NaN 0"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# normalizing ratings\n",
"M_norm = M_9_1.apply(lambda x: x - np.mean(x), axis=1)\n",
"M_norm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 9.3.2 The Duality of Similarity\n",
"1. We can use information about users to recommend items, whereas even if we find pairs of similar items, it takes an **additional step** in order to recommend items to users. \n",
" + find $n$ similar users $\\to$ recommend item $I$ to user $U$. \n",
" normalize the utility matrix first. \n",
" \\begin{align}\n",
" M[U,I] &= Ave(M[U,:]) + Ave(M[0:n,I] - Ave(M[0:n,I])) \\\\\n",
" &\\approx Ave(M[U,:]) + Std(M[0:n,I])\n",
" \\end{align}\n",
" \n",
" + find $m$ similar items $\\to$ recommend item $I$ to user $U$. \n",
" $$M[U,I] = Ave(M[U,0:m])$$\n",
" \n",
" + in order to recommend items to user $U$, we need to find all or most of entry in $M[U,:]$. \n",
" **tradeoff**: \n",
" 1. user-item: find similar users, directly get all predict values of all potent items. \n",
" item-item: find similar items, we need **calculate all items one by one (additional step)** to fill $M[U,:]$.\n",
" \n",
" 2. item-item similarity often provides more **reliable** information due to the simplicity of items (genre). \n",
" \n",
" + **precompute** preferred items for each user. \n",
" utility matrix evolves slowly $\\implies$ compute it infrequently and assume that it remains fixed between recomputations.\n",
"\n",
"2. Items tend to be classifiable in simple terms (eg: genre), whereas the individuals are complex."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 9.3.3 Clustering Users and Items\n",
"Hierachical approach is prefered: \n",
"\n",
"1. leave many cluster unmerged at first.\n",
"\n",
"2. cluster items, and average the corresponding value in utility matrix.\n",
"\n",
"3. cluster users, and average as well.\n",
"\n",
"4. repeat several times if we like.\n",
"\n",
"\n",
"Predict $M[U,I]: \n",
"\n",
"1. $U \\in C$, and $I \\in D$. \n",
"\n",
"2. predict: \n",
"\n",
" \\begin{equation}\n",
" M[U,I] = \\begin{cases}\n",
" M_{revised}[C,D] & \\quad \\text{if existed.} \\\\\n",
" \\text{estimate using similar users/items} & \\quad \\text{otherwise}\n",
" \\end{cases}\n",
" \\end{equation}\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" e | \n",
" f | \n",
" g | \n",
" h | \n",
"
\n",
" \n",
" \n",
" \n",
" A | \n",
" 4 | \n",
" 5 | \n",
" NaN | \n",
" 5 | \n",
" 1 | \n",
" NaN | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
" B | \n",
" NaN | \n",
" 3 | \n",
" 4 | \n",
" 3 | \n",
" 1 | \n",
" 2 | \n",
" 1 | \n",
" NaN | \n",
"
\n",
" \n",
" C | \n",
" 2 | \n",
" NaN | \n",
" 1 | \n",
" 3 | \n",
" NaN | \n",
" 4 | \n",
" 5 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c d e f g h\n",
"A 4 5 NaN 5 1 NaN 3 2\n",
"B NaN 3 4 3 1 2 1 NaN\n",
"C 2 NaN 1 3 NaN 4 5 3"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Fig 9.8\n",
"\n",
"raw_data = [\n",
" [4, 5, np.nan, 5, 1, np.nan, 3, 2],\n",
" [np.nan, 3, 4, 3, 1, 2, 1, np.nan],\n",
" [2, np.nan, 1, 3, np.nan, 4, 5, 3]\n",
"]\n",
"\n",
"import string\n",
"M_uti = pd.DataFrame(raw_data, index=list(string.uppercase[:3]), columns=list(string.lowercase[:8]))\n",
"M_uti"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('A', 'B') jaccard: 0.5\n",
"('A', 'C') jaccard: 0.5\n",
"('B', 'C') jaccard: 0.5\n"
]
}
],
"source": [
"logger.setLevel('WARN')\n",
"# exercise 9.3.1\n",
"from scipy.spatial.distance import jaccard, cosine\n",
"from itertools import combinations\n",
"\n",
"\n",
"def calc_distance_among_matrix(M, func_dis):\n",
" for c in list(combinations(M.index, 2)):\n",
" logger.info('c: {}'.format(c))\n",
" u, v = M.loc[c[0]], M.loc[c[1]]\n",
" logger.info('\\n u:{},\\n v:{}\\n'.format(u.values,v.values))\n",
" print('{} {}: {}'.format(c, func_dis.__name__, func_dis(u,v)))\n",
"\n",
"# (a)\n",
"calc_distance_among_matrix(M_uti.notnull(), jaccard) "
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('A', 'B') cosine: 0.398959235991\n",
"('A', 'C') cosine: 0.385081306188\n",
"('B', 'C') cosine: 0.486129880223\n"
]
}
],
"source": [
"# (b)\n",
"calc_distance_among_matrix(M_uti.fillna(value=0), cosine)"
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('A', 'B') jaccard: 0.714285714286\n",
"('A', 'C') jaccard: 0.75\n",
"('B', 'C') jaccard: 0.875\n"
]
}
],
"source": [
"# (c)\n",
"M_tmp = M_uti.copy()\n",
"M_tmp[M_uti < 3] = 0\n",
"M_tmp[M_uti >= 3] = 1 \n",
"calc_distance_among_matrix(M_tmp, jaccard)"
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('A', 'B') cosine: 0.42264973081\n",
"('A', 'C') cosine: 0.5\n",
"('B', 'C') cosine: 0.711324865405\n"
]
}
],
"source": [
"# (d)\n",
"calc_distance_among_matrix(M_tmp.fillna(value=0), cosine)"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" a | \n",
" b | \n",
" c | \n",
" d | \n",
" e | \n",
" f | \n",
" g | \n",
" h | \n",
"
\n",
" \n",
" \n",
" \n",
" A | \n",
" 0.666667 | \n",
" 1.666667 | \n",
" NaN | \n",
" 1.666667 | \n",
" -2.333333 | \n",
" NaN | \n",
" -0.333333 | \n",
" -1.333333 | \n",
"
\n",
" \n",
" B | \n",
" NaN | \n",
" 0.666667 | \n",
" 1.666667 | \n",
" 0.666667 | \n",
" -1.333333 | \n",
" -0.333333 | \n",
" -1.333333 | \n",
" NaN | \n",
"
\n",
" \n",
" C | \n",
" -1.000000 | \n",
" NaN | \n",
" -2.000000 | \n",
" 0.000000 | \n",
" NaN | \n",
" 1.000000 | \n",
" 2.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" a b c d e f g \\\n",
"A 0.666667 1.666667 NaN 1.666667 -2.333333 NaN -0.333333 \n",
"B NaN 0.666667 1.666667 0.666667 -1.333333 -0.333333 -1.333333 \n",
"C -1.000000 NaN -2.000000 0.000000 NaN 1.000000 2.000000 \n",
"\n",
" h \n",
"A -1.333333 \n",
"B NaN \n",
"C 0.000000 "
]
},
"execution_count": 146,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# (e)\n",
"M_uti_nor = M_uti.apply(lambda x: x - np.mean(x), axis=1)\n",
"M_uti_nor"
]
},
{
"cell_type": "code",
"execution_count": 148,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('A', 'B') cosine: 0.415693452532\n",
"('A', 'C') cosine: 1.11547005384\n",
"('B', 'C') cosine: 1.73957399695\n"
]
}
],
"source": [
"# (f)\n",
"calc_distance_among_matrix(M_uti_nor.fillna(value=0), cosine)"
]
},
{
"cell_type": "code",
"execution_count": 149,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# exercise 9.3.2\n",
"#todo"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 9.4 Dimensionality Reduction\n",
"\n",
"UV-decomposition: $$M = U \\times V$$\n",
"\n",
"measure: RMSE (root-mean-square error)\n",
"\n",
"\n",
"#### Building a Complete UV-Decomposition Algorithm\n",
"1. Preprocessing of the matrix $M$. \n",
" **normalization**: \n",
" 1. subtract: average rating of user $i$, then average rating of item $j$. \n",
" 2. subtract: first item, then user. \n",
" 3. subtract: half of average of item and half of average of user.\n",
"\n",
"2. Initializing $U$ and $V$. \n",
" choice: gives the elements of $UV$ the average of the nonblank elements of $M$. \n",
" $\\implies$ the element of $U$ and $V$ should be $\\sqrt{a/d}$, \n",
" where $a$ is the average nonblank element of $M$, $d$ is the lengths of the short sides of $U$ and $V$. \n",
" \n",
" local minima contains global minima: \n",
" 1. vary the initial values of $U$ and $V$: \n",
" perturb the value $\\sqrt{a/d}$ randomly. \n",
" 2. vary the way we seek the optimum.\n",
"\n",
"3. Performing the Optimization. \n",
" different optimization path: \n",
" choose a permutation of the elements and follow that order for every round. \n",
" \n",
" Gradient Descent $\\to$ stochastic gradient descent.\n",
" \n",
"4. Converging to a Minimum. \n",
" track the amount of improvement in the RMSE obtained.\n",
" \n",
" stop condition: \n",
" 1. stop when that improvement in one round falls below a threshold. \n",
" 2. stop when the maximum improvement during a round is below a threshold. \n",
" \n",
"\n",
"##### Avoiding Overfitting\n",
"solutions:\n",
"\n",
"1. optimized by only moving the value of a component a fraction of the way from its current value toward its optimized value.\n",
"\n",
"2. Stop before the process has converged.\n",
"\n",
"3. Take several different $UV$ decompositions, and average their predicts."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# exercise 9.4.6\n",
"\n",
"#todo"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 9.5 The NetFlix Challenge\n",
"\n",
"some facts:\n",
"\n",
"1. CineMatch was not a very good algorithm.\n",
"\n",
"2. UV-decomposition alorithm given a 7\\% improvement over CineMatch when couped with normalization and a few other tricks.\n",
"\n",
"3. Combing different algorithms is a preferred strategy.\n",
"\n",
"4. Genre and other information in IMDB was no useful.\n",
"\n",
"5. Time of rating turned out to be useful: upward or downward slope with time.\n",
"\n",
"todo: read the papers introduced in the chapter."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}