\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" userID placeID rating food_rating service_rating\n",
"0 U1077 135085 2 2 2\n",
"1 U1077 135038 2 2 1\n",
"2 U1077 132825 2 2 2\n",
"3 U1077 135060 1 2 2\n",
"4 U1068 135104 1 1 2\n",
"5 U1068 132740 0 0 0\n",
"6 U1068 132663 1 1 1\n",
"7 U1068 132732 0 0 0\n",
"8 U1068 132630 1 1 1\n",
"9 U1067 132584 2 2 2"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Users give restaurants ratings based of food, service and overall quality. Possible ratings are 0, 1, 2. To distinguish zero ratings from lack of ratings I replace zero ratings with very small values. I'll use only overall rating in the analysis."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"data['rating'] = data['rating'].apply(lambda x: 0.000001 if x == 0 else x)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"#Sparse matrix.\n",
"ratings = data.pivot_table(index='userID', columns='placeID', values='rating')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Memory-based."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### User-based. Pearson correlation, neighborhood-based.\n",
"\n",
"Algorithm uses Pearson correlation:\n",
"\n",
"![Pearson correlation](http://i.imgur.com/MhSERyR.png)\n",
"\n",
", where: r - ratings; x, y - users; Ixy is the set of items rated by both user x and user y.\n",
"\n",
"At first similarities between users are calculates using Pearson correlation, then users, who are most similar to the target user are identified. Recommendations are generated based on their ratings."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def pearson(user1, user2, df):\n",
" '''\n",
" Calculates similarity between two users. Takes user's ids and dataframe as inputs.\n",
" '''\n",
"\n",
" df_short = df[df[user1].notnull() & df[user2].notnull()]\n",
"\n",
" if len(df_short) == 0:\n",
" return 0\n",
"\n",
" else:\n",
" rat1 = [row[user1] for i, row in df_short.iterrows()]\n",
" rat2 = [row[user2] for i, row in df_short.iterrows()]\n",
"\n",
" numerator = sum([(rat1[i] - sum(rat1)/len(rat1)) * (rat2[i] - sum(rat2)/len(rat2)) for i in range(0,len(df_short))])\n",
" denominator1 = sum([(rat1[i] - sum(rat1)/len(rat1)) ** 2 for i in range(0,len(df_short))])\n",
" denominator2 = sum([(rat2[i] - sum(rat2)/len(rat2)) ** 2 for i in range(0,len(df_short))])\n",
"\n",
" if denominator1 * denominator2 == 0:\n",
" return 0\n",
" else:\n",
" return numerator / ((denominator1 * denominator2) ** 0.5)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.2970444347126886"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Dataframe is transposed, for easier processing.\n",
"pearson('U1103', 'U1028', ratings.transpose())"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def get_neighbours(user_id, df):\n",
" '''\n",
" Creates a sorted list of users, who are most similar to specified user. Calculate similarity between current user and\n",
" all other users and sort by similarity.\n",
" '''\n",
" distances = [(user, pearson(user_id, user, df)) for user in df.columns if user != user_id]\n",
"\n",
" distances.sort(key=lambda x: x[1], reverse=True)\n",
"\n",
" distances = [i for i in distances if i[1] > 0]\n",
" return distances"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('U1068', 0.79056917787109249), ('U1028', 0.2970444347126886)]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_neighbours('U1103', ratings.transpose())"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"def recommend(user, df, n_users=2, n_recommendations=2):\n",
" '''\n",
" Generate recommendations for the user. Take userID and Dataframe as input. Get neighbours and get weighted score for\n",
" each place they rated. Return sorted list of places and their scores.\n",
" '''\n",
" \n",
" recommendations = {}\n",
" nearest = get_neighbours(user, df)\n",
"\n",
" n_users = n_users if n_users <= len(nearest) else len(nearest)\n",
"\n",
" user_ratings = df[df[user].notnull()][user]\n",
"\n",
" place_ratings = []\n",
"\n",
" for i in range(n_users):\n",
" neighbour_ratings = df[df[nearest[i][0]].notnull()][nearest[i][0]]\n",
" for place in neighbour_ratings.index:\n",
" if place not in user_ratings.index:\n",
" place_ratings.append([place,neighbour_ratings[place],nearest[i][1]])\n",
" \n",
" recommendations = get_ratings(place_ratings)\n",
" return recommendations[:n_recommendations]\n",
"\n",
"def get_ratings(place_ratings):\n",
" \n",
" '''\n",
" Creates Dataframe from list of lists. Calculates weighted rarings for each place. \n",
" '''\n",
"\n",
" ratings_df = pd.DataFrame(place_ratings, columns=['placeID', 'rating', 'weight'])\n",
"\n",
" ratings_df['total_weight'] = ratings_df['weight'].groupby(ratings_df['placeID']).transform('sum')\n",
" recommendations = []\n",
"\n",
" for i in ratings_df.placeID.unique():\n",
" place_ratings = 0\n",
" df_short = ratings_df.loc[ratings_df.placeID == i]\n",
" for j, row in df_short.iterrows():\n",
" place_ratings += row[1] * row[2] / row[3]\n",
" recommendations.append((i, place_ratings))\n",
"\n",
" recommendations = [i for i in recommendations if i[1] >= 1]\n",
" \n",
" recommendations.sort(key=lambda x: x[1], reverse=True)\n",
" return recommendations"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(132564, 2.0), (132613, 1.3934554478666792), (132717, 1.0000005000000001)]"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recommend('U1068', ratings.transpose(),5,5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Item-based. Slope-one recommendation.\n",
"\n",
"The idea behind slope one algorithm is simple: calculate average difference in ratings for each pair of items and use this difference as prediction. For example if users generally rate item A higher than item B by 1 point, then to predict rating for item A we take a rating of rating B by targer user and add 1 point. Usually there are more items that two, so weighted average is used. The algorithm's main advantages are simplicity and speed."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"def get_dev_fr(data):\n",
" \n",
" '''\n",
" Calculates average difference between each pair of places and frequency - number of ratings. Both values are calculated\n",
" for cases, where a user rated both places.\n",
" '''\n",
" \n",
" data_dev = pd.DataFrame(index=data.columns,columns=data.columns)\n",
" data_fr = pd.DataFrame(index=data.columns,columns=data.columns)\n",
" for i in data_dev.columns:\n",
" for j in data_dev.columns:\n",
" df_loc = data[data[i].notnull() & data[j].notnull()]\n",
" if len(df_loc) != 0:\n",
"\n",
" data_dev.loc[i,j] = (sum(df_loc[i]) - sum(df_loc[j]))/len(df_loc) if i != j else 0\n",
"\n",
" data_fr.loc[i,j] = len(df_loc) if i != j else 0\n",
" return data_dev, data_fr"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"data_dev, data_fr = get_dev_fr(ratings)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def slopeone(user, data):\n",
" \n",
" '''\n",
" Generate recommended ratings for each place which user didn't rate adding weighted differences between places.\n",
" '''\n",
" #Places, which user didn't rate. The condition finds nan values.\n",
" recommendation = [i for i in data.columns if data.loc[user,i] != data.loc[user,i]]\n",
" recommendation_dictionary = {}\n",
" \n",
" for j in recommendation:\n",
" score = 0\n",
" denominator = 0\n",
" for i in data.columns.drop(recommendation):\n",
"\n",
" if data_dev.loc[j,i] == data_dev.loc[j,i] and data_fr.loc[j,i] == data_fr.loc[j,i]:\n",
" score += (data.loc[user,i] + data_dev.loc[j,i]) * data_fr.loc[j,i]\n",
" denominator += data_fr.loc[j,i]\n",
" if denominator == 0:\n",
" recommendation_dictionary[j] = 0\n",
" else:\n",
" score = score/denominator\n",
" recommendation_dictionary[j] = score\n",
" \n",
" recommendation_dictionary = {k:round(v,2) for k, v in recommendation_dictionary.items()}\n",
" return sorted(recommendation_dictionary.items(), key=lambda x: x[1], reverse=True)[:5]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[(132564, 2.0), (132668, 1.5), (132608, 1.24), (132665, 1.0), (132715, 1.0)]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"slopeone('U1103', ratings)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Cosine similarity.\n",
"\n",
"Another metric for similarity is cosine similarity.\n",
"\n",
"![similarity](http://i.imgur.com/N2FXjoh.png)\n",
"\n",
"Predictions are generated in a similar way to the previous methods - as a weighted rating of other users/items. Also it is a good idea to remove user's bias. Users tend to give low or high ratings for all movies. So I'll take in consideration average ratings of users.\n",
"\n",
"![rating prediction](http://i.imgur.com/fJqezOS.png)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"ratings_filled = data.pivot_table(index='userID', columns='placeID', values='rating', fill_value=0)\n",
"ratings_filled = ratings_filled.astype(float).values"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def similarity(ratings, matrix_type='user', epsilon=1e-9):\n",
" if matrix_type == 'user':\n",
" sim = ratings.dot(ratings.T) + epsilon\n",
" elif matrix_type == 'place':\n",
" sim = ratings.T.dot(ratings) + epsilon\n",
" norms = np.array([np.sqrt(np.diagonal(sim))])\n",
" return (sim / norms / norms.T)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"user_similarity = similarity(ratings_filled, matrix_type='user')\n",
"item_similarity = similarity(ratings_filled, matrix_type='place')"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def predict(ratings, similarity, matrix_type='user'):\n",
" \n",
" '''\n",
" Predict places based on similarity. \n",
" '''\n",
" \n",
" if matrix_type == 'user':\n",
" #Bias as sum of non-zero values divided by the number of non-zer0 values.\n",
" user_bias = np.true_divide(ratings.sum(axis=1),(ratings!=0).sum(axis=1))\n",
" ratings = (ratings - user_bias[:, np.newaxis]).copy()\n",
" pred = similarity.dot(ratings) / np.array([np.abs(similarity).sum(axis=1)]).T\n",
" pred += user_bias[:, np.newaxis]\n",
" \n",
" elif matrix_type == 'place':\n",
" item_bias = np.true_divide(ratings.sum(axis=0),(ratings!=0).sum(axis=0))\n",
" ratings = (ratings - item_bias[np.newaxis, :]).copy()\n",
" pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])\n",
" pred += item_bias[np.newaxis, :]\n",
" \n",
" return pred\n",
"\n",
"def recommend_cosine(rating, matrix, user):\n",
" \n",
" '''\n",
" If user has rated a place, replace predicted rating with 0. Return top-5 predictions.\n",
" '''\n",
" \n",
" predictions = [[0 if rating[j][i] > 0 else matrix[j][i] for i in range(len(matrix[j]))] for j in range(len(matrix))]\n",
" recommendations = pd.DataFrame(index=ratings.index,columns=ratings.columns,data=np.round(predictions,4)).transpose()\n",
" return recommendations[user].sort_values(ascending=False)[:5]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"user_pred = predict(ratings_filled, user_similarity, matrix_type='user')\n",
"item_pred = predict(ratings_filled, item_similarity, matrix_type='place')"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"placeID\n",
"132660 0.7804\n",
"132955 0.6337\n",
"135034 0.6262\n",
"134986 0.6229\n",
"132922 0.4647\n",
"Name: U1103, dtype: float64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recommend_cosine(ratings_filled, item_pred, 'U1103')"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"placeID\n",
"132660 0.3974\n",
"132740 0.3504\n",
"132608 0.3306\n",
"132594 0.2953\n",
"132609 0.1946\n",
"Name: U1103, dtype: float64"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"recommend_cosine(ratings_filled, user_pred, 'U1103')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model-based. ALS.\n",
"\n",
"Model-based collaborative filtering strives to find latent features in the data. Matrix factorization is a commonly used method. It implies finding two matrices so that their multiplication will yield the matrix with ratings: ones that we already have and predicted ones. One matrix is for users (P), the other one is for items (Q). Ratings matrix (R) is their multiplication. One dimension of matrices P and Q is number of users/items respectively, the other is the number of latent features.\n",
"\n",
"There are several algorithms with which matrix factorization could be done. Alternating least squares is one of them.\n",
"\n",
"$$\\underset{Q* , P*}{min}\\sum_{(u,i)\\epsilon K }(r_{ui}-Q_i^TP_u)^2+\\lambda(\\left \\| Q_i \\right \\|^2 + \\left \\| P_u \\right \\|^2)$$\n",
"\n",
"The algorithms optimizes the difference between the original ratings and the ratings which are produced by the multiplication of aforementioned matrices. Second part of the formula is regularization. The idea of ALS is to fix one of matrices (P or Q), optimize for the other matrix, then at the next step fix the second matrix and optimize for the first one.\n",
"\n",
"\n",
"$${p}_{i} = A^{-1}_{i} V_{i} \\ with\\ A_{i} = Q_{I_i} Q_{I_i}^{T} + \\lambda n_{p_i} E \\ and\\ V_{i} = Q_{I_i} R^{T}(i,I_{i})$$\n",
"\n",
"$${q}_{j} = A^{-1}_{j} V_{j} \\ with\\ A_{j} = P_{I_j} P_{I_j}^{T} + \\lambda n_{q_j} E \\ and\\ V_{j} = P_{I_j} R^{T}(I_{j},j)$$\n",
"\n",
"P. S. This part is heavilly based on [this article](http://online.cambridgecoding.com/notebooks/mhaller/predicting-user-preferences-in-python-using-alternating-least-squares)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At first I need to split the data into train and test to check how the error changes with each step. Train and test should have the same dimensions as original data, so here is a function for it. I take users, who have rated at least one movie and select their three ratings - this is test. Train is all other values."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"def train_test_split(ratings):\n",
" test = np.zeros(ratings.shape)\n",
" train = ratings.copy()\n",
" non_zero = [i for i in range(ratings.shape[0]) if sum(ratings[i]) > 0]\n",
" for user in non_zero:\n",
" test_ratings = np.random.choice(ratings[user, :].nonzero()[0], \n",
" size=3, \n",
" replace=False)\n",
" train[user, test_ratings] = 0.\n",
" test[user, test_ratings] = ratings[user, test_ratings]\n",
" \n",
" return train, test"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"R, T = train_test_split(ratings_filled)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now I need index matrix, where value 1 means that a certain user has rated a particular item."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"I = R.copy()\n",
"I[I > 0] = 1\n",
"I[I == 0] = 0\n",
"\n",
"I2 = T.copy()\n",
"I2[I2 > 0] = 1\n",
"I2[I2 == 0] = 0"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def rmse(I,R,Q,P):\n",
" return np.sqrt(np.sum((I * (R - np.dot(P.T,Q)))**2)/len(R[R > 0]))"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"def als(R=R, T=T, lmbda=0.1, k=40, n_epochs=30, I=I, I2=I2):\n",
" \n",
" '''\n",
" Function for ALS. Takes matrices and parameters as inputs.\n",
" Lmbda - learning rate;\n",
" k - dimensionality of latent feature space,\n",
" n_epochs - number of epochs for training.\n",
" '''\n",
" #Number of users and items.\n",
" m, n = R.shape\n",
" P = 1.5 * np.random.rand(k,m) # Latent user feature matrix.\n",
" Q = 1.5 * np.random.rand(k,n) # Latent places feature matrix.\n",
" Q[0,:] = R[R != 0].mean(axis=0) # Avg. rating for each movie for initial step.\n",
" E = np.eye(k) # (k x k)-dimensional idendity matrix.\n",
" \n",
" train_errors = []\n",
" test_errors = []\n",
"\n",
"\n",
" for epoch in range(n_epochs):\n",
" # Fix Q and estimate P\n",
" for i, Ii in enumerate(I):\n",
" nui = np.count_nonzero(Ii)\n",
" if (nui == 0): nui = 1\n",
"\n",
" a = np.dot(np.diag(Ii), Q.T)\n",
" Ai = np.dot(Q, a) + lmbda * nui * E\n",
" v = np.dot(np.diag(Ii), R[i].T)\n",
" Vi = np.dot(Q, v)\n",
" P[:,i] = np.linalg.solve(Ai,Vi)\n",
"\n",
" # Fix P and estimate Q\n",
" for j, Ij in enumerate(I.T):\n",
" nmj = np.count_nonzero(Ij)\n",
" if (nmj == 0): nmj = 1\n",
"\n",
" a = np.dot(np.diag(Ij), P.T)\n",
" Aj = np.dot(P, a) + lmbda * nmj * E\n",
" v = np.dot(np.diag(Ij), R[:,j])\n",
" Vj = np.dot(P, v)\n",
" Q[:,j] = np.linalg.solve(Aj,Vj)\n",
"\n",
" train_rmse = rmse(I,R,Q,P)\n",
" test_rmse = rmse(I2,T,Q,P)\n",
" train_errors.append(train_rmse)\n",
" test_errors.append(test_rmse)\n",
"\n",
" print(f'[Epoch {epoch+1}/{n_epochs}] train error: {train_rmse:6.6}, test error: {test_rmse:6.6}')\n",
" \n",
" if len(train_errors) > 1 and test_errors[-1:] > test_errors[-2:-1]:\n",
" break\n",
" print('Test error stopped improving, algorithm stopped')\n",
" \n",
" R = pd.DataFrame(R)\n",
" R.columns = ratings.columns\n",
" R.index = ratings.index\n",
" \n",
" R_pred = pd.DataFrame(np.dot(P.T,Q))\n",
" R_pred.columns = ratings.columns\n",
" R_pred.index = ratings.index\n",
" \n",
" return pd.DataFrame(R), R_pred"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Epoch 1/30] train error: 0.720442, test error: 1.13765\n",
"[Epoch 2/30] train error: 0.309338, test error: 0.961706\n",
"[Epoch 3/30] train error: 0.249321, test error: 0.948285\n",
"[Epoch 4/30] train error: 0.224138, test error: 0.944137\n",
"[Epoch 5/30] train error: 0.211072, test error: 0.94301\n",
"[Epoch 6/30] train error: 0.203513, test error: 0.943203\n",
"Test error stopped improving, algorithm stopped\n"
]
}
],
"source": [
"R, R_pred = als()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So this is it. Now let's compare original and predicted ratings of a user."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"userID | placeID | rating | food_rating | service_rating | |
---|---|---|---|---|---|

0 | U1077 | 135085 | 2 | 2 | 2 |

1 | U1077 | 135038 | 2 | 2 | 1 |

2 | U1077 | 132825 | 2 | 2 | 2 |

3 | U1077 | 135060 | 1 | 2 | 2 |

4 | U1068 | 135104 | 1 | 1 | 2 |

5 | U1068 | 132740 | 0 | 0 | 0 |

6 | U1068 | 132663 | 1 | 1 | 1 |

7 | U1068 | 132732 | 0 | 0 | 0 |

8 | U1068 | 132630 | 1 | 1 | 1 |

9 | U1067 | 132584 | 2 | 2 | 2 |

\n",
"\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"

\n",
"

"
],
"text/plain": [
" Actual Predicted\n",
"placeID \n",
"132584 1.0 0.995225\n",
"132594 1.0 0.923987\n",
"132733 1.0 0.994598\n",
"132740 1.0 0.874841\n",
"135104 2.0 1.554813"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"user_ratings = R.transpose()['U1123'][R.transpose()['U1123'].sort_values(ascending=False) >=1]\n",
"predictions = pd.DataFrame(user_ratings)\n",
"predictions.columns = ['Actual']\n",
"predictions['Predicted'] = R_pred.loc['U1123',user_ratings.index]\n",
"predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The difference in ratings truly isn't very big. And now let's see the recommendations."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"placeID\n",
"135034 1.497545\n",
"132958 1.487171\n",
"132723 1.453056\n",
"135075 1.435531\n",
"132755 1.422608\n",
"Name: U1123, dtype: float64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"R_pred.loc['U1123',set(R_pred.transpose().index)-set(user_ratings.index)].sort_values(ascending=False)[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusions\n",
"\n",
"Collaborative filtering has a lot of methods, which have certain advantages and disadvantages. So methods should be chosen depending on the cituation. Also collaborative filtering can be combined with other approaches to building recommendation systems - like content-based approaches."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [Root]",
"language": "python",
"name": "Python [Root]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Actual | Predicted | |
---|---|---|

placeID | ||

132584 | 1.0 | 0.995225 |

132594 | 1.0 | 0.923987 |

132733 | 1.0 | 0.994598 |

132740 | 1.0 | 0.874841 |

135104 | 2.0 | 1.554813 |