{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"# Latent Factor Recommender System\n",
"\n",
"\n",
"![image.png](images/author.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The NetFlix Challenge\n",
"\n",
"**Training data** 100 million ratings, 480,000 users, 17,770 movies. 6 years of data: 2000-2005\n",
"\n",
"**Test data** Last few ratings of each user (2.8 million)\n",
"\n",
"**Competition** 2,700+ teams, $1 million prize for 10% improvement on Netflix"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"- Evaluation criterion: Root Mean Square Error (RMSE) \n",
"\n",
"\\begin{equation}\n",
" RMSE = \\frac{1}{|R|} \\sqrt{\\sum_{(i, x)\\in R}(\\hat{r}_{xi} - r_{xi})^2}\n",
"\\end{equation}\n",
"- Netflix’s system RMSE: 0.9514"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-13T11:10:28.931508Z",
"start_time": "2020-06-13T11:10:28.922374Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"![image.png](images/latent.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"![image.png](images/latent2.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"![image.png](images/latent3.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"![image.png](images/latent4.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-13T11:10:28.931508Z",
"start_time": "2020-06-13T11:10:28.922374Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"![image.png](images/latent5.png)\n",
"\n",
"U (m x m) , $\\Sigma$(m x n), $V^T$ (n x n)"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T05:20:41.371788Z",
"start_time": "2020-06-21T05:20:41.366179Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[1 2 3]\n",
" [4 5 6]\n",
" [7 8 9]]\n",
"[[-8.08154958e+00 -9.64331175e+00 -1.12050739e+01]\n",
" [-8.29792976e-01 -8.08611173e-02 6.68070742e-01]\n",
" [-1.36140716e-16 2.72281431e-16 -1.36140716e-16]]\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])\n",
"print(A)\n",
"# Singular-value decomposition\n",
"U, s, VT = np.linalg.svd(A)\n",
"# create n x n Sigma matrix\n",
"Sigma = np.diag(s)\n",
"# reconstruct matrix\n",
"PT = Sigma.dot(VT)\n",
"#B = U.dot(Sigma.dot(VT))\n",
"print(PT)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T05:24:08.464482Z",
"start_time": "2020-06-21T05:24:08.462334Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"$\\Sigma$本来应该跟A矩阵的大小一样,但linalg.svd()只返回了一个行向量的$\\Sigma$,并且舍弃值为0的奇异值。因此,必须先将$\\Sigma$转化为矩阵。\n",
"\n",
"![image.png](images/latent6.png)"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T05:38:31.694326Z",
"start_time": "2020-06-21T05:38:31.685324Z"
},
"code_folding": [],
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"A = \n",
" [[1 2]\n",
" [3 4]\n",
" [5 6]] \n",
"\n",
"U = \n",
" [[-0.2298477 0.88346102 0.40824829]\n",
" [-0.52474482 0.24078249 -0.81649658]\n",
" [-0.81964194 -0.40189603 0.40824829]] \n",
"\n",
"Sigma = \n",
" [[9.52551809 0. ]\n",
" [0. 0.51430058]\n",
" [0. 0. ]] \n",
"\n",
"VT = \n",
" [[-0.61962948 -0.78489445]\n",
" [-0.78489445 0.61962948]] \n",
"\n",
"PT = \n",
" [[-5.90229186 -7.47652631]\n",
" [-0.40367167 0.3186758 ]\n",
" [ 0. 0. ]] \n",
"\n",
"B = \n",
" [[1. 2.]\n",
" [3. 4.]\n",
" [5. 6.]] \n",
"\n"
]
}
],
"source": [
"# Singular-value decomposition \n",
"A = np.array([[1, 2], [3, 4], [5, 6]])\n",
"U, s, VT = np.linalg.svd(A)\n",
"# create n x n Sigma matrix\n",
"Sigma = np.zeros((A.shape[0], A.shape[1]))\n",
"# populate Sigma with n x n diagonal matrix\n",
"Sigma[:A.shape[1], :A.shape[1]] = np.diag(s)\n",
"# reconstruct matrix\n",
"PT = Sigma.dot(VT)\n",
"B = U.dot(PT)\n",
"\n",
"print('A = \\n', A, '\\n')\n",
"print('U = \\n', U, '\\n')\n",
"print('Sigma = \\n', Sigma, '\\n')\n",
"print('VT = \\n', VT, '\\n')\n",
"print('PT = \\n', PT, '\\n')\n",
"print('B = \\n', B, '\\n') "
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T06:12:38.801639Z",
"start_time": "2020-06-21T06:12:38.790584Z"
},
"code_folding": [
0
],
"scrolled": true,
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"A = \n",
" [[1 2 3]\n",
" [4 5 6]] \n",
"\n",
"U = \n",
" [[-0.3863177 -0.92236578]\n",
" [-0.92236578 0.3863177 ]] \n",
"\n",
"Sigma = \n",
" [[9.508032 0. 0. ]\n",
" [0. 0.77286964 0. ]\n",
" [0. 0. 0. ]] \n",
"\n",
"VT = \n",
" [[-0.42866713 -0.56630692 -0.7039467 ]\n",
" [ 0.80596391 0.11238241 -0.58119908]\n",
" [ 0.40824829 -0.81649658 0.40824829]] \n",
"\n",
"PT = \n",
" [[-4.07578082 -5.38446431 -6.69314779]\n",
" [ 0.62290503 0.08685696 -0.44919112]] \n",
"\n",
"B = \n",
" [[1. 2. 3.]\n",
" [4. 5. 6.]] \n",
"\n"
]
}
],
"source": [
"# Singular-value decomposition\n",
"A = np.array([[1, 2, 3], \n",
" [4, 5, 6]])\n",
"U,S,VT = np.linalg.svd(A)\n",
"# create n x n Sigma matrix\n",
"Sigma = np.zeros((A.shape[1], A.shape[1]))\n",
"# populate Sigma with n x n diagonal matrix\n",
"if A.shape[1] > S.shape[0]:\n",
" S = np.append(S, 0)\n",
"Sigma[:A.shape[1], :A.shape[1]] = np.diag(S)\n",
"\n",
"PT= Sigma.dot(VT)\n",
"PT = PT[0:A.shape[0]]\n",
"B = U.dot(PT)\n",
"print('A = \\n', A, '\\n')\n",
"print('U = \\n', U, '\\n')\n",
"print('Sigma = \\n', Sigma, '\\n')\n",
"print('VT = \\n', VT, '\\n')\n",
"print('PT = \\n', PT, '\\n')\n",
"print('B = \\n', B, '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"![image.png](images/latent7.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"![image.png](images/latent8.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-13T11:10:28.931508Z",
"start_time": "2020-06-13T11:10:28.922374Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"SVD gives minimum reconstruction error (Sum of Squared Errors, **SSE**)\n",
"\n",
"SSE and RMSE are monotonically related:$RMSE=\\frac{1}{c}\\sqrt{SSE}$\n",
" \n",
"Great news: SVD is minimizing RMSE"
]
},
{
"cell_type": "code",
"execution_count": 226,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T07:14:01.788580Z",
"start_time": "2020-06-21T07:13:59.250457Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"# https://beckernick.github.io/matrix-factorization-recommender/\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"ratings_list = [i.strip().split(\"::\") for i in open('/Users/datalab/bigdata/cjc/ml-1m/ratings.dat', 'r').readlines()]\n",
"users_list = [i.strip().split(\"::\") for i in open('/Users/datalab/bigdata/cjc/ml-1m/users.dat', 'r').readlines()]\n",
"movies_list = [i.strip().split(\"::\") for i in open('/Users/datalab/bigdata/cjc/ml-1m/movies.dat', 'r', encoding = 'iso-8859-15').readlines()]\n",
"\n",
"ratings_df = pd.DataFrame(ratings_list, columns = ['UserID', 'MovieID', 'Rating', 'Timestamp'], dtype = int)\n",
"movies_df = pd.DataFrame(movies_list, columns = ['MovieID', 'Title', 'Genres'])\n",
"\n",
"movies_df['MovieID'] = movies_df['MovieID'].astype('int64')\n",
"ratings_df['UserID'] = ratings_df['UserID'].astype('int64')\n",
"ratings_df['MovieID'] = ratings_df['MovieID'].astype('int64')"
]
},
{
"cell_type": "code",
"execution_count": 219,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T07:12:09.215096Z",
"start_time": "2020-06-21T07:12:09.207845Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" MovieID | \n",
" Title | \n",
" Genres | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Toy Story (1995) | \n",
" Animation|Children's|Comedy | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" Jumanji (1995) | \n",
" Adventure|Children's|Fantasy | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" Grumpier Old Men (1995) | \n",
" Comedy|Romance | \n",
"
\n",
" \n",
" 3 | \n",
" 4 | \n",
" Waiting to Exhale (1995) | \n",
" Comedy|Drama | \n",
"
\n",
" \n",
" 4 | \n",
" 5 | \n",
" Father of the Bride Part II (1995) | \n",
" Comedy | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" MovieID Title Genres\n",
"0 1 Toy Story (1995) Animation|Children's|Comedy\n",
"1 2 Jumanji (1995) Adventure|Children's|Fantasy\n",
"2 3 Grumpier Old Men (1995) Comedy|Romance\n",
"3 4 Waiting to Exhale (1995) Comedy|Drama\n",
"4 5 Father of the Bride Part II (1995) Comedy"
]
},
"execution_count": 219,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"movies_df.head()\n"
]
},
{
"cell_type": "code",
"execution_count": 241,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T07:16:26.039680Z",
"start_time": "2020-06-21T07:16:26.031571Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" UserID | \n",
" MovieID | \n",
" Rating | \n",
" Timestamp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1193 | \n",
" 5 | \n",
" 978300760 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 661 | \n",
" 3 | \n",
" 978302109 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 914 | \n",
" 3 | \n",
" 978301968 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 3408 | \n",
" 4 | \n",
" 978300275 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 2355 | \n",
" 5 | \n",
" 978824291 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" UserID MovieID Rating Timestamp\n",
"0 1 1193 5 978300760\n",
"1 1 661 3 978302109\n",
"2 1 914 3 978301968\n",
"3 1 3408 4 978300275\n",
"4 1 2355 5 978824291"
]
},
"execution_count": 241,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ratings_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 227,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T07:14:09.715349Z",
"start_time": "2020-06-21T07:14:05.949062Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" MovieID | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" ... | \n",
" 3943 | \n",
" 3944 | \n",
" 3945 | \n",
" 3946 | \n",
" 3947 | \n",
" 3948 | \n",
" 3949 | \n",
" 3950 | \n",
" 3951 | \n",
" 3952 | \n",
"
\n",
" \n",
" UserID | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 5 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 5 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 2 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" ... | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
5 rows × 3706 columns
\n",
"
"
],
"text/plain": [
"MovieID 1 2 3 4 5 6 7 8 9 10 ... 3943 3944 3945 \\\n",
"UserID ... \n",
"1 5 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n",
"2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n",
"3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n",
"4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n",
"5 0 0 0 0 0 2 0 0 0 0 ... 0 0 0 \n",
"\n",
"MovieID 3946 3947 3948 3949 3950 3951 3952 \n",
"UserID \n",
"1 0 0 0 0 0 0 0 \n",
"2 0 0 0 0 0 0 0 \n",
"3 0 0 0 0 0 0 0 \n",
"4 0 0 0 0 0 0 0 \n",
"5 0 0 0 0 0 0 0 \n",
"\n",
"[5 rows x 3706 columns]"
]
},
"execution_count": 227,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 注意:使用0填充缺失值\n",
"R_df = ratings_df.pivot(index = 'UserID', columns ='MovieID', values = 'Rating').fillna(0)\n",
"R_df.head()"
]
},
{
"cell_type": "code",
"execution_count": 228,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T07:14:25.908888Z",
"start_time": "2020-06-21T07:14:24.982459Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"R = R_df.to_numpy(dtype=np.int16)\n",
"user_ratings_mean = np.mean(R, axis = 1)\n",
"R_demeaned = R - user_ratings_mean.reshape(-1, 1)"
]
},
{
"cell_type": "code",
"execution_count": 229,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T07:14:28.928544Z",
"start_time": "2020-06-21T07:14:26.842984Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"from scipy.sparse.linalg import svds\n",
"U, sigma, Vt = svds(R_demeaned, k = 50)"
]
},
{
"cell_type": "code",
"execution_count": 230,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T07:14:30.367595Z",
"start_time": "2020-06-21T07:14:30.142945Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"sigma = np.diag(sigma)\n",
"\n",
"all_user_predicted_ratings = U.dot( sigma.dot(Vt)) + user_ratings_mean.reshape(-1, 1)\n",
"preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)\n"
]
},
{
"cell_type": "code",
"execution_count": 239,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T07:15:53.058613Z",
"start_time": "2020-06-21T07:15:53.031321Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" MovieID | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
" 5 | \n",
" 6 | \n",
" 7 | \n",
" 8 | \n",
" 9 | \n",
" 10 | \n",
" ... | \n",
" 3943 | \n",
" 3944 | \n",
" 3945 | \n",
" 3946 | \n",
" 3947 | \n",
" 3948 | \n",
" 3949 | \n",
" 3950 | \n",
" 3951 | \n",
" 3952 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 4.288861 | \n",
" 0.143055 | \n",
" -0.195080 | \n",
" -0.018843 | \n",
" 0.012232 | \n",
" -0.176604 | \n",
" -0.074120 | \n",
" 0.141358 | \n",
" -0.059553 | \n",
" -0.195950 | \n",
" ... | \n",
" 0.027807 | \n",
" 0.001640 | \n",
" 0.026395 | \n",
" -0.022024 | \n",
" -0.085415 | \n",
" 0.403529 | \n",
" 0.105579 | \n",
" 0.031912 | \n",
" 0.050450 | \n",
" 0.088910 | \n",
"
\n",
" \n",
" 1 | \n",
" 0.744716 | \n",
" 0.169659 | \n",
" 0.335418 | \n",
" 0.000758 | \n",
" 0.022475 | \n",
" 1.353050 | \n",
" 0.051426 | \n",
" 0.071258 | \n",
" 0.161601 | \n",
" 1.567246 | \n",
" ... | \n",
" -0.056502 | \n",
" -0.013733 | \n",
" -0.010580 | \n",
" 0.062576 | \n",
" -0.016248 | \n",
" 0.155790 | \n",
" -0.418737 | \n",
" -0.101102 | \n",
" -0.054098 | \n",
" -0.140188 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.818824 | \n",
" 0.456136 | \n",
" 0.090978 | \n",
" -0.043037 | \n",
" -0.025694 | \n",
" -0.158617 | \n",
" -0.131778 | \n",
" 0.098977 | \n",
" 0.030551 | \n",
" 0.735470 | \n",
" ... | \n",
" 0.040481 | \n",
" -0.005301 | \n",
" 0.012832 | \n",
" 0.029349 | \n",
" 0.020866 | \n",
" 0.121532 | \n",
" 0.076205 | \n",
" 0.012345 | \n",
" 0.015148 | \n",
" -0.109956 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.408057 | \n",
" -0.072960 | \n",
" 0.039642 | \n",
" 0.089363 | \n",
" 0.041950 | \n",
" 0.237753 | \n",
" -0.049426 | \n",
" 0.009467 | \n",
" 0.045469 | \n",
" -0.111370 | \n",
" ... | \n",
" 0.008571 | \n",
" -0.005425 | \n",
" -0.008500 | \n",
" -0.003417 | \n",
" -0.083982 | \n",
" 0.094512 | \n",
" 0.057557 | \n",
" -0.026050 | \n",
" 0.014841 | \n",
" -0.034224 | \n",
"
\n",
" \n",
" 4 | \n",
" 1.574272 | \n",
" 0.021239 | \n",
" -0.051300 | \n",
" 0.246884 | \n",
" -0.032406 | \n",
" 1.552281 | \n",
" -0.199630 | \n",
" -0.014920 | \n",
" -0.060498 | \n",
" 0.450512 | \n",
" ... | \n",
" 0.110151 | \n",
" 0.046010 | \n",
" 0.006934 | \n",
" -0.015940 | \n",
" -0.050080 | \n",
" -0.052539 | \n",
" 0.507189 | \n",
" 0.033830 | \n",
" 0.125706 | \n",
" 0.199244 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 6035 | \n",
" 2.392388 | \n",
" 0.233964 | \n",
" 0.413676 | \n",
" 0.443726 | \n",
" -0.083641 | \n",
" 2.192294 | \n",
" 1.168936 | \n",
" 0.145237 | \n",
" -0.046551 | \n",
" 0.560895 | \n",
" ... | \n",
" 0.188493 | \n",
" -0.004439 | \n",
" -0.042271 | \n",
" -0.090101 | \n",
" 0.276312 | \n",
" 0.133806 | \n",
" 0.732374 | \n",
" 0.271234 | \n",
" 0.244983 | \n",
" 0.734771 | \n",
"
\n",
" \n",
" 6036 | \n",
" 2.070760 | \n",
" 0.139294 | \n",
" -0.012666 | \n",
" -0.176990 | \n",
" 0.261243 | \n",
" 1.074234 | \n",
" 0.083999 | \n",
" 0.013814 | \n",
" -0.030179 | \n",
" -0.084956 | \n",
" ... | \n",
" -0.161548 | \n",
" 0.001184 | \n",
" -0.029223 | \n",
" -0.047087 | \n",
" 0.099036 | \n",
" -0.192653 | \n",
" -0.091265 | \n",
" 0.050798 | \n",
" -0.113427 | \n",
" 0.033283 | \n",
"
\n",
" \n",
" 6037 | \n",
" 0.619089 | \n",
" -0.161769 | \n",
" 0.106738 | \n",
" 0.007048 | \n",
" -0.074701 | \n",
" -0.079953 | \n",
" 0.100220 | \n",
" -0.034013 | \n",
" 0.007671 | \n",
" 0.001280 | \n",
" ... | \n",
" -0.053546 | \n",
" 0.005835 | \n",
" 0.007551 | \n",
" -0.024082 | \n",
" -0.010739 | \n",
" -0.008863 | \n",
" -0.099774 | \n",
" -0.013369 | \n",
" -0.030354 | \n",
" -0.114936 | \n",
"
\n",
" \n",
" 6038 | \n",
" 1.503605 | \n",
" -0.036208 | \n",
" -0.161268 | \n",
" -0.083401 | \n",
" -0.081617 | \n",
" -0.143517 | \n",
" 0.106668 | \n",
" -0.054404 | \n",
" -0.008826 | \n",
" 0.205801 | \n",
" ... | \n",
" -0.006104 | \n",
" 0.008933 | \n",
" 0.007595 | \n",
" -0.037800 | \n",
" 0.050743 | \n",
" 0.024052 | \n",
" -0.172466 | \n",
" -0.010904 | \n",
" -0.038647 | \n",
" -0.168359 | \n",
"
\n",
" \n",
" 6039 | \n",
" 1.996248 | \n",
" -0.185987 | \n",
" -0.156478 | \n",
" 0.104143 | \n",
" -0.030001 | \n",
" 0.105521 | \n",
" -0.168477 | \n",
" -0.058174 | \n",
" 0.122714 | \n",
" -0.119716 | \n",
" ... | \n",
" 0.238088 | \n",
" -0.047046 | \n",
" -0.043259 | \n",
" 0.038256 | \n",
" 0.055693 | \n",
" 0.149593 | \n",
" 0.587989 | \n",
" -0.006641 | \n",
" 0.127067 | \n",
" 0.285001 | \n",
"
\n",
" \n",
"
\n",
"
6040 rows × 3706 columns
\n",
"
"
],
"text/plain": [
"MovieID 1 2 3 4 5 6 7 \\\n",
"0 4.288861 0.143055 -0.195080 -0.018843 0.012232 -0.176604 -0.074120 \n",
"1 0.744716 0.169659 0.335418 0.000758 0.022475 1.353050 0.051426 \n",
"2 1.818824 0.456136 0.090978 -0.043037 -0.025694 -0.158617 -0.131778 \n",
"3 0.408057 -0.072960 0.039642 0.089363 0.041950 0.237753 -0.049426 \n",
"4 1.574272 0.021239 -0.051300 0.246884 -0.032406 1.552281 -0.199630 \n",
"... ... ... ... ... ... ... ... \n",
"6035 2.392388 0.233964 0.413676 0.443726 -0.083641 2.192294 1.168936 \n",
"6036 2.070760 0.139294 -0.012666 -0.176990 0.261243 1.074234 0.083999 \n",
"6037 0.619089 -0.161769 0.106738 0.007048 -0.074701 -0.079953 0.100220 \n",
"6038 1.503605 -0.036208 -0.161268 -0.083401 -0.081617 -0.143517 0.106668 \n",
"6039 1.996248 -0.185987 -0.156478 0.104143 -0.030001 0.105521 -0.168477 \n",
"\n",
"MovieID 8 9 10 ... 3943 3944 3945 \\\n",
"0 0.141358 -0.059553 -0.195950 ... 0.027807 0.001640 0.026395 \n",
"1 0.071258 0.161601 1.567246 ... -0.056502 -0.013733 -0.010580 \n",
"2 0.098977 0.030551 0.735470 ... 0.040481 -0.005301 0.012832 \n",
"3 0.009467 0.045469 -0.111370 ... 0.008571 -0.005425 -0.008500 \n",
"4 -0.014920 -0.060498 0.450512 ... 0.110151 0.046010 0.006934 \n",
"... ... ... ... ... ... ... ... \n",
"6035 0.145237 -0.046551 0.560895 ... 0.188493 -0.004439 -0.042271 \n",
"6036 0.013814 -0.030179 -0.084956 ... -0.161548 0.001184 -0.029223 \n",
"6037 -0.034013 0.007671 0.001280 ... -0.053546 0.005835 0.007551 \n",
"6038 -0.054404 -0.008826 0.205801 ... -0.006104 0.008933 0.007595 \n",
"6039 -0.058174 0.122714 -0.119716 ... 0.238088 -0.047046 -0.043259 \n",
"\n",
"MovieID 3946 3947 3948 3949 3950 3951 3952 \n",
"0 -0.022024 -0.085415 0.403529 0.105579 0.031912 0.050450 0.088910 \n",
"1 0.062576 -0.016248 0.155790 -0.418737 -0.101102 -0.054098 -0.140188 \n",
"2 0.029349 0.020866 0.121532 0.076205 0.012345 0.015148 -0.109956 \n",
"3 -0.003417 -0.083982 0.094512 0.057557 -0.026050 0.014841 -0.034224 \n",
"4 -0.015940 -0.050080 -0.052539 0.507189 0.033830 0.125706 0.199244 \n",
"... ... ... ... ... ... ... ... \n",
"6035 -0.090101 0.276312 0.133806 0.732374 0.271234 0.244983 0.734771 \n",
"6036 -0.047087 0.099036 -0.192653 -0.091265 0.050798 -0.113427 0.033283 \n",
"6037 -0.024082 -0.010739 -0.008863 -0.099774 -0.013369 -0.030354 -0.114936 \n",
"6038 -0.037800 0.050743 0.024052 -0.172466 -0.010904 -0.038647 -0.168359 \n",
"6039 0.038256 0.055693 0.149593 0.587989 -0.006641 0.127067 0.285001 \n",
"\n",
"[6040 rows x 3706 columns]"
]
},
"execution_count": 239,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"preds_df\n",
"# each row is a user\n",
"# each column is a movie"
]
},
{
"cell_type": "code",
"execution_count": 246,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T07:18:51.751294Z",
"start_time": "2020-06-21T07:18:51.744349Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"def recommend_movies(preds_df, user_row_number, movies_df, ratings_df, num_recommendations=5):\n",
" # Get and sort the user's predictions\n",
" sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False)\n",
" \n",
" # Get the user's data and merge in the movie information.\n",
" userID = user_row_number + 1\n",
" user_data = ratings_df[ratings_df.UserID == userID]\n",
" user_full = (user_data.merge(movies_df, how = 'left', left_on = 'MovieID', right_on = 'MovieID').\n",
" sort_values(['Rating'], ascending=False)\n",
" )\n",
"\n",
" print('UserID {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))\n",
" print('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))\n",
" \n",
" # Recommend the highest predicted rating movies that the user hasn't seen yet.\n",
" potential_movie_df= movies_df[~movies_df['MovieID'].isin(user_full['MovieID'])]\n",
" predicted_movie_df = pd.DataFrame(sorted_user_predictions).reset_index()\n",
" predicted_movie_df['MovieID'] = predicted_movie_df['MovieID'].astype('int64')\n",
" recommendations = (\n",
" potential_movie_df.merge(predicted_movie_df, how = 'left', on = 'MovieID').\n",
" rename(columns = {user_row_number: 'Predictions'}).\n",
" sort_values('Predictions', ascending = False).\n",
" iloc[:num_recommendations, :-1]\n",
" )\n",
"\n",
" return user_full, recommendations "
]
},
{
"cell_type": "code",
"execution_count": 247,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T07:18:53.887987Z",
"start_time": "2020-06-21T07:18:53.871109Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"UserID 1 has already rated 53 movies.\n",
"Recommending the highest 10 predicted ratings movies not already rated.\n"
]
}
],
"source": [
"already_rated, predictions = recommend_movies(preds_df, 0, movies_df, ratings_df, 10)"
]
},
{
"cell_type": "code",
"execution_count": 238,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T07:14:56.866443Z",
"start_time": "2020-06-21T07:14:56.857045Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" UserID | \n",
" MovieID | \n",
" Rating | \n",
" Timestamp | \n",
" Title | \n",
" Genres | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1193 | \n",
" 5 | \n",
" 978300760 | \n",
" One Flew Over the Cuckoo's Nest (1975) | \n",
" Drama | \n",
"
\n",
" \n",
" 46 | \n",
" 1 | \n",
" 1029 | \n",
" 5 | \n",
" 978302205 | \n",
" Dumbo (1941) | \n",
" Animation|Children's|Musical | \n",
"
\n",
" \n",
" 40 | \n",
" 1 | \n",
" 1 | \n",
" 5 | \n",
" 978824268 | \n",
" Toy Story (1995) | \n",
" Animation|Children's|Comedy | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" UserID MovieID Rating Timestamp Title \\\n",
"0 1 1193 5 978300760 One Flew Over the Cuckoo's Nest (1975) \n",
"46 1 1029 5 978302205 Dumbo (1941) \n",
"40 1 1 5 978824268 Toy Story (1995) \n",
"\n",
" Genres \n",
"0 Drama \n",
"46 Animation|Children's|Musical \n",
"40 Animation|Children's|Comedy "
]
},
"execution_count": 238,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"already_rated[:3]\n"
]
},
{
"cell_type": "code",
"execution_count": 237,
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-21T07:14:45.543230Z",
"start_time": "2020-06-21T07:14:45.535808Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" MovieID | \n",
" Title | \n",
" Genres | \n",
"
\n",
" \n",
" \n",
" \n",
" 311 | \n",
" 318 | \n",
" Shawshank Redemption, The (1994) | \n",
" Drama | \n",
"
\n",
" \n",
" 32 | \n",
" 34 | \n",
" Babe (1995) | \n",
" Children's|Comedy|Drama | \n",
"
\n",
" \n",
" 356 | \n",
" 364 | \n",
" Lion King, The (1994) | \n",
" Animation|Children's|Musical | \n",
"
\n",
" \n",
" 1975 | \n",
" 2081 | \n",
" Little Mermaid, The (1989) | \n",
" Animation|Children's|Comedy|Musical|Romance | \n",
"
\n",
" \n",
" 1235 | \n",
" 1282 | \n",
" Fantasia (1940) | \n",
" Animation|Children's|Musical | \n",
"
\n",
" \n",
" 1974 | \n",
" 2080 | \n",
" Lady and the Tramp (1955) | \n",
" Animation|Children's|Comedy|Musical|Romance | \n",
"
\n",
" \n",
" 1972 | \n",
" 2078 | \n",
" Jungle Book, The (1967) | \n",
" Animation|Children's|Comedy|Musical | \n",
"
\n",
" \n",
" 1990 | \n",
" 2096 | \n",
" Sleeping Beauty (1959) | \n",
" Animation|Children's|Musical | \n",
"
\n",
" \n",
" 1981 | \n",
" 2087 | \n",
" Peter Pan (1953) | \n",
" Animation|Children's|Fantasy|Musical | \n",
"
\n",
" \n",
" 348 | \n",
" 356 | \n",
" Forrest Gump (1994) | \n",
" Comedy|Romance|War | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" MovieID Title \\\n",
"311 318 Shawshank Redemption, The (1994) \n",
"32 34 Babe (1995) \n",
"356 364 Lion King, The (1994) \n",
"1975 2081 Little Mermaid, The (1989) \n",
"1235 1282 Fantasia (1940) \n",
"1974 2080 Lady and the Tramp (1955) \n",
"1972 2078 Jungle Book, The (1967) \n",
"1990 2096 Sleeping Beauty (1959) \n",
"1981 2087 Peter Pan (1953) \n",
"348 356 Forrest Gump (1994) \n",
"\n",
" Genres \n",
"311 Drama \n",
"32 Children's|Comedy|Drama \n",
"356 Animation|Children's|Musical \n",
"1975 Animation|Children's|Comedy|Musical|Romance \n",
"1235 Animation|Children's|Musical \n",
"1974 Animation|Children's|Comedy|Musical|Romance \n",
"1972 Animation|Children's|Comedy|Musical \n",
"1990 Animation|Children's|Musical \n",
"1981 Animation|Children's|Fantasy|Musical \n",
"348 Comedy|Romance|War "
]
},
"execution_count": 237,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predictions"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"比较三种矩阵分解的方法\n",
"- 特征值分解 Eigen value decomposition\n",
" - 只能用于方阵\n",
"- 奇异值分解 Singular value decomposition\n",
" - 需要填充稀疏矩阵中的缺失元素\n",
" - 计算复杂度高 $O(mn^2)$\n",
"- 梯度下降 Gradient Descent\n",
" - 广泛使用!"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-13T11:10:28.931508Z",
"start_time": "2020-06-13T11:10:28.922374Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"![image.png](images/latent9.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Including bias\n",
"\n",
"![image.png](images/latent10.png)\n",
"\n",
"\\begin{equation}\n",
" \\hat{r}_{xi}= u + b_x + b_i + q_i p_x^{T}\n",
"\\end{equation}\n",
"\n",
"- $u$ is the global bias, measured by the overall mean rating\n",
"- $b_x$ is the bias for user x, measured by the mean rating given by user x.\n",
"- $b_i$ is the bias for movie i, measured by the mean ratings of movie i.\n",
"- $q_i p_{x}^{T}$ is the user-movie interaction"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-13T11:10:28.931508Z",
"start_time": "2020-06-13T11:10:28.922374Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"![image.png](images/latent11.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"![image.png](images/latent12.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-13T11:10:28.931508Z",
"start_time": "2020-06-13T11:10:28.922374Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"![image.png](images/latent13.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"ExecuteTime": {
"end_time": "2020-06-13T11:10:28.931508Z",
"start_time": "2020-06-13T11:10:28.922374Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Further reading:\n",
"Y. Koren, Collaborative filtering with temporal dynamics, KDD ’09\n",
"- http://www2.research.att.com/~volinsky/netflix/bpc.html\n",
"- http://www.the-ensemble.com/\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"![](images/recsys14.png)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"![image.png](images/end.png)"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": false,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}