{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\n", "# Latent Factor Recommender System\n", "\n", "\n", "![image.png](images/author.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The NetFlix Challenge\n", "\n", "**Training data** 100 million ratings, 480,000 users, 17,770 movies. 6 years of data: 2000-2005\n", "\n", "**Test data** Last few ratings of each user (2.8 million)\n", "\n", "**Competition** 2,700+ teams, $1 million prize for 10% improvement on Netflix" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "- Evaluation criterion: Root Mean Square Error (RMSE) \n", "\n", "\\begin{equation}\n", " RMSE = \\frac{1}{|R|} \\sqrt{\\sum_{(i, x)\\in R}(\\hat{r}_{xi} - r_{xi})^2}\n", "\\end{equation}\n", "- Netflix’s system RMSE: 0.9514" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2020-06-13T11:10:28.931508Z", "start_time": "2020-06-13T11:10:28.922374Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "![image.png](images/latent.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![image.png](images/latent2.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![image.png](images/latent3.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![image.png](images/latent4.png)" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2020-06-13T11:10:28.931508Z", "start_time": "2020-06-13T11:10:28.922374Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "![image.png](images/latent5.png)\n", "\n", "U (m x m) , $\\Sigma$(m x n), $V^T$ (n x n)" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T05:20:41.371788Z", "start_time": "2020-06-21T05:20:41.366179Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1 2 3]\n", " [4 5 6]\n", " [7 8 9]]\n", "[[-8.08154958e+00 -9.64331175e+00 -1.12050739e+01]\n", " [-8.29792976e-01 -8.08611173e-02 6.68070742e-01]\n", " [-1.36140716e-16 2.72281431e-16 -1.36140716e-16]]\n" ] } ], "source": [ "import numpy as np\n", "\n", "A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])\n", "print(A)\n", "# Singular-value decomposition\n", "U, s, VT = np.linalg.svd(A)\n", "# create n x n Sigma matrix\n", "Sigma = np.diag(s)\n", "# reconstruct matrix\n", "PT = Sigma.dot(VT)\n", "#B = U.dot(Sigma.dot(VT))\n", "print(PT)" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2020-06-21T05:24:08.464482Z", "start_time": "2020-06-21T05:24:08.462334Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "$\\Sigma$本来应该跟A矩阵的大小一样,但linalg.svd()只返回了一个行向量的$\\Sigma$,并且舍弃值为0的奇异值。因此,必须先将$\\Sigma$转化为矩阵。\n", "\n", "![image.png](images/latent6.png)" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T05:38:31.694326Z", "start_time": "2020-06-21T05:38:31.685324Z" }, "code_folding": [], "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A = \n", " [[1 2]\n", " [3 4]\n", " [5 6]] \n", "\n", "U = \n", " [[-0.2298477 0.88346102 0.40824829]\n", " [-0.52474482 0.24078249 -0.81649658]\n", " [-0.81964194 -0.40189603 0.40824829]] \n", "\n", "Sigma = \n", " [[9.52551809 0. ]\n", " [0. 0.51430058]\n", " [0. 0. ]] \n", "\n", "VT = \n", " [[-0.61962948 -0.78489445]\n", " [-0.78489445 0.61962948]] \n", "\n", "PT = \n", " [[-5.90229186 -7.47652631]\n", " [-0.40367167 0.3186758 ]\n", " [ 0. 0. ]] \n", "\n", "B = \n", " [[1. 2.]\n", " [3. 4.]\n", " [5. 6.]] \n", "\n" ] } ], "source": [ "# Singular-value decomposition \n", "A = np.array([[1, 2], [3, 4], [5, 6]])\n", "U, s, VT = np.linalg.svd(A)\n", "# create n x n Sigma matrix\n", "Sigma = np.zeros((A.shape[0], A.shape[1]))\n", "# populate Sigma with n x n diagonal matrix\n", "Sigma[:A.shape[1], :A.shape[1]] = np.diag(s)\n", "# reconstruct matrix\n", "PT = Sigma.dot(VT)\n", "B = U.dot(PT)\n", "\n", "print('A = \\n', A, '\\n')\n", "print('U = \\n', U, '\\n')\n", "print('Sigma = \\n', Sigma, '\\n')\n", "print('VT = \\n', VT, '\\n')\n", "print('PT = \\n', PT, '\\n')\n", "print('B = \\n', B, '\\n') " ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T06:12:38.801639Z", "start_time": "2020-06-21T06:12:38.790584Z" }, "code_folding": [ 0 ], "scrolled": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A = \n", " [[1 2 3]\n", " [4 5 6]] \n", "\n", "U = \n", " [[-0.3863177 -0.92236578]\n", " [-0.92236578 0.3863177 ]] \n", "\n", "Sigma = \n", " [[9.508032 0. 0. ]\n", " [0. 0.77286964 0. ]\n", " [0. 0. 0. ]] \n", "\n", "VT = \n", " [[-0.42866713 -0.56630692 -0.7039467 ]\n", " [ 0.80596391 0.11238241 -0.58119908]\n", " [ 0.40824829 -0.81649658 0.40824829]] \n", "\n", "PT = \n", " [[-4.07578082 -5.38446431 -6.69314779]\n", " [ 0.62290503 0.08685696 -0.44919112]] \n", "\n", "B = \n", " [[1. 2. 3.]\n", " [4. 5. 6.]] \n", "\n" ] } ], "source": [ "# Singular-value decomposition\n", "A = np.array([[1, 2, 3], \n", " [4, 5, 6]])\n", "U,S,VT = np.linalg.svd(A)\n", "# create n x n Sigma matrix\n", "Sigma = np.zeros((A.shape[1], A.shape[1]))\n", "# populate Sigma with n x n diagonal matrix\n", "if A.shape[1] > S.shape[0]:\n", " S = np.append(S, 0)\n", "Sigma[:A.shape[1], :A.shape[1]] = np.diag(S)\n", "\n", "PT= Sigma.dot(VT)\n", "PT = PT[0:A.shape[0]]\n", "B = U.dot(PT)\n", "print('A = \\n', A, '\\n')\n", "print('U = \\n', U, '\\n')\n", "print('Sigma = \\n', Sigma, '\\n')\n", "print('VT = \\n', VT, '\\n')\n", "print('PT = \\n', PT, '\\n')\n", "print('B = \\n', B, '\\n')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![image.png](images/latent7.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![image.png](images/latent8.png)" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2020-06-13T11:10:28.931508Z", "start_time": "2020-06-13T11:10:28.922374Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "SVD gives minimum reconstruction error (Sum of Squared Errors, **SSE**)\n", "\n", "SSE and RMSE are monotonically related:$RMSE=\\frac{1}{c}\\sqrt{SSE}$\n", " \n", "Great news: SVD is minimizing RMSE" ] }, { "cell_type": "code", "execution_count": 226, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T07:14:01.788580Z", "start_time": "2020-06-21T07:13:59.250457Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# https://beckernick.github.io/matrix-factorization-recommender/\n", "import pandas as pd\n", "import numpy as np\n", "\n", "ratings_list = [i.strip().split(\"::\") for i in open('/Users/datalab/bigdata/cjc/ml-1m/ratings.dat', 'r').readlines()]\n", "users_list = [i.strip().split(\"::\") for i in open('/Users/datalab/bigdata/cjc/ml-1m/users.dat', 'r').readlines()]\n", "movies_list = [i.strip().split(\"::\") for i in open('/Users/datalab/bigdata/cjc/ml-1m/movies.dat', 'r', encoding = 'iso-8859-15').readlines()]\n", "\n", "ratings_df = pd.DataFrame(ratings_list, columns = ['UserID', 'MovieID', 'Rating', 'Timestamp'], dtype = int)\n", "movies_df = pd.DataFrame(movies_list, columns = ['MovieID', 'Title', 'Genres'])\n", "\n", "movies_df['MovieID'] = movies_df['MovieID'].astype('int64')\n", "ratings_df['UserID'] = ratings_df['UserID'].astype('int64')\n", "ratings_df['MovieID'] = ratings_df['MovieID'].astype('int64')" ] }, { "cell_type": "code", "execution_count": 219, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T07:12:09.215096Z", "start_time": "2020-06-21T07:12:09.207845Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MovieIDTitleGenres
01Toy Story (1995)Animation|Children's|Comedy
12Jumanji (1995)Adventure|Children's|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
\n", "
" ], "text/plain": [ " MovieID Title Genres\n", "0 1 Toy Story (1995) Animation|Children's|Comedy\n", "1 2 Jumanji (1995) Adventure|Children's|Fantasy\n", "2 3 Grumpier Old Men (1995) Comedy|Romance\n", "3 4 Waiting to Exhale (1995) Comedy|Drama\n", "4 5 Father of the Bride Part II (1995) Comedy" ] }, "execution_count": 219, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies_df.head()\n" ] }, { "cell_type": "code", "execution_count": 241, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T07:16:26.039680Z", "start_time": "2020-06-21T07:16:26.031571Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UserIDMovieIDRatingTimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291
\n", "
" ], "text/plain": [ " UserID MovieID Rating Timestamp\n", "0 1 1193 5 978300760\n", "1 1 661 3 978302109\n", "2 1 914 3 978301968\n", "3 1 3408 4 978300275\n", "4 1 2355 5 978824291" ] }, "execution_count": 241, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ratings_df.head()" ] }, { "cell_type": "code", "execution_count": 227, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T07:14:09.715349Z", "start_time": "2020-06-21T07:14:05.949062Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MovieID12345678910...3943394439453946394739483949395039513952
UserID
15000000000...0000000000
20000000000...0000000000
30000000000...0000000000
40000000000...0000000000
50000020000...0000000000
\n", "

5 rows × 3706 columns

\n", "
" ], "text/plain": [ "MovieID 1 2 3 4 5 6 7 8 9 10 ... 3943 3944 3945 \\\n", "UserID ... \n", "1 5 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 \n", "5 0 0 0 0 0 2 0 0 0 0 ... 0 0 0 \n", "\n", "MovieID 3946 3947 3948 3949 3950 3951 3952 \n", "UserID \n", "1 0 0 0 0 0 0 0 \n", "2 0 0 0 0 0 0 0 \n", "3 0 0 0 0 0 0 0 \n", "4 0 0 0 0 0 0 0 \n", "5 0 0 0 0 0 0 0 \n", "\n", "[5 rows x 3706 columns]" ] }, "execution_count": 227, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 注意:使用0填充缺失值\n", "R_df = ratings_df.pivot(index = 'UserID', columns ='MovieID', values = 'Rating').fillna(0)\n", "R_df.head()" ] }, { "cell_type": "code", "execution_count": 228, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T07:14:25.908888Z", "start_time": "2020-06-21T07:14:24.982459Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "R = R_df.to_numpy(dtype=np.int16)\n", "user_ratings_mean = np.mean(R, axis = 1)\n", "R_demeaned = R - user_ratings_mean.reshape(-1, 1)" ] }, { "cell_type": "code", "execution_count": 229, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T07:14:28.928544Z", "start_time": "2020-06-21T07:14:26.842984Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from scipy.sparse.linalg import svds\n", "U, sigma, Vt = svds(R_demeaned, k = 50)" ] }, { "cell_type": "code", "execution_count": 230, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T07:14:30.367595Z", "start_time": "2020-06-21T07:14:30.142945Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "sigma = np.diag(sigma)\n", "\n", "all_user_predicted_ratings = U.dot( sigma.dot(Vt)) + user_ratings_mean.reshape(-1, 1)\n", "preds_df = pd.DataFrame(all_user_predicted_ratings, columns = R_df.columns)\n" ] }, { "cell_type": "code", "execution_count": 239, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T07:15:53.058613Z", "start_time": "2020-06-21T07:15:53.031321Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MovieID12345678910...3943394439453946394739483949395039513952
04.2888610.143055-0.195080-0.0188430.012232-0.176604-0.0741200.141358-0.059553-0.195950...0.0278070.0016400.026395-0.022024-0.0854150.4035290.1055790.0319120.0504500.088910
10.7447160.1696590.3354180.0007580.0224751.3530500.0514260.0712580.1616011.567246...-0.056502-0.013733-0.0105800.062576-0.0162480.155790-0.418737-0.101102-0.054098-0.140188
21.8188240.4561360.090978-0.043037-0.025694-0.158617-0.1317780.0989770.0305510.735470...0.040481-0.0053010.0128320.0293490.0208660.1215320.0762050.0123450.015148-0.109956
30.408057-0.0729600.0396420.0893630.0419500.237753-0.0494260.0094670.045469-0.111370...0.008571-0.005425-0.008500-0.003417-0.0839820.0945120.057557-0.0260500.014841-0.034224
41.5742720.021239-0.0513000.246884-0.0324061.552281-0.199630-0.014920-0.0604980.450512...0.1101510.0460100.006934-0.015940-0.050080-0.0525390.5071890.0338300.1257060.199244
..................................................................
60352.3923880.2339640.4136760.443726-0.0836412.1922941.1689360.145237-0.0465510.560895...0.188493-0.004439-0.042271-0.0901010.2763120.1338060.7323740.2712340.2449830.734771
60362.0707600.139294-0.012666-0.1769900.2612431.0742340.0839990.013814-0.030179-0.084956...-0.1615480.001184-0.029223-0.0470870.099036-0.192653-0.0912650.050798-0.1134270.033283
60370.619089-0.1617690.1067380.007048-0.074701-0.0799530.100220-0.0340130.0076710.001280...-0.0535460.0058350.007551-0.024082-0.010739-0.008863-0.099774-0.013369-0.030354-0.114936
60381.503605-0.036208-0.161268-0.083401-0.081617-0.1435170.106668-0.054404-0.0088260.205801...-0.0061040.0089330.007595-0.0378000.0507430.024052-0.172466-0.010904-0.038647-0.168359
60391.996248-0.185987-0.1564780.104143-0.0300010.105521-0.168477-0.0581740.122714-0.119716...0.238088-0.047046-0.0432590.0382560.0556930.1495930.587989-0.0066410.1270670.285001
\n", "

6040 rows × 3706 columns

\n", "
" ], "text/plain": [ "MovieID 1 2 3 4 5 6 7 \\\n", "0 4.288861 0.143055 -0.195080 -0.018843 0.012232 -0.176604 -0.074120 \n", "1 0.744716 0.169659 0.335418 0.000758 0.022475 1.353050 0.051426 \n", "2 1.818824 0.456136 0.090978 -0.043037 -0.025694 -0.158617 -0.131778 \n", "3 0.408057 -0.072960 0.039642 0.089363 0.041950 0.237753 -0.049426 \n", "4 1.574272 0.021239 -0.051300 0.246884 -0.032406 1.552281 -0.199630 \n", "... ... ... ... ... ... ... ... \n", "6035 2.392388 0.233964 0.413676 0.443726 -0.083641 2.192294 1.168936 \n", "6036 2.070760 0.139294 -0.012666 -0.176990 0.261243 1.074234 0.083999 \n", "6037 0.619089 -0.161769 0.106738 0.007048 -0.074701 -0.079953 0.100220 \n", "6038 1.503605 -0.036208 -0.161268 -0.083401 -0.081617 -0.143517 0.106668 \n", "6039 1.996248 -0.185987 -0.156478 0.104143 -0.030001 0.105521 -0.168477 \n", "\n", "MovieID 8 9 10 ... 3943 3944 3945 \\\n", "0 0.141358 -0.059553 -0.195950 ... 0.027807 0.001640 0.026395 \n", "1 0.071258 0.161601 1.567246 ... -0.056502 -0.013733 -0.010580 \n", "2 0.098977 0.030551 0.735470 ... 0.040481 -0.005301 0.012832 \n", "3 0.009467 0.045469 -0.111370 ... 0.008571 -0.005425 -0.008500 \n", "4 -0.014920 -0.060498 0.450512 ... 0.110151 0.046010 0.006934 \n", "... ... ... ... ... ... ... ... \n", "6035 0.145237 -0.046551 0.560895 ... 0.188493 -0.004439 -0.042271 \n", "6036 0.013814 -0.030179 -0.084956 ... -0.161548 0.001184 -0.029223 \n", "6037 -0.034013 0.007671 0.001280 ... -0.053546 0.005835 0.007551 \n", "6038 -0.054404 -0.008826 0.205801 ... -0.006104 0.008933 0.007595 \n", "6039 -0.058174 0.122714 -0.119716 ... 0.238088 -0.047046 -0.043259 \n", "\n", "MovieID 3946 3947 3948 3949 3950 3951 3952 \n", "0 -0.022024 -0.085415 0.403529 0.105579 0.031912 0.050450 0.088910 \n", "1 0.062576 -0.016248 0.155790 -0.418737 -0.101102 -0.054098 -0.140188 \n", "2 0.029349 0.020866 0.121532 0.076205 0.012345 0.015148 -0.109956 \n", "3 -0.003417 -0.083982 0.094512 0.057557 -0.026050 0.014841 -0.034224 \n", "4 -0.015940 -0.050080 -0.052539 0.507189 0.033830 0.125706 0.199244 \n", "... ... ... ... ... ... ... ... \n", "6035 -0.090101 0.276312 0.133806 0.732374 0.271234 0.244983 0.734771 \n", "6036 -0.047087 0.099036 -0.192653 -0.091265 0.050798 -0.113427 0.033283 \n", "6037 -0.024082 -0.010739 -0.008863 -0.099774 -0.013369 -0.030354 -0.114936 \n", "6038 -0.037800 0.050743 0.024052 -0.172466 -0.010904 -0.038647 -0.168359 \n", "6039 0.038256 0.055693 0.149593 0.587989 -0.006641 0.127067 0.285001 \n", "\n", "[6040 rows x 3706 columns]" ] }, "execution_count": 239, "metadata": {}, "output_type": "execute_result" } ], "source": [ "preds_df\n", "# each row is a user\n", "# each column is a movie" ] }, { "cell_type": "code", "execution_count": 246, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T07:18:51.751294Z", "start_time": "2020-06-21T07:18:51.744349Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def recommend_movies(preds_df, user_row_number, movies_df, ratings_df, num_recommendations=5):\n", " # Get and sort the user's predictions\n", " sorted_user_predictions = preds_df.iloc[user_row_number].sort_values(ascending=False)\n", " \n", " # Get the user's data and merge in the movie information.\n", " userID = user_row_number + 1\n", " user_data = ratings_df[ratings_df.UserID == userID]\n", " user_full = (user_data.merge(movies_df, how = 'left', left_on = 'MovieID', right_on = 'MovieID').\n", " sort_values(['Rating'], ascending=False)\n", " )\n", "\n", " print('UserID {0} has already rated {1} movies.'.format(userID, user_full.shape[0]))\n", " print('Recommending the highest {0} predicted ratings movies not already rated.'.format(num_recommendations))\n", " \n", " # Recommend the highest predicted rating movies that the user hasn't seen yet.\n", " potential_movie_df= movies_df[~movies_df['MovieID'].isin(user_full['MovieID'])]\n", " predicted_movie_df = pd.DataFrame(sorted_user_predictions).reset_index()\n", " predicted_movie_df['MovieID'] = predicted_movie_df['MovieID'].astype('int64')\n", " recommendations = (\n", " potential_movie_df.merge(predicted_movie_df, how = 'left', on = 'MovieID').\n", " rename(columns = {user_row_number: 'Predictions'}).\n", " sort_values('Predictions', ascending = False).\n", " iloc[:num_recommendations, :-1]\n", " )\n", "\n", " return user_full, recommendations " ] }, { "cell_type": "code", "execution_count": 247, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T07:18:53.887987Z", "start_time": "2020-06-21T07:18:53.871109Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "UserID 1 has already rated 53 movies.\n", "Recommending the highest 10 predicted ratings movies not already rated.\n" ] } ], "source": [ "already_rated, predictions = recommend_movies(preds_df, 0, movies_df, ratings_df, 10)" ] }, { "cell_type": "code", "execution_count": 238, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T07:14:56.866443Z", "start_time": "2020-06-21T07:14:56.857045Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
UserIDMovieIDRatingTimestampTitleGenres
0111935978300760One Flew Over the Cuckoo's Nest (1975)Drama
46110295978302205Dumbo (1941)Animation|Children's|Musical
40115978824268Toy Story (1995)Animation|Children's|Comedy
\n", "
" ], "text/plain": [ " UserID MovieID Rating Timestamp Title \\\n", "0 1 1193 5 978300760 One Flew Over the Cuckoo's Nest (1975) \n", "46 1 1029 5 978302205 Dumbo (1941) \n", "40 1 1 5 978824268 Toy Story (1995) \n", "\n", " Genres \n", "0 Drama \n", "46 Animation|Children's|Musical \n", "40 Animation|Children's|Comedy " ] }, "execution_count": 238, "metadata": {}, "output_type": "execute_result" } ], "source": [ "already_rated[:3]\n" ] }, { "cell_type": "code", "execution_count": 237, "metadata": { "ExecuteTime": { "end_time": "2020-06-21T07:14:45.543230Z", "start_time": "2020-06-21T07:14:45.535808Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MovieIDTitleGenres
311318Shawshank Redemption, The (1994)Drama
3234Babe (1995)Children's|Comedy|Drama
356364Lion King, The (1994)Animation|Children's|Musical
19752081Little Mermaid, The (1989)Animation|Children's|Comedy|Musical|Romance
12351282Fantasia (1940)Animation|Children's|Musical
19742080Lady and the Tramp (1955)Animation|Children's|Comedy|Musical|Romance
19722078Jungle Book, The (1967)Animation|Children's|Comedy|Musical
19902096Sleeping Beauty (1959)Animation|Children's|Musical
19812087Peter Pan (1953)Animation|Children's|Fantasy|Musical
348356Forrest Gump (1994)Comedy|Romance|War
\n", "
" ], "text/plain": [ " MovieID Title \\\n", "311 318 Shawshank Redemption, The (1994) \n", "32 34 Babe (1995) \n", "356 364 Lion King, The (1994) \n", "1975 2081 Little Mermaid, The (1989) \n", "1235 1282 Fantasia (1940) \n", "1974 2080 Lady and the Tramp (1955) \n", "1972 2078 Jungle Book, The (1967) \n", "1990 2096 Sleeping Beauty (1959) \n", "1981 2087 Peter Pan (1953) \n", "348 356 Forrest Gump (1994) \n", "\n", " Genres \n", "311 Drama \n", "32 Children's|Comedy|Drama \n", "356 Animation|Children's|Musical \n", "1975 Animation|Children's|Comedy|Musical|Romance \n", "1235 Animation|Children's|Musical \n", "1974 Animation|Children's|Comedy|Musical|Romance \n", "1972 Animation|Children's|Comedy|Musical \n", "1990 Animation|Children's|Musical \n", "1981 Animation|Children's|Fantasy|Musical \n", "348 Comedy|Romance|War " ] }, "execution_count": 237, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictions" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "比较三种矩阵分解的方法\n", "- 特征值分解 Eigen value decomposition\n", " - 只能用于方阵\n", "- 奇异值分解 Singular value decomposition\n", " - 需要填充稀疏矩阵中的缺失元素\n", " - 计算复杂度高 $O(mn^2)$\n", "- 梯度下降 Gradient Descent\n", " - 广泛使用!" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2020-06-13T11:10:28.931508Z", "start_time": "2020-06-13T11:10:28.922374Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "![image.png](images/latent9.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Including bias\n", "\n", "![image.png](images/latent10.png)\n", "\n", "\\begin{equation}\n", " \\hat{r}_{xi}= u + b_x + b_i + q_i p_x^{T}\n", "\\end{equation}\n", "\n", "- $u$ is the global bias, measured by the overall mean rating\n", "- $b_x$ is the bias for user x, measured by the mean rating given by user x.\n", "- $b_i$ is the bias for movie i, measured by the mean ratings of movie i.\n", "- $q_i p_{x}^{T}$ is the user-movie interaction" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2020-06-13T11:10:28.931508Z", "start_time": "2020-06-13T11:10:28.922374Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "![image.png](images/latent11.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![image.png](images/latent12.png)" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2020-06-13T11:10:28.931508Z", "start_time": "2020-06-13T11:10:28.922374Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "![image.png](images/latent13.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2020-06-13T11:10:28.931508Z", "start_time": "2020-06-13T11:10:28.922374Z" }, "slideshow": { "slide_type": "subslide" } }, "source": [ "## Further reading:\n", "Y. Koren, Collaborative filtering with temporal dynamics, KDD ’09\n", "- http://www2.research.att.com/~volinsky/netflix/bpc.html\n", "- http://www.the-ensemble.com/\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "![](images/recsys14.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "![image.png](images/end.png)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }