{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# **ch7 추천엔진 과 모델의 평가**\n", "- 머신러닝과 통계 [**(자료다운)**](http://acornpub.co.kr/book/statistics-machine-learning) 자료 다운로드\n", "- movielens 데이터를 사용하여 분석합니다" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **1 데이터 불러오기**\n", "- **CSV 영화 데이터** 불러오기\n", "- 1개의 테이블로 묶고, 이를 **user/ movie Pivot Table로** 변환\n", "- 연산의 용이성을 위해 **numpy Matrix로** 변환" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " userId movieId rating timestamp\n", "0 1 31 2.5 1260759144\n", "1 1 1029 3.0 1260759179\n", " movieId title genres\n", "0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy\n", "1 2 Jumanji (1995) Adventure|Children|Fantasy\n" ] } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "ratings = pd.read_csv(\"data/ml-latest-small/ratings.csv\")\n", "print (ratings.head(2))\n", "movies = pd.read_csv(\"data/ml-latest-small/movies.csv\")\n", "print (movies.head(2))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userIdmovieIdratingtitle
01312.5Dangerous Minds (1995)
1110293.0Dumbo (1941)
2110613.0Sleepers (1996)
\n", "
" ], "text/plain": [ " userId movieId rating title\n", "0 1 31 2.5 Dangerous Minds (1995)\n", "1 1 1029 3.0 Dumbo (1941)\n", "2 1 1061 3.0 Sleepers (1996)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 영화제목과 평점정보를 1개의 DataFrame으로 묶음\n", "ratings = pd.merge(ratings[['userId', 'movieId', 'rating']], \n", " movies[['movieId', 'title']],\n", " how='left', left_on='movieId', right_on='movieId')\n", "ratings.head(3)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieId12345678910...161084161155161594161830161918161944162376162542162672163949
userId
10.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
20.00.00.00.00.00.00.00.00.04.0...0.00.00.00.00.00.00.00.00.00.0
30.00.00.00.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.00.0
\n", "

3 rows × 9066 columns

\n", "
" ], "text/plain": [ "movieId 1 2 3 4 5 6 7 8 \\\n", "userId \n", "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", "\n", "movieId 9 10 ... 161084 161155 161594 161830 161918 \\\n", "userId ... \n", "1 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 4.0 ... 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 \n", "\n", "movieId 161944 162376 162542 162672 163949 \n", "userId \n", "1 0.0 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 \n", "\n", "[3 rows x 9066 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 영화명 필드, 사용자 인덱스 Pivot Table을 생성합니다\n", "rp = ratings.pivot_table(columns = ['movieId'], \n", " index = ['userId'], values = 'rating')\n", "rp = rp.fillna(0)\n", "rp.head(3)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 0., 0.],\n", " ...,\n", " [0., 0., 0., ..., 0., 0., 0.],\n", " [4., 0., 0., ..., 0., 0., 0.],\n", " [5., 0., 0., ..., 0., 0., 0.]])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 연산이 용이하도록 numpy matrix로 변환합니다\n", "rp_mat = rp.values # as_matrix()\n", "rp_mat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **2 내용 기반의 협업 필터링** (Cosin 유사도 측정)\n", "### **01 User Based Table** (사용자 Cosin 유사도 측정)\n", "- **Numpy Matrix** 간의 유사도를 측정합니다\n", "- 뒤에 이어질 **내용기반 필터링** 방법에서도 동일하게 적용됩니다\n", "- **Pivot Table** 의 **Cosin 유사도를** 측정하다보니 시간이 오래걸림\n", "- from **sklearn.metrics.pairwise** import **linear_kernel** 가 더 빠르더라" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A 과 B 행렬의 Cosin 유사도 : 0.822\n" ] } ], "source": [ "# The cosine of the angle between them is about 0.822.\n", "from scipy.spatial.distance import cosine\n", "a = np.asarray([2, 1, 0, 2, 0, 1, 1, 1])\n", "b = np.asarray([2, 1, 1, 1, 1, 0, 1, 1])\n", "print (\"A 과 B 행렬의 Cosin 유사도 : {:.3f}\".format(1-cosine(a, b)))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "userId 1 2 3 4 5 6 7 \\\n", "userId \n", "1 0.000000 0.000000 0.000000 0.074482 0.016818 0.000000 0.083884 \n", "2 0.000000 0.000000 0.124295 0.118821 0.103646 0.000000 0.212985 \n", "3 0.000000 0.124295 0.000000 0.081640 0.151531 0.060691 0.154714 \n", "4 0.074482 0.118821 0.081640 0.000000 0.130649 0.079648 0.319745 \n", "5 0.016818 0.103646 0.151531 0.130649 0.000000 0.063796 0.095888 \n", "\n", "userId 8 9 10 ... 662 663 664 \\\n", "userId ... \n", "1 0.000000 0.012843 0.000000 ... 0.000000 0.000000 0.014474 \n", "2 0.113190 0.113333 0.043213 ... 0.477306 0.063202 0.077745 \n", "3 0.249781 0.134475 0.114672 ... 0.161205 0.064198 0.176134 \n", "4 0.191013 0.030417 0.137186 ... 0.114319 0.047228 0.136579 \n", "5 0.165712 0.086616 0.032370 ... 0.191029 0.021142 0.146173 \n", "\n", "userId 665 666 667 668 669 670 671 \n", "userId \n", "1 0.043719 0.000000 0.000000 0.000000 0.062917 0.000000 0.017466 \n", "2 0.164162 0.466281 0.425462 0.084646 0.024140 0.170595 0.113175 \n", "3 0.158357 0.177098 0.124562 0.124911 0.080984 0.136606 0.170193 \n", "4 0.254030 0.121905 0.088735 0.068483 0.104309 0.054512 0.211609 \n", "5 0.224245 0.139721 0.058252 0.042926 0.038358 0.062642 0.225086 \n", "\n", "[5 rows x 671 columns]\n", "CPU times: user 1min 34s, sys: 16 ms, total: 1min 34s\n", "Wall time: 1min 34s\n" ] } ], "source": [ "%%time\n", "# User similarity matrix\n", "m, n = rp.shape\n", "mat_users = np.zeros((m, m))\n", "for i in range(m):\n", " for j in range(m):\n", " if i != j: \n", " mat_users[i][j] = (1-cosine(rp_mat[i,:], rp_mat[j,:]))\n", " else: \n", " mat_users[i][j] = 0.\n", " \n", "pd_users = pd.DataFrame(mat_users, index=rp.index, columns=rp.index )\n", "print(pd_users.head(2))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Similar users as user: 17\n", " score\n", "userId \n", "596 0.379128\n", "23 0.374641\n", "355 0.329605\n", "430 0.328872\n", "608 0.319770\n", "509 0.319313\n", "105 0.309477\n", "457 0.308201\n", "15 0.307179\n", "461 0.299035\n" ] } ], "source": [ "# 사용자 기반 유사도 측정\n", "def topn_simusers(uid=16, n=5):\n", " users = pd_users.loc[uid, :].sort_values(ascending=False)\n", " topn_users = users.iloc[:n, ]\n", " topn_users = topn_users.rename('score') \n", " print (\"Similar users as user:\", uid)\n", " return pd.DataFrame(topn_users)\n", "\n", "# 17번 사용자와 유사한 10명의 ID 정보를 출력\n", "print(topn_simusers(uid=17, n=10))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 10 movie ratings of user: 596\n", " userId movieId rating title\n", "89645 596 4262 5.0 Scarface (1983)\n", "89732 596 6874 5.0 Kill Bill: Vol. 1 (2003)\n", "89353 596 194 5.0 Smoke (1995)\n", "89546 596 2329 5.0 American History X (1998)\n", "89453 596 1193 5.0 One Flew Over the Cuckoo's Nest (1975)\n", "89751 596 8132 5.0 Gladiator (1992)\n", "89579 596 2858 5.0 American Beauty (1999)\n", "89365 596 296 5.0 Pulp Fiction (1994)\n", "89587 596 2959 5.0 Fight Club (1999)\n", "89368 596 318 5.0 Shawshank Redemption, The (1994)\n" ] } ], "source": [ "# 사용자가 선호하는 영화목록을 출력하는 함수\n", "def topn_movieratings(uid=355, n_ratings=10): \n", " uid_ratings = ratings.loc[ratings['userId'] == uid]\n", " uid_ratings = uid_ratings.sort_values(by='rating', ascending=[False])\n", " print (\"Top {} movie ratings of user: {}\".format(n_ratings, uid))\n", " return uid_ratings.iloc[:n_ratings, ] \n", "\n", "# 596번 사용자가 선호하는 영화목록 10개\n", "print(topn_movieratings(uid=596, n_ratings=10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **02 Item Based Table** (영화 Cosin 유사도 측정)\n", "- **Numpy Matrix** 간의 유사도를 측정합니다\n", "- 뒤에 이어질 **내용기반 필터링** 방법에서도 동일하게 적용됩니다\n", "- **Pivot Table** 의 **Cosin 유사도를** 측정하다보니 시간이 오래걸림\n", "- from **sklearn.metrics.pairwise** import **linear_kernel** 가 더 빠르더라" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 145. , 0. , 0. , ..., 16. , 0. , 9. ],\n", " [ 0. , 985. , 101.5 , ..., 16. , 119. , 152. ],\n", " [ 0. , 101.5 , 677. , ..., 44.5 , 79. , 189.5 ],\n", " ...,\n", " [ 16. , 16. , 44.5 , ..., 446. , 20. , 77. ],\n", " [ 0. , 119. , 79. , ..., 20. , 494. , 217.5 ],\n", " [ 9. , 152. , 189.5 , ..., 77. , 217.5 , 1831.25]])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# cosine_sim 코싸인 유사도 행렬\n", "# TF-IDF Vectorizer간 Dot Product 계산시 Cosine Similarity Score 제공\n", "from sklearn.metrics.pairwise import linear_kernel\n", "mat_movies = linear_kernel(rp_mat, rp_mat)\n", "mat_movies" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(671, 671)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy.spatial.distance import cdist\n", "mat_movies = cdist(rp_mat, rp_mat)\n", "mat_movies.shape" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(671, 671)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import pairwise_distances\n", "mat_movies = pairwise_distances(rp_mat, metric='manhattan')\n", "mat_movies.shape" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# %%time\n", "# # 개별 영화간 유사도 측정\n", "# mat_movies = np.zeros((n, n))\n", "# for i in range(n):\n", "# for j in range(n):\n", "# if i != j: mat_movies[i,j] = (1-cosine(rp_mat[:,i], rp_mat[:,j]))\n", "# else: mat_movies[i,j] = 0.\n", "\n", "# # 대략 56min 5s 소요\n", "# print(mat_movies.shape)\n", "# pd_movies = pd.DataFrame(mat_movies, index=rp.columns ,columns=rp.columns )\n", "# pd_movies.to_csv('data/pd_movies.csv', sep=',')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Movies similar to movie id: 589, Terminator 2: Judgment Day (1991), are\n", " 589 movieId title\n", "0 0.702256 480 Jurassic Park (1993)\n", "1 0.636392 1240 Terminator, The (1984)\n", "2 0.633428 110 Braveheart (1995)\n", "3 0.619415 356 Forrest Gump (1994)\n", "4 0.614814 377 Speed (1994)\n", "5 0.605887 380 True Lies (1994)\n", "6 0.604555 457 Fugitive, The (1993)\n", "7 0.591071 593 Silence of the Lambs, The (1991)\n", "8 0.579325 367 Mask, The (1994)\n", "9 0.577299 1036 Die Hard (1988)\n", "10 0.576275 592 Batman (1989)\n", "11 0.568341 296 Pulp Fiction (1994)\n", "12 0.564779 1196 Star Wars: Episode V - The Empire Strikes Back...\n", "13 0.562415 260 Star Wars: Episode IV - A New Hope (1977)\n", "14 0.553626 47 Seven (a.k.a. Se7en) (1995)\n" ] } ], "source": [ "pd_movies = pd.read_csv(\"data/pd_movies.csv\",index_col='movieId')\n", "\n", "# Finding similar movies\n", "def topn_simovies(mid = 588,n=15):\n", " mid_ratings = pd_movies.loc[mid,:].sort_values(ascending = False)\n", " topn_movies = pd.DataFrame(mid_ratings.iloc[:n,])\n", " topn_movies['index1'] = topn_movies.index\n", " topn_movies['index1'] = topn_movies['index1'].astype('int64')\n", " topn_movies = pd.merge(topn_movies, movies[['movieId','title']],\n", " how='left', left_on='index1', right_on='movieId')\n", " print (\"Movies similar to movie id: {}, {}, are\".format(\n", " mid, \n", " movies['title'][movies['movieId']==mid].to_string(index=False)))\n", " del topn_movies['index1']\n", " return topn_movies\n", "\n", "# 589번 사용자가 유사한 영화목록 15개 출력\n", "print (topn_simovies(mid=589, n=15))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **3 ALS** (Alternating Least Squares) **를 사용한 협업 필터링**\n", "### **01 희소행렬의 생성**\n", "- **평점 희소행렬을** (데이터 유무로 0,1) 사용하여 연산을 진행 합니다\n", "- 연산 및 연산후 정렬을 쉽게 연산할 수 있도록 도와주는 행렬을 활용합니다" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Shape of Original Sparse Matrix (671, 9066)\n" ] }, { "data": { "text/plain": [ "array([[0., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 0., 0.],\n", " ...,\n", " [0., 0., 0., ..., 0., 0., 0.],\n", " [4., 0., 0., ..., 0., 0., 0.],\n", " [5., 0., 0., ..., 0., 0., 0.]])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "# ratings = pd.read_csv(\"data/ml-latest-small/ratings.csv\")\n", "# movies = pd.read_csv(\"data/ml-latest-small/movies.csv\")\n", "# rp = ratings.pivot_table(columns=['movieId'], index=['userId'], values='rating')\n", "# rp = rp.fillna(0)\n", "A = rp.values\n", "print (\"\\nShape of Original Sparse Matrix\", A.shape)\n", "A" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 0., ..., 0., 0., 0.],\n", " ...,\n", " [0., 0., 0., ..., 0., 0., 0.],\n", " [1., 0., 0., ..., 0., 0., 0.],\n", " [1., 0., 0., ..., 0., 0., 0.]])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 0.5~5 평점이 존재하면1, 없으면 0의 희소행렬을 생성\n", "W = A > 0.5\n", "W [W == True] = 1\n", "W [W == False] = 0\n", "W = W.astype(np.float64,copy=False)\n", "W" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0., 1., 1., ..., 1., 1., 1.],\n", " [1., 0., 1., ..., 1., 1., 1.],\n", " [1., 1., 0., ..., 1., 1., 1.],\n", " ...,\n", " [1., 1., 1., ..., 1., 1., 1.],\n", " [0., 1., 1., ..., 1., 1., 1.],\n", " [0., 1., 1., ..., 1., 1., 1.]])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# W 와 반대조건의 Table을 생성\n", "# 예측된 평가행렬을 W_pred 와 곱하면 0 이 되도록 만들기 위함\n", "# 연산 후 내림차순으로 정렬을 쉽게 도와주는 행렬\n", "W_pred = A < 0.5\n", "W_pred[W_pred==True] = 1\n", "W_pred[W_pred==False] = 0\n", "W_pred = W_pred.astype(np.float64, copy=False)\n", "np.fill_diagonal(W_pred, val=0)\n", "W_pred" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **02 근사행렬을 활용한 예측모델 생성**\n", "- **평점 희소행렬을** (데이터 유무로 0,1) 사용하여 연산을 진행 합니다\n", "- **연산** 및 **연산 후 정렬을** 쉽게 연산할 수 있도록 도와주는 행렬을 활용합니다" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 0 반복완료 RMSE is: 3.2415\n", " 10 반복완료 RMSE is: 1.7181\n", " 20 반복완료 RMSE is: 1.7081\n", " 30 반복완료 RMSE is: 1.7042\n", " 40 반복완료 RMSE is: 1.7020\n", " 50 반복완료 RMSE is: 1.7007\n", " 60 반복완료 RMSE is: 1.6997\n", "영화폄점 최종 모델의 RMSE: 1.6990904715763828\n" ] } ], "source": [ "# Parameters\n", "m, n = A.shape\n", "n_iterations = 70 # 학습을 위한 반복횟수\n", "n_factors = 100 # 잠재요인\n", "lmbda = 0.1 # 학습률\n", "\n", "X = 5 * np.random.rand(m, n_factors)\n", "Y = 5 * np.random.rand(n_factors, n)\n", "\n", "# RMSE 오차계산 함수를 정의합니다\n", "def get_error(A, X, Y, W):\n", " return np.sqrt(np.sum((W * (A - np.dot(X, Y)))**2)/np.sum(W))\n", "\n", "errors = []\n", "for itr in range(n_iterations):\n", " X = np.linalg.solve(np.dot(Y,Y.T) + lmbda*np.eye(n_factors), np.dot(Y,A.T)).T\n", " Y = np.linalg.solve(np.dot(X.T,X) + lmbda*np.eye(n_factors), np.dot(X.T,A)) \n", " if itr % 10 == 0:\n", " print(\"{:3} 반복완료 RMSE is: {:.4f}\".format(itr,get_error(A,X,Y,W)))\n", " errors.append(get_error(A, X, Y, W))\n", "\n", "# 최종 예측행렬을 생성 합니다\n", "A_hat = np.dot(X, Y)\n", "print (\"영화폄점 최종 모델의 RMSE: \",get_error(A,X,Y,W)) " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "from matplotlib import rc\n", "from matplotlib import pyplot as plt\n", "rc('font', family=['NanumGothic','Malgun Gothic'])\n", "\n", "plt.plot(errors)\n", "plt.ylim([0, 3.5])\n", "plt.xlabel(\"반복 학습횟수 (Number of Iterations)\")\n", "plt.ylabel(\"RMSE 값\")\n", "plt.title(\"No.of Iterations vs. RMSE\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Top 10 movies predicted for the user: 355 based on collaborative filtering\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pred_ratingsmovieIdtitle
02.3891611923There's Something About Mary (1998)
12.3234061213Goodfellas (1990)
22.1791705010Black Hawk Down (2001)
32.1654681197Princess Bride, The (1987)
41.9375028798Collateral (2004)
51.9342482987Who Framed Roger Rabbit? (1988)
61.8903468622Fahrenheit 9/11 (2004)
71.8683845903Equilibrium (2002)
81.8646978957Saw (2004)
91.8580354370A.I. Artificial Intelligence (2001)
\n", "
" ], "text/plain": [ " pred_ratings movieId title\n", "0 2.389161 1923 There's Something About Mary (1998)\n", "1 2.323406 1213 Goodfellas (1990)\n", "2 2.179170 5010 Black Hawk Down (2001)\n", "3 2.165468 1197 Princess Bride, The (1987)\n", "4 1.937502 8798 Collateral (2004)\n", "5 1.934248 2987 Who Framed Roger Rabbit? (1988)\n", "6 1.890346 8622 Fahrenheit 9/11 (2004)\n", "7 1.868384 5903 Equilibrium (2002)\n", "8 1.864697 8957 Saw (2004)\n", "9 1.858035 4370 A.I. Artificial Intelligence (2001)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 위에서 반복학습한 행렬을 곱하여 최종 예측행렬을 출력합니다\n", "def print_recommovies(uid=315, n_movies=15, pred_mat=A_hat, wpred_mat=W_pred):\n", " pred_recos = pred_mat * wpred_mat\n", " pd_predrecos = pd.DataFrame(pred_recos, index=rp.index ,columns=rp.columns)\n", " pred_ratings = pd_predrecos.loc[uid,:].sort_values(ascending = False)\n", " pred_topratings = pred_ratings[:n_movies,]\n", " pred_topratings = pred_topratings.rename('pred_ratings') \n", " pred_topratings = pd.DataFrame(pred_topratings)\n", " pred_topratings['index1'] = pred_topratings.index\n", " pred_topratings['index1'] = pred_topratings['index1'].astype('int64')\n", " pred_topratings = pd.merge(pred_topratings,movies[['movieId','title']],how = 'left',left_on ='index1' ,right_on = 'movieId')\n", " del pred_topratings['index1'] \n", " print (\"\\nTop\",n_movies,\"movies predicted for the user:\",uid,\" based on collaborative filtering\\n\")\n", " return pred_topratings\n", "\n", "# 355번 사용자가 선호하는 영화목록 10편을 출력합니다\n", "# 연산에 활용할 예측행렬, 희소행렬을 입력합니다\n", "print_recommovies(uid=355, n_movies=10, pred_mat=A_hat, wpred_mat=W_pred)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Top 5 movies predicted for the user: 11 based on collaborative filtering\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pred_ratingsmovieIdtitle
01.0985352959Fight Club (1999)
10.979711608Fargo (1996)
20.78731399114Django Unchained (2012)
30.72053868157Inglourious Basterds (2009)
40.71349480463Social Network, The (2010)
\n", "
" ], "text/plain": [ " pred_ratings movieId title\n", "0 1.098535 2959 Fight Club (1999)\n", "1 0.979711 608 Fargo (1996)\n", "2 0.787313 99114 Django Unchained (2012)\n", "3 0.720538 68157 Inglourious Basterds (2009)\n", "4 0.713494 80463 Social Network, The (2010)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 11번 사용자가 선호하는 영화목록 5편을 출력합니다\n", "print_recommovies(uid=11, n_movies=5, pred_mat=A_hat, wpred_mat=W_pred)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ALS 행렬 파라미터 Grid Search:\n", "\n", "iters: 20 Factor 수: 30 lambda: 0.001 RMSE 값 2.3199\n", "iters: 20 Factor 수: 50 lambda: 0.001 RMSE 값 2.1036\n", "iters: 20 Factor 수: 70 lambda: 0.001 RMSE 값 1.9282\n", "iters: 20 Factor 수: 100 lambda: 0.001 RMSE 값 1.7077\n", "iters: 50 Factor 수: 100 lambda: 0.001 RMSE 값 1.6998\n", "iters: 100 Factor 수: 100 lambda: 0.001 RMSE 값 1.6975\n", "iters: 100 Factor 수: 100 lambda: 0.1 RMSE 값 1.6975\n", "iters: 200 Factor 수: 100 lambda: 0.001 RMSE 값 1.6959\n", "iters: 200 Factor 수: 100 lambda: 0.1 RMSE 값 1.6957\n", "CPU times: user 20min 31s, sys: 2min 46s, total: 23min 17s\n", "Wall time: 6min 4s\n" ] } ], "source": [ "%%time\n", "# Grid Search on Collaborative Filtering\n", "def get_error(A, X, Y, W):\n", " return np.sqrt(np.sum((W *(A-np.dot(X, Y)))**2)/np.sum(W))\n", "\n", "init_error = float(\"inf\")\n", "niters = [20, 50, 100, 200]\n", "factors = [30, 50, 70, 100]\n", "lambdas = [0.001, 0.01, 0.05, 0.1]\n", "print(\"ALS 행렬 파라미터 Grid Search:\\n\")\n", "\n", "for niter in niters:\n", " for facts in factors:\n", " for lmbd in lambdas: \n", " X = 5 * np.random.rand(m, facts)\n", " Y = 5 * np.random.rand(facts, n)\n", " for itr in range(niter):\n", " X = np.linalg.solve(np.dot(Y, Y.T)+lmbd*np.eye(facts), np.dot(Y, A.T)).T\n", " Y = np.linalg.solve(np.dot(X.T, X)+lmbd*np.eye(facts), np.dot(X.T, A))\n", " error = get_error(A, X, Y, W)\n", " if error < init_error:\n", " print(\"iters: {:3} Factor 수: {:3} lambda: {} RMSE 값 {:.4f}\".format(\n", " niter, facts, lmbd ,error))\n", " init_error = error" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **4 PipeLine 을 활용한 GridSearchCV 활용법**\n", "위의 마지막 소스코드에서 복잡하게 for 문을 반복하여 오류를 찾음\n", "- Python의 느린 방법에 의해 속도가 문제가 있다\n", "- 이를 극복할 make_pipeline 과 GridSearchCV 함수가 있는데 아직 미흡\n", "- sklearn 책 6장과, hands-on-ml 책을 찾아보면서 정리를 하자!!\n", "- 모르는 부분이 파이프라인의 설정과, in-output 데이터 연결 부분!!!" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 31s, sys: 23.4 ms, total: 1min 31s\n", "Wall time: 1min 31s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/markbaum/Python/python/lib/python3.6/site-packages/sklearn/model_selection/_search.py:841: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.\n", " DeprecationWarning)\n" ] } ], "source": [ "# PipeLine 생성 예제\n", "from sklearn.svm import SVC\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.datasets import load_digits\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.model_selection import GridSearchCV, validation_curve\n", "\n", "digits = load_digits()\n", "X, y = digits.data, digits.target\n", "pipe_svc = Pipeline([('scl', StandardScaler()), ('clf', SVC(random_state=1))])\n", "param_range = [0.0001, 0.001, 0.01] #, 0.1, 1.0, 10.0, 100.0, 1000.0]\n", "param_grid = [\n", " {'clf__C': param_range, 'clf__kernel': ['linear']},\n", " {'clf__C': param_range, 'clf__gamma': param_range, 'clf__kernel': ['rbf']}]\n", "gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid,\n", " scoring='accuracy', cv=10, n_jobs=1)\n", "%time gs = gs.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GridSearchCV(cv=5, error_score='raise-deprecating',\n", " estimator=Pipeline(memory=None,\n", " steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, max_iter=100, multi_class='warn',\n", " n_jobs=None, penalty='l2', random_state=None, solver='liblinear',\n", " tol=0.0001, verbose=0, warm_start=False))]),\n", " fit_params=None, iid='warn', n_jobs=None,\n", " param_grid={'logisticregression__C': [0.01, 0.1, 1, 10, 100]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n", " scoring=None, verbose=0)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# make PipeLine 생성 예제2\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.datasets import load_breast_cancer\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.pipeline import make_pipeline\n", "\n", "cancer = load_breast_cancer()\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " cancer.data, cancer.target, random_state=0)\n", "\n", "pipe = make_pipeline(StandardScaler(), LogisticRegression(solver='liblinear'))\n", "param_grid = {'logisticregression__C': [0.01, 0.1, 1, 10, 100]}\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " cancer.data, cancer.target, random_state=4)\n", "grid = GridSearchCV(pipe, param_grid, cv=5)\n", "grid.fit(X_train, y_train)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }