{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Recommending movies (based on user demographics and movie info)\n", "** *\n", "*Note: if you are visualizing this notebook directly from GitHub, some mathematical symbols might display incorrectly or not display at all. This same notebook can be rendered from nbviewer by following [this link.](http://nbviewer.ipython.org/github/david-cortes/datascienceprojects/blob/master/machine_learning/recommender_system_w_coldstart.ipynb)*\n", "\n", "This project consists of recommending movies to users based on their demographic information (in this case: age, gender, occupation and geographical region) and on movie information (in this case: year of production, genres, and user tags), using the [MovieLens 1M](https://grouplens.org/datasets/movielens/1m/) dataset, which contains 1,000,209 ratings of 3,900 movies from 6,040 users, along with the users’ demographic information and some basic movie information; enhancing the movie info with the [MovieLens 20M](https://grouplens.org/datasets/movielens/20m/) dataset, which contains more detailed movie information in the form of _tag genomes_ as described in [Vig, J., Sen, S., & Riedl, J. (2012). The tag genome: Encoding community knowledge to support novel interaction. ACM Transactions on Interactive Intelligent Systems (TiiS), 2(3), 13.](http://dl.acm.org/citation.cfm?id=2362395).\n", "\n", "The formula used is an implementation of what’s described in [Park, S. T., & Chu, W. (2009, October). Pairwise preference regression for cold-start recommendation. In Proceedings of the third ACM conference on Recommender systems (pp. 21-28). ACM.]( http://dl.acm.org/citation.cfm?id=1639720), with some slight modifications – the general idea is to produce a regression on differences between ratings of two movies from the same user using the outer products of the user and movie attribute vectors.\n", "\n", "In comparison to recommendations based on past user ratings, these kinds of models are able to provide quality recommendations to new users (for whom there is demographic information available), and are able to recommend both old and new movies (as long as there is information about them).\n", "\n", "The idea implemented here differs from what the paper above describes in that:\n", "* More movie information is added through the use of the _tag genome_ info.\n", "* No rating bots are used to enhance movie features.\n", "* For computational reasons, the results will only be evaluated with rating averages of the movies that would be recommended for a hold-out user set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Recommendation formula\n", "\n", "This model consists on ranking movies for a user by a linear combination of the outer product of movie and user features multiplied by some weights that minimize the squares of differences between ratings of each two movies minus predicted differences, plus a regularization term - this is given by the following formula:\n", "\n", "$$ \\min_w \\sum_{u \\in users} \\bigg( \\frac{1}{\\left\\vert\\ M_u \\right\\vert} \\sum_{i \\in M_u} \\sum_{j \\in M_u} (\\:(R_{ui} - R_{uj}) - (w^\\mathsf{T}(z_i \\otimes x_u) - w^\\mathsf{T}(z_j \\otimes x_u)\\:)^2 \\bigg) + \\lambda \\lVert w \\rVert^2_2 $$\n", "\n", "Where $R_{ui}$ are the ratings given by users to movies, $x$ are the user features, $z$ are the movie features, $M_{user}$ are the movies that have been rated by a given user, $\\lambda$ is a regularization term, and $w$ are the coefficients assigned to each combination of user-movie feature. Note that the optimization problem is convex with respect to $w$.\n", "\n", "Recommendations are then produced for each user by calculating, for each movie, $w^\\mathsf{T}(z_j \\otimes x_u)$ and taking the movies with highest such score for that user.\n", "\n", "Intuitively, the parameters that minimize such a loss function would be exactly the same that minimize a regression of the centered ratings for each user, that is:\n", "\n", "$$ \\min_w \\sum_{u \\in users} \\frac{1}{\\left\\vert\\ M_u \\right\\vert} \\sum_{i \\in M_u} \\bigg( (R_{ui} - \\overline{R_u}) - w^\\mathsf{T}(z_i \\otimes x_u) \\bigg)^2 + \\lambda \\lVert w \\rVert^2_2 $$\n", "\n", "This is far easier and faster to work with, and can be easily computed with standard libraries. I found the formula without weights by movies rated per user to be slightly more accurate after optimizing the coefficients for half an hour." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sections\n", "\n", "[1. Processing the movie data](#p1)\n", "\n", "[2. Processing the user data](#p2)\n", "\n", "[3. Loading ratings and generating test set](#p3)\n", "\n", "[4. Fitting the model with Spark](#p4)\n", "\n", "[5. Evaluating the model and checking some recommendations](#p5)\n", "** *\n", "\n", "\n", "## 1. Processing the movie data\n", "\n", "\n", "The movie data needs some processing in order to put it in the right format.\n", "As explained before, the movie data can be enhanced with the tag genome information, which contains 1128 tags for each movie in a relative scale from 0 to 1 (with values closer to 1 indicating that the movie has more of that tag). Although these are way too many tags to use with this small ratings data, many tags are relate to each other and it’s possible to make some good feature reduction using principal components.\n", "\n", "A small look at the data:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/david/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.py:179: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " self._setitem_with_indexer(indexer, value)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
movieIdtitlegenresYear
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy1995
12Jumanji (1995)Adventure|Children|Fantasy1995
23Grumpier Old Men (1995)Comedy|Romance1995
34Waiting to Exhale (1995)Comedy|Drama|Romance1995
45Father of the Bride Part II (1995)Comedy1995
\n", "
" ], "text/plain": [ " movieId title \\\n", "0 1 Toy Story (1995) \n", "1 2 Jumanji (1995) \n", "2 3 Grumpier Old Men (1995) \n", "3 4 Waiting to Exhale (1995) \n", "4 5 Father of the Bride Part II (1995) \n", "\n", " genres Year \n", "0 Adventure|Animation|Children|Comedy|Fantasy 1995 \n", "1 Adventure|Children|Fantasy 1995 \n", "2 Comedy|Romance 1995 \n", "3 Comedy|Drama|Romance 1995 \n", "4 Comedy 1995 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd, numpy as np, re\n", "from collections import defaultdict\n", "\n", "movies=pd.read_csv('/home/david/movielens/ml-latest/ml-latest/movies.csv')\n", "movies_humanreadable=movies.copy()\n", "movies['hasYear']=movies.title.map(lambda x: bool(re.search(\"\\s\\((\\d{4})\\)$\",x.strip())))\n", "movies['Year']='unknown'\n", "movies['Year'].loc[movies.hasYear]=movies.title.loc[movies.hasYear].map(lambda x: re.search(\"\\s\\((\\d{4})\\)$\",x.strip()).group(1))\n", "del movies['hasYear']\n", "movies.head()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'(no genres listed)',\n", " 'Action',\n", " 'Adventure',\n", " 'Animation',\n", " 'Children',\n", " 'Comedy',\n", " 'Crime',\n", " 'Documentary',\n", " 'Drama',\n", " 'Fantasy',\n", " 'Film-Noir',\n", " 'Horror',\n", " 'IMAX',\n", " 'Musical',\n", " 'Mystery',\n", " 'Romance',\n", " 'Sci-Fi',\n", " 'Thriller',\n", " 'War',\n", " 'Western'}" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies['genres']=movies.genres.map(lambda x: set(x.split('|')))\n", "present_genres=set()\n", "for movie in movies.itertuples():\n", " present_genres=present_genres.union(movie.genres)\n", "for genre in present_genres:\n", " movies['genre'+genre]=movies.genres.map(lambda x: 1.0*(genre in x))\n", "present_genres" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Processing the tag genome info:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAABIQAAAHVCAYAAACAOCDDAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzs3Xl0nNdh3/3fnQ0YYAb7QhDcwJ2UqIWEKFmWbdCKIsqR\noyiRHXlR3ijRq9i13rR1k9pJ66RtGh+7js/7+m0cq2wsK3ZS026cOqpMS05swVotUtTGTRT3BQQJ\nYsdgMPvtHzMYDldA4uB5Bpjv5xyceZY7g99APj7Wz/e511hrBQAAAAAAgPLhcTsAAAAAAAAAnEUh\nBAAAAAAAUGYohAAAAAAAAMoMhRAAAAAAAECZoRACAAAAAAAoMxRCAAAAAAAAZYZCCAAAAAAAoMxQ\nCAEAAAAAAJQZCiEAAAAAAIAy43PrFzc1NdklS5a49euLanx8XNXV1W7HeEfI7AwyO4PMziCzM8js\nDDI7g8zOILMzyOwMMjuDzHPbzp07+621zVONc60QWrJkiV555RW3fn1RdXd3q6ury+0Y7wiZnUFm\nZ5DZGWR2BpmdQWZnkNkZZHYGmZ1BZmeQeW4zxhybzjgeGQMAAAAAACgzFEIAAAAAAABlhkIIAAAA\nAACgzFAIAQAAAAAAlBkKIQAAAAAAgDJDIQQAAAAAAFBmKIQAAAAAAADKDIUQAAAAAABAmaEQAgAA\nAAAAKDMUQgAAAAAAAGWGQggAAAAAAKDMUAgBAAAAAACUGQohAAAAAACAMkMhBAAAAAAAUGamLISM\nMY8ZY/qMMbsvc98YY/5/Y8xBY8ybxpj1xY8JAAAAAACAYpnODKHHJW2+wv27JK3I/Tws6RtXHwsA\nAAAAAAAzxTfVAGvts8aYJVcYco+kb1trraRfGGPqjDFt1treImUEAAAAAAB4R6y1slaykjK544y1\n0gXnVpLNSNUVXvm85bOyzpSF0DS0SzpRcH4yd41CCAAAAACAIrPWKpWxSmeskumM0pnseSptlcpk\ncq+Xup893t2fUuatM/lx2bEZJdP2orHZa5mCz8+epzPZMiWdsUpbq0zmguNc2TJ5/dxYXWKsVabg\n+kWfa62i0Zj8L/5UaWtlJ+9nsu+bHGtznzP5+k794NO3asPi+uL/AytRxtqp/0q5GUJPWmuvvcS9\nJyV9yVr7fO78p5I+Z6195RJjH1b2sTK1trZu2Lp161WFLxWRSEShUMjtGO8ImZ1BZmeQ2RlkdgaZ\nnUFmZ5DZGWR2Bpmd4XTmjM0WFOmMlMpIKWvzx2mrbAlScD9tbXZcRrnrVuMTcfkCFfn3pzLZ8fn3\n58ee//mT7z/3u7LjJ4+zhYrOlSg2+55M7scNRpLXI3lMdv0Zjyn8MfKY7JjC6yY/1lww/sKxpmDs\nxZ+RSaUU8Psv8bkF71f2usl9dv5cl7tmpNxnSNLN87yqq5z9M4Q2bdq001rbOdW4YswQ6pG0sOB8\nQe7aRay1WyRtkaTOzk7b1dVVhF/vvu7ubs2270JmZ5DZGWR2BpmdQWZnkNkZZHYGmZ1B5uJIZ6wS\nqYziqXTuNaNEOqNEKvvz9is7tbb9WiXSBfcvGHPhtcLzeDJ93tgL35cdm1YynZ05kypKs2IkJS66\n6vca+b0e+TxGAZ9Hfu/kj8kfV3iNQl6PAr7sOL/XI7/PI3/u2Oc18nqMfJ7sfa/XyO/x5K4Z+XKf\n7/Nmz70eT8Fx9jOyr7l7ufe9+cbruqlzw3nv9Xk8573n3L1z557J5sQFpfif59muGIXQE5IeMcZs\nlXSzpBHWDwIAAACA0jP5qFEsmVYsmS1mzntNphVPZRS7xOulxsdSacVz5/nXfDmTuaicSU+ngHn5\n5SmHGCNV+DwKeD0K+LzZ49x5hX/yukehSl/+OODzqCI31u81uRLGkytrzCULG7/Xo4AvW4pMHl84\n7pXtL+v9t71XgVyBM3lvcvZJKZo47tUNC+vcjgGXTVkIGWO+K6lLUpMx5qSkP5XklyRr7aOStkn6\nkKSDkqKSHpypsAAAAAAwl1ibXaNlIplWLJnWRCKtiWT2J1ZwPJHI3U9mi5fJIubc68UlzsBQVL4d\nz1xU7FzNpBifx6jS71WlP1eu5F6z5x7VVQXOlS8XlDPZwsZbUM4UvObG7NuzWzetvyFf3hSOK/wc\nn6d0CpcjQY+awxVuxwDesensMvaxKe5bSZ8pWiIAAAAAKAHp3EyaiWRaZ6MZHTgzli9n8gVOMq2J\nROYdFTqT75k8n9asmQv4PEYVPo8q/d78ayD3Wun3qC7oly/hUfu8unx5U/g6+b6KgvdX5N9/8bWK\nXDEz0zswec/s081LG2f0dwDIKsYjYwAAAADgCmutYsmMxhMpTSTSGk+kFE2kFY2nFU2kNJFMazx3\nHE2kcz+pC14LjuPnSptEKnP+L3v22SnzeIxUFfCp0u9VMOBR0O9VMFey1FcHNH/yPODN3wsGsveD\nBe+pLLiXP8+Nq5xmMZNdc+XGd/unBTDHUQgBAAAAmHHWZh+LisRTOj2e0e6ekVxZM1nkpDWRSGk8\nV9BMHk8kcmNyY88VONkx0WRa09g4OS/g86gq4FV1wKdgwKvqQLZoaQ1XKhjwqirgPVfoFBQ0Rw8d\n0I3XXXOu4LlMoVPqa8cAwCQKIQAAAAAXsdYqnspoPJ7SeDxb5IwnUtnX3E8kni44zpY159/Pvnc8\n997znox67vkr/v7JYqYqX9Jkz5tCFdnjCp+q/LnXgvuXOq6uyJY/VX7vu37kqTt+VF3XzX9X7wWA\nUkQhBAAAAMwR+Vk4sZRGY9lCZiyWVCRWUOScV9qcK2sKr00eT3dL7gqfR6EKn6pz5Uyowqe6qoAW\n1FepuiJbyEzer67w6cThA9pw/bXnZulUeFXl96mqIlvgVPq8rm5vDQDlgEIIAAAAKAGJVCZf4IzF\nUhorLHTiqfy1wvNILKXegQnZl3+WHz+dBYr9XpMtZwK+8wqb1nBl7tibL29C+ddLXAtkSxz/O5x1\n0x07oq5r5r3bPxUAoAgohAAAAICrFEumNRpLanQilXtNZmfoxKYodAoKoPiFCxhfQsDnUbjCp3Cl\nT6FKn8IVfjUFjToWNOSu+7PXK7OlTU3uPHTeDB2vKnxeB/4qAIBSRiEEAACAshdPpTU6kVJvJKPX\njg9pNJbKlTpJjUxcXPRM3pu8ftFuVBfwGCmUK2zCucKmKRRQR1N1vsDJFzoFhU/NBeeXKnKyO0nd\nMFN/GgDAHEUhBAAAgFkvlc5oNJbScDRxycLmXLlTeO9cuXPe7JznX7zo8/1eo9qgXzWVfoWDftVU\n+tReH8xfqwn6cq/Ze5OvkwVPVcDLzlMAgJJCIQQAAICSEUumNRzNzsoZjiY0PJHUSDSp4YlE7lry\nktfGYqkrfq7Pkyt0Cgqb+bXBi4qcniMHdfOG61RT6Vdtwb0Kn4dCBwAwp1AIAQAAoKgyGauxeEpn\noxntOjmi4YlEvsgZnSx6CoqdkYlkfsyV1tHxeozqgn7VVvlVF/SrJVypFS1h1Qb9qstdq63yF8za\nOXdc6Z9eodMdP6quVS3F/HMAAFCSKIQAAABwWcl0RkO5AmdwPKHhaEKD48nctezxcDShwcmSJ5qd\ntZPf6OrZ5y/6zKDfq7pccVNX5deSpirVBeuy16r8qgsG8vfyZU9VQNU8dgUAQNFQCAEAAJSJeCqd\nL3aGogkN5YqdofGEhqK548Lz8YTG4pd/FCvo96q+yq/66oDqqwJqrwuqviqQL3J6jx3SLeuvOzd7\nJ/fIVqWfHa4AAHAbhRAAAMAslM7Y3AydhPoj2dfB8Xh+9s5Q7l7hzJ7xRPqyn1cd8OaLnfrqgJY0\nVWePqwJqqM7O0GmozpY9DblxUxU73enj6lrbWuyvDgAAioBCCAAAoASkMzZf4gzkCp6B8Xj+eN+R\nmL6x/6Xc9WzBk38s6wLhSl+uvAmoKRTQipZQruw5N5snW/z41VAVUG2V/5LbmQMAgLmLQggAAGAG\nFBY8/ZF4bgZPtuwZGI8XHCfyj3DZyxQ8dVV+VZqMFgWlZc0h3dQRUFN1dsZOQ6hCjdUBNYYC+Zk7\nfq/H2S8LAABmHQohAACAaUqkMhoYj6t/LFvynI3Es69jcfVHEuofy54PTKPgaawOqLG6QsubQ2rs\nCKgxV/A05gqehlzB01AVkM/rUXd3t7q63uPsFwYAAHMWhRAAAChr8VT6vDKnsOA5G4kXXM/unnUp\n1QGvmsIVagpVaGlztW7qaFBTrtxpqA7kC57G6grVV/nlYwYPAABwGYUQAACYcxKpjM5G4uobjWXL\nnnzJk/05eHJC/+mVbp2NxDUWu/QuWuEKX67kCWhla1i3LssWPk3hQPY1VKHm3HlVgP9JBQAAZhf+\n1wsAAJg1oomU+kbj6huLq28sdt7x2bF47jymoeilZ/KEK31qDlfIL2lNW43eF8qVO7nZPU258+Zw\nBVujAwCAOY1CCAAAuMpaq9FYSmcvKHjOOx6L6+xoXGPxi2fz+DxGzeEKtYQrtKixSp1L6tUSrlRL\nTXYGT3M4W/g0Vp/bJj27Hs96p78qAABAyaAQAgAAM8Jaq7F4SmdGYjo9GtPpkWyx0zeaey0ofuKp\nzEXvr/R7ssVOuEKr54X1/hXN+eKnpSZ7vSVcofqqgDwe48I3BAAAmL0ohAAAwDuWSmfX6Dk9EtMr\np1M6+sIRnR6N60yu+Dkzmi2Boon0Re8NV/pyZU6l1i+qzx+31FTkCp/scbjCJ2MoegAAAGYChRAA\nADjPWCyZK3biOj0ay5c8hcf9kbgyhVuqv75Xfq9RS7hS82ortaatRl2rWjSvtkKtNZWaV5O93hKu\nVDDA2jwAAABuoxACAKBMWGs1HE3q1MiETg3H1Dsyod6R2LlHukazx+OXmNVTU+nTvNpKtdZUavW8\nsObVVKq1Nlv0nDywW7/ywdvUwKNbAAAAswaFEAAAc0QknlLv8IROjcTyr6eGJ7LFz3BMp0YmFEue\nv1aPz2PUEq5Qa22lVrVm1+mZlyt6WnOzeubVXHlWT/eZfWoKVcz01wMAAEARUQgBADALxFNpnR6J\n6dTwuZJnsvjpzRU/o7Hzd+AyRmoJV6itNqjVbWFtWt2i+XVBza+tVFvutSlUwaweAACAMkQhBACA\ny6y1OhuJq2doQtt7U3r72UMFxU/20a7+SOKi9zVUB9RWW6kF9VXa2NGgttqg5tdVan5dUG25x7v8\nXo8L3wgAAACljkIIAIAZls5YnRmNqWd4Qj1DEzo5FFXP8IRODmXPe4Ynzt92/Y23FKrwaX5dpdpq\ng7q2vUZttdmSZ35dMF/4VPpZnBkAAADvDoUQAABXKZnOqHc4ppPD0fNKnsnip3c4ptR5W3JJTaGA\n2uuCWtNWo19a26r2uqAW1Ad16uAe3XPH+1RT6Xfp2wAAAKAcUAgBADCFeCqdm9mTm9UzHM2f9wxP\n6Mxo7Lwt2I2RWsOVaq8Pav2ierVfF1R7fVAL6qvUXhdUe13wsos0d5/ZRxkEAACAGUchBAAoe5Nr\n+JwYjOr4YFQnBid0PH8c1enRmGxB4eP1GLXVVqq9LqhblzVly57cDJ/2+qDaaoMK+Fi7BwAAAKWL\nQggAUBaiiVS+6DlRUPYcH4zqxFD0ou3YW2sqtKihSu9Z1qhFDVVaWF+lhQ1Vaq8PqjVcIR+LNQMA\nAGAWoxACAMwJ6YzV6dGYjg9kC54Tg1Ht2BfT1/a+oBODE+qPxM8bXx3wamFDlTqaqvWBlc1a2FCV\nLX4aqrSgPsiCzQAAAJjTKIQAALNGMp3RyaEJHR0Y19H+cR0biOroQPb15FBUyfS557o8RmqoNFo5\n36vbV7doUWPVudKnPqiG6oCMMS5+GwAAAMA9FEIAgJIST6V1YnBCxwbGdXQgmn892j+unuEJpQtW\nb64OeLW4sVpr2sK685p5uRk+QS1qqNL8uqBeeO5ZdXXd4uK3AQAAAEoThRAAwHGxZFonBqP5wudI\nwWyfU8MT5+3YFa7waUlTta5bUKtfvX6+ljRVa0ljlRY3VqspxCwfAAAA4N2gEAIAzIh0xqpnaEKH\n+yM6fHZch/sjOtI/rqP9UZ0amThv167aoF9Lmqq1YXG9fn39gnzh09FUrfoqP6UPAAAAUGQUQgCA\nqzI0ntDh/nEdPhvR4f5xHcmVP0cHokqkzu3cFa70aWlzSDctqdeSpgXqaKrW4sbsbJ+6qoCL3wAA\nAAAoPxRCAIApxVNp9UQyemr36exMn7Pj+RJoKJrMj/N5jBY1VmlpU0ibVrWoo6laS5tDWtpcrUYW\ncQYAAABKxrQKIWPMZklfk+SV9NfW2i9dcL9e0mOSlkmKSfoda+3uImcFAMywwfGEDvZF8j+Tj3ud\nHIpm1/V5fqckqTlcoaVN1dp8bZuWNVfni5+F9UH5vB53vwQAAACAKU1ZCBljvJK+LukOSScl7TDG\nPGGt3Vsw7I8lvW6tvdcYszo3/vaZCAwAuDrWWp0ZjetA35gO9kV0IFf+HOqLaGA8kR8X9HvVkVvM\n+ddubFfs7HHd/f5OdTRVK1zpd/EbAAAAALha05khtFHSQWvtYUkyxmyVdI+kwkJoraQvSZK19i1j\nzBJjTKu19kyxAwMApiedsTo5FD2v9DnQF9HhvojG4qn8uNqgX8tbQrpjbauWt4TyP/Nrg/J4zj3i\n1d19StctqHPjqwAAAAAoMmMLt3m51ABj7pO02Vr7UO78AUk3W2sfKRjzRUlBa+2/NsZslPRibszO\nCz7rYUkPS1Jra+uGrVu3FvXLuCUSiSgUCrkd4x0hszPI7Ixyz5zKWJ2JWp2KZM79jFudHs8oeW5N\nZ9VVGLVVG80PebI/1dnXmoCmtbZPuf+dnUJmZ5DZGWR2BpmdQWZnkNkZZJ7bNm3atNNa2znVuGIt\nKv0lSV8zxrwuaZek1ySlLxxkrd0iaYskdXZ22q6uriL9end1d3drtn0XMjuDzM4ol8yZjFXP8IT2\nnx7T/jNj2n96TG+fGdOhsxEl0+fK/QX1Qa2YH9Lm/GyfsJa3hFQbvLrHvMrl7+w2MjuDzM4gszPI\n7AwyO4PMziAzpOkVQj2SFhacL8hdy7PWjkp6UJJM9v9mPiLpcJEyAkBZsdbqbCSut09HcsXPqPaf\niejAmTFFE+e69va6oFbNC6trVYtWtoa0sjWspc3VqgqwgSQAAACAK5vOvzXskLTCGNOhbBF0v6SP\nFw4wxtRJilprE5IekvRsriQCAFzByERSB85kZ/y8fXpMb+Vm/RRu5d4UCmhla1gf7VyoVfPCWtka\n1srWEAs7AwAAAHjXpiyErLUpY8wjkp5Wdtv5x6y1e4wxn8rdf1TSGkl/Y4yxkvZI+t0ZzAwAs04m\nY3VsMKp9vaPa1zuqZ3fF9Mcv/VSnRmL5MaEKn1a2hrT52nla2RrWqtawVs4LqylU4WJyAAAAAHPR\ntJ4rsNZuk7TtgmuPFhy/JGllcaMBwOwUTaS0//SY9ubKn72nRrX/9JjGc497eT1G86qkm5Y3aPW8\nGq2al33cq70uOK3FnQEAAADgarHQBAC8S9ZanRmNZ0uf3M++U6M6MjCuyQ0cwxU+rWmr0Uc6F2pN\nW1hr22q1ojWkX7zwnLq6bnT3CwAAAAAoWxRCADAN6YzV4bMR7T41oj09o9p3elT7esc0OJ7Ij1lQ\nH9Tathp9+Pr5Wju/RmvbarSgnlk/AAAAAEoPhRAAXCCVzuhw/7h2nRzRrp4R7e4Z0d7e0fwOXwGf\nR6taw/qlNS1a21ajNW01Wt1Wc9XbugMAAACAUyiEAJS1VDqjg2cj2nUyW/zsypU/sWRGkhT0e7V2\nfo0+2rlQ18yv0boFtVreHJLP63E5OQAAAAC8exRCAMpGMp3RgTORfPGzq2dE+3pHFU9ly5+qgFfX\nzK/RxzYu0rr2Wl3bXqtlzSF5PTzyBQAAAGBuoRACMCdZa3V8MKrXTwzrjRMjeuPksHb3jOTLn1CF\nT2vn1+iTtyzOlz8dTdWUPwAAAADKAoUQgDlhIBLXmydHsgXQyWG9cWJYQ9GkJKnS79G69lp98pbF\num5Brda112pJY7U8lD8AAAAAyhSFEIBZZyKR1p5Tk+XPiH7xdlRnn/pnSZLHSCtbw/rltfN0/cI6\nXb+wVqtaw6z5AwAAAAAFKIQAlDRrrU4MTmjn8UG9emxYrx4f0lunx5TOWElSe11Qi2s8eqhrha5f\nWKd17bWqruC/2gAAAADgSvi3JgAlJZZMa3fPiHYeG9LOY0N69fiw+iNxSdl1f25YWKdPf2BZdvbP\nglq11FSqu7tbXR9Y5nJyAAAAAJg9KIQAuOr0SEyvHh/KF0B7To0omc7O/lnSWKX3r2jS+sX12rC4\nXitbwyz6DAAAAABFQCEEwDGZjNX+M2PafmRQrxwb0qvHhtQzPCFJqvB5dP2COv3ObR3asKhe6xfX\nqylU4XJiAAAAAJibKIQAzJhkOqNdPSPacWRQ248MasfRQY3GUpKkeTWV2rCkXr97W4fWL67X2rYa\nBXws/AwAAAAATqAQAlA0E4m0XjsxpO25Aui148OaSKYlSUubq/WhdW3a2NGgjR0NWlBf5XJaAAAA\nAChfFEIA3rWRiaR2HhvU9iND2n5kQLt6suv/GCOtmVej37xpoTZ2NOimJQ1qDvP4FwAAAACUCgoh\nANMWTaS04+iQXjzUr5cODWh3z4gyVvJ7jda11+p3b1uqmzsatH5xvWqDfrfjAgAAAAAug0IIwGXF\nU2m9dnxYLx4a0EuH+vX6iWEl01Z+r9GNC+v1yAdX6JalDbpxYb2CAa/bcQEAAAAA00QhBCAvlc7o\n0HBae545qJcODWjH0UHFUxl5jPIzgG5d1qjOJfWqCvBfHwAAAAAwW/FvdEAZs9bq0NmInn27Xy8c\n7NfLRwYViack7dfqeWF9/OZFunVZkzZ2NPAIGAAAAADMIRRCQJkZjib0/MF+Pfd2v547cFanRmKS\npI6mat1zw3zVxs/od+9+nxpDLAINAAAAAHMVhRAwxyXTGb1+YljPvX1WPz/QrzdPDstaKVzp023L\nm/TIB5v1vhVNWtiQ3Qa+u3uAMggAAAAA5jgKIWAOOj4Q1c8PnNVzb5/VS4cGNBZPyWOkGxbW6V/e\nvkLvW9Gs6xfUyuf1uB0VAAAAAOACCiFgDoin0tp+ZFA/e6tPz7zVp6MDUUlSe11Qd18/X+9f0aRb\nlzexDhAAAAAAQBKFEDBr9Y3G9Mz+Pv3srT49f6Bf44m0Knwe3bqsUQ++t0PvW9GkjqZqGWPcjgoA\nAAAAKDEUQsAskclYvdkzkp8FtKtnRJI0v7ZSv3Zju25f06L3LG1SMOB1OSkAAAAAoNRRCAElLJZM\n67kD/frJntN6Zn+f+iMJeYy0flG9/vDOVbp9TYtWtYaZBQQAAAAAeEcohIASMxJN6mf7z+gne87o\n52+fVTSRVrjSp65VLbp9dYs+sLJZ9dUBt2MCAAAAAGYxCiGgBJweiemf9p7W03vO6BeHB5TKWLWE\nK/Tr69t15zXzdHNHowI+dgQDAAAAABQHhRDgkiP94/rx7l49veeM3jgxLEla2lSth963VHde06rr\nF9TJ4+FRMAAAAABA8VEIAQ46PhDVk7tO6ck3erW3d1SSdP2CWv3hnat05zWtWt4SdjkhAAAAAKAc\nUAgBM+zkUFTbdvXqyTd79ebJ7M5gNy6q0xfuXqu7rp2n+XVBlxMCAAAAAMoNhRAwA3pHJvSjN3v1\n3ZcmdOipZyRJ1y2o1R9/aLU+tK5NC+qrXE4IAAAAAChnFEJAkYzGkvrxrl794NUebT8yKElaFPbo\n325epbvXzdeiRkogAAAAAEBpoBACrkIyndFzB87qB6/26J/3nlE8ldHSpmp99o6Vuvu6Nh3f84q6\nupa7HRMAAAAAgPNQCAHvkLVWu3tG9YNXT+p/v3FKA+MJ1Vf59Zs3LdSvr1+g6xfUypjs7mDHXc4K\nAAAAAMClUAgB03R6JKZ/eO2k/uHVHh3siyjg9ej2NS2698Z2da1qUcDncTsiAAAAAADTQiEEXEEy\nndHP3urT93acUPf+PmWs1Lm4Xn9+77W6e9181Vb53Y4IAAAAAMA7RiEEXMKR/nF9b8cJ/f3Ok+qP\nxNUSrtCnPrBMH+1cqCVN1W7HAwAAAADgqkyrEDLGbJb0NUleSX9trf3SBfdrJf2tpEW5z/wLa+23\nipwVmFGxZFo/3t2rrdtP6OUjg/J6jDatatH9Ny1U16pm+bw8EgYAAAAAmBumLISMMV5JX5d0h6ST\nknYYY56w1u4tGPYZSXuttR82xjRL2m+M+TtrbWJGUgNFdHwgqr99+Zi+/8oJDUeTWtxYpT+8c5Xu\n27BArTWVbscDAAAAAKDopjNDaKOkg9baw5JkjNkq6R5JhYWQlRQ22a2VQpIGJaWKnBUomkzG6ucH\nzuo7Lx3TM/v75DFGd17Tqk/evFi3LG2Ux2PcjggAAAAAwIwx1torDzDmPkmbrbUP5c4fkHSztfaR\ngjFhSU9IWi0pLOk3rbU/usRnPSzpYUlqbW3dsHXr1mJ9D1dFIhGFQiG3Y7wj5Zp5PGn13MmUfnYi\nqb6oVU3AqGuhT10LfWqoLP4jYeX6d3YamZ1BZmeQ2RlkdgaZnUFmZ5DZGWR2Bpnntk2bNu201nZO\nNa5Yi0rfKel1SR+UtEzSPxljnrPWjhYOstZukbRFkjo7O21XV1eRfr27uru7Ndu+S7llPtgX0Tef\nP6L/9dpJxZIZ3bSkXv/+PUu0+Zp5M7pdfLn9nd1CZmeQ2RlkdgaZnUFmZ5DZGWR2BpmdQWZI0yuE\neiQtLDhfkLtW6EFJX7LZ6UYHjTFHlJ0ttL0oKYF3wVqrl48M6r8/e1g/fatPFT6P7r2xXb/1niVa\nO7/G7XgAAAAAALhmOoXQDkkrjDEdyhZB90v6+AVjjku6XdJzxphWSaskHS5mUGC6UumMtu0+rf/+\n7GHt6hlRQ3VA/+qXVuiBWxarMVThdjwAAAAAAFw3ZSFkrU0ZYx6R9LSy284/Zq3dY4z5VO7+o5L+\nTNLjxphdkoykz1lr+2cwN3CRSDylrduP61svHFXP8ISWNlXri/eu06+vb1el3+t2PAAAAAAASsa0\n1hCy1m4NAj8IAAAgAElEQVSTtO2Ca48WHJ+S9MvFjQZMz0g0qcdeOKJvvXBEo7GUbu5o0H/81Wv0\nwdUt7BYGAAAAAMAlFGtRacBxA5G4vvn8EX37pWOKxFP65bWt+sym5bp+YZ3b0QAAAAAAKGkUQph1\n+kZj2vLsYf3dy8cVS6X1oXVtemTTcq1pY6FoAAAAAACmg0IIs8ZwLKP/8MQe/Y/tx5XOWN1z/Xz9\ni03Ltbwl5HY0AAAAAABmFQohlLyh8YQe/fkhfev5CWV0TL+xfoH+xaZlWtxY7XY0AAAAAABmJQoh\nlKyxWFLffP6I/vq5IxpPpHRLm1df+sT7KIIAAAAAALhKFEIoOcl0Rt956Zj+688OaCia1J3XtOqz\nd6xS71s7KYMAAAAAACgCCiGUDGutfrqvT1/ctk+H+8f13uWN+tzm1bpuQXbXsN63XA4IAAAAAMAc\nQSGEkrCvd1T/+Ud79cLBAS1trtZjv92pTataZIxxOxoAAAAAAHMOhRBcdXYsrq/+ZL++/8oJ1QT9\n+g8fXqtP3LJYfq/H7WgAAAAAAMxZFEJwRTpj9be/OKa/eHq/JpJpPfjeDv3+B1eotsrvdjQAAAAA\nAOY8CiE47o0Tw/p3P9yl3T2jum15k/7jPddoWXPI7VgAAAAAAJQNCiE4ZiSa1Fd+8pb+7uXjag5V\n6L9+7EbdfV0b6wQBAAAAAOAwCiHMOGuttu06rT99YrcGxxP67VuX6LN3rFS4ksfDAAAAAABwA4UQ\nZlTfWExf+OFuPb3njNa11+rxBzfq2vZat2MBAAAAAFDWKIQwI6y1+sGrPfqzJ/dqIpnW5+9arYdu\n65CP3cMAAAAAAHAdhRCKbiAS1+d+8Kb+eV+fOhfX68v3Xcei0QAAAAAAlBAKIRTV8wf69dnvv67h\naFJfuHutHrx1iTweFo0GAAAAAKCUUAihKBKpjL76T/u15dnDWtpUrccf3Ki182vcjgUAAAAAAC6B\nQghX7Wj/uH5/62t68+SIPrZxkf7k7rUKBrxuxwIAAAAAAJdBIYSr8tTuXv3B/3xTHiN94xPrdde6\nNrcjAQAAAACAKVAI4V1JpTP6yk/267/9/LCuX1inv/rEerXXBd2OBQAAAAAApoFCCO9YfySu3//u\na3rx0IA+cfMi/cmH16rCxyNiAAAAAADMFhRCeEfePDms3/vOTg2OJ/SV+67TRzoXuh0JAAAAAAC8\nQxRCmLandvfqX33vdTVWV+gHn75V17bXuh0JAAAAAAC8CxRCmJK1VluePawvPfWWblhYpy0PdKo5\nXOF2LAAAAAAA8C5RCOGKkumM/uQfd+u720/oV65r01c/cr0q/awXBAAAAADAbEYhhMuKxFP69N/u\n1HMH+vWZTcv0b+5YJY/HuB0LAAAAAABcJQohXNLgeEIPfmu7dp8a1X/5jev00ZtYPBoAAAAAgLmC\nQggX6R2Z0APf3K4Tg1H9t09u0C+tbXU7EgAAAAAAKCIKIZzn8NmIHvjmdo1OJPXt39mom5c2uh0J\nAAAAAAAUGYUQ8g6djej+Lb9QJmP13YdvYVt5AAAAAADmKAohSDpXBllrtfXhW7SiNex2JAAAAAAA\nMEM8bgeA+wrLoO/+35RBAAAAAADMdRRCZe5o/zhlEAAAAAAAZYZHxsrYmdGYPvnNl5XOWH2Px8QA\nAAAAACgbzBAqU+NJq9/65nYNjSf0+IM3UQYBAAAAAFBGmCFUhqKJlP7fnTEdH5O+9eBNum5BnduR\nAAAAAACAg5ghVGbSGatH/sdrOjSc0dfuv0HvXd7kdiQAAAAAAOCwaRVCxpjNxpj9xpiDxpjPX+L+\nHxpjXs/97DbGpI0xDcWPi6v1xW379LO3+vTA2oDuWtfmdhwAAAAAAOCCKQshY4xX0tcl3SVpraSP\nGWPWFo6x1n7FWnuDtfYGSX8k6efW2sGZCIx377vbj+ubzx/Rg+9dog8u8rsdBwAAAAAAuGQ6M4Q2\nSjporT1srU1I2irpniuM/5ik7xYjHIrnF4cH9IUf7tYHVjbr331ojdtxAAAAAACAi4y19soDjLlP\n0mZr7UO58wck3WytfeQSY6sknZS0/FIzhIwxD0t6WJJaW1s3bN269eq/QQmIRCIKhUJux7isoVhG\nf/rihKr9Rl+4Jagqvyn5zJdCZmeQ2RlkdgaZnUFmZ5DZGWR2BpmdQWZnkNkZszGzWzZt2rTTWts5\n1bhi7zL2YUkvXO5xMWvtFklbJKmzs9N2dXUV+de7o7u7W6X6XRKpjO7f8pJSSug7v/deLW/Jbi9f\nypkvh8zOILMzyOwMMjuDzM4gszPI7AwyO4PMziCzM2Zj5lI3nUKoR9LCgvMFuWuXcr94XKykfHHb\nPr16fFh/+fEb82UQAAAAAAAob9NZQ2iHpBXGmA5jTEDZ0ueJCwcZY2olfUDSPxY3It6tp3b36vEX\nj+p33tuhu6+b73YcAAAAAABQIqacIWStTRljHpH0tCSvpMestXuMMZ/K3X80N/ReST+x1o7PWFpM\nW+/IhD73g11a116rz9+12u04AAAAAACghExrDSFr7TZJ2y649ugF549LerxYwfDuZTJWn/3eG0qk\nMvra/Tco4JvORDAAAAAAAFAuir2oNErAlucO66XDA/ryb6zT0mZWYQcAAAAAAOdj6sgc8/aZMX31\nJ/t15zWt+mjnwqnfAAAAAAAAyg6F0BySzlj9279/U6EKn/783nUyxrgdCQAAAAAAlCAeGZtDvvXC\nEb1+Ylhfu/8GNYUq3I4DAAAAAABKFDOE5ohjA+P6i5/s1+2rW/Sr17PFPAAAAAAAuDwKoTnAWqt/\n/8Pd8ns8+s/3XsujYgAAAAAA4IoohOaAp3af1nMH+vXZX16pttqg23EAAAAAAECJoxCa5SYSaf3Z\nk3u1el5YD9yy2O04AAAAAABgFqAQmuW+/sxBnRqJ6T/dc618Xv5xAgAAAACAqdEgzGJH+8e15dnD\nuvfGdm3saHA7DgAAAAAAmCUohGaxLz/1lnxeoz+6a7XbUQAAAAAAwCxCITRL7Tw2qB/vPq3fe/8y\ntdRUuh0HAAAAAADMIhRCs5C1Vl/c9paawxV66H0dbscBAAAAAACzDIXQLPT0ntPaeWxIn71jpaor\nfG7HAQAAAAAAswyF0CyTSmf05af2a0VLSB/ZsMDtOAAAAAAAYBaiEJpl/vebp3Skf1x/cOcqtpkH\nAAAAAADvCo3CLJLJWH39mUNa1RrWHWta3Y4DAAAAAABmKQqhWeTpPad1sC+iz3xwuTwe43YcAAAA\nAAAwS1EIzRLWWv3lMwfV0VStX1nX5nYcAAAAAAAwi1EIzRLdb5/VnlOj+nTXMnmZHQQAAAAAAK4C\nhdAs8Y3uQ2qvC+reG9vdjgIAAAAAAGY5CqFZYF/vqLYfGdT/deti+dlZDAAAAAAAXCXahVngb148\nqkq/Rx/tXOh2FAAAAAAAMAdQCJW44WhCP3y9R/fe2K66qoDbcQAAAAAAwBxAIVTivv/KCcWSGf3W\ne5a4HQUAAAAAAMwRFEIlLJ2x+vZLx7Sxo0Fr2mrcjgMAAAAAAOYICqES9sxbfTo5NKHfvnWJ21EA\nAAAAAMAcQiFUwrbuOKHmcIXuWNvqdhQAAAAAADCHUAiVqL6xmJ7Z36ffWL+AreYBAAAAAEBR0TSU\nqP/1ao/SGauPdC5wOwoAAAAAAJhjKIRKkLVW33/lhDoX12tZc8jtOAAAAAAAYI6hECpBr58Y1qGz\n48wOAgAAAAAAM4JCqAQ9+WavAl6P7lrX5nYUAAAAAAAwB1EIlZhMxupHb/bq/SubVVPpdzsOAAAA\nAACYgyiESszO40M6PRrTh69ndhAAAAAAAJgZFEIl5kdv9qrC59Hta1rdjgIAAAAAAOYoCqESkslY\nbdvVq02rWhSq8LkdBwAAAAAAzFEUQiXkjZPD6huLa/O189yOAgAAAAAA5rBpFULGmM3GmP3GmIPG\nmM9fZkyXMeZ1Y8weY8zPixuzPPzsrT55jNS1qtntKAAAAAAAYA6b8rkkY4xX0tcl3SHppKQdxpgn\nrLV7C8bUSforSZuttceNMS0zFXgu++m+PnUublBdVcDtKAAAAAAAYA6bzgyhjZIOWmsPW2sTkrZK\nuueCMR+X9A/W2uOSZK3tK27Mua93ZEJ7e0f1wTV0aQAAAAAAYGYZa+2VBxhzn7Izfx7KnT8g6WZr\n7SMFY/4/SX5J10gKS/qatfbbl/ishyU9LEmtra0btm7dWqzv4apIJKJQKHRVn/HM8aT+Zm9Cf35b\nUO2hmV/aqRiZnUZmZ5DZGWR2BpmdQWZnkNkZZHYGmZ1BZmeQ2RmzMbNbNm3atNNa2znVuGJtZeWT\ntEHS7ZKCkl4yxvzCWvt24SBr7RZJWySps7PTdnV1FenXu6u7u1tX+12+8/gOLWwY08d/ZZOMMcUJ\ndgXFyOw0MjuDzM4gszPI7AwyO4PMziCzM8jsDDI7g8zOmI2ZS910pqL0SFpYcL4gd63QSUlPW2vH\nrbX9kp6VdH1xIs59yXRGLx0e0AdWNjtSBgEAAAAAgPI2nUJoh6QVxpgOY0xA0v2SnrhgzD9Kus0Y\n4zPGVEm6WdK+4kadu944MaxoIq33LmtyOwoAAAAAACgDUz4yZq1NGWMekfS0JK+kx6y1e4wxn8rd\nf9Rau88Y85SkNyVlJP21tXb3TAafS148NCBjpFuWNrodBQAAAAAAlIFprSFkrd0madsF1x694Pwr\nkr5SvGjl48VD/VrbVqP6arabBwAAAAAAM2/mt7PCFcWSab16bFi3LmN2EAAAAAAAcAaFkMt2HhtS\nIp3RrawfBAAAAAAAHEIh5LIXD/XL5zG6qaPB7SgAAAAAAKBMUAi5bMeRIV3TXqtQxbSWcwIAAAAA\nALhqFEIuSqQyeuPksDoX17sdBQAAAAAAlBEKIRft7R1VPJXRBgohAAAAAADgIAohF+08NiRJFEIA\nAAAAAMBRFEIuevXYkNrrgmqtqXQ7CgAAAAAAKCMUQi6x1uqVY4PMDgIAAAAAAI6jEHLJqZGYzozG\n1bmEQggAAAAAADiLQsglk+sHrV9EIQQAAAAAAJxFIeSS3T0jCvg8WjUv7HYUAAAAAABQZiiEXLL3\n1KhWtYbl9/KPAAAAAAAAOIs2wgXWWu3tHdXathq3owAAAAAAgDJEIeSCM6NxDY4ndE07hRAAAAAA\nAHAehZAL9vaOSBIzhAAAAAAAgCsohFyw99SoJGk1hRAAAAAAAHABhZAL9vaOakljlUIVPrejAAAA\nAACAMkQh5IK9p0a1dj6zgwAAAAAAgDsohBwWiad0dCDK+kEAAAAAAMA1FEIO2386u37QGgohAAAA\nAADgEgohh719JiJJWtkadjkJAAAAAAAoVxRCDjvYF1Gl36P2uqDbUQAAAAAAQJmiEHLYgb6IlreE\n5PEYt6MAAAAAAIAyRSHksINnxrSihcfFAAAAAACAeyiEHDQWS+rUSEzLW0JuRwEAAAAAAGWMQshB\nh86OS5JWUAgBAAAAAAAXUQg56MCZMUlihhAAAAAAAHAVhZCDDvZFFPB6tKihyu0oAAAAAACgjFEI\nOehgX0RLm6vl8/JnBwAAAAAA7qGZcNDklvMAAAAAAABuohBySCyZ1omhKIUQAAAAAABwHYWQQ44N\nRGWt1NFU7XYUAAAAAABQ5iiEHHJ0ILvl/JJGCiEAAAAAAOAuCiGHHO3PFULMEAIAAAAAAC6jEHLI\n0YGoGqoDqg363Y4CAAAAAADKHIWQQ472j2txY5XbMQAAAAAAACiEnHJ0YFwdrB8EAAAAAABKwLQK\nIWPMZmPMfmPMQWPM5y9xv8sYM2KMeT338yfFjzp7xZJp9Y7EWD8IAAAAAACUBN9UA4wxXklfl3SH\npJOSdhhjnrDW7r1g6HPW2rtnIOOsd2wgKokFpQEAAAAAQGmYzgyhjZIOWmsPW2sTkrZKumdmY80t\n57acZw0hAAAAAADgPmOtvfIAY+6TtNla+1Du/AFJN1trHykY0yXpH5SdQdQj6Q+stXsu8VkPS3pY\nklpbWzds3bq1SF/DXZFIRKFQ6LL3tx1J6Pv7k/qr26tU5TcOJru8qTKXIjI7g8zOILMzyOwMMjuD\nzM4gszPI7AwyO4PMzpiNmd2yadOmndbazqnGTfnI2DS9KmmRtTZijPmQpB9KWnHhIGvtFklbJKmz\ns9N2dXUV6de7q7u7W1f6Lk8PvqnG6jP60B2bnAs1hakylyIyO4PMziCzM8jsDDI7g8zOILMzyOwM\nMjuDzM6YjZlL3XQeGeuRtLDgfEHuWp61dtRaG8kdb5PkN8Y0FS3lLHe0P8qW8wAAAAAAoGRMpxDa\nIWmFMabDGBOQdL+kJwoHGGPmGWNM7nhj7nMHih12tjo+GNVitpwHAAAAAAAlYspHxqy1KWPMI5Ke\nluSV9Ji1do8x5lO5+49Kuk/Sp40xKUkTku63Uy1OVCbSGavTozG11wXdjgIAAAAAACBpmmsI5R4D\n23bBtUcLjv9S0l8WN9rccGY0pnTGaj6FEAAAAAAAKBHTeWQMV+HU8IQkqb2eQggAAAAAAJQGCqEZ\n1jNZCNVVupwEAAAAAAAgi0Johk0WQjwyBgAAAAAASgWF0AzrGZpQfZVfVYFpLdcEAAAAAAAw4yiE\nZtip4QnWDwIAAAAAACWFQmiG9QxPaH4thRAAAAAAACgdFEIzyFqrnqEJ1g8CAAAAAAAlhUJoBo1O\npDSeSGsBj4wBAAAAAIASQiE0g04ORyWxwxgAAAAAACgtFEIz6NRwTJLUTiEEAAAAAABKCIXQDOoZ\nYoYQAAAAAAAoPRRCM+jUSEwBn0dNoYDbUQAAAAAAAPIohGZQ70hMbbWVMsa4HQUAAAAAACCPQmgG\n9Y3G1BKucDsGAAAAAADAeSiEZtDZSFwt4Uq3YwAAAAAAAJyHQmgGnR2Lq5kZQgAAAAAAoMRQCM2Q\nWDKtsViKQggAAAAAAJQcCqEZcnYsLkkUQgAAAAAAoORQCM2QvrGYJAohAAAAAABQeiiEZsjkDCF2\nGQMAAAAAAKWGQmiG8MgYAAAAAAAoVRRCM+TsWFweIzVWUwgBAAAAAIDSQiE0Q/rHE6qvCsjrMW5H\nAQAAAAAAOA+F0AwZjCRUXx1wOwYAAAAAAMBFKIRmyOB4Qg0UQgAAAAAAoARRCM2QwWhCjRRCAAAA\nAACgBFEIzRBmCAEAAAAAgFJFITQD0hmrIWYIAQAAAACAEkUhNAOGowlZKxaVBgAAAAAAJYlCaAYM\njickiUfGAAAAAABASaIQmgGThVBjdYXLSQAAAAAAAC5GITQDmCEEAAAAAABKGYXQDBigEAIAAAAA\nACWMQmgGjEwkJUl1VX6XkwAAAAAAAFyMQmgGDEcTCvq9qvR73Y4CAAAAAABwEQqhGTAcTTI7CAAA\nAAAAlCwKoRkwPJFUbZBCCAAAAAAAlCYKoRkwwgwhAAAAAABQwqZVCBljNhtj9htjDhpjPn+FcTcZ\nY1LGmPuKF3H2GZ5IqC7IDmMAAAAAAKA0TVkIGWO8kr4u6S5JayV9zBiz9jLjvizpJ8UOOduwhhAA\nAAAAAChl05khtFHSQWvtYWttQtJWSfdcYtz/I+kHkvqKmG/WsdZm1xCiEAIAAAAAACVqOoVQu6QT\nBecnc9fyjDHtku6V9I3iRZudYsmMEqkMj4wBAAAAAICSZay1Vx6QXQ9os7X2odz5A5JuttY+UjDm\nf0r6qrX2F8aYxyU9aa39+0t81sOSHpak1tbWDVu3bi3aF3FTJBJRKBSSJA3GMvps94QevCagDyws\n3VlChZlnCzI7g8zOILMzyOwMMjuDzM4gszPI7AwyO4PMzpiNmd2yadOmndbazikHWmuv+CPpPZKe\nLjj/I0l/dMGYI5KO5n4iyj429mtX+twNGzbYueKZZ57JH+/pGbGLP/ek/fGuU+4FmobCzLMFmZ1B\nZmeQ2RlkdgaZnUFmZ5DZGWR2BpmdQWZnzMbMbpH0ip2i67HWyjeNcmmHpBXGmA5JPZLul/TxC0ql\njsnjghlCP5zGZ885wxMJSVItj4wBAAAAAIASNWUhZK1NGWMekfS0JK+kx6y1e4wxn8rdf3SGM84q\nI9GkJLHLGAAAAAAAKFnTmSEka+02SdsuuHbJIsha+9tXH2v2Gp6gEAIAAAAAAKVtOruM4R0Ynpwh\nxCNjAAAAAACgRFEIFdnwREIBn0eVfv60AAAAAACgNNFaFNlINKm6oF/GGLejAAAAAAAAXBKFUJEN\nR5OsHwQAAAAAAEoahVCRDU8kWD8IAAAAAACUNAqhIhuOJlXLDCEAAAAAAFDCKISKbGQiu4YQAAAA\nAABAqaIQKjLWEAIAAAAAAKWOQqiIYsm0JpJp1VWxhhAAAAAAAChdFEJFNDqRlCTV8sgYAAAAAAAo\nYRRCRTScK4TqmSEEAAAAAABKGIVQEY3kCqGaoM/lJAAAAAAAAJdHIVREkVhKkhSqoBACAAAAAACl\ni0KoiCJxCiEAAAAAAFD6KISKaHyyEKqkEAIAAAAAAKWLQqiIJmcIVf+f9u4+xrLzrg/49+dZ7/pl\nEwIJ3aZ2gm0RAhGkeVmFUJLUboDatGBSQDUqgapEFlVTkSJUGSFFIP5pqrZqq1IsN0nVliYrStNi\n0ZTwUhYqRYDj4Dh2HAfnhcQmsUNKSWabzOvTP+5ZezyZOzvruZz7HM/nI4323nOPsl99Pdk79zfP\nc44VQgAAAEDHDIQW6PGB0HEDIQAAAKBfBkILtPqlzVxxfCUrl9SyowAAAADMZSC0QOfWN20XAwAA\nALpnILRAq2tb7jAGAAAAdM9AaIFWv7RhIAQAAAB0z0Bogc6tbeXKEyvLjgEAAACwLwOhBVpd27RC\nCAAAAOiegdACGQgBAAAAU2AgtEDn1txlDAAAAOifgdACWSEEAAAATIGB0IJsbG1nbXPbQAgAAADo\nnoHQgpxb20wSW8YAAACA7hkILcgXvjQbCFkhBAAAAPTOQGhBzq0PA6HLDIQAAACAvhkILYgtYwAA\nAMBUGAgtyBNbxlaWnAQAAABgfwZCC3JubStJcvLEpUtOAgAAALA/A6EFeWLLmBVCAAAAQN8MhBbk\nC8NA6BlWCAEAAACdMxBaECuEAAAAgKkwEFqQ1bXNnDh2SY6tqBQAAADom+nFgqyubeYZl7nlPAAA\nANC/Aw2EqurGqnqwqh6qqtv2eP3mqrq3qu6pqvdV1asWH7Vv59Y2c+UJAyEAAACgfxecYFTVSpKf\nS/LtSR5OcldV3dla+9CO034zyZ2ttVZVL07yi0m+/s8jcK9Wv7SZK48bCAEAAAD9O8gKoVckeai1\n9rHW2nqSM0lu3nlCa221tdaGp1cmaTliVtc2c9KWMQAAAGACDjIQuirJp3Y8f3g49iRV9bqq+nCS\n/5Hk7y0m3nScW9/MSVvGAAAAgAmoJxb2zDmh6vuS3Nhae8Pw/PVJvrm19sY5578myZtba9+2x2u3\nJrk1SU6dOvXyM2fOHDJ+H1ZXV/Oz778k1zzzkvz9l1y27DgHsrq6mpMnTy47xkWReRwyj0Pmccg8\nDpnHIfM4ZB6HzOOQeRwyj2OKmZflhhtuuLu1dvpC5x1kScsjSZ634/nVw7E9tdZ+p6quq6rntNb+\nZNdrdyS5I0lOnz7drr/++gP89f07e/Zsti7ZzHXPP5Xrr/+mZcc5kLNnz2Zq/cs8DpnHIfM4ZB6H\nzOOQeRwyj0Pmccg8DpnHMcXMvTvIlrG7krygqq6tquNJbkly584Tquprq6qGxy9LciLJ5xYdtmfn\n1jZz8sTKsmMAAAAAXNAFVwi11jar6o1J3pNkJcnbW2v3V9WPDq/fnuR7k/xQVW0k+WKSv90utBft\naWS7tXxxY8tt5wEAAIBJONAEo7X27iTv3nXs9h2P35LkLYuNNh1rW7M/rzhuhRAAAADQv4NsGeMC\nNrZnf544ZiAEAAAA9M9AaAE2tma7404cUycAAADQPxOMBTi/QuiyS60QAgAAAPpnILQAT2wZUycA\nAADQPxOMBVgftoxZIQQAAABMgYHQAlghBAAAAEyJCcYCPH5R6UvVCQAAAPTPBGMB1t12HgAAAJgQ\nA6EFeOIuY+oEAAAA+meCsQCPbxmzQggAAACYAAOhBXj8otJWCAEAAAATYIKxAE9sGbNCCAAAAOif\ngdACrD++ZUydAAAAQP9MMBZgYzupSo6vqBMAAADonwnGAqxvzVYHVdWyowAAAABckIHQAmxsN3cY\nAwAAACbDQGgBNraTy9xhDAAAAJgIU4wF2NiyQggAAACYDgOhBdjYdocxAAAAYDpMMRZgfTu57FIr\nhAAAAIBpMBBagNmWMVUCAAAA02CKsQAbVggBAAAAE2IgtACuIQQAAABMiSnGAmxstZxw23kAAABg\nIkwxFmBjO7nMbecBAACAiTAQWoD17VghBAAAAEyGKcYCzO4yZoUQAAAAMA0GQguwuZ0cd1FpAAAA\nYCJMMRZgYzs5vqJKAAAAYBpMMQ5pa7ulJbnUQAgAAACYCFOMQ1rf3E5iyxgAAAAwHaYYh2QgBAAA\nAEyNKcYhrW1tJTEQAgAAAKbDFOOQzq8QOuEaQgAAAMBEmGIcki1jAAAAwNSYYhzSxlZL4i5jAAAA\nwHSYYhySFUIAAADA1JhiHNK6i0oDAAAAE2OKcUhr51cI2TIGAAAATMSBphhVdWNVPVhVD1XVbXu8\n/neq6t6q+mBVvbeq/vLio/bJljEAAABgai44xaiqlSQ/l+SmJC9K8gNV9aJdp308yV9trX1Tkp9N\ncseig/Zq3QohAAAAYGIOMsV4RZKHWmsfa62tJzmT5OadJ7TW3tta+9Ph6e8muXqxMft1/i5jVggB\nAAAAU1Gttf1PqPq+JDe21t4wPH99km9urb1xzvk/keTrz5+/67Vbk9yaJKdOnXr5mTNnDhl/+d77\nx5u54961/JNXX56/eOV0hkKrq6s5efLksmNcFJnHIfM4ZB6HzOOQeRwyj0Pmccg8DpnHIfM4pph5\nWZbqCBcAAAt5SURBVG644Ya7W2unL3TesUX+pVV1Q5IfSfKqvV5vrd2RYTvZ6dOn2/XXX7/Iv34p\nHr3rk8m9H8yrv/VbctWzLl92nAM7e/Zspta/zOOQeRwyj0Pmccg8DpnHIfM4ZB6HzOOQeRxTzNy7\ngwyEHknyvB3Prx6OPUlVvTjJW5Pc1Fr73GLi9c81hAAAAICpOcgU464kL6iqa6vqeJJbkty584Sq\nen6SdyV5fWvtI4uP2a81dxkDAAAAJuaCK4Raa5tV9cYk70mykuTtrbX7q+pHh9dvT/LmJM9O8m+r\nKkk2D7Jf7elgfcsKIQAAAGBaDnQNodbau5O8e9ex23c8fkOSL7uI9FGwsekuYwAAAMC0mGIc0vrW\nVi6pZOWSWnYUAAAAgAMxEDqk9c3tWBwEAAAATIlRxiGtb27nUi0CAAAAE2KUcUjrW9s5ZrsYAAAA\nMCEGQoe0trmdY+ZBAAAAwIQYCB3SxlZzDSEAAABgUowyDml9c8s1hAAAAIBJMco4pNldxuwZAwAA\nAKbDQOiQZheVXnYKAAAAgIMzyjgkt50HAAAApsYo45DWt1pWbBkDAAAAJsRA6JCsEAIAAACm5tiy\nA0zd1506mc0/++KyYwAAAAAcmLUth/Svbnlpvv+Fx5cdAwAAAODADIQAAAAAjhgDIQAAAIAjxkAI\nAAAA4IgxEAIAAAA4YgyEAAAAAI4YAyEAAACAI8ZACAAAAOCIMRACAAAAOGIMhAAAAACOGAMhAAAA\ngCPGQAgAAADgiDEQAgAAADhiDIQAAAAAjhgDIQAAAIAjxkAIAAAA4IgxEAIAAAA4YgyEAAAAAI4Y\nAyEAAACAI6Zaa8v5i6s+m+SPlvKXL95zkvzJskNcJJnHIfM4ZB6HzOOQeRwyj0Pmccg8DpnHIfM4\nZH56+5rW2ldf6KSlDYSeTqrqfa2108vOcTFkHofM45B5HDKPQ+ZxyDwOmcch8zhkHofM45CZxJYx\nAAAAgCPHQAgAAADgiDEQWow7lh3gKZB5HDKPQ+ZxyDwOmcch8zhkHofM45B5HDKPQ2ZcQwgAAADg\nqLFCCAAAAOCIMRACAAAAOGIMhA6hqm6sqger6qGqum3Zeeapqk9U1Qer6p6qet9w7Kuq6ter6g+H\nP7+yg5xvr6rHquq+Hcfm5qyqnxy6f7Cq/npHmX+6qh4Z+r6nqr6zl8xV9byq+q2q+lBV3V9VPzYc\n77bnfTL33PNlVfX7VfWBIfPPDMd77nle5m573pFjpar+oKp+ZXjebc/7ZO6654t9H+k4c+89P6uq\nfqmqPlxVD1TVt0yg570yd9tzVb1wR657qurzVfWmnnveJ3O3PQ8Z/tHwfnJfVb1zeJ/ptud9Mvfe\n848Nee+vqjcNx3rvea/M3fVcC/psUlUvr9n70UNV9a+rqnrIXFXXVNUXd3R+e0eZv3/4/tiuqtO7\nzl96z08rrTVfT+EryUqSjya5LsnxJB9I8qJl55qT9RNJnrPr2D9Nctvw+LYkb+kg52uSvCzJfRfK\nmeRFQ+cnklw7/LdY6STzTyf5iT3OXXrmJM9N8rLh8TOSfGTI1W3P+2TuuedKcnJ4fGmS30vyys57\nnpe52553ZPnxJO9I8ivD82573idz1z3nIt5HOs/ce8//IckbhsfHkzxrAj3vlbnrnnfkWUnymSRf\n03vPczJ323OSq5J8PMnlw/NfTPJ3e+55n8w99/yNSe5LckWSY0l+I8nXdt7zvMzd9ZwFfTZJ8vuZ\n/UxVSf5nkps6yXzNzvN2/e8sO/M3JHlhkrNJTh/k+2HMzE+nLyuEnrpXJHmotfax1tp6kjNJbl5y\npotxc2Y/xGX483uWmCVJ0lr7nST/Z9fheTlvTnKmtbbWWvt4kocy+28yqjmZ51l65tbap1tr7x8e\nfyHJA5n9ANRtz/tknqeHzK21tjo8vXT4aum753mZ51l65iSpqquT/I0kb92Vrcuek7mZ5+ki8xxd\n93yRlp65qr4isx+K35YkrbX11tr/Tcc975N5nqVn3uW1ST7aWvujdNzzLjszz9NL5mNJLq+qY5l9\n+P/j9N/zXpnn6SHzNyT5vdba/2utbSb57SR/K333PC/zPEvLvIjPJlX13CTPbK39bmutJfmP+XP8\n3HWRmffUQ+bW2gOttQf3OL2Lnp9ODISeuquSfGrH84ez/4fUZWpJfqOq7q6qW4djp1prnx4efybJ\nqeVEu6B5OXvv/x9W1b3DEsjzS0m7ylxV1yR5aWYrQSbR867MScc912xL0D1JHkvy66217nuekznp\nuOck/zLJP06yveNY1z1n78xJ3z1fzPtIz5mTfnu+Nslnk/z7mm0nfGtVXZm+e56XOem3551uSfLO\n4XHPPe+0M3PSac+ttUeS/LMkn0zy6SR/1lr7tXTc8z6Zk057zmylzaur6tlVdUWS70zyvHTcc+Zn\nTvrteaeL7faq4fHu42Pa73PftcN2sd+uqlcPx3rIPE/PPU+SgdDR8KrW2kuS3JTkH1TVa3a+OExR\n91sJ0IWp5Ezy85ltJXxJZj9Q/PPlxvlyVXUyyX9N8qbW2ud3vtZrz3tk7rrn1trW8P+7qzP7zcU3\n7nq9u57nZO6256r6m0kea63dPe+c3nreJ3O3PQ+m+D6yV+aeez6W2ZL5n2+tvTTJucyW9j+uw57n\nZe655yRJVR1P8t1J/svu1zrsOcmembvtefgwf3NmQ8O/lOTKqvrBnef01vM+mbvtubX2QJK3JPm1\nJL+a5J4kW7vO6arnfTJ32/M8vXV7ELsyfzrJ84f3yh9P8o6qeubSwrEUBkJP3SN5YpqdzD5APbKk\nLPsafuOR1tpjSf5bZsssHx2W1p1fFvjY8hLua17ObvtvrT06fLDeTvLv8sSy1i4yV9WlmQ1W/nNr\n7V3D4a573itz7z2fN2yf+K0kN6bzns/bmbnznr81yXdX1Scy27b716rqF9J3z3tm7rzni30f6TZz\n5z0/nOThHSvzfimzYUvPPe+ZufOez7spyftba48Oz3vu+bwnZe68529L8vHW2mdbaxtJ3pXkr6Tv\nnvfM3HnPaa29rbX28tbaa5L8aWbXWuy55z0z997zDhfb7SPD493Hx7Rn5mHb1eeGx3dndj2er0sf\nmefpuedJMhB66u5K8oKqunb4jc0tSe5ccqYvU1VXVtUzzj9O8h2ZLdW8M8kPD6f9cJJfXk7CC5qX\n884kt1TViaq6NskLMruQ2NKd/wd38LrM+k46yDxcbf9tSR5orf2LHS912/O8zJ33/NVV9azh8eVJ\nvj3Jh9N3z3tm7rnn1tpPttaubq1dk9m/wf+rtfaD6bjneZl77vkpvI90m7nnnltrn0nyqap64XDo\ntUk+lI57npe55553+IE8eetVtz3v8KTMnff8ySSvrKorhvfx12Z2DcCee94zc+c9p6r+wvDn8zO7\nFs870nfPe2buvecdLqrbYavW56vqlcP31Q9l/M9de2YefvZbGR5fN2T+WCeZ5+m552lqHVzZeqpf\nme15/Uhm09SfWnaeORmvy+xK7B9Icv/5nEmeneQ3k/xhZlf3/6oOsr4zs6WLG5n91vFH9suZ5KeG\n7h/Mkq4iPyfzf0rywST3ZvaP1nN7yZzkVZktE703syW69wzfx932vE/mnnt+cZI/GLLdl+TNw/Ge\ne56Xudued+W/Pk/csavbnvfJ3G3PeQrvIx1n7rbnIcNLkrxvyPffk3xlzz3vk7n3nq9M8rkkX7Hj\nWO8975W5955/JrNfhtw3ZD0xgZ73ytx7z/87s+HxB5K8diLfz3tl7q7nLOizSZLTw/fUR5P8myTV\nQ+Yk35vZe+Q9Sd6f5Ls6yvy64fFakkeTvKennp9OXzWUBwAAAMARYcsYAAAAwBFjIAQAAABwxBgI\nAQAAABwxBkIAAAAAR4yBEAAAAMARYyAEAAAAcMQYCAEAAAAcMf8f6Pfh6e7YOB0AAAAASUVORK5C\nYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "from sklearn.decomposition import PCA\n", "%matplotlib inline\n", "\n", "tags=pd.read_csv('/home/david/movielens/ml-latest/ml-latest/genome-scores.csv')\n", "tags_wide=tags.pivot(index='movieId', columns='tagId', values='relevance')\n", "tags_wide=tags_wide.fillna(0)\n", "pca=PCA(svd_solver='full')\n", "pca.fit(tags_wide)\n", "\n", "plt.figure(figsize=(20,8))\n", "plt.plot(np.cumsum(pca.explained_variance_ratio_))\n", "plt.yticks(np.arange(0.2, 1.1, .1))\n", "plt.xticks(np.arange(0, 1128, 50))\n", "plt.grid()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this figure, it seems that 50 tags or so would be a good number to include." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tags_pca=pd.DataFrame(pca.transform(tags_wide)[:,:50])\n", "tags_pca.columns=[\"pc\"+str(x) for x in tags_pca.columns.values]\n", "tags_pca['movieId']=tags_wide.index\n", "movies=pd.merge(movies,tags_pca,how='inner',on='movieId')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The year is converted into a discrete variable using the same criteria as in the original paper - the ratings were taken around the year 2000 so it makes sense to use these limits, in order to identify what were more recent movies at that time (which comprise the majority of the ratings)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genreMysterygenreSci-FigenreCrimegenreDramagenreAnimationgenreIMAXgenreActiongenreComedygenreDocumentarygenreWar...Year_1999Year_40sYear_50sYear_60sYear_70sYear_80sYear_<1940Year_>=2000Year_low90sYear_unknown
movieId
10.00.00.00.01.00.00.01.00.00.0...0000000000
20.00.00.00.00.00.00.00.00.00.0...0000000000
30.00.00.00.00.00.00.01.00.00.0...0000000000
40.00.00.01.00.00.00.01.00.00.0...0000000000
50.00.00.00.00.00.00.01.00.00.0...0000000000
\n", "

5 rows × 83 columns

\n", "
" ], "text/plain": [ " genreMystery genreSci-Fi genreCrime genreDrama genreAnimation \\\n", "movieId \n", "1 0.0 0.0 0.0 0.0 1.0 \n", "2 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 1.0 0.0 \n", "5 0.0 0.0 0.0 0.0 0.0 \n", "\n", " genreIMAX genreAction genreComedy genreDocumentary genreWar \\\n", "movieId \n", "1 0.0 0.0 1.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 0.0 \n", "3 0.0 0.0 1.0 0.0 0.0 \n", "4 0.0 0.0 1.0 0.0 0.0 \n", "5 0.0 0.0 1.0 0.0 0.0 \n", "\n", " ... Year_1999 Year_40s Year_50s Year_60s Year_70s \\\n", "movieId ... \n", "1 ... 0 0 0 0 0 \n", "2 ... 0 0 0 0 0 \n", "3 ... 0 0 0 0 0 \n", "4 ... 0 0 0 0 0 \n", "5 ... 0 0 0 0 0 \n", "\n", " Year_80s Year_<1940 Year_>=2000 Year_low90s Year_unknown \n", "movieId \n", "1 0 0 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 0 \n", "5 0 0 0 0 0 \n", "\n", "[5 rows x 83 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## these criteria for making year discrete were taken from the same paper describing the method\n", "def discretize_year(x):\n", " if x=='unknown':\n", " return x\n", " else:\n", " x=int(x)\n", " if x>=2000:\n", " return '>=2000'\n", " if x>=1995 and x<=1999:\n", " return str(x)\n", " if x>=1990 and x<=1994:\n", " return 'low90s'\n", " if x>=1980 and x<=1989:\n", " return '80s'\n", " if x>=1970 and x<=1979:\n", " return '70s'\n", " if x>=1960 and x<=1969:\n", " return '60s'\n", " if x>=1950 and x<=1959:\n", " return '50s'\n", " if x>=1940 and x<=1959:\n", " return '40s'\n", " if x<1940:\n", " return '<1940'\n", " else:\n", " return 'unknown'\n", "\n", "movies_features=movies.copy()\n", "del movies_features['title']\n", "del movies_features['genres']\n", "del movies_features['genre(no genres listed)']\n", "movies_features['Year']=movies_features.Year.map(lambda x: discretize_year(x))\n", "movies_features=pd.get_dummies(movies_features, columns=['Year'])\n", "movies_features.set_index('movieId',inplace=True)\n", "movies_features.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Processing the user data\n", "\n", "The dataset contains demographic info with zip codes. As there are way too many of them, I’ll try to guess the US region from these zipcodes. In order to do so, I’m using a [publicly available table](http://federalgovernmentzipcodes.us/) mapping zip codes to states, [another one](http://www.fonz.net/blog/archives/2008/04/06/csv-of-states-and-state-abbreviations/) mapping state names to their abbreviations, and finally classifying the states into regions according to [usual definitions](https://www.infoplease.com/us/states/sizing-states)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "zipcode_abbs=pd.read_csv(\"/home/david/movielens/zips/states.csv\")\n", "zipcode_abbs_dct={z.State:z.Abbreviation for z in zipcode_abbs.itertuples()}\n", "us_regs_table=[\n", " ('New England', 'Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont'),\n", " ('Middle Atlantic', 'Delaware, Maryland, New Jersey, New York, Pennsylvania'),\n", " ('South', 'Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, Missouri, North Carolina, South Carolina, Tennessee, Virginia, West Virginia'),\n", " ('Midwest', 'Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Nebraska, North Dakota, Ohio, South Dakota, Wisconsin'),\n", " ('Southwest', 'Arizona, New Mexico, Oklahoma, Texas'),\n", " ('West', 'Alaska, California, Colorado, Hawaii, Idaho, Montana, Nevada, Oregon, Utah, Washington, Wyoming')\n", " ]\n", "us_regs_table=[(x[0],[i.strip() for i in x[1].split(\",\")]) for x in us_regs_table]\n", "us_regs_dct=dict()\n", "for r in us_regs_table:\n", " for s in r[1]:\n", " us_regs_dct[zipcode_abbs_dct[s]]=r[0]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/david/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (11) have mixed types. Specify dtype option on import or set low_memory=False.\n", " interactivity=interactivity, compiler=compiler, result=result)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ZipcodeRegion
0501Middle Atlantic
1544Middle Atlantic
2601UsOther
3602UsOther
4603UsOther
\n", "
" ], "text/plain": [ " Zipcode Region\n", "0 501 Middle Atlantic\n", "1 544 Middle Atlantic\n", "2 601 UsOther\n", "3 602 UsOther\n", "4 603 UsOther" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "zipcode_info=pd.read_csv(\"/home/david/movielens/free-zipcode-database.csv\")\n", "zipcode_info=zipcode_info.groupby('Zipcode').first().reset_index()\n", "zipcode_info['State'].loc[zipcode_info.Country!=\"US\"]='UnknownOrNonUS'\n", "zipcode_info['Region']=zipcode_info['State'].copy()\n", "zipcode_info['Region'].loc[zipcode_info.Country==\"US\"]=zipcode_info.Region.loc[zipcode_info.Country==\"US\"].map(lambda x: us_regs_dct[x] if x in us_regs_dct else 'UsOther')\n", "zipcode_info=zipcode_info[['Zipcode', 'Region']]\n", "zipcode_info.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A small look the the demographic data provided in the dataset:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userIdGenderAgeOccupationZipcodeRegion
01F11048067Midwest
12M561670072South
23M251555117Midwest
34M4572460New England
45M252055455Midwest
\n", "
" ], "text/plain": [ " userId Gender Age Occupation Zipcode Region\n", "0 1 F 1 10 48067 Midwest\n", "1 2 M 56 16 70072 South\n", "2 3 M 25 15 55117 Midwest\n", "3 4 M 45 7 2460 New England\n", "4 5 M 25 20 55455 Midwest" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users=pd.read_table(\"/home/david/movielens/ml-1m/ml-1m/users.dat\",sep='::',names=[\"userId\",\"Gender\",\"Age\",\"Occupation\",\"Zipcode\"], engine='python')\n", "users[\"Zipcode\"]=users.Zipcode.map(lambda x: np.int(re.sub(\"-.*\",\"\",x)))\n", "users=pd.merge(users,zipcode_info,on='Zipcode',how='left')\n", "users['Region']=users.Region.fillna('UnknownOrNonUS')\n", "users.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "West 1652\n", "Midwest 1546\n", "South 887\n", "Middle Atlantic 872\n", "New England 507\n", "Southwest 462\n", "UnknownOrNonUS 73\n", "UsOther 41\n", "Name: Region, dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users.Region.value_counts()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GenderAge_1Age_18Age_25Age_35Age_45Age_50Age_56Occupation_0Occupation_1...Occupation_8Occupation_9Region_Middle AtlanticRegion_MidwestRegion_New EnglandRegion_SouthRegion_SouthwestRegion_UnknownOrNonUSRegion_UsOtherRegion_West
userId
10.0100000000...0001000000
21.0000000100...0000010000
31.0001000000...0001000000
41.0000010000...0000100000
51.0001000000...0001000000
\n", "

5 rows × 37 columns

\n", "
" ], "text/plain": [ " Gender Age_1 Age_18 Age_25 Age_35 Age_45 Age_50 Age_56 \\\n", "userId \n", "1 0.0 1 0 0 0 0 0 0 \n", "2 1.0 0 0 0 0 0 0 1 \n", "3 1.0 0 0 1 0 0 0 0 \n", "4 1.0 0 0 0 0 1 0 0 \n", "5 1.0 0 0 1 0 0 0 0 \n", "\n", " Occupation_0 Occupation_1 ... Occupation_8 Occupation_9 \\\n", "userId ... \n", "1 0 0 ... 0 0 \n", "2 0 0 ... 0 0 \n", "3 0 0 ... 0 0 \n", "4 0 0 ... 0 0 \n", "5 0 0 ... 0 0 \n", "\n", " Region_Middle Atlantic Region_Midwest Region_New England \\\n", "userId \n", "1 0 1 0 \n", "2 0 0 0 \n", "3 0 1 0 \n", "4 0 0 1 \n", "5 0 1 0 \n", "\n", " Region_South Region_Southwest Region_UnknownOrNonUS Region_UsOther \\\n", "userId \n", "1 0 0 0 0 \n", "2 1 0 0 0 \n", "3 0 0 0 0 \n", "4 0 0 0 0 \n", "5 0 0 0 0 \n", "\n", " Region_West \n", "userId \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 \n", "5 0 \n", "\n", "[5 rows x 37 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "users_features=users.copy()\n", "users_features['Gender']=users_features.Gender.map(lambda x: 1.0*(x=='M'))\n", "del users_features['Zipcode']\n", "users_features['Age']=users_features.Age.map(lambda x: str(x))\n", "users_features['Occupation']=users_features.Occupation.map(lambda x: str(x))\n", "users_features=pd.get_dummies(users_features, columns=['Age', 'Occupation', 'Region'])\n", "users_features.set_index('userId',inplace=True)\n", "users_features.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 3. Loading ratings and generating test set\n", "\n", "A small look at the ratings provided:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userIdmovieIdRatingTimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291
\n", "
" ], "text/plain": [ " userId movieId Rating Timestamp\n", "0 1 1193 5 978300760\n", "1 1 661 3 978302109\n", "2 1 914 3 978301968\n", "3 1 3408 4 978300275\n", "4 1 2355 5 978824291" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ratings=pd.read_table(\"/home/david/movielens/ml-1m/ml-1m/ratings.dat\", sep=\"::\", names=[\"userId\",\"movieId\",\"Rating\",\"Timestamp\"], engine='python')\n", "movies_w_sideinfo=set(list(movies.movieId))\n", "ratings=ratings.loc[ratings.movieId.map(lambda x: x in movies_w_sideinfo)]\n", "ratings.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generating a train and test set - for computational reasons I'll just take 100 random users as test with all the movies they rated:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "978017\n", "18866\n" ] } ], "source": [ "userids_present=list(set(list(ratings.userId)))\n", "np.random.seed(1)\n", "users_testset=set(list(np.random.choice(userids_present,replace=False,size=100)))\n", "\n", "ratings_train=ratings.loc[ratings.userId.map(lambda x: x not in users_testset)]\n", "ratings_test=ratings.loc[ratings.userId.map(lambda x: x in users_testset)]\n", "users_trainset=set(list(ratings.userId.loc[ratings.userId.map(lambda x: x not in users_testset)]))\n", "\n", "# now centering the ratings\n", "avg_rating_by_user=ratings_train.groupby('userId')['Rating'].mean().to_frame().rename(columns={'Rating':'AvgRating'})\n", "ratings_train=pd.merge(ratings_train, avg_rating_by_user, left_on='userId',right_index=True)\n", "ratings_train['RatingCentered']=ratings_train.Rating-ratings_train.AvgRating\n", "\n", "print(ratings_train.shape[0])\n", "print(ratings_test.shape[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 4. Fitting the model with Spark\n", "\n", "The data is very high-dimensional and doesn't fit in a computer's RAM memory, thus Spark comes very handy for the computations, even when run locally. As it takes a long time to compute the coefficients, this will be done without any hyperparameter tuning.\n", "\n", "Starting Spark:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import findspark\n", "\n", "findspark.init(\"/home/david/Downloads/spark-2.1.1-bin-hadoop2.7/\")\n", "\n", "import pyspark\n", "sc = pyspark.SparkContext()\n", "from pyspark.sql import SQLContext\n", "sqlContext = SQLContext(sc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now fitting the model:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pyspark.mllib.regression import (LabeledPoint, RidgeRegressionWithSGD)\n", "from pyspark.ml.regression import LinearRegression\n", "from pyspark.ml.linalg import Vectors, VectorUDT\n", "from scipy.sparse import csc_matrix\n", "\n", "def generate_features(user,movie,users_features_bc,movies_features_bc):\n", " user_feats=users_features_bc.value.loc[user].as_matrix()\n", " movie_feats=movies_features_bc.value.loc[movie].as_matrix()\n", " return csc_matrix(np.kron(user_feats,movie_feats).reshape(-1,1))\n", "\n", "users_features_bc=sc.broadcast(users_features)\n", "movies_features_bc=sc.broadcast(movies_features)\n", "\n", "trainset=sc.parallelize([(i.userId,i.movieId,i.RatingCentered) for i in ratings_train.itertuples()])\\\n", ".map(lambda x: LabeledPoint(x[2],generate_features(x[0],x[1],users_features_bc,movies_features_bc)))\\\n", ".map(lambda x: (float(x.label),x.features.asML())).toDF(['label','features'])\n", "trainset.repartition(50)\n", "\n", "recommender=LinearRegression(regParam=1e-4).fit(trainset)\n", "formula_coeffs=recommender.coefficients.toArray()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 5. Evaluating the model and checking some recommendations\n", "\n", "Finally, evaluating what this system recommends to users. Due to the computational time it takes, the results won’t be evaluated with the metrics proposed in the paper at the beginning. I’ll just take average ratings for top-5 recommendations for each user in the test set and compare them to average ratings (the expected value for random recommendations) and to the maximum possible ratings from 5 movies each.\n", "\n", "This is not a really good measure, but it’s a good sense check to see if the recommendations are making sense and if the system is better than nothing.\n", "** *\n", "Getting scores for the test set:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/david/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:7: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" ] } ], "source": [ "def generate_features_series(user,movie):\n", " user_feats=users_features.loc[user].as_matrix()\n", " movie_feats=movies_features.loc[movie].as_matrix()\n", " return pd.Series(np.kron(user_feats,movie_feats).astype('float64'))\n", "\n", "X_test=ratings_test.apply(lambda x: generate_features_series(x['userId'],x['movieId']), axis=1)\n", "ratings_test['score']=X_test.dot(formula_coeffs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Comparing the model to recommending the most popular movies too:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "avg_ratings=ratings.groupby('movieId')['Rating'].mean().to_frame().rename(columns={\"Rating\":\"AvgRating\"})\n", "ratings_test=pd.merge(ratings_test,avg_ratings,left_on='movieId',right_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now comparing it to no model (random recommendation) and best possible recommendations (in terms of ratings):" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Averge movie rating: 3.68455396497\n", "Average rating for top-5 rated by each user: 4.96\n", "Average rating for bottom-5 rated by each user: 1.61\n", "Average rating for top-5 recommendations of best-rated movies: 4.416\n", "----------------------\n", "Average rating for top-5 recommendations from this model: 4.338\n", "Average rating for bottom-5 (non-)recommendations from this model: 2.554\n" ] } ], "source": [ "print 'Averge movie rating:',ratings_test.groupby('userId')['Rating'].mean().mean()\n", "print 'Average rating for top-5 rated by each user:',ratings_test.sort_values(['userId','Rating'],ascending=False).groupby('userId')['Rating'].head(5).mean()\n", "print 'Average rating for bottom-5 rated by each user:',ratings_test.sort_values(['userId','Rating'],ascending=True).groupby('userId')['Rating'].head(5).mean()\n", "print 'Average rating for top-5 recommendations of best-rated movies:',ratings_test.sort_values(['userId','AvgRating'],ascending=False).groupby('userId')['Rating'].head(5).mean()\n", "print '----------------------'\n", "print 'Average rating for top-5 recommendations from this model:',ratings_test.sort_values(['userId','score'],ascending=False).groupby('userId')['Rating'].head(5).mean()\n", "print 'Average rating for bottom-5 (non-)recommendations from this model:',ratings_test.sort_values(['userId','score'],ascending=True).groupby('userId')['Rating'].head(5).mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Examining some recommendations (3 per user):" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userIdRecommended MovieRating by userGenderAgeOccupationRegionMovie's Genres
05938Raiders of the Lost Ark (Indiana Jones and the...5M25-34academic/educatorSouthAction|Adventure
15938Usual Suspects, The (1995)5M25-34academic/educatorSouthCrime|Mystery|Thriller
25938North by Northwest (1959)5M25-34academic/educatorSouthAction|Adventure|Mystery|Romance|Thriller
35798Raiders of the Lost Ark (Indiana Jones and the...4M35-44other or not specifiedWestAction|Adventure
45798Saving Private Ryan (1998)5M35-44other or not specifiedWestAction|Drama|War
55798Godfather, The (1972)5M35-44other or not specifiedWestCrime|Drama
65693North by Northwest (1959)4F25-34college/grad studentWestAction|Adventure|Mystery|Romance|Thriller
75693Schindler's List (1993)3F25-34college/grad studentWestDrama|War
85693Notorious (1946)3F25-34college/grad studentWestFilm-Noir|Romance|Thriller
95692Shawshank Redemption, The (1994)5F25-34executive/managerialSouthCrime|Drama
105692Schindler's List (1993)5F25-34executive/managerialSouthDrama|War
115692Raiders of the Lost Ark (Indiana Jones and the...4F25-34executive/managerialSouthAction|Adventure
125582Schindler's List (1993)5M45-49academic/educatorMidwestDrama|War
135582To Kill a Mockingbird (1962)5M45-49academic/educatorMidwestDrama
145582Third Man, The (1949)5M45-49academic/educatorMidwestFilm-Noir|Mystery|Thriller
155560Sting, The (1973)5F35-44technician/engineerWestComedy|Crime
165560Guess Who's Coming to Dinner (1967)5F35-44technician/engineerWestDrama
175560Sixth Sense, The (1999)5F35-44technician/engineerWestDrama|Horror|Mystery
185392Fargo (1996)4M50-55academic/educatorWestComedy|Crime|Drama|Thriller
195392Close Encounters of the Third Kind (1977)5M50-55academic/educatorWestAdventure|Drama|Sci-Fi
205392American Beauty (1999)4M50-55academic/educatorWestDrama|Romance
215352It Happened One Night (1934)5F35-44customer serviceMidwestComedy|Romance
225352Thin Man, The (1934)4F35-44customer serviceMidwestComedy|Crime
235352My Man Godfrey (1936)5F35-44customer serviceMidwestComedy|Romance
245235Wallace & Gromit: A Close Shave (1995)5M25-34college/grad studentMiddle AtlanticAnimation|Children|Comedy
255235Monty Python and the Holy Grail (1975)5M25-34college/grad studentMiddle AtlanticAdventure|Comedy|Fantasy
265235Dr. Strangelove or: How I Learned to Stop Worr...5M25-34college/grad studentMiddle AtlanticComedy|War
275203Forrest Gump (1994)2F45-49doctor/health careNew EnglandComedy|Drama|Romance|War
285203Shakespeare in Love (1998)5F45-49doctor/health careNew EnglandComedy|Drama|Romance
295203Braveheart (1995)5F45-49doctor/health careNew EnglandAction|Drama|War
...........................
270286Raiders of the Lost Ark (Indiana Jones and the...4M25-34academic/educatorMidwestAction|Adventure
271286Rear Window (1954)4M25-34academic/educatorMidwestMystery|Thriller
272286North by Northwest (1959)5M25-34academic/educatorMidwestAction|Adventure|Mystery|Romance|Thriller
273277Princess Bride, The (1987)4F35-44academic/educatorWestAction|Adventure|Comedy|Fantasy|Romance
274277Sixth Sense, The (1999)5F35-44academic/educatorWestDrama|Horror|Mystery
275277Shakespeare in Love (1998)4F35-44academic/educatorWestComedy|Drama|Romance
276249Shawshank Redemption, The (1994)5F18-24sales/marketingMidwestCrime|Drama
277249Princess Bride, The (1987)5F18-24sales/marketingMidwestAction|Adventure|Comedy|Fantasy|Romance
278249Usual Suspects, The (1995)4F18-24sales/marketingMidwestCrime|Mystery|Thriller
279235M (1931)5M25-34other or not specifiedUnknownOrNonUSCrime|Film-Noir|Thriller
280235Reservoir Dogs (1992)5M25-34other or not specifiedUnknownOrNonUSCrime|Mystery|Thriller
281235Princess Mononoke (Mononoke-hime) (1997)4M25-34other or not specifiedUnknownOrNonUSAction|Adventure|Animation|Drama|Fantasy
282180Shadow of a Doubt (1943)4M45-49programmerNew EnglandCrime|Drama|Thriller
283180Silence of the Lambs, The (1991)4M45-49programmerNew EnglandCrime|Horror|Thriller
284180One Flew Over the Cuckoo's Nest (1975)5M45-49programmerNew EnglandDrama
285170Dr. Strangelove or: How I Learned to Stop Worr...5M25-34lawyerMiddle AtlanticComedy|War
286170North by Northwest (1959)5M25-34lawyerMiddle AtlanticAction|Adventure|Mystery|Romance|Thriller
287170Raiders of the Lost Ark (Indiana Jones and the...5M25-34lawyerMiddle AtlanticAction|Adventure
288148Raiders of the Lost Ark (Indiana Jones and the...5M50-55technician/engineerMidwestAction|Adventure
289148Schindler's List (1993)5M50-55technician/engineerMidwestDrama|War
290148North by Northwest (1959)5M50-55technician/engineerMidwestAction|Adventure|Mystery|Romance|Thriller
291126Charade (1963)5M18-24homemakerWestComedy|Crime|Mystery|Romance|Thriller
292126Gladiator (2000)4M18-24homemakerWestAction|Adventure|Drama
293126Die Hard (1988)1M18-24homemakerWestAction|Crime|Thriller
29480Schindler's List (1993)5M56+academic/educatorMidwestDrama|War
29580One Flew Over the Cuckoo's Nest (1975)4M56+academic/educatorMidwestDrama
29680Green Mile, The (1999)5M56+academic/educatorMidwestCrime|Drama
29746Evil Dead II (Dead by Dawn) (1987)5M18-24unemployedSouthwestAction|Comedy|Fantasy|Horror
29846Rosemary's Baby (1968)5M18-24unemployedSouthwestDrama|Horror|Thriller
29946Night on Earth (1991)1M18-24unemployedSouthwestComedy|Drama
\n", "

300 rows × 8 columns

\n", "
" ], "text/plain": [ " userId Recommended Movie \\\n", "0 5938 Raiders of the Lost Ark (Indiana Jones and the... \n", "1 5938 Usual Suspects, The (1995) \n", "2 5938 North by Northwest (1959) \n", "3 5798 Raiders of the Lost Ark (Indiana Jones and the... \n", "4 5798 Saving Private Ryan (1998) \n", "5 5798 Godfather, The (1972) \n", "6 5693 North by Northwest (1959) \n", "7 5693 Schindler's List (1993) \n", "8 5693 Notorious (1946) \n", "9 5692 Shawshank Redemption, The (1994) \n", "10 5692 Schindler's List (1993) \n", "11 5692 Raiders of the Lost Ark (Indiana Jones and the... \n", "12 5582 Schindler's List (1993) \n", "13 5582 To Kill a Mockingbird (1962) \n", "14 5582 Third Man, The (1949) \n", "15 5560 Sting, The (1973) \n", "16 5560 Guess Who's Coming to Dinner (1967) \n", "17 5560 Sixth Sense, The (1999) \n", "18 5392 Fargo (1996) \n", "19 5392 Close Encounters of the Third Kind (1977) \n", "20 5392 American Beauty (1999) \n", "21 5352 It Happened One Night (1934) \n", "22 5352 Thin Man, The (1934) \n", "23 5352 My Man Godfrey (1936) \n", "24 5235 Wallace & Gromit: A Close Shave (1995) \n", "25 5235 Monty Python and the Holy Grail (1975) \n", "26 5235 Dr. Strangelove or: How I Learned to Stop Worr... \n", "27 5203 Forrest Gump (1994) \n", "28 5203 Shakespeare in Love (1998) \n", "29 5203 Braveheart (1995) \n", ".. ... ... \n", "270 286 Raiders of the Lost Ark (Indiana Jones and the... \n", "271 286 Rear Window (1954) \n", "272 286 North by Northwest (1959) \n", "273 277 Princess Bride, The (1987) \n", "274 277 Sixth Sense, The (1999) \n", "275 277 Shakespeare in Love (1998) \n", "276 249 Shawshank Redemption, The (1994) \n", "277 249 Princess Bride, The (1987) \n", "278 249 Usual Suspects, The (1995) \n", "279 235 M (1931) \n", "280 235 Reservoir Dogs (1992) \n", "281 235 Princess Mononoke (Mononoke-hime) (1997) \n", "282 180 Shadow of a Doubt (1943) \n", "283 180 Silence of the Lambs, The (1991) \n", "284 180 One Flew Over the Cuckoo's Nest (1975) \n", "285 170 Dr. Strangelove or: How I Learned to Stop Worr... \n", "286 170 North by Northwest (1959) \n", "287 170 Raiders of the Lost Ark (Indiana Jones and the... \n", "288 148 Raiders of the Lost Ark (Indiana Jones and the... \n", "289 148 Schindler's List (1993) \n", "290 148 North by Northwest (1959) \n", "291 126 Charade (1963) \n", "292 126 Gladiator (2000) \n", "293 126 Die Hard (1988) \n", "294 80 Schindler's List (1993) \n", "295 80 One Flew Over the Cuckoo's Nest (1975) \n", "296 80 Green Mile, The (1999) \n", "297 46 Evil Dead II (Dead by Dawn) (1987) \n", "298 46 Rosemary's Baby (1968) \n", "299 46 Night on Earth (1991) \n", "\n", " Rating by user Gender Age Occupation Region \\\n", "0 5 M 25-34 academic/educator South \n", "1 5 M 25-34 academic/educator South \n", "2 5 M 25-34 academic/educator South \n", "3 4 M 35-44 other or not specified West \n", "4 5 M 35-44 other or not specified West \n", "5 5 M 35-44 other or not specified West \n", "6 4 F 25-34 college/grad student West \n", "7 3 F 25-34 college/grad student West \n", "8 3 F 25-34 college/grad student West \n", "9 5 F 25-34 executive/managerial South \n", "10 5 F 25-34 executive/managerial South \n", "11 4 F 25-34 executive/managerial South \n", "12 5 M 45-49 academic/educator Midwest \n", "13 5 M 45-49 academic/educator Midwest \n", "14 5 M 45-49 academic/educator Midwest \n", "15 5 F 35-44 technician/engineer West \n", "16 5 F 35-44 technician/engineer West \n", "17 5 F 35-44 technician/engineer West \n", "18 4 M 50-55 academic/educator West \n", "19 5 M 50-55 academic/educator West \n", "20 4 M 50-55 academic/educator West \n", "21 5 F 35-44 customer service Midwest \n", "22 4 F 35-44 customer service Midwest \n", "23 5 F 35-44 customer service Midwest \n", "24 5 M 25-34 college/grad student Middle Atlantic \n", "25 5 M 25-34 college/grad student Middle Atlantic \n", "26 5 M 25-34 college/grad student Middle Atlantic \n", "27 2 F 45-49 doctor/health care New England \n", "28 5 F 45-49 doctor/health care New England \n", "29 5 F 45-49 doctor/health care New England \n", ".. ... ... ... ... ... \n", "270 4 M 25-34 academic/educator Midwest \n", "271 4 M 25-34 academic/educator Midwest \n", "272 5 M 25-34 academic/educator Midwest \n", "273 4 F 35-44 academic/educator West \n", "274 5 F 35-44 academic/educator West \n", "275 4 F 35-44 academic/educator West \n", "276 5 F 18-24 sales/marketing Midwest \n", "277 5 F 18-24 sales/marketing Midwest \n", "278 4 F 18-24 sales/marketing Midwest \n", "279 5 M 25-34 other or not specified UnknownOrNonUS \n", "280 5 M 25-34 other or not specified UnknownOrNonUS \n", "281 4 M 25-34 other or not specified UnknownOrNonUS \n", "282 4 M 45-49 programmer New England \n", "283 4 M 45-49 programmer New England \n", "284 5 M 45-49 programmer New England \n", "285 5 M 25-34 lawyer Middle Atlantic \n", "286 5 M 25-34 lawyer Middle Atlantic \n", "287 5 M 25-34 lawyer Middle Atlantic \n", "288 5 M 50-55 technician/engineer Midwest \n", "289 5 M 50-55 technician/engineer Midwest \n", "290 5 M 50-55 technician/engineer Midwest \n", "291 5 M 18-24 homemaker West \n", "292 4 M 18-24 homemaker West \n", "293 1 M 18-24 homemaker West \n", "294 5 M 56+ academic/educator Midwest \n", "295 4 M 56+ academic/educator Midwest \n", "296 5 M 56+ academic/educator Midwest \n", "297 5 M 18-24 unemployed Southwest \n", "298 5 M 18-24 unemployed Southwest \n", "299 1 M 18-24 unemployed Southwest \n", "\n", " Movie's Genres \n", "0 Action|Adventure \n", "1 Crime|Mystery|Thriller \n", "2 Action|Adventure|Mystery|Romance|Thriller \n", "3 Action|Adventure \n", "4 Action|Drama|War \n", "5 Crime|Drama \n", "6 Action|Adventure|Mystery|Romance|Thriller \n", "7 Drama|War \n", "8 Film-Noir|Romance|Thriller \n", "9 Crime|Drama \n", "10 Drama|War \n", "11 Action|Adventure \n", "12 Drama|War \n", "13 Drama \n", "14 Film-Noir|Mystery|Thriller \n", "15 Comedy|Crime \n", "16 Drama \n", "17 Drama|Horror|Mystery \n", "18 Comedy|Crime|Drama|Thriller \n", "19 Adventure|Drama|Sci-Fi \n", "20 Drama|Romance \n", "21 Comedy|Romance \n", "22 Comedy|Crime \n", "23 Comedy|Romance \n", "24 Animation|Children|Comedy \n", "25 Adventure|Comedy|Fantasy \n", "26 Comedy|War \n", "27 Comedy|Drama|Romance|War \n", "28 Comedy|Drama|Romance \n", "29 Action|Drama|War \n", ".. ... \n", "270 Action|Adventure \n", "271 Mystery|Thriller \n", "272 Action|Adventure|Mystery|Romance|Thriller \n", "273 Action|Adventure|Comedy|Fantasy|Romance \n", "274 Drama|Horror|Mystery \n", "275 Comedy|Drama|Romance \n", "276 Crime|Drama \n", "277 Action|Adventure|Comedy|Fantasy|Romance \n", "278 Crime|Mystery|Thriller \n", "279 Crime|Film-Noir|Thriller \n", "280 Crime|Mystery|Thriller \n", "281 Action|Adventure|Animation|Drama|Fantasy \n", "282 Crime|Drama|Thriller \n", "283 Crime|Horror|Thriller \n", "284 Drama \n", "285 Comedy|War \n", "286 Action|Adventure|Mystery|Romance|Thriller \n", "287 Action|Adventure \n", "288 Action|Adventure \n", "289 Drama|War \n", "290 Action|Adventure|Mystery|Romance|Thriller \n", "291 Comedy|Crime|Mystery|Romance|Thriller \n", "292 Action|Adventure|Drama \n", "293 Action|Crime|Thriller \n", "294 Drama|War \n", "295 Drama \n", "296 Crime|Drama \n", "297 Action|Comedy|Fantasy|Horror \n", "298 Drama|Horror|Thriller \n", "299 Comedy|Drama \n", "\n", "[300 rows x 8 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "top3=ratings_test.sort_values(['userId','score'],ascending=False).groupby('userId').head(3)\n", "top3=top3[['userId','movieId','Rating']]\n", "top3=pd.merge(top3,users,on='userId',how='left')\n", "top3=pd.merge(top3,movies_humanreadable,on='movieId',how='left')\n", "top3.rename(columns={'title':'Recommended Movie', 'genres':\"Movie's Genres\", 'Rating':'Rating by user'},inplace=True)\n", "age_mapping={\n", " 1: \"Under 18\",\n", " 18: \"18-24\",\n", " 25: \"25-34\",\n", " 35: \"35-44\",\n", " 45: \"45-49\",\n", " 50: \"50-55\",\n", " 56: \"56+\"\n", "}\n", "top3['Age']=top3.Age.map(lambda x: age_mapping[x])\n", "occupations_mapping={\n", " 0: \"other or not specified\",\n", " 1: \"academic/educator\",\n", " 2: \"artist\",\n", " 3: \"clerical/admin\",\n", " 4: \"college/grad student\",\n", " 5: \"customer service\",\n", " 6: \"doctor/health care\",\n", " 7: \"executive/managerial\",\n", " 8: \"farmer\",\n", " 9: \"homemaker\",\n", " 10: \"K-12 student\",\n", " 11: \"lawyer\",\n", " 12: \"programmer\",\n", " 13: \"retired\",\n", " 14: \"sales/marketing\",\n", " 15: \"scientist\",\n", " 16: \"self-employed\",\n", " 17: \"technician/engineer\",\n", " 18: \"tradesman/craftsman\",\n", " 19: \"unemployed\",\n", " 20: \"writer\"\n", "}\n", "top3['Occupation']=top3.Occupation.map(lambda x: occupations_mapping[x])\n", "del top3['Zipcode']\n", "del top3['movieId']\n", "top3[['userId','Recommended Movie','Rating by user', 'Gender','Age','Occupation','Region',\"Movie's Genres\"]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this was a complete-user hold-out test set (the model's parameters were not calculated using any information from these users and might not even have used any information from the movies that are being recommended), but the recommendations shown here are limited to only the movies that each of these users have rated in order to be able to see how they would have rated the recommendations.\n", "\n", "As the top-5 recommendations for each user seem to have been well rated by them, we might guess they are good. They might not be rated as highly as simply recommending the best-rated movies, but the recommendations are personalized and can recommend newer movies too (as long as they have tags).\n", "\n", "As for implementing such a model, since it generates the same recommendations for any user with the same combination of gender, age, occupation and location, the recommendations could be pre-computed for each of the $2 \\times 7 \\times 21 \\times 8=2352$ theoreticall buckets (in reality, not all of them make sense though, as people of certain ages - such as under 18 - shouldn't fall into certain occupations)." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }