{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Combining recommended lists\n",
    "** *\n",
    "This IPython notebook consists in combining the Top-N recommended items from different recommender methodologies (here one list each coming from collaborative filtering, content-based, and most-popular) for a given user using interleaved ranking, in order to obtain a final recommended list.\n",
    "\n",
    "A simple approach to combine recommendations from different sources is to add or multiply the score that each item for a given user gets under each algorithm, but this might not end up changing the recommendations too much if the scores are dissimilar or if they come in the form of a ranking. Interleaved ranking – originally an algorithm for mixing search engine results – offers a method to force the final recommended list to be more “mixed” by making them contain elements from each list.\n",
    "\n",
    "There are different algorithms for making an interleaved ranked list – here I’ll use the simplest algorithm, also known as the soccer team selection, which intuitively is as follows: each recommended list gets to contribute items to the final list in a sequence, by trying to add their top-ranked item, but ignoring items that got already put in the final list by another recommended list.\n",
    "\n",
    "Here I’ll produce three different recommended lists of 20 items each using the [MovieLens 1M dataset](https://grouplens.org/datasets/movielens/1m/) for the user numbered $100$ (userId = 100) as follows:\n",
    "* Most-popular: each item’s score is the sum of the ratings they get from all users, thus favoring both highly rated and highly voted movies. This is a non-personalized list (i.e. it’s the same for all users).\n",
    "* Collaborative filtering: a low-rank matrix factorization of the ratings matrix using alternating least squares.\n",
    "* Content-based: regression of the (centered) ratings against the outer product of user and movie features – this is a more involved process and the details can be found [in this other IPython notebook](http://nbviewer.ipython.org/github/david-cortes/datascienceprojects/blob/master/machine_learning/recommender_system_w_coldstart.ipynb).\n",
    "\n",
    "** *\n",
    "## Sections\n",
    "\n",
    "[1. Loading the data](#p1)\n",
    "\n",
    "[2. Producing a Most-Popular recommended list](#p2)\n",
    "\n",
    "[3. Producing a Collaborative Filtering recommended list](#p3)\n",
    "\n",
    "[4. Producing a Content-Based recommended list](#p4)\n",
    "\n",
    "[5. Examining the recommendations](#p5)\n",
    "\n",
    "[6. Combining recommended lists](#p6)\n",
    "** *\n",
    "\n",
    "<a id=\"p1\"></a>\n",
    "## 1. Loading the data\n",
    "\n",
    "Initiallizing spark locally (will be used for most computations) and loading the necessary libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np, pandas as pd, re, findspark\n",
    "from collections import defaultdict\n",
    "from sklearn.decomposition import PCA\n",
    "from scipy.sparse import csc_matrix\n",
    "\n",
    "findspark.init(\"/home/david/Downloads/spark-2.1.1-bin-hadoop2.7/\")\n",
    "\n",
    "import pyspark\n",
    "sc = pyspark.SparkContext()\n",
    "from pyspark.sql import SQLContext\n",
    "sqlContext = SQLContext(sc)\n",
    "\n",
    "from pyspark.mllib.regression import (LabeledPoint, RidgeRegressionWithSGD)\n",
    "from pyspark.ml.regression import LinearRegression\n",
    "from pyspark.ml.recommendation import ALS"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Loading the MovieLens-1M ratings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>movieId</th>\n",
       "      <th>Rating</th>\n",
       "      <th>Timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1193</td>\n",
       "      <td>5</td>\n",
       "      <td>978300760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>661</td>\n",
       "      <td>3</td>\n",
       "      <td>978302109</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>914</td>\n",
       "      <td>3</td>\n",
       "      <td>978301968</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>3408</td>\n",
       "      <td>4</td>\n",
       "      <td>978300275</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>2355</td>\n",
       "      <td>5</td>\n",
       "      <td>978824291</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   userId  movieId  Rating  Timestamp\n",
       "0       1     1193       5  978300760\n",
       "1       1      661       3  978302109\n",
       "2       1      914       3  978301968\n",
       "3       1     3408       4  978300275\n",
       "4       1     2355       5  978824291"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings=pd.read_table(\"/home/david/movielens/ml-1m/ml-1m/ratings.dat\", sep=\"::\", names=[\"userId\",\"movieId\",\"Rating\",\"Timestamp\"], engine='python')\n",
    "ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Loading the movie titles encoding - will be used later to examine recommended lists:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "movie_titles=pd.read_csv('/home/david/movielens/ml-1m/ml-1m/movies.dat', sep=\"::\", names=['movieId','MovieTitle','genres'],engine='python')\n",
    "movie_titles={i.movieId:i.MovieTitle for i in movie_titles.itertuples()}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"p2\"></a>\n",
    "## 2. Producing a Most-Popular recommended list\n",
    "\n",
    "Items are ranked by sum of their ratings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>NumRatings</th>\n",
       "      <th>AvgRating</th>\n",
       "      <th>score</th>\n",
       "      <th>Title</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>movieId</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2858</th>\n",
       "      <td>3428</td>\n",
       "      <td>4.317386</td>\n",
       "      <td>14800.0</td>\n",
       "      <td>American Beauty (1999)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>260</th>\n",
       "      <td>2991</td>\n",
       "      <td>4.453694</td>\n",
       "      <td>13321.0</td>\n",
       "      <td>Star Wars: Episode IV - A New Hope (1977)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1196</th>\n",
       "      <td>2990</td>\n",
       "      <td>4.292977</td>\n",
       "      <td>12836.0</td>\n",
       "      <td>Star Wars: Episode V - The Empire Strikes Back...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1210</th>\n",
       "      <td>2883</td>\n",
       "      <td>4.022893</td>\n",
       "      <td>11598.0</td>\n",
       "      <td>Star Wars: Episode VI - Return of the Jedi (1983)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2028</th>\n",
       "      <td>2653</td>\n",
       "      <td>4.337354</td>\n",
       "      <td>11507.0</td>\n",
       "      <td>Saving Private Ryan (1998)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         NumRatings  AvgRating    score  \\\n",
       "movieId                                   \n",
       "2858           3428   4.317386  14800.0   \n",
       "260            2991   4.453694  13321.0   \n",
       "1196           2990   4.292977  12836.0   \n",
       "1210           2883   4.022893  11598.0   \n",
       "2028           2653   4.337354  11507.0   \n",
       "\n",
       "                                                     Title  \n",
       "movieId                                                     \n",
       "2858                                American Beauty (1999)  \n",
       "260              Star Wars: Episode IV - A New Hope (1977)  \n",
       "1196     Star Wars: Episode V - The Empire Strikes Back...  \n",
       "1210     Star Wars: Episode VI - Return of the Jedi (1983)  \n",
       "2028                            Saving Private Ryan (1998)  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user=100\n",
    "movies_watched_by_user=set(list(ratings.movieId.loc[ratings.userId==user]))\n",
    "\n",
    "avg_ratings=ratings.groupby('movieId')['Rating'].mean().to_frame().rename(columns={'Rating':'AvgRating'})\n",
    "num_ratings=ratings.groupby('movieId')['Rating'].agg(lambda x: len(tuple(x))).to_frame().rename(columns={'Rating':'NumRatings'})\n",
    "pop_rec=num_ratings.join(avg_ratings)\n",
    "pop_rec.loc[~pop_rec.index.isin(movies_watched_by_user)]\n",
    "pop_rec['score']=pop_rec.NumRatings*pop_rec.AvgRating\n",
    "pop_rec=pop_rec.sort_values('score',ascending=False)\n",
    "pop20=list(pop_rec.index[:20])\n",
    "pop_rec['Title']=pop_rec.index.map(lambda x: movie_titles[x])\n",
    "pop_rec.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"p3\"></a>\n",
    "## 3. Producing a Collaborative Filtering recommended list\n",
    "\n",
    "Here I'm using ALS from PySpark to factorize the ratings matrix:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>movieId</th>\n",
       "      <th>score_cf</th>\n",
       "      <th>Title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1405</th>\n",
       "      <td>100</td>\n",
       "      <td>3382</td>\n",
       "      <td>4.950840</td>\n",
       "      <td>Song of Freedom (1936)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3333</th>\n",
       "      <td>100</td>\n",
       "      <td>557</td>\n",
       "      <td>3.812159</td>\n",
       "      <td>Mamma Roma (1962)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1244</th>\n",
       "      <td>100</td>\n",
       "      <td>989</td>\n",
       "      <td>3.618343</td>\n",
       "      <td>Schlafes Bruder (Brother of Sleep) (1995)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>512</th>\n",
       "      <td>100</td>\n",
       "      <td>578</td>\n",
       "      <td>3.510315</td>\n",
       "      <td>Hour of the Pig, The (1993)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2633</th>\n",
       "      <td>100</td>\n",
       "      <td>3233</td>\n",
       "      <td>3.498407</td>\n",
       "      <td>Smashing Time (1967)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      userId  movieId  score_cf                                      Title\n",
       "1405     100     3382  4.950840                     Song of Freedom (1936)\n",
       "3333     100      557  3.812159                          Mamma Roma (1962)\n",
       "1244     100      989  3.618343  Schlafes Bruder (Brother of Sleep) (1995)\n",
       "512      100      578  3.510315                Hour of the Pig, The (1993)\n",
       "2633     100     3233  3.498407                       Smashing Time (1967)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings_df=sqlContext.createDataFrame(ratings)\n",
    "\n",
    "cfmodel=ALS(rank=50, regParam=0.5, userCol=\"userId\", itemCol=\"movieId\", ratingCol=\"Rating\").fit(ratings_df)\n",
    "movies_available=set(list(ratings.movieId))\n",
    "movies_available=movies_available.difference(movies_watched_by_user)\n",
    "preds=pd.DataFrame([(user,m) for m in movies_available],columns=['userId','movieId'])\n",
    "preds_df=sqlContext.createDataFrame(preds)\n",
    "preds_scores=cfmodel.transform(preds_df).collect()\n",
    "preds_scores=pd.DataFrame(preds_scores, columns=['userId','movieId','score_cf'])\n",
    "preds_scores=preds_scores.sort_values('score_cf',ascending=False)\n",
    "cf20=list(preds_scores.movieId.iloc[:20])\n",
    "preds_scores['Title']=preds_scores.movieId.map(lambda x: movie_titles[x])\n",
    "preds_scores.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"p4\"></a>\n",
    "## 4. Producing a Content-Based recommended list\n",
    "\n",
    "The overall idea is to get user demographic info including their geographical region, which I get from their zip codes by using some free zip code databases, and movie information by taking the movie tags from the latest movielens releases, matching them by title to the movielens-1m ratings and adding the movie genres and release year as a discretized category.\n",
    "\n",
    "Then, a regression is performed on the centered rating against the outer product of the user and movie features - a more detailed and explained version can be found [here](http://nbviewer.ipython.org/github/david-cortes/datascienceprojects/blob/master/machine_learning/recommender_system_w_coldstart.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/david/anaconda2/lib/python2.7/site-packages/pandas/core/indexing.py:179: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame\n",
      "\n",
      "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
      "  self._setitem_with_indexer(indexer, value)\n",
      "/home/david/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (11) have mixed types. Specify dtype option on import or set low_memory=False.\n",
      "  interactivity=interactivity, compiler=compiler, result=result)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style>\n",
       "    .dataframe thead tr:only-child th {\n",
       "        text-align: right;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>movieId</th>\n",
       "      <th>score_cf</th>\n",
       "      <th>Title</th>\n",
       "      <th>score_cb</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3191</th>\n",
       "      <td>100</td>\n",
       "      <td>1262</td>\n",
       "      <td>3.136641</td>\n",
       "      <td>Great Escape, The (1963)</td>\n",
       "      <td>1.030581</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>835</th>\n",
       "      <td>100</td>\n",
       "      <td>3030</td>\n",
       "      <td>3.163622</td>\n",
       "      <td>Yojimbo (1961)</td>\n",
       "      <td>1.015274</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>398</th>\n",
       "      <td>100</td>\n",
       "      <td>908</td>\n",
       "      <td>3.137593</td>\n",
       "      <td>North by Northwest (1959)</td>\n",
       "      <td>1.012968</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>354</th>\n",
       "      <td>100</td>\n",
       "      <td>3435</td>\n",
       "      <td>3.167782</td>\n",
       "      <td>Double Indemnity (1944)</td>\n",
       "      <td>1.003191</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>473</th>\n",
       "      <td>100</td>\n",
       "      <td>3196</td>\n",
       "      <td>3.033045</td>\n",
       "      <td>Stalag 17 (1953)</td>\n",
       "      <td>0.998952</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      userId  movieId  score_cf                      Title  score_cb\n",
       "3191     100     1262  3.136641   Great Escape, The (1963)  1.030581\n",
       "835      100     3030  3.163622             Yojimbo (1961)  1.015274\n",
       "398      100      908  3.137593  North by Northwest (1959)  1.012968\n",
       "354      100     3435  3.167782    Double Indemnity (1944)  1.003191\n",
       "473      100     3196  3.033045           Stalag 17 (1953)  0.998952"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies=pd.read_csv('/home/david/movielens/ml-latest/ml-latest/movies.csv')\n",
    "movies_humanreadable=movies.copy()\n",
    "movies['hasYear']=movies.title.map(lambda x: bool(re.search(\"\\s\\((\\d{4})\\)$\",x.strip())))\n",
    "movies['Year']='unknown'\n",
    "movies['Year'].loc[movies.hasYear]=movies.title.loc[movies.hasYear].map(lambda x: re.search(\"\\s\\((\\d{4})\\)$\",x.strip()).group(1))\n",
    "del movies['hasYear']\n",
    "\n",
    "movies['genres']=movies.genres.map(lambda x: set(x.split('|')))\n",
    "present_genres=set()\n",
    "for movie in movies.itertuples():\n",
    "    present_genres=present_genres.union(movie.genres)\n",
    "for genre in present_genres:\n",
    "    movies['genre'+genre]=movies.genres.map(lambda x: 1.0*(genre in x))\n",
    "\n",
    "tags=pd.read_csv('/home/david/movielens/ml-latest/ml-latest/genome-scores.csv')\n",
    "tags_wide=tags.pivot(index='movieId', columns='tagId', values='relevance')\n",
    "tags_wide=tags_wide.fillna(0)\n",
    "pca=PCA(svd_solver='full')\n",
    "pca.fit(tags_wide)\n",
    "tags_pca=pd.DataFrame(pca.transform(tags_wide)[:,:50])\n",
    "tags_pca.columns=[\"pc\"+str(x) for x in tags_pca.columns.values]\n",
    "tags_pca['movieId']=tags_wide.index\n",
    "movies=pd.merge(movies,tags_pca,how='inner',on='movieId')\n",
    "\n",
    "def discretize_year(x):\n",
    "    if x=='unknown':\n",
    "        return x\n",
    "    else:\n",
    "        x=int(x)\n",
    "        if x>=2000:\n",
    "            return '>=2000'\n",
    "        if x>=1995 and x<=1999:\n",
    "            return str(x)\n",
    "        if x>=1990 and x<=1994:\n",
    "            return 'low90s'\n",
    "        if x>=1980 and x<=1989:\n",
    "            return '80s'\n",
    "        if x>=1970 and x<=1979:\n",
    "            return '70s'\n",
    "        if x>=1960 and x<=1969:\n",
    "            return '60s'\n",
    "        if x>=1950 and x<=1959:\n",
    "            return '50s'\n",
    "        if x>=1940 and x<=1959:\n",
    "            return '40s'\n",
    "        if x<1940:\n",
    "            return '<1940'\n",
    "        else:\n",
    "            return 'unknown'\n",
    "\n",
    "movies_features=movies.copy()\n",
    "del movies_features['title']\n",
    "del movies_features['genres']\n",
    "del movies_features['genre(no genres listed)']\n",
    "movies_features['Year']=movies_features.Year.map(lambda x: discretize_year(x))\n",
    "movies_features=pd.get_dummies(movies_features, columns=['Year'])\n",
    "movies_features.set_index('movieId',inplace=True)\n",
    "\n",
    "zipcode_abbs=pd.read_csv(\"/home/david/movielens/zips/states.csv\")\n",
    "zipcode_abbs_dct={z.State:z.Abbreviation for z in zipcode_abbs.itertuples()}\n",
    "us_regs_table=[\n",
    "    ('New England', 'Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, Vermont'),\n",
    "    ('Middle Atlantic', 'Delaware, Maryland, New Jersey, New York, Pennsylvania'),\n",
    "    ('South', 'Alabama, Arkansas, Florida, Georgia, Kentucky, Louisiana, Mississippi, Missouri, North Carolina, South Carolina, Tennessee, Virginia, West Virginia'),\n",
    "    ('Midwest', 'Illinois, Indiana, Iowa, Kansas, Michigan, Minnesota, Nebraska, North Dakota, Ohio, South Dakota, Wisconsin'),\n",
    "    ('Southwest', 'Arizona, New Mexico, Oklahoma, Texas'),\n",
    "    ('West', 'Alaska, California, Colorado, Hawaii, Idaho, Montana, Nevada, Oregon, Utah, Washington, Wyoming')\n",
    "    ]\n",
    "us_regs_table=[(x[0],[i.strip() for i in x[1].split(\",\")]) for x in us_regs_table]\n",
    "us_regs_dct=dict()\n",
    "for r in us_regs_table:\n",
    "    for s in r[1]:\n",
    "        us_regs_dct[zipcode_abbs_dct[s]]=r[0]\n",
    "\n",
    "zipcode_info=pd.read_csv(\"/home/david/movielens/free-zipcode-database.csv\")\n",
    "zipcode_info=zipcode_info.groupby('Zipcode').first().reset_index()\n",
    "zipcode_info['State'].loc[zipcode_info.Country!=\"US\"]='UnknownOrNonUS'\n",
    "zipcode_info['Region']=zipcode_info['State'].copy()\n",
    "zipcode_info['Region'].loc[zipcode_info.Country==\"US\"]=zipcode_info.Region.loc[zipcode_info.Country==\"US\"].map(lambda x: us_regs_dct[x] if x in us_regs_dct else 'UsOther')\n",
    "zipcode_info=zipcode_info[['Zipcode', 'Region']]\n",
    "\n",
    "users=pd.read_table(\"/home/david/movielens/ml-1m/ml-1m/users.dat\",sep='::',names=[\"userId\",\"Gender\",\"Age\",\"Occupation\",\"Zipcode\"], engine='python')\n",
    "users[\"Zipcode\"]=users.Zipcode.map(lambda x: np.int(re.sub(\"-.*\",\"\",x)))\n",
    "users=pd.merge(users,zipcode_info,on='Zipcode',how='left')\n",
    "users['Region']=users.Region.fillna('UnknownOrNonUS')\n",
    "\n",
    "users_features=users.copy()\n",
    "users_features['Gender']=users_features.Gender.map(lambda x: 1.0*(x=='M'))\n",
    "del users_features['Zipcode']\n",
    "users_features['Age']=users_features.Age.map(lambda x: str(x))\n",
    "users_features['Occupation']=users_features.Occupation.map(lambda x: str(x))\n",
    "users_features=pd.get_dummies(users_features, columns=['Age', 'Occupation', 'Region'])\n",
    "users_features.set_index('userId',inplace=True)\n",
    "\n",
    "movies_w_sideinfo=set(list(movies.movieId))\n",
    "ratings=ratings.loc[ratings.movieId.map(lambda x: x in movies_w_sideinfo)]\n",
    "avg_rating_by_user=ratings.groupby('userId')['Rating'].mean().to_frame().rename(columns={'Rating':'AvgRating'})\n",
    "ratings_train=pd.merge(ratings, avg_rating_by_user, left_on='userId',right_index=True)\n",
    "ratings_train['RatingCentered']=ratings_train.Rating-ratings_train.AvgRating\n",
    "\n",
    "def generate_features(user,movie,users_features_bc,movies_features_bc):\n",
    "    user_feats=users_features_bc.value.loc[user].as_matrix()\n",
    "    movie_feats=movies_features_bc.value.loc[movie].as_matrix()\n",
    "    return csc_matrix(np.kron(user_feats,movie_feats).reshape(-1,1))\n",
    "\n",
    "users_features_bc=sc.broadcast(users_features)\n",
    "movies_features_bc=sc.broadcast(movies_features)\n",
    "\n",
    "trainset=sc.parallelize([(i.userId,i.movieId,i.RatingCentered) for i in ratings_train.itertuples()])\\\n",
    ".map(lambda x: LabeledPoint(x[2],generate_features(x[0],x[1],users_features_bc,movies_features_bc)))\\\n",
    ".map(lambda x: (float(x.label),x.features.asML())).toDF(['label','features'])\n",
    "trainset.repartition(50)\n",
    "\n",
    "recommender=LinearRegression(regParam=1e-4).fit(trainset)\n",
    "formula_coeffs=recommender.coefficients.toArray()\n",
    "\n",
    "def generate_features_series(user,movie):\n",
    "    user_feats=users_features.loc[user].as_matrix()\n",
    "    movie_feats=movies_features.loc[movie].as_matrix()\n",
    "    return pd.Series(np.kron(user_feats,movie_feats).astype('float64'))\n",
    "\n",
    "preds_scores=preds_scores.loc[preds_scores.movieId.map(lambda x: x in movies_w_sideinfo)]\n",
    "X_predict=preds_scores.movieId.apply(lambda x: generate_features_series(user,x))\n",
    "preds_scores['score_cb']=X_predict.dot(formula_coeffs)\n",
    "preds_scores=preds_scores.sort_values('score_cb',ascending=False)\n",
    "cb20=list(preds_scores.movieId.iloc[:20])\n",
    "preds_scores.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"p5\"></a>\n",
    "## 5. Examining the recommendations\n",
    "\n",
    "Now taking a look at what these lists are actually recommend each - their recommendations are very different with little intersection, and as expected, collaborative filtering tends to favor less popular items for this user. First Most-Popular recommended list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) - American Beauty (1999) - Average Rating: 4.32 - Number of ratings: 3428\n",
      "2) - Star Wars: Episode IV - A New Hope (1977) - Average Rating: 4.45 - Number of ratings: 2991\n",
      "3) - Star Wars: Episode V - The Empire Strikes Back (1980) - Average Rating: 4.29 - Number of ratings: 2990\n",
      "4) - Star Wars: Episode VI - Return of the Jedi (1983) - Average Rating: 4.02 - Number of ratings: 2883\n",
      "5) - Saving Private Ryan (1998) - Average Rating: 4.34 - Number of ratings: 2653\n",
      "6) - Raiders of the Lost Ark (1981) - Average Rating: 4.48 - Number of ratings: 2514\n",
      "7) - Silence of the Lambs, The (1991) - Average Rating: 4.35 - Number of ratings: 2578\n",
      "8) - Matrix, The (1999) - Average Rating: 4.32 - Number of ratings: 2590\n",
      "9) - Sixth Sense, The (1999) - Average Rating: 4.41 - Number of ratings: 2459\n",
      "10) - Terminator 2: Judgment Day (1991) - Average Rating: 4.06 - Number of ratings: 2649\n",
      "11) - Fargo (1996) - Average Rating: 4.25 - Number of ratings: 2513\n",
      "12) - Schindler's List (1993) - Average Rating: 4.51 - Number of ratings: 2304\n",
      "13) - Braveheart (1995) - Average Rating: 4.23 - Number of ratings: 2443\n",
      "14) - Back to the Future (1985) - Average Rating: 3.99 - Number of ratings: 2583\n",
      "15) - Shawshank Redemption, The (1994) - Average Rating: 4.55 - Number of ratings: 2227\n",
      "16) - Godfather, The (1972) - Average Rating: 4.52 - Number of ratings: 2223\n",
      "17) - Jurassic Park (1993) - Average Rating: 3.76 - Number of ratings: 2672\n",
      "18) - Princess Bride, The (1987) - Average Rating: 4.3 - Number of ratings: 2318\n",
      "19) - Shakespeare in Love (1998) - Average Rating: 4.13 - Number of ratings: 2369\n",
      "20) - L.A. Confidential (1997) - Average Rating: 4.22 - Number of ratings: 2288\n"
     ]
    }
   ],
   "source": [
    "def print_reclist(reclist):\n",
    "    list_w_info=[str(m+1)+\") - \"+movie_titles[reclist[m]]+\\\n",
    "        \" - Average Rating: \"+str(np.round(avg_ratings.loc[reclist[m]].iloc[0],2))+\\\n",
    "        \" - Number of ratings: \"+str(num_ratings.loc[reclist[m]].iloc[0]) for m in range(len(reclist))]\n",
    "    print \"\\n\".join(list_w_info)\n",
    "    \n",
    "print_reclist(pop20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Collaborative filtering recommended list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) - Song of Freedom (1936) - Average Rating: 5.0 - Number of ratings: 1\n",
      "2) - Mamma Roma (1962) - Average Rating: 4.5 - Number of ratings: 2\n",
      "3) - Schlafes Bruder (Brother of Sleep) (1995) - Average Rating: 5.0 - Number of ratings: 1\n",
      "4) - Hour of the Pig, The (1993) - Average Rating: 4.5 - Number of ratings: 2\n",
      "5) - Smashing Time (1967) - Average Rating: 5.0 - Number of ratings: 2\n",
      "6) - Gate of Heavenly Peace, The (1995) - Average Rating: 5.0 - Number of ratings: 3\n",
      "7) - Apple, The (Sib) (1998) - Average Rating: 4.67 - Number of ratings: 9\n",
      "8) - Ulysses (Ulisse) (1954) - Average Rating: 5.0 - Number of ratings: 1\n",
      "9) - Follow the Bitch (1998) - Average Rating: 5.0 - Number of ratings: 1\n",
      "10) - I Am Cuba (Soy Cuba/Ya Kuba) (1964) - Average Rating: 4.8 - Number of ratings: 5\n",
      "11) - One Little Indian (1973) - Average Rating: 5.0 - Number of ratings: 1\n",
      "12) - Lamerica (1994) - Average Rating: 4.75 - Number of ratings: 8\n",
      "13) - Foreign Student (1994) - Average Rating: 3.0 - Number of ratings: 2\n",
      "14) - Sanjuro (1962) - Average Rating: 4.61 - Number of ratings: 69\n",
      "15) - Lured (1947) - Average Rating: 5.0 - Number of ratings: 1\n",
      "16) - Bells, The (1926) - Average Rating: 4.5 - Number of ratings: 2\n",
      "17) - Bittersweet Motel (2000) - Average Rating: 5.0 - Number of ratings: 1\n",
      "18) - Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) - Average Rating: 4.56 - Number of ratings: 628\n",
      "19) - Jar, The (Khomreh) (1992) - Average Rating: 4.0 - Number of ratings: 1\n",
      "20) - For All Mankind (1989) - Average Rating: 4.44 - Number of ratings: 27\n"
     ]
    }
   ],
   "source": [
    "print_reclist(cf20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Content-based recommended list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) - Great Escape, The (1963) - Average Rating: 4.38 - Number of ratings: 696\n",
      "2) - Yojimbo (1961) - Average Rating: 4.4 - Number of ratings: 215\n",
      "3) - North by Northwest (1959) - Average Rating: 4.38 - Number of ratings: 1315\n",
      "4) - Double Indemnity (1944) - Average Rating: 4.42 - Number of ratings: 551\n",
      "5) - Stalag 17 (1953) - Average Rating: 4.23 - Number of ratings: 394\n",
      "6) - It Happened One Night (1934) - Average Rating: 4.28 - Number of ratings: 374\n",
      "7) - Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) - Average Rating: 4.56 - Number of ratings: 628\n",
      "8) - Gladiator (2000) - Average Rating: 4.11 - Number of ratings: 1924\n",
      "9) - Casablanca (1942) - Average Rating: 4.41 - Number of ratings: 1669\n",
      "10) - Third Man, The (1949) - Average Rating: 4.45 - Number of ratings: 480\n",
      "11) - Maltese Falcon, The (1941) - Average Rating: 4.4 - Number of ratings: 1043\n",
      "12) - To Kill a Mockingbird (1962) - Average Rating: 4.43 - Number of ratings: 928\n",
      "13) - Treasure of the Sierra Madre, The (1948) - Average Rating: 4.29 - Number of ratings: 453\n",
      "14) - Everest (1998) - Average Rating: 4.01 - Number of ratings: 167\n",
      "15) - Wrong Trousers, The (1993) - Average Rating: 4.51 - Number of ratings: 882\n",
      "16) - In the Heat of the Night (1967) - Average Rating: 4.13 - Number of ratings: 348\n",
      "17) - Terminator 2: Judgment Day (1991) - Average Rating: 4.06 - Number of ratings: 2649\n",
      "18) - Modern Times (1936) - Average Rating: 4.24 - Number of ratings: 305\n",
      "19) - City Lights (1931) - Average Rating: 4.39 - Number of ratings: 271\n",
      "20) - Terminator, The (1984) - Average Rating: 4.15 - Number of ratings: 2098\n"
     ]
    }
   ],
   "source": [
    "print_reclist(cb20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"p6\"></a>\n",
    "## 6. Combining recommended lists\n",
    "\n",
    "Finally, combining these three lists through interleaved ranking, prioritizing them in this order: CF-CB-MP:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) - Song of Freedom (1936) - Average Rating: 5.0 - Number of ratings: 1\n",
      "2) - Great Escape, The (1963) - Average Rating: 4.38 - Number of ratings: 696\n",
      "3) - American Beauty (1999) - Average Rating: 4.32 - Number of ratings: 3428\n",
      "4) - Mamma Roma (1962) - Average Rating: 4.5 - Number of ratings: 2\n",
      "5) - Yojimbo (1961) - Average Rating: 4.4 - Number of ratings: 215\n",
      "6) - Star Wars: Episode IV - A New Hope (1977) - Average Rating: 4.45 - Number of ratings: 2991\n",
      "7) - Schlafes Bruder (Brother of Sleep) (1995) - Average Rating: 5.0 - Number of ratings: 1\n",
      "8) - North by Northwest (1959) - Average Rating: 4.38 - Number of ratings: 1315\n",
      "9) - Star Wars: Episode V - The Empire Strikes Back (1980) - Average Rating: 4.29 - Number of ratings: 2990\n",
      "10) - Hour of the Pig, The (1993) - Average Rating: 4.5 - Number of ratings: 2\n",
      "11) - Double Indemnity (1944) - Average Rating: 4.42 - Number of ratings: 551\n",
      "12) - Star Wars: Episode VI - Return of the Jedi (1983) - Average Rating: 4.02 - Number of ratings: 2883\n",
      "13) - Smashing Time (1967) - Average Rating: 5.0 - Number of ratings: 2\n",
      "14) - Stalag 17 (1953) - Average Rating: 4.23 - Number of ratings: 394\n",
      "15) - Saving Private Ryan (1998) - Average Rating: 4.34 - Number of ratings: 2653\n",
      "16) - Gate of Heavenly Peace, The (1995) - Average Rating: 5.0 - Number of ratings: 3\n",
      "17) - It Happened One Night (1934) - Average Rating: 4.28 - Number of ratings: 374\n",
      "18) - Raiders of the Lost Ark (1981) - Average Rating: 4.48 - Number of ratings: 2514\n",
      "19) - Apple, The (Sib) (1998) - Average Rating: 4.67 - Number of ratings: 9\n",
      "20) - Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) - Average Rating: 4.56 - Number of ratings: 628\n"
     ]
    }
   ],
   "source": [
    "def interleaved_ranking(lst_of_lists,n):\n",
    "    final_list=list()\n",
    "    while len(final_list)<n:\n",
    "        for lst in lst_of_lists:\n",
    "            lst=[m for m in lst if m not in final_list]\n",
    "            if len(lst)==0:\n",
    "                continue\n",
    "            new=lst[0]\n",
    "            final_list.append(new)\n",
    "            if len(final_list)==n:\n",
    "                break\n",
    "    return final_list\n",
    "\n",
    "mixed_list=interleaved_ranking([cf20,cb20,pop20],20)\n",
    "print_reclist(mixed_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The list seems now a lot more diverse, which is probably a good thing. Such a list can be further diversified following other heuristics as illustrated in [this other IPython notebook](http://nbviewer.ipython.org/github/david-cortes/datascienceprojects/blob/master/machine_learning/topic_diversification.ipynb), and the lists can be further rotated in time (e.g. starting from rank10 instead of rank1) to offer more items to the user.\n",
    "\n",
    "When there is a large degree of intersection between the items from each list, changing the order in which they are prioritized will change not only the relative orderings in the final list, but also the items that end up appearing. This is not the case here though, as there is little intersection between the lists:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) - American Beauty (1999) - Average Rating: 4.32 - Number of ratings: 3428\n",
      "2) - Song of Freedom (1936) - Average Rating: 5.0 - Number of ratings: 1\n",
      "3) - Great Escape, The (1963) - Average Rating: 4.38 - Number of ratings: 696\n",
      "4) - Star Wars: Episode IV - A New Hope (1977) - Average Rating: 4.45 - Number of ratings: 2991\n",
      "5) - Mamma Roma (1962) - Average Rating: 4.5 - Number of ratings: 2\n",
      "6) - Yojimbo (1961) - Average Rating: 4.4 - Number of ratings: 215\n",
      "7) - Star Wars: Episode V - The Empire Strikes Back (1980) - Average Rating: 4.29 - Number of ratings: 2990\n",
      "8) - Schlafes Bruder (Brother of Sleep) (1995) - Average Rating: 5.0 - Number of ratings: 1\n",
      "9) - North by Northwest (1959) - Average Rating: 4.38 - Number of ratings: 1315\n",
      "10) - Star Wars: Episode VI - Return of the Jedi (1983) - Average Rating: 4.02 - Number of ratings: 2883\n",
      "11) - Hour of the Pig, The (1993) - Average Rating: 4.5 - Number of ratings: 2\n",
      "12) - Double Indemnity (1944) - Average Rating: 4.42 - Number of ratings: 551\n",
      "13) - Saving Private Ryan (1998) - Average Rating: 4.34 - Number of ratings: 2653\n",
      "14) - Smashing Time (1967) - Average Rating: 5.0 - Number of ratings: 2\n",
      "15) - Stalag 17 (1953) - Average Rating: 4.23 - Number of ratings: 394\n",
      "16) - Raiders of the Lost Ark (1981) - Average Rating: 4.48 - Number of ratings: 2514\n",
      "17) - Gate of Heavenly Peace, The (1995) - Average Rating: 5.0 - Number of ratings: 3\n",
      "18) - It Happened One Night (1934) - Average Rating: 4.28 - Number of ratings: 374\n",
      "19) - Silence of the Lambs, The (1991) - Average Rating: 4.35 - Number of ratings: 2578\n",
      "20) - Apple, The (Sib) (1998) - Average Rating: 4.67 - Number of ratings: 9\n"
     ]
    }
   ],
   "source": [
    "mixed_list=interleaved_ranking([pop20,cf20,cb20],20)\n",
    "print_reclist(mixed_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Interleaved ranking can also be used as a heuristic to movie away from most-popular recommendations, by following the same algorithm as before but removing the items that came from most-popular (and this can be expanded by letting most-popular choose more than one item at once and other heuristics) - here is a simple version to force the list to contain less popular items (in this particular case it's the same as just exluding the most-popular list as there is pretty much no intersection):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) - Song of Freedom (1936) - Average Rating: 5.0 - Number of ratings: 1\n",
      "2) - Great Escape, The (1963) - Average Rating: 4.38 - Number of ratings: 696\n",
      "3) - Mamma Roma (1962) - Average Rating: 4.5 - Number of ratings: 2\n",
      "4) - Yojimbo (1961) - Average Rating: 4.4 - Number of ratings: 215\n",
      "5) - Schlafes Bruder (Brother of Sleep) (1995) - Average Rating: 5.0 - Number of ratings: 1\n",
      "6) - North by Northwest (1959) - Average Rating: 4.38 - Number of ratings: 1315\n",
      "7) - Hour of the Pig, The (1993) - Average Rating: 4.5 - Number of ratings: 2\n",
      "8) - Double Indemnity (1944) - Average Rating: 4.42 - Number of ratings: 551\n",
      "9) - Smashing Time (1967) - Average Rating: 5.0 - Number of ratings: 2\n",
      "10) - Stalag 17 (1953) - Average Rating: 4.23 - Number of ratings: 394\n"
     ]
    }
   ],
   "source": [
    "def interleaved_ranking_decreased_popularity(most_popular,lst_of_lists,n):\n",
    "    final_list=list()\n",
    "    excl_list=set()\n",
    "    while len(final_list)<n:\n",
    "        most_popular=[m for m in most_popular if m not in excl_list]\n",
    "        excl=most_popular[0]\n",
    "        excl_list.add(excl)\n",
    "        for lst in lst_of_lists:\n",
    "            lst=[m for m in lst if m not in excl_list]\n",
    "            if len(lst)==0:\n",
    "                continue\n",
    "            new=lst[0]\n",
    "            final_list.append(new)\n",
    "            excl_list.add(new)\n",
    "            if len(final_list)==n:\n",
    "                break\n",
    "    return final_list\n",
    "\n",
    "mixed_list_dec_pop=interleaved_ranking_decreased_popularity(pop20,[cf20,cb20],10)\n",
    "print_reclist(mixed_list_dec_pop)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}