{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Collaborative filtering with side information\n",
    "\n",
    "** *\n",
    "This IPython notebook illustrates the usage of the [cmfrec](https://github.com/david-cortes/cmfrec) Python package for building recommender systems through different matrix factorization models with or without using information about user and item attributes – for more details see the references at the bottom.\n",
    "\n",
    "The example uses the [MovieLens-1M data](https://grouplens.org/datasets/movielens/1m/) which consists of ratings from users about movies + user demographic information, plus the [movie tag genome](https://grouplens.org/datasets/movielens/latest/). Note however that, for implicit-feedback datasets (e.g. item purchases), it's recommended to use different models than the ones shown here (see [documentation](http://cmfrec.readthedocs.io/en/latest/) for details about models in the package aimed at implicit-feedback data).\n",
    "\n",
    "**Small note: if the TOC here is not clickable or the math symbols don't show properly, try visualizing this same notebook from nbviewer following [this link](http://nbviewer.jupyter.org/github/david-cortes/cmfrec/blob/master/example/cmfrec_movielens_sideinfo.ipynb).**\n",
    "## Sections\n",
    "\n",
    "\n",
    "[1. Loading the data](#p1)\n",
    "\n",
    "[2. Fitting recommender models](#p2)\n",
    "\n",
    "[3. Examining top-N recommended lists](#p3)\n",
    "\n",
    "[4. Tuning model parameters](#p4)\n",
    "\n",
    "[5. Recommendations for new users](#p5)\n",
    "\n",
    "[6. Evaluating models](#p6)\n",
    "\n",
    "[7. Adding implicit features and dynamic regularization](#p7)\n",
    "\n",
    "[8. References](#p8)\n",
    "** *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"p1\"></a>\n",
    "## 1. Loading the data\n",
    "\n",
    "This section uses pre-processed data from the MovieLens datasets joined with external zip codes databases. The script for  processing and cleaning the data can be found in another notebook [here](http://nbviewer.jupyter.org/github/david-cortes/cmfrec/blob/master/example/load_data.ipynb)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np, pandas as pd, pickle\n",
    "\n",
    "ratings = pickle.load(open(\"ratings.p\", \"rb\"))\n",
    "item_sideinfo_pca = pickle.load(open(\"item_sideinfo_pca.p\", \"rb\"))\n",
    "user_side_info = pickle.load(open(\"user_side_info.p\", \"rb\"))\n",
    "movie_id_to_title = pickle.load(open(\"movie_id_to_title.p\", \"rb\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ratings data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>UserId</th>\n",
       "      <th>ItemId</th>\n",
       "      <th>Rating</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1193</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>661</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>914</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>3408</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>2355</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   UserId  ItemId  Rating\n",
       "0       1    1193       5\n",
       "1       1     661       3\n",
       "2       1     914       3\n",
       "3       1    3408       4\n",
       "4       1    2355       5"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Item attributes (reduced through PCA)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ItemId</th>\n",
       "      <th>pc1</th>\n",
       "      <th>pc2</th>\n",
       "      <th>pc3</th>\n",
       "      <th>pc4</th>\n",
       "      <th>pc5</th>\n",
       "      <th>pc6</th>\n",
       "      <th>pc7</th>\n",
       "      <th>pc8</th>\n",
       "      <th>pc9</th>\n",
       "      <th>...</th>\n",
       "      <th>pc41</th>\n",
       "      <th>pc42</th>\n",
       "      <th>pc43</th>\n",
       "      <th>pc44</th>\n",
       "      <th>pc45</th>\n",
       "      <th>pc46</th>\n",
       "      <th>pc47</th>\n",
       "      <th>pc48</th>\n",
       "      <th>pc49</th>\n",
       "      <th>pc50</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1.193171</td>\n",
       "      <td>2.085621</td>\n",
       "      <td>2.634135</td>\n",
       "      <td>1.156088</td>\n",
       "      <td>0.721649</td>\n",
       "      <td>0.995436</td>\n",
       "      <td>1.250474</td>\n",
       "      <td>-0.779532</td>\n",
       "      <td>1.616702</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.317134</td>\n",
       "      <td>-0.070338</td>\n",
       "      <td>-0.019553</td>\n",
       "      <td>0.169051</td>\n",
       "      <td>0.201415</td>\n",
       "      <td>-0.094831</td>\n",
       "      <td>-0.250461</td>\n",
       "      <td>-0.149919</td>\n",
       "      <td>-0.031735</td>\n",
       "      <td>-0.177708</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>-1.333533</td>\n",
       "      <td>1.743796</td>\n",
       "      <td>1.352161</td>\n",
       "      <td>0.795724</td>\n",
       "      <td>-0.484175</td>\n",
       "      <td>0.380645</td>\n",
       "      <td>0.804462</td>\n",
       "      <td>-0.598527</td>\n",
       "      <td>0.917250</td>\n",
       "      <td>...</td>\n",
       "      <td>0.300060</td>\n",
       "      <td>-0.261956</td>\n",
       "      <td>0.054457</td>\n",
       "      <td>0.003863</td>\n",
       "      <td>0.304605</td>\n",
       "      <td>-0.315796</td>\n",
       "      <td>0.360203</td>\n",
       "      <td>0.152770</td>\n",
       "      <td>0.144790</td>\n",
       "      <td>-0.096549</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>-1.363395</td>\n",
       "      <td>-0.017107</td>\n",
       "      <td>0.530395</td>\n",
       "      <td>-0.316202</td>\n",
       "      <td>0.469430</td>\n",
       "      <td>0.164630</td>\n",
       "      <td>0.019083</td>\n",
       "      <td>0.159188</td>\n",
       "      <td>-0.232969</td>\n",
       "      <td>...</td>\n",
       "      <td>0.215020</td>\n",
       "      <td>-0.060682</td>\n",
       "      <td>-0.280852</td>\n",
       "      <td>0.001087</td>\n",
       "      <td>0.084960</td>\n",
       "      <td>-0.257190</td>\n",
       "      <td>-0.136963</td>\n",
       "      <td>-0.113914</td>\n",
       "      <td>0.128352</td>\n",
       "      <td>-0.203658</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>-1.237840</td>\n",
       "      <td>-0.993731</td>\n",
       "      <td>0.809815</td>\n",
       "      <td>-0.303009</td>\n",
       "      <td>-0.088991</td>\n",
       "      <td>-0.049621</td>\n",
       "      <td>-0.179544</td>\n",
       "      <td>-0.771278</td>\n",
       "      <td>-0.400499</td>\n",
       "      <td>...</td>\n",
       "      <td>0.066207</td>\n",
       "      <td>0.056054</td>\n",
       "      <td>-0.223027</td>\n",
       "      <td>0.400157</td>\n",
       "      <td>0.292300</td>\n",
       "      <td>0.260936</td>\n",
       "      <td>-0.307608</td>\n",
       "      <td>-0.224141</td>\n",
       "      <td>0.488955</td>\n",
       "      <td>0.439189</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>-1.611499</td>\n",
       "      <td>-0.251899</td>\n",
       "      <td>1.126443</td>\n",
       "      <td>-0.135702</td>\n",
       "      <td>0.403340</td>\n",
       "      <td>0.187289</td>\n",
       "      <td>0.108451</td>\n",
       "      <td>-0.275341</td>\n",
       "      <td>-0.261142</td>\n",
       "      <td>...</td>\n",
       "      <td>0.109560</td>\n",
       "      <td>-0.086042</td>\n",
       "      <td>-0.236327</td>\n",
       "      <td>0.461589</td>\n",
       "      <td>0.013350</td>\n",
       "      <td>-0.192557</td>\n",
       "      <td>-0.234025</td>\n",
       "      <td>-0.369643</td>\n",
       "      <td>-0.041060</td>\n",
       "      <td>-0.074656</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 51 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   ItemId       pc1       pc2       pc3       pc4       pc5       pc6  \\\n",
       "0       1  1.193171  2.085621  2.634135  1.156088  0.721649  0.995436   \n",
       "1       2 -1.333533  1.743796  1.352161  0.795724 -0.484175  0.380645   \n",
       "2       3 -1.363395 -0.017107  0.530395 -0.316202  0.469430  0.164630   \n",
       "3       4 -1.237840 -0.993731  0.809815 -0.303009 -0.088991 -0.049621   \n",
       "4       5 -1.611499 -0.251899  1.126443 -0.135702  0.403340  0.187289   \n",
       "\n",
       "        pc7       pc8       pc9  ...      pc41      pc42      pc43      pc44  \\\n",
       "0  1.250474 -0.779532  1.616702  ... -0.317134 -0.070338 -0.019553  0.169051   \n",
       "1  0.804462 -0.598527  0.917250  ...  0.300060 -0.261956  0.054457  0.003863   \n",
       "2  0.019083  0.159188 -0.232969  ...  0.215020 -0.060682 -0.280852  0.001087   \n",
       "3 -0.179544 -0.771278 -0.400499  ...  0.066207  0.056054 -0.223027  0.400157   \n",
       "4  0.108451 -0.275341 -0.261142  ...  0.109560 -0.086042 -0.236327  0.461589   \n",
       "\n",
       "       pc45      pc46      pc47      pc48      pc49      pc50  \n",
       "0  0.201415 -0.094831 -0.250461 -0.149919 -0.031735 -0.177708  \n",
       "1  0.304605 -0.315796  0.360203  0.152770  0.144790 -0.096549  \n",
       "2  0.084960 -0.257190 -0.136963 -0.113914  0.128352 -0.203658  \n",
       "3  0.292300  0.260936 -0.307608 -0.224141  0.488955  0.439189  \n",
       "4  0.013350 -0.192557 -0.234025 -0.369643 -0.041060 -0.074656  \n",
       "\n",
       "[5 rows x 51 columns]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "item_sideinfo_pca.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### User attributes (one-hot encoded)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>UserId</th>\n",
       "      <th>Gender_F</th>\n",
       "      <th>Gender_M</th>\n",
       "      <th>Age_1</th>\n",
       "      <th>Age_18</th>\n",
       "      <th>Age_25</th>\n",
       "      <th>Age_35</th>\n",
       "      <th>Age_45</th>\n",
       "      <th>Age_50</th>\n",
       "      <th>Age_56</th>\n",
       "      <th>...</th>\n",
       "      <th>Occupation_unemployed</th>\n",
       "      <th>Occupation_writer</th>\n",
       "      <th>Region_Middle Atlantic</th>\n",
       "      <th>Region_Midwest</th>\n",
       "      <th>Region_New England</th>\n",
       "      <th>Region_South</th>\n",
       "      <th>Region_Southwest</th>\n",
       "      <th>Region_UnknownOrNonUS</th>\n",
       "      <th>Region_UsOther</th>\n",
       "      <th>Region_West</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 39 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   UserId  Gender_F  Gender_M  Age_1  Age_18  Age_25  Age_35  Age_45  Age_50  \\\n",
       "0       1      True     False   True   False   False   False   False   False   \n",
       "1       2     False      True  False   False   False   False   False   False   \n",
       "2       3     False      True  False   False    True   False   False   False   \n",
       "3       4     False      True  False   False   False   False    True   False   \n",
       "4       5     False      True  False   False    True   False   False   False   \n",
       "\n",
       "   Age_56  ...  Occupation_unemployed  Occupation_writer  \\\n",
       "0   False  ...                  False              False   \n",
       "1    True  ...                  False              False   \n",
       "2   False  ...                  False              False   \n",
       "3   False  ...                  False              False   \n",
       "4   False  ...                  False               True   \n",
       "\n",
       "   Region_Middle Atlantic  Region_Midwest  Region_New England  Region_South  \\\n",
       "0                   False            True               False         False   \n",
       "1                   False           False               False          True   \n",
       "2                   False            True               False         False   \n",
       "3                   False           False                True         False   \n",
       "4                   False            True               False         False   \n",
       "\n",
       "   Region_Southwest  Region_UnknownOrNonUS  Region_UsOther  Region_West  \n",
       "0             False                  False           False        False  \n",
       "1             False                  False           False        False  \n",
       "2             False                  False           False        False  \n",
       "3             False                  False           False        False  \n",
       "4             False                  False           False        False  \n",
       "\n",
       "[5 rows x 39 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user_side_info.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"p2\"></a>\n",
    "## 2. Fitting recommender models\n",
    "\n",
    "This section fits different recommendation models and then compares the recommendations produced by them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Classic model\n",
    "\n",
    "Usual low-rank matrix factorization model with no user/item attributes:\n",
    "$$\n",
    "\\mathbf{X} \\approx \\mathbf{A} \\mathbf{B}^T + \\mu + \\mathbf{b}_A + \\mathbf{b}_B\n",
    "$$\n",
    "Where\n",
    "* $\\mathbf{X}$ is the ratings matrix, in which users are rows, items are columns, and the entries denote the ratings.\n",
    "* $\\mathbf{A}$ is the user-factors matrix.\n",
    "* $\\mathbf{B}$ is the item-factors matrix.\n",
    "* $\\mu$ is the average rating.\n",
    "* $\\mathbf{b}_A$ are user-specific biases (row vector).\n",
    "* $\\mathbf{b}_B$ are item-specific biases (column vector).\n",
    "\n",
    "(For more details see references at the bottom)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 6.75 s, sys: 1.56 s, total: 8.31 s\n",
      "Wall time: 592 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Collective matrix factorization model\n",
       "(explicit-feedback variant)\n"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "from cmfrec import CMF\n",
    "\n",
    "model_no_sideinfo = CMF(method=\"als\", k=40, lambda_=1e+1)\n",
    "model_no_sideinfo.fit(ratings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Collective model\n",
    "\n",
    "The collective matrix factorization model extends the earlier model by making the user and item factor matrices also be able to make low-rank approximate factorizations of the user and item attributes:\n",
    "$$\n",
    "\\mathbf{X} \\approx \\mathbf{A} \\mathbf{B}^T + \\mu + \\mathbf{b}_A + \\mathbf{b}_B\n",
    ",\\:\\:\\:\\:\n",
    "\\mathbf{U} \\approx \\mathbf{A} \\mathbf{C}^T + \\mathbf{\\mu}_U\n",
    ",\\:\\:\\:\\: \\mathbf{I} \\approx \\mathbf{B} \\mathbf{D}^T + \\mathbf{\\mu}_I\n",
    "$$\n",
    "\n",
    "Where\n",
    "* $\\mathbf{U}$ is the user attributes matrix, in which users are rows and attributes are columns.\n",
    "* $\\mathbf{I}$ is the item attributes matrix, in which items are rows and attributes are columns.\n",
    "* $\\mathbf{\\mu}_U$ are the column means for the user attributes (column vector).\n",
    "* $\\mathbf{\\mu}_I$ are the columns means for the item attributes (column vector).\n",
    "* $\\mathbf{C}$ and $\\mathbf{D}$ are attribute-factor matrices (also model parameters).\n",
    "\n",
    "**In addition**, this package can also apply sigmoid transformations on the attribute columns which are binary. Note that this requires a different optimization approach which is slower than the ALS (alternating least-squares) method used here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 11.2 s, sys: 13 s, total: 24.2 s\n",
      "Wall time: 1.5 s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Collective matrix factorization model\n",
       "(explicit-feedback variant)\n"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "model_with_sideinfo = CMF(method=\"als\", k=40, lambda_=1e+1, w_main=0.5, w_user=0.25, w_item=0.25)\n",
    "model_with_sideinfo.fit(X=ratings, U=user_side_info, I=item_sideinfo_pca)\n",
    "\n",
    "### for the sigmoid transformations:\n",
    "# model_with_sideinfo = CMF(method=\"lbfgs\", maxiter=0, k=40, lambda_=1e+1, w_main=0.5, w_user=0.25, w_item=0.25)\n",
    "# model_with_sideinfo.fit(X=ratings, U_bin=user_side_info, I=item_sideinfo_pca)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_(Note that, since the side info has variables in a different scale, even though the weights sum up to 1, it's still not the same as the earlier model w.r.t. the regularization parameter - this type of model requires more hyperparameter tuning too.)_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Content-based model\n",
    "\n",
    "This is a model in which the factorizing matrices are constrained to be linear combinations of the user and item attributes, thereby making the recommendations based entirely on side information, with no free parameters for specific users or items:\n",
    "$$\n",
    "\\mathbf{X} \\approx (\\mathbf{U} \\mathbf{C}) (\\mathbf{I} \\mathbf{D})^T + \\mu\n",
    "$$\n",
    "\n",
    "_(Note that the movie attributes are not available for all the movies with ratings)_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 13min 8s, sys: 23min 31s, total: 36min 39s\n",
      "Wall time: 1min 57s\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Content-based factorization model\n",
       "(explicit-feedback)\n"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "from cmfrec import ContentBased\n",
    "\n",
    "model_content_based = ContentBased(k=40, maxiter=0, user_bias=False, item_bias=False)\n",
    "model_content_based.fit(X=ratings.loc[lambda x: x[\"ItemId\"].isin(item_sideinfo_pca[\"ItemId\"])],\n",
    "                        U=user_side_info,\n",
    "                        I=item_sideinfo_pca.loc[lambda x: x[\"ItemId\"].isin(ratings[\"ItemId\"])])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 Non-personalized model\n",
    "\n",
    "This is an intercepts-only version of the classical model, which estimates one parameter per user and one parameter per item, and as such produces a simple rank of the items based on those parameters. It is intended for comparison purposes and can be helpful to check that the recommendations for different users are having some variability (e.g. setting too large regularization values will tend to make all personalzied recommended lists similar to each other)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 304 ms, sys: 800 ms, total: 1.1 s\n",
      "Wall time: 105 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Most-Popular recommendation model\n",
       "(explicit-feedback variant)\n"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "from cmfrec import MostPopular\n",
    "\n",
    "model_non_personalized = MostPopular(user_bias=True, implicit=False)\n",
    "model_non_personalized.fit(ratings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"p3\"></a>\n",
    "## 3. Examining top-N recommended lists\n",
    "\n",
    "This section will examine what would each model recommend to the user with ID 948.\n",
    "\n",
    "This is the demographic information for the user:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>947</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>UserId</th>\n",
       "      <td>948</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Gender_M</th>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Age_56</th>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Occupation_programmer</th>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Region_Midwest</th>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        947\n",
       "UserId                  948\n",
       "Gender_M               True\n",
       "Age_56                 True\n",
       "Occupation_programmer  True\n",
       "Region_Midwest         True"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user_side_info.loc[user_side_info[\"UserId\"] == 948].T.where(lambda x: x > 0).dropna()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are the highest-rated movies from the user:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>UserId</th>\n",
       "      <th>ItemId</th>\n",
       "      <th>Rating</th>\n",
       "      <th>Movie</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>146721</th>\n",
       "      <td>948</td>\n",
       "      <td>3789</td>\n",
       "      <td>5</td>\n",
       "      <td>Pawnbroker, The (1965)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146889</th>\n",
       "      <td>948</td>\n",
       "      <td>2665</td>\n",
       "      <td>5</td>\n",
       "      <td>Earth Vs. the Flying Saucers (1956)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146871</th>\n",
       "      <td>948</td>\n",
       "      <td>2640</td>\n",
       "      <td>5</td>\n",
       "      <td>Superman (1978)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146872</th>\n",
       "      <td>948</td>\n",
       "      <td>2641</td>\n",
       "      <td>5</td>\n",
       "      <td>Superman II (1980)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147105</th>\n",
       "      <td>948</td>\n",
       "      <td>2761</td>\n",
       "      <td>5</td>\n",
       "      <td>Iron Giant, The (1999)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146875</th>\n",
       "      <td>948</td>\n",
       "      <td>2644</td>\n",
       "      <td>5</td>\n",
       "      <td>Dracula (1931)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146878</th>\n",
       "      <td>948</td>\n",
       "      <td>2648</td>\n",
       "      <td>5</td>\n",
       "      <td>Frankenstein (1931)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147097</th>\n",
       "      <td>948</td>\n",
       "      <td>1019</td>\n",
       "      <td>5</td>\n",
       "      <td>20,000 Leagues Under the Sea (1954)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146881</th>\n",
       "      <td>948</td>\n",
       "      <td>2657</td>\n",
       "      <td>5</td>\n",
       "      <td>Rocky Horror Picture Show, The (1975)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146884</th>\n",
       "      <td>948</td>\n",
       "      <td>2660</td>\n",
       "      <td>5</td>\n",
       "      <td>Thing From Another World, The (1951)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        UserId  ItemId  Rating                                  Movie\n",
       "146721     948    3789       5                 Pawnbroker, The (1965)\n",
       "146889     948    2665       5    Earth Vs. the Flying Saucers (1956)\n",
       "146871     948    2640       5                        Superman (1978)\n",
       "146872     948    2641       5                     Superman II (1980)\n",
       "147105     948    2761       5                 Iron Giant, The (1999)\n",
       "146875     948    2644       5                         Dracula (1931)\n",
       "146878     948    2648       5                    Frankenstein (1931)\n",
       "147097     948    1019       5    20,000 Leagues Under the Sea (1954)\n",
       "146881     948    2657       5  Rocky Horror Picture Show, The (1975)\n",
       "146884     948    2660       5   Thing From Another World, The (1951)"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(\n",
    "    ratings\n",
    "    .loc[lambda x: x[\"UserId\"] == 948]\n",
    "    .sort_values(\"Rating\", ascending=False)\n",
    "    .assign(Movie=lambda x: x[\"ItemId\"].map(movie_id_to_title))\n",
    "    .head(10)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are the lowest-rated movies from the user:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>UserId</th>\n",
       "      <th>ItemId</th>\n",
       "      <th>Rating</th>\n",
       "      <th>Movie</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>147237</th>\n",
       "      <td>948</td>\n",
       "      <td>1247</td>\n",
       "      <td>1</td>\n",
       "      <td>Graduate, The (1967)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147173</th>\n",
       "      <td>948</td>\n",
       "      <td>70</td>\n",
       "      <td>1</td>\n",
       "      <td>From Dusk Till Dawn (1996)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146768</th>\n",
       "      <td>948</td>\n",
       "      <td>748</td>\n",
       "      <td>1</td>\n",
       "      <td>Arrival, The (1996)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147135</th>\n",
       "      <td>948</td>\n",
       "      <td>45</td>\n",
       "      <td>1</td>\n",
       "      <td>To Die For (1995)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146812</th>\n",
       "      <td>948</td>\n",
       "      <td>780</td>\n",
       "      <td>1</td>\n",
       "      <td>Independence Day (ID4) (1996)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146813</th>\n",
       "      <td>948</td>\n",
       "      <td>788</td>\n",
       "      <td>1</td>\n",
       "      <td>Nutty Professor, The (1996)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146814</th>\n",
       "      <td>948</td>\n",
       "      <td>3201</td>\n",
       "      <td>1</td>\n",
       "      <td>Five Easy Pieces (1970)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147118</th>\n",
       "      <td>948</td>\n",
       "      <td>356</td>\n",
       "      <td>1</td>\n",
       "      <td>Forrest Gump (1994)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146821</th>\n",
       "      <td>948</td>\n",
       "      <td>3070</td>\n",
       "      <td>1</td>\n",
       "      <td>Adventures of Buckaroo Bonzai Across the 8th D...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146822</th>\n",
       "      <td>948</td>\n",
       "      <td>1617</td>\n",
       "      <td>1</td>\n",
       "      <td>L.A. Confidential (1997)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        UserId  ItemId  Rating  \\\n",
       "147237     948    1247       1   \n",
       "147173     948      70       1   \n",
       "146768     948     748       1   \n",
       "147135     948      45       1   \n",
       "146812     948     780       1   \n",
       "146813     948     788       1   \n",
       "146814     948    3201       1   \n",
       "147118     948     356       1   \n",
       "146821     948    3070       1   \n",
       "146822     948    1617       1   \n",
       "\n",
       "                                                    Movie  \n",
       "147237                               Graduate, The (1967)  \n",
       "147173                         From Dusk Till Dawn (1996)  \n",
       "146768                                Arrival, The (1996)  \n",
       "147135                                  To Die For (1995)  \n",
       "146812                      Independence Day (ID4) (1996)  \n",
       "146813                        Nutty Professor, The (1996)  \n",
       "146814                            Five Easy Pieces (1970)  \n",
       "147118                                Forrest Gump (1994)  \n",
       "146821  Adventures of Buckaroo Bonzai Across the 8th D...  \n",
       "146822                           L.A. Confidential (1997)  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(\n",
    "    ratings\n",
    "    .loc[lambda x: x[\"UserId\"] == 948]\n",
    "    .sort_values(\"Rating\", ascending=True)\n",
    "    .assign(Movie=lambda x: x[\"ItemId\"].map(movie_id_to_title))\n",
    "    .head(10)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now producing recommendations from each model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "### Will exclude already-seen movies\n",
    "exclude = ratings[\"ItemId\"].loc[ratings[\"UserId\"] == 948]\n",
    "exclude_cb = exclude.loc[lambda x: x.isin(item_sideinfo_pca[\"ItemId\"])]\n",
    "\n",
    "### Recommended lists with those excluded\n",
    "recommended_non_personalized = model_non_personalized.topN(user=948, n=10, exclude=exclude)\n",
    "recommended_no_side_info = model_no_sideinfo.topN(user=948, n=10, exclude=exclude)\n",
    "recommended_with_side_info = model_with_sideinfo.topN(user=948, n=10, exclude=exclude)\n",
    "recommended_content_based = model_content_based.topN(user=948, n=10, exclude=exclude_cb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2019,  318, 2905,  745, 1148, 1212, 3435,  923,  720, 3307])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "recommended_non_personalized"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A handy function to print top-N recommended lists with associated information:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Recommended from non-personalized model\n",
      "1) - Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) - Average Rating: 4.56 - Number of ratings: 628\n",
      "2) - Shawshank Redemption, The (1994) - Average Rating: 4.55 - Number of ratings: 2227\n",
      "3) - Sanjuro (1962) - Average Rating: 4.61 - Number of ratings: 69\n",
      "4) - Close Shave, A (1995) - Average Rating: 4.52 - Number of ratings: 657\n",
      "5) - Wrong Trousers, The (1993) - Average Rating: 4.51 - Number of ratings: 882\n",
      "6) - Third Man, The (1949) - Average Rating: 4.45 - Number of ratings: 480\n",
      "7) - Double Indemnity (1944) - Average Rating: 4.42 - Number of ratings: 551\n",
      "8) - Citizen Kane (1941) - Average Rating: 4.39 - Number of ratings: 1116\n",
      "9) - Wallace & Gromit: The Best of Aardman Animation (1996) - Average Rating: 4.43 - Number of ratings: 438\n",
      "10) - City Lights (1931) - Average Rating: 4.39 - Number of ratings: 271\n",
      "----------------\n",
      "Recommended from ratings-only model\n",
      "1) - Arsenic and Old Lace (1944) - Average Rating: 4.17 - Number of ratings: 672\n",
      "2) - Beauty and the Beast (1991) - Average Rating: 3.89 - Number of ratings: 1060\n",
      "3) - Nosferatu (Nosferatu, eine Symphonie des Grauens) (1922) - Average Rating: 3.99 - Number of ratings: 238\n",
      "4) - It's a Wonderful Life (1946) - Average Rating: 4.3 - Number of ratings: 729\n",
      "5) - Invasion of the Body Snatchers (1956) - Average Rating: 3.91 - Number of ratings: 628\n",
      "6) - Hurricane, The (1999) - Average Rating: 3.85 - Number of ratings: 509\n",
      "7) - Contender, The (2000) - Average Rating: 3.78 - Number of ratings: 388\n",
      "8) - Wolf Man, The (1941) - Average Rating: 3.76 - Number of ratings: 134\n",
      "9) - Apostle, The (1997) - Average Rating: 3.73 - Number of ratings: 471\n",
      "10) - Mummy, The (1932) - Average Rating: 3.54 - Number of ratings: 162\n",
      "----------------\n",
      "Recommended from attributes-only model\n",
      "1) - Shawshank Redemption, The (1994) - Average Rating: 4.55 - Number of ratings: 2227\n",
      "2) - Third Man, The (1949) - Average Rating: 4.45 - Number of ratings: 480\n",
      "3) - City Lights (1931) - Average Rating: 4.39 - Number of ratings: 271\n",
      "4) - Jean de Florette (1986) - Average Rating: 4.32 - Number of ratings: 216\n",
      "5) - It Happened One Night (1934) - Average Rating: 4.28 - Number of ratings: 374\n",
      "6) - Central Station (Central do Brasil) (1998) - Average Rating: 4.28 - Number of ratings: 215\n",
      "7) - Man Who Would Be King, The (1975) - Average Rating: 4.13 - Number of ratings: 310\n",
      "8) - Best Years of Our Lives, The (1946) - Average Rating: 4.12 - Number of ratings: 236\n",
      "9) - Double Indemnity (1944) - Average Rating: 4.42 - Number of ratings: 551\n",
      "10) - In the Heat of the Night (1967) - Average Rating: 4.13 - Number of ratings: 348\n",
      "----------------\n",
      "Recommended from hybrid model\n",
      "1) - It's a Wonderful Life (1946) - Average Rating: 4.3 - Number of ratings: 729\n",
      "2) - Nosferatu (Nosferatu, eine Symphonie des Grauens) (1922) - Average Rating: 3.99 - Number of ratings: 238\n",
      "3) - Beauty and the Beast (1991) - Average Rating: 3.89 - Number of ratings: 1060\n",
      "4) - Arsenic and Old Lace (1944) - Average Rating: 4.17 - Number of ratings: 672\n",
      "5) - Invasion of the Body Snatchers (1956) - Average Rating: 3.91 - Number of ratings: 628\n",
      "6) - Mr. Smith Goes to Washington (1939) - Average Rating: 4.24 - Number of ratings: 383\n",
      "7) - Life Is Beautiful (La Vita è bella) (1997) - Average Rating: 4.33 - Number of ratings: 1152\n",
      "8) - Gold Rush, The (1925) - Average Rating: 4.19 - Number of ratings: 275\n",
      "9) - Bride of Frankenstein (1935) - Average Rating: 3.91 - Number of ratings: 216\n",
      "10) - Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) - Average Rating: 4.56 - Number of ratings: 628\n"
     ]
    }
   ],
   "source": [
    "from collections import defaultdict\n",
    "\n",
    "# aggregate statistics\n",
    "avg_movie_rating = defaultdict(lambda: 0)\n",
    "num_ratings_per_movie = defaultdict(lambda: 0)\n",
    "for i in ratings.groupby('ItemId')['Rating'].mean().to_frame().itertuples():\n",
    "    avg_movie_rating[i.Index] = i.Rating\n",
    "for i in ratings.groupby('ItemId')['Rating'].agg(lambda x: len(tuple(x))).to_frame().itertuples():\n",
    "    num_ratings_per_movie[i.Index] = i.Rating\n",
    "\n",
    "# function to print recommended lists more nicely\n",
    "def print_reclist(reclist):\n",
    "    list_w_info = [str(m + 1) + \") - \" + movie_id_to_title[reclist[m]] +\\\n",
    "        \" - Average Rating: \" + str(np.round(avg_movie_rating[reclist[m]], 2))+\\\n",
    "        \" - Number of ratings: \" + str(num_ratings_per_movie[reclist[m]])\\\n",
    "                   for m in range(len(reclist))]\n",
    "    print(\"\\n\".join(list_w_info))\n",
    "    \n",
    "print(\"Recommended from non-personalized model\")\n",
    "print_reclist(recommended_non_personalized)\n",
    "print(\"----------------\")\n",
    "print(\"Recommended from ratings-only model\")\n",
    "print_reclist(recommended_no_side_info)\n",
    "print(\"----------------\")\n",
    "print(\"Recommended from attributes-only model\")\n",
    "print_reclist(recommended_content_based)\n",
    "print(\"----------------\")\n",
    "print(\"Recommended from hybrid model\")\n",
    "print_reclist(recommended_with_side_info)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(As can be seen, the personalized recommendations tend to recommend very old movies, which is what this user seems to rate highly, with no overlap with the non-personalized recommendations)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"p4\"></a>\n",
    "## 4. Tuning model parameters\n",
    "\n",
    "The models here offer many tuneable parameters which can be tweaked in order to alter the recommended lists in some way. For example, setting a low regularization to the item biases will tend to favor movies with a high average rating regardless of the number of ratings, while setting a high regularization for the factorizing matrices will tend to produce the same recommendations for all users."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) - Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) - Average Rating: 4.56 - Number of ratings: 628\n",
      "2) - Shawshank Redemption, The (1994) - Average Rating: 4.55 - Number of ratings: 2227\n",
      "3) - Close Shave, A (1995) - Average Rating: 4.52 - Number of ratings: 657\n",
      "4) - Wrong Trousers, The (1993) - Average Rating: 4.51 - Number of ratings: 882\n",
      "5) - Sanjuro (1962) - Average Rating: 4.61 - Number of ratings: 69\n",
      "6) - Third Man, The (1949) - Average Rating: 4.45 - Number of ratings: 480\n",
      "7) - Double Indemnity (1944) - Average Rating: 4.42 - Number of ratings: 551\n",
      "8) - Wallace & Gromit: The Best of Aardman Animation (1996) - Average Rating: 4.43 - Number of ratings: 438\n",
      "9) - Citizen Kane (1941) - Average Rating: 4.39 - Number of ratings: 1116\n",
      "10) - City Lights (1931) - Average Rating: 4.39 - Number of ratings: 271\n"
     ]
    }
   ],
   "source": [
    "### Less personalized (underfitted)\n",
    "reclist = \\\n",
    "    CMF(lambda_=[1e+3, 1e+1, 1e+2, 1e+2, 1e+2, 1e+2])\\\n",
    "        .fit(ratings)\\\n",
    "        .topN(user=948, n=10, exclude=exclude)\n",
    "print_reclist(reclist)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) - Plan 9 from Outer Space (1958) - Average Rating: 2.63 - Number of ratings: 249\n",
      "2) - East-West (Est-ouest) (1999) - Average Rating: 3.77 - Number of ratings: 103\n",
      "3) - Rugrats Movie, The (1998) - Average Rating: 2.78 - Number of ratings: 141\n",
      "4) - Taste of Cherry (1997) - Average Rating: 3.53 - Number of ratings: 32\n",
      "5) - Julien Donkey-Boy (1999) - Average Rating: 3.33 - Number of ratings: 12\n",
      "6) - Original Kings of Comedy, The (2000) - Average Rating: 3.23 - Number of ratings: 147\n",
      "7) - Maya Lin: A Strong Clear Vision (1994) - Average Rating: 4.1 - Number of ratings: 59\n",
      "8) - Double Life of Veronique, The (La Double Vie de Véronique) (1991) - Average Rating: 3.94 - Number of ratings: 129\n",
      "9) - Crash (1996) - Average Rating: 2.76 - Number of ratings: 141\n",
      "10) - Faraway, So Close (In Weiter Ferne, So Nah!) (1993) - Average Rating: 3.71 - Number of ratings: 66\n"
     ]
    }
   ],
   "source": [
    "### More personalized (overfitted)\n",
    "reclist = \\\n",
    "    CMF(lambda_=[0., 1e+3, 1e-1, 1e-1, 1e-1, 1e-1])\\\n",
    "        .fit(ratings)\\\n",
    "        .topN(user=948, n=10, exclude=exclude)\n",
    "print_reclist(reclist)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The collective model can also have variations such as weighting each factorization differently, or setting components (factors) that are not to be shared between factorizations (not shown)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) - Wrong Trousers, The (1993) - Average Rating: 4.51 - Number of ratings: 882\n",
      "2) - Willy Wonka and the Chocolate Factory (1971) - Average Rating: 3.86 - Number of ratings: 1313\n",
      "3) - Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) - Average Rating: 4.56 - Number of ratings: 628\n",
      "4) - It's a Wonderful Life (1946) - Average Rating: 4.3 - Number of ratings: 729\n",
      "5) - Third Man, The (1949) - Average Rating: 4.45 - Number of ratings: 480\n",
      "6) - Close Shave, A (1995) - Average Rating: 4.52 - Number of ratings: 657\n",
      "7) - Grand Day Out, A (1992) - Average Rating: 4.36 - Number of ratings: 473\n",
      "8) - Citizen Kane (1941) - Average Rating: 4.39 - Number of ratings: 1116\n",
      "9) - Singin' in the Rain (1952) - Average Rating: 4.28 - Number of ratings: 751\n",
      "10) - Rebecca (1940) - Average Rating: 4.2 - Number of ratings: 386\n"
     ]
    }
   ],
   "source": [
    "### More oriented towards content-based than towards collaborative-filtering\n",
    "reclist = \\\n",
    "    CMF(k=40, w_main=0.5, w_item=3., w_user=5., lambda_=1e+1)\\\n",
    "        .fit(ratings, U=user_side_info, I=item_sideinfo_pca)\\\n",
    "        .topN(user=948, n=10, exclude=exclude)\n",
    "print_reclist(reclist)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"p5\"></a>\n",
    "## 5. Recommendations for new users\n",
    "\n",
    "Models can also be used to make recommendations for new users based on ratings and/or side information.\n",
    "\n",
    "_(Be aware that, due to the nature of computer floating point aithmetic, there might be some slight discrepancies between the results from `topN` and `topN_warm`)_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) - It's a Wonderful Life (1946) - Average Rating: 4.3 - Number of ratings: 729\n",
      "2) - Nosferatu (Nosferatu, eine Symphonie des Grauens) (1922) - Average Rating: 3.99 - Number of ratings: 238\n",
      "3) - Beauty and the Beast (1991) - Average Rating: 3.89 - Number of ratings: 1060\n",
      "4) - Arsenic and Old Lace (1944) - Average Rating: 4.17 - Number of ratings: 672\n",
      "5) - Invasion of the Body Snatchers (1956) - Average Rating: 3.91 - Number of ratings: 628\n",
      "6) - Mr. Smith Goes to Washington (1939) - Average Rating: 4.24 - Number of ratings: 383\n",
      "7) - Life Is Beautiful (La Vita è bella) (1997) - Average Rating: 4.33 - Number of ratings: 1152\n",
      "8) - Gold Rush, The (1925) - Average Rating: 4.19 - Number of ratings: 275\n",
      "9) - Bride of Frankenstein (1935) - Average Rating: 3.91 - Number of ratings: 216\n",
      "10) - Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) - Average Rating: 4.56 - Number of ratings: 628\n"
     ]
    }
   ],
   "source": [
    "print_reclist(model_with_sideinfo.topN_warm(X_col=ratings[\"ItemId\"].loc[ratings[\"UserId\"] == 948],\n",
    "                                            X_val=ratings[\"Rating\"].loc[ratings[\"UserId\"] == 948],\n",
    "                                            exclude=exclude))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) - It's a Wonderful Life (1946) - Average Rating: 4.3 - Number of ratings: 729\n",
      "2) - Nosferatu (Nosferatu, eine Symphonie des Grauens) (1922) - Average Rating: 3.99 - Number of ratings: 238\n",
      "3) - Beauty and the Beast (1991) - Average Rating: 3.89 - Number of ratings: 1060\n",
      "4) - Arsenic and Old Lace (1944) - Average Rating: 4.17 - Number of ratings: 672\n",
      "5) - Invasion of the Body Snatchers (1956) - Average Rating: 3.91 - Number of ratings: 628\n",
      "6) - Mr. Smith Goes to Washington (1939) - Average Rating: 4.24 - Number of ratings: 383\n",
      "7) - Life Is Beautiful (La Vita è bella) (1997) - Average Rating: 4.33 - Number of ratings: 1152\n",
      "8) - Gold Rush, The (1925) - Average Rating: 4.19 - Number of ratings: 275\n",
      "9) - Bride of Frankenstein (1935) - Average Rating: 3.91 - Number of ratings: 216\n",
      "10) - Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) - Average Rating: 4.56 - Number of ratings: 628\n"
     ]
    }
   ],
   "source": [
    "print_reclist(model_with_sideinfo.topN_warm(X_col=ratings[\"ItemId\"].loc[ratings[\"UserId\"] == 948],\n",
    "                                            X_val=ratings[\"Rating\"].loc[ratings[\"UserId\"] == 948],\n",
    "                                            U=user_side_info.loc[lambda x: x[\"UserId\"] == 948],\n",
    "                                            exclude=exclude))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) - Shawshank Redemption, The (1994) - Average Rating: 4.55 - Number of ratings: 2227\n",
      "2) - Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) - Average Rating: 4.56 - Number of ratings: 628\n",
      "3) - Wrong Trousers, The (1993) - Average Rating: 4.51 - Number of ratings: 882\n",
      "4) - Close Shave, A (1995) - Average Rating: 4.52 - Number of ratings: 657\n",
      "5) - Sanjuro (1962) - Average Rating: 4.61 - Number of ratings: 69\n",
      "6) - Wallace & Gromit: The Best of Aardman Animation (1996) - Average Rating: 4.43 - Number of ratings: 438\n",
      "7) - Double Indemnity (1944) - Average Rating: 4.42 - Number of ratings: 551\n",
      "8) - Third Man, The (1949) - Average Rating: 4.45 - Number of ratings: 480\n",
      "9) - Life Is Beautiful (La Vita è bella) (1997) - Average Rating: 4.33 - Number of ratings: 1152\n",
      "10) - Grand Day Out, A (1992) - Average Rating: 4.36 - Number of ratings: 473\n"
     ]
    }
   ],
   "source": [
    "print_reclist(\n",
    "    model_with_sideinfo.topN_cold(\n",
    "        U=user_side_info.loc[lambda x: x[\"UserId\"] == 948].drop(\"UserId\", axis=1),\n",
    "        exclude=exclude\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This last one is very similar to the non-personalized recommended list - that is, the user side information had very little leverage in the model, at least for that user - in this regard, the content-based model tends to be better at cold-start recommendations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1) - Shawshank Redemption, The (1994) - Average Rating: 4.55 - Number of ratings: 2227\n",
      "2) - Third Man, The (1949) - Average Rating: 4.45 - Number of ratings: 480\n",
      "3) - City Lights (1931) - Average Rating: 4.39 - Number of ratings: 271\n",
      "4) - Jean de Florette (1986) - Average Rating: 4.32 - Number of ratings: 216\n",
      "5) - It Happened One Night (1934) - Average Rating: 4.28 - Number of ratings: 374\n",
      "6) - Central Station (Central do Brasil) (1998) - Average Rating: 4.28 - Number of ratings: 215\n",
      "7) - Man Who Would Be King, The (1975) - Average Rating: 4.13 - Number of ratings: 310\n",
      "8) - Best Years of Our Lives, The (1946) - Average Rating: 4.12 - Number of ratings: 236\n",
      "9) - Double Indemnity (1944) - Average Rating: 4.42 - Number of ratings: 551\n",
      "10) - In the Heat of the Night (1967) - Average Rating: 4.13 - Number of ratings: 348\n"
     ]
    }
   ],
   "source": [
    "print_reclist(\n",
    "    model_content_based.topN_cold(\n",
    "        U=user_side_info.loc[lambda x: x[\"UserId\"] == 948].drop(\"UserId\", axis=1),\n",
    "        exclude=exclude_cb\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "_(For this use-case, would also be better to add item biases to the content-based model though)_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"p6\"></a>\n",
    "## 6. Evaluating models\n",
    "\n",
    "This section shows usage of the `predict` family of functions for getting the predicted  rating for a given user and item, in order to calculate evaluation metrics such as RMSE and tune model parameters.\n",
    "\n",
    "**Note that, while widely used in earlier literature, RMSE might not provide a good overview of the ranking of items (which is what matters for recommendations), and it's recommended to also evaluate other metrics such as NDCG@K, P@K, correlations, etc.**\n",
    "\n",
    "**Also be aware that there is a different class `CMF_implicit` which might perform better at implicit-feedback metrics such as P@K.**\n",
    "\n",
    "When making recommendations, there's quite a difference between making predictions based on ratings data or based on side information alone. In this regard, one can classify prediction types into 4 types:\n",
    "1. Predictions for users and items which were both in the training data.\n",
    "2. Predictions for users which were in the training data and items which were not in the training data.\n",
    "3. Predictions for users which were not in the training data and items which were in the training data.\n",
    "4. Predictions for users and items, of which neither were in the training data.\n",
    "\n",
    "(One could sub-divide further according to users/items which were present in the training data with only ratings or with only side information, but this notebook will not go into that level of detail)\n",
    "\n",
    "The classic model is only able to make predictions for the first case, while the collective model can leverage the side information in order to make predictions for (2) and (3). In theory, it could also do (4), but this is not recommended and the API does not provide such functionality.\n",
    "\n",
    "The content-based model, on the other hand, is an ideal approach for case (4). The package also provides a different model (the \"offsets\" model - see references at the bottom) aimed at improving cases (2) and (3) when there is side information about only user or only about items at the expense of case (1), but such models are not shown in this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "** *\n",
    "Producing a training and test set split of the ratings and side information:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of ratings in training data: 512972\n",
      "Number of ratings in test data type (1): 128221\n",
      "Number of ratings in test data type (2): 154507\n",
      "Number of ratings in test data type (3): 139009\n",
      "Number of ratings in test data type (4): 36774\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "users_train, users_test = train_test_split(ratings[\"UserId\"].unique(), test_size=0.2, random_state=1)\n",
    "items_train, items_test = train_test_split(ratings[\"ItemId\"].unique(), test_size=0.2, random_state=2)\n",
    "\n",
    "ratings_train, ratings_test1 = train_test_split(ratings.loc[ratings[\"UserId\"].isin(users_train) &\n",
    "                                                            ratings[\"ItemId\"].isin(items_train)],\n",
    "                                                test_size=0.2, random_state=123)\n",
    "users_train = ratings_train[\"UserId\"].unique()\n",
    "items_train = ratings_train[\"ItemId\"].unique()\n",
    "ratings_test1 = ratings_test1.loc[ratings_test1[\"UserId\"].isin(users_train) &\n",
    "                                  ratings_test1[\"ItemId\"].isin(items_train)]\n",
    "\n",
    "user_attr_train = user_side_info.loc[lambda x: x[\"UserId\"].isin(users_train)]\n",
    "item_attr_train = item_sideinfo_pca.loc[lambda x: x[\"ItemId\"].isin(items_train)]\n",
    "\n",
    "ratings_test2 = ratings.loc[ratings[\"UserId\"].isin(users_train) &\n",
    "                            ~ratings[\"ItemId\"].isin(items_train) &\n",
    "                            ratings[\"ItemId\"].isin(item_sideinfo_pca[\"ItemId\"])]\n",
    "ratings_test3 = ratings.loc[~ratings[\"UserId\"].isin(users_train) &\n",
    "                            ratings[\"ItemId\"].isin(items_train) &\n",
    "                            ratings[\"UserId\"].isin(user_side_info[\"UserId\"]) &\n",
    "                            ratings[\"ItemId\"].isin(item_sideinfo_pca[\"ItemId\"])]\n",
    "ratings_test4 = ratings.loc[~ratings[\"UserId\"].isin(users_train) &\n",
    "                            ~ratings[\"ItemId\"].isin(items_train) &\n",
    "                            ratings[\"UserId\"].isin(user_side_info[\"UserId\"]) &\n",
    "                            ratings[\"ItemId\"].isin(item_sideinfo_pca[\"ItemId\"])]\n",
    "\n",
    "\n",
    "print(\"Number of ratings in training data: %d\" % ratings_train.shape[0])\n",
    "print(\"Number of ratings in test data type (1): %d\" % ratings_test1.shape[0])\n",
    "print(\"Number of ratings in test data type (2): %d\" % ratings_test2.shape[0])\n",
    "print(\"Number of ratings in test data type (3): %d\" % ratings_test3.shape[0])\n",
    "print(\"Number of ratings in test data type (4): %d\" % ratings_test4.shape[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "### Handy usage of Pandas indexing\n",
    "user_attr_test = user_side_info.set_index(\"UserId\")\n",
    "item_attr_test = item_sideinfo_pca.set_index(\"ItemId\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Re-fitting earlier models to the training subset of the earlier data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "m_classic = CMF(k=40)\\\n",
    "                .fit(ratings_train)\n",
    "m_collective = CMF(k=40, w_main=0.5, w_user=0.5, w_item=0.5)\\\n",
    "                .fit(X=ratings_train,\n",
    "                     U=user_attr_train,\n",
    "                     I=item_attr_train)\n",
    "m_contentbased = ContentBased(k=40, user_bias=False, item_bias=False)\\\n",
    "                .fit(X=ratings_train.loc[ratings_train[\"UserId\"].isin(user_attr_train[\"UserId\"]) &\n",
    "                                         ratings_train[\"ItemId\"].isin(item_attr_train[\"ItemId\"])],\n",
    "                     U=user_attr_train,\n",
    "                     I=item_attr_train)\n",
    "m_mostpopular = MostPopular(user_bias=True)\\\n",
    "                .fit(X=ratings_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "RMSE for users and items which were both in the training data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RMSE type 1 non-personalized model: 0.911 [rho: 0.580]\n",
      "RMSE type 1 ratings-only model: 0.896 [rho: 0.603]\n",
      "RMSE type 1 hybrid model: 0.861 [rho: 0.640]\n",
      "RMSE type 1 content-based model: 0.975 [rho: 0.487]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "pred_contetbased = m_mostpopular.predict(ratings_test1[\"UserId\"], ratings_test1[\"ItemId\"])\n",
    "print(\"RMSE type 1 non-personalized model: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test1[\"Rating\"],\n",
    "                                  pred_contetbased,\n",
    "                                  squared=True)),\n",
    "      np.corrcoef(ratings_test1[\"Rating\"], pred_contetbased)[0,1]))\n",
    "\n",
    "pred_ratingsonly = m_classic.predict(ratings_test1[\"UserId\"], ratings_test1[\"ItemId\"])\n",
    "print(\"RMSE type 1 ratings-only model: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test1[\"Rating\"],\n",
    "                                  pred_ratingsonly,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test1[\"Rating\"], pred_ratingsonly)[0,1]))\n",
    "\n",
    "pred_hybrid = m_collective.predict(ratings_test1[\"UserId\"], ratings_test1[\"ItemId\"])\n",
    "print(\"RMSE type 1 hybrid model: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test1[\"Rating\"],\n",
    "                                  pred_hybrid,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test1[\"Rating\"], pred_hybrid)[0,1]))\n",
    "\n",
    "test_cb = ratings_test1.loc[ratings_test1[\"UserId\"].isin(user_attr_train[\"UserId\"]) &\n",
    "                            ratings_test1[\"ItemId\"].isin(item_attr_train[\"ItemId\"])]\n",
    "pred_contentbased = m_contentbased.predict(test_cb[\"UserId\"], test_cb[\"ItemId\"])\n",
    "print(\"RMSE type 1 content-based model: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(test_cb[\"Rating\"],\n",
    "                                  pred_contentbased,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(test_cb[\"Rating\"], pred_contentbased)[0,1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "RMSE for users which were in the training data but items which were not:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RMSE type 2 hybrid model: 1.025 [rho: 0.424]\n",
      "RMSE type 2 content-based model: 0.977 [rho: 0.486]\n"
     ]
    }
   ],
   "source": [
    "pred_hybrid = m_collective.predict_new(ratings_test2[\"UserId\"],\n",
    "                                       item_attr_test.loc[ratings_test2[\"ItemId\"]])\n",
    "print(\"RMSE type 2 hybrid model: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test2[\"Rating\"],\n",
    "                                  pred_hybrid,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test2[\"Rating\"], pred_hybrid)[0,1]))\n",
    "\n",
    "pred_contentbased = m_contentbased.predict_new(user_attr_test.loc[ratings_test2[\"UserId\"]],\n",
    "                                               item_attr_test.loc[ratings_test2[\"ItemId\"]])\n",
    "print(\"RMSE type 2 content-based model: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test2[\"Rating\"],\n",
    "                                  pred_contentbased,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test2[\"Rating\"], pred_contentbased)[0,1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "RMSE for items which were in the training data but users which were not:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RMSE type 3 hybrid model: 0.988  [rho: 0.470]\n",
      "RMSE type 3 content-based model: 0.981 [rho: 0.468]\n"
     ]
    }
   ],
   "source": [
    "pred_hybrid = m_collective.predict_cold_multiple(item=ratings_test3[\"ItemId\"],\n",
    "                                                 U=user_attr_test.loc[ratings_test3[\"UserId\"]])\n",
    "print(\"RMSE type 3 hybrid model: %.3f  [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test3[\"Rating\"],\n",
    "                                  pred_hybrid,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test3[\"Rating\"], pred_hybrid)[0,1]))\n",
    "\n",
    "pred_contentbased = m_contentbased.predict_new(user_attr_test.loc[ratings_test3[\"UserId\"]],\n",
    "                                               item_attr_test.loc[ratings_test3[\"ItemId\"]])\n",
    "print(\"RMSE type 3 content-based model: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test3[\"Rating\"],\n",
    "                                  pred_contentbased,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test3[\"Rating\"], pred_contentbased)[0,1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "RMSE for users and items which were not in the training data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RMSE type 4 content-based model: 0.986 [rho: 0.464]\n"
     ]
    }
   ],
   "source": [
    "pred_contentbased = m_contentbased.predict_new(user_attr_test.loc[ratings_test4[\"UserId\"]],\n",
    "                                               item_attr_test.loc[ratings_test4[\"ItemId\"]])\n",
    "print(\"RMSE type 4 content-based model: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test4[\"Rating\"],\n",
    "                                  pred_contentbased,\n",
    "                                  squared=True)),\n",
    "      np.corrcoef(ratings_test4[\"Rating\"], pred_contentbased)[0,1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"p7\"></a>\n",
    "## 7. Adding implicit features and dynamic regularization\n",
    "\n",
    "In addition to external side information about the users and items, one can also generate features from the same $\\mathbf{X}$ data by considering which movies a user rated and which ones didn't - these are taken as binary features, with the zeros being counted towards the loss/objective function.\n",
    "\n",
    "The package offers an easy option for automatically generating these features on-the-fly, which can then be used in addition to the external features. The full model now becomes:\n",
    "$$\n",
    "\\mathbf{X} \\approx \\mathbf{A} \\mathbf{B}^T + \\mu + \\mathbf{b}_A + \\mathbf{b}_B\n",
    "$$\n",
    "$$\n",
    "\\mathbf{I}_x \\approx \\mathbf{A} \\mathbf{B}_i^T, \\:\\: \\mathbf{I}_x^T \\approx \\mathbf{B} \\mathbf{A}_i^T\n",
    "$$\n",
    "$$\n",
    "\\mathbf{U} \\approx \\mathbf{A} \\mathbf{C}^T + \\mathbf{\\mu}_U\n",
    ",\\:\\:\\:\\: \\mathbf{I} \\approx \\mathbf{B} \\mathbf{D}^T + \\mathbf{\\mu}_I\n",
    "$$\n",
    "\n",
    "Where:\n",
    "* $\\mathbf{I}_x$ is a binary matrix having a 1 at position ${i,j}$ if $x_{ij}$ is not missing, and a zero otherwise.\n",
    "* $\\mathbf{A}_i$ and $\\mathbf{B}_i$ are the implicit feature matrices.\n",
    "\n",
    "While in the earlier models, every user/item had the same regularization applied on its factors, it's also possible to make this regularization adjust itself according to the number of ratings for each user movie, which tends to produce better models at the expense of more hyperparameter tuning.\n",
    "\n",
    "As well, the package offers an ALS-Cholesky solver, which is slower but tends to give better end results. This section will now use the implicit features and the Cholesky solver, and compare the new models to the previous ones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RMSE type 1 ratings-only model: 0.896 [rho: 0.603]\n",
      "RMSE type 1 ratings + implicit + dyn + Chol: 0.853 [rho: 0.646]\n",
      "RMSE type 1 hybrid model: 0.861 [rho: 0.640]\n",
      "RMSE type 1 hybrid + implicit + dyn + Chol: 0.846 [rho: 0.654]\n"
     ]
    }
   ],
   "source": [
    "m_implicit = CMF(k=40, add_implicit_features=True,\n",
    "                 lambda_=0.05, scale_lam=True,\n",
    "                 w_main=0.7, w_implicit=1., use_cg=False)\\\n",
    "            .fit(X=ratings_train)\n",
    "m_implicit_plus_collective = \\\n",
    "        CMF(k=40, add_implicit_features=True, use_cg=False,\n",
    "            lambda_=0.03, scale_lam=True,\n",
    "            w_main=0.5, w_user=0.3, w_item=0.3, w_implicit=1.)\\\n",
    "            .fit(X=ratings_train,\n",
    "                 U=user_attr_train,\n",
    "                 I=item_attr_train)\n",
    "\n",
    "pred_ratingsonly = m_classic.predict(ratings_test1[\"UserId\"], ratings_test1[\"ItemId\"])\n",
    "print(\"RMSE type 1 ratings-only model: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test1[\"Rating\"],\n",
    "                                  pred_ratingsonly,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test1[\"Rating\"], pred_ratingsonly)[0,1]))\n",
    "\n",
    "pred_implicit = m_implicit.predict(ratings_test1[\"UserId\"], ratings_test1[\"ItemId\"])\n",
    "print(\"RMSE type 1 ratings + implicit + dyn + Chol: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test1[\"Rating\"],\n",
    "                                  pred_implicit,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test1[\"Rating\"], pred_implicit)[0,1]))\n",
    "\n",
    "pred_hybrid = m_collective.predict(ratings_test1[\"UserId\"], ratings_test1[\"ItemId\"])\n",
    "print(\"RMSE type 1 hybrid model: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test1[\"Rating\"],\n",
    "                                  pred_hybrid,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test1[\"Rating\"], pred_hybrid)[0,1]))\n",
    "\n",
    "\n",
    "pred_implicit_plus_collective = m_implicit_plus_collective.\\\n",
    "                                predict(ratings_test1[\"UserId\"], ratings_test1[\"ItemId\"])\n",
    "print(\"RMSE type 1 hybrid + implicit + dyn + Chol: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test1[\"Rating\"],\n",
    "                                  pred_implicit_plus_collective,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test1[\"Rating\"], pred_implicit_plus_collective)[0,1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But note that, while the dynamic regularization and Cholesky method usually lead to improvements in general, the newly-added implicit features oftentimes result in worse cold-start predictions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RMSE type 2 hybrid model: 1.025 [rho: 0.424]\n",
      "RMSE type 2 hybrid model + implicit + dyn + Chol: 1.004 [rho: 0.480] (might get worse)\n",
      "RMSE type 2 content-based model: 0.977 [rho: 0.486]\n"
     ]
    }
   ],
   "source": [
    "pred_hybrid = m_collective.predict_new(ratings_test2[\"UserId\"],\n",
    "                                       item_attr_test.loc[ratings_test2[\"ItemId\"]])\n",
    "print(\"RMSE type 2 hybrid model: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test2[\"Rating\"],\n",
    "                                  pred_hybrid,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test2[\"Rating\"], pred_hybrid)[0,1]))\n",
    "\n",
    "pred_implicit_plus_collective = \\\n",
    "                m_implicit_plus_collective\\\n",
    "                    .predict_new(ratings_test2[\"UserId\"],\n",
    "                                 item_attr_test.loc[ratings_test2[\"ItemId\"]])\n",
    "print(\"RMSE type 2 hybrid model + implicit + dyn + Chol: %.3f [rho: %.3f] (might get worse)\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test2[\"Rating\"],\n",
    "                                  pred_implicit_plus_collective,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test2[\"Rating\"], pred_implicit_plus_collective)[0,1]))\n",
    "\n",
    "pred_contentbased = m_contentbased.predict_new(user_attr_test.loc[ratings_test2[\"UserId\"]],\n",
    "                                               item_attr_test.loc[ratings_test2[\"ItemId\"]])\n",
    "print(\"RMSE type 2 content-based model: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test2[\"Rating\"],\n",
    "                                  pred_contentbased,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test2[\"Rating\"], pred_contentbased)[0,1]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RMSE type 3 hybrid model: 0.988  [rho: 0.470]\n",
      "RMSE type 3 hybrid model + implicit + dyn + Chol: 1.013  [rho: 0.458] (got worse)\n",
      "RMSE type 3 content-based model: 0.981 [rho: 0.468]\n"
     ]
    }
   ],
   "source": [
    "pred_hybrid = m_collective.predict_cold_multiple(item=ratings_test3[\"ItemId\"],\n",
    "                                                 U=user_attr_test.loc[ratings_test3[\"UserId\"]])\n",
    "print(\"RMSE type 3 hybrid model: %.3f  [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test3[\"Rating\"],\n",
    "                                  pred_hybrid,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test3[\"Rating\"], pred_hybrid)[0,1]))\n",
    "\n",
    "\n",
    "pred_implicit_plus_collective = \\\n",
    "    m_implicit_plus_collective\\\n",
    "    .predict_cold_multiple(item=ratings_test3[\"ItemId\"],\n",
    "                           U=user_attr_test.loc[ratings_test3[\"UserId\"]])\n",
    "print(\"RMSE type 3 hybrid model + implicit + dyn + Chol: %.3f  [rho: %.3f] (got worse)\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test3[\"Rating\"],\n",
    "                                  pred_implicit_plus_collective,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test3[\"Rating\"], pred_implicit_plus_collective)[0,1]))\n",
    "\n",
    "pred_contentbased = m_contentbased.predict_new(user_attr_test.loc[ratings_test3[\"UserId\"]],\n",
    "                                               item_attr_test.loc[ratings_test3[\"ItemId\"]])\n",
    "print(\"RMSE type 3 content-based model: %.3f [rho: %.3f]\" %\n",
    "      (np.sqrt(mean_squared_error(ratings_test3[\"Rating\"],\n",
    "                                  pred_contentbased,\n",
    "                                  squared=True)),\n",
    "       np.corrcoef(ratings_test3[\"Rating\"], pred_contentbased)[0,1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a id=\"p8\"></a>\n",
    "## 8. References\n",
    "\n",
    "* Cortes, David. \"Cold-start recommendations in Collective Matrix Factorization.\" arXiv preprint arXiv:1809.00366 (2018).\n",
    "* Singh, Ajit P., and Geoffrey J. Gordon. \"Relational learning via collective matrix factorization.\" Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008.\n",
    "* Takacs, Gabor, Istvan Pilaszy, and Domonkos Tikk. \"Applications of the conjugate gradient method for implicit feedback collaborative filtering.\" Proceedings of the fifth ACM conference on Recommender systems. 2011.\n",
    "* Rendle, Steffen, Li Zhang, and Yehuda Koren. \"On the difficulty of evaluating baselines: A study on recommender systems.\" arXiv preprint arXiv:1905.01395 (2019).\n",
    "* Zhou, Yunhong, et al. \"Large-scale parallel collaborative filtering for the netflix prize.\" International conference on algorithmic applications in management. Springer, Berlin, Heidelberg, 2008."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}