{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Collaborative filtering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "hide_input": true
   },
   "outputs": [],
   "source": [
    "from fastai.gen_doc.nbdoc import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This package contains all the necessary functions to quickly train a model for a collaborative filtering task. Let's start by importing all we'll need."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai import *\n",
    "from fastai.collab import * "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Collaborative filtering is when you're tasked to predict how much a user is going to like a certain item. The fastai library contains a [`CollabFilteringDataset`](/collab.html#CollabFilteringDataset) class that will help you create datasets suitable for training, and a function `get_colab_learner` to build a simple model directly from a ratings table. Let's first see how we can get started before devling in the documentation.\n",
    "\n",
    "For our example, we'll use a small subset of the [MovieLens](https://grouplens.org/datasets/movielens/) dataset. In there, we have to predict the rating a user gave a given movie (from 0 to 5). It comes in the form of a csv file where each line is the rating of a movie by a given person."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>movieId</th>\n",
       "      <th>rating</th>\n",
       "      <th>timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>73</td>\n",
       "      <td>1097</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1255504951</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>561</td>\n",
       "      <td>924</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1172695223</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>157</td>\n",
       "      <td>260</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1291598691</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>358</td>\n",
       "      <td>1210</td>\n",
       "      <td>5.0</td>\n",
       "      <td>957481884</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>130</td>\n",
       "      <td>316</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1138999234</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   userId  movieId  rating   timestamp\n",
       "0      73     1097     4.0  1255504951\n",
       "1     561      924     3.5  1172695223\n",
       "2     157      260     3.5  1291598691\n",
       "3     358     1210     5.0   957481884\n",
       "4     130      316     2.0  1138999234"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = untar_data(URLs.ML_SAMPLE)\n",
    "ratings = pd.read_csv(path/'ratings.csv')\n",
    "ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll first turn the `userId` and `movieId` columns in category codes, so that we can replace them with their codes when it's time to feed them to an `Embedding` layer. This step would be even more important if our csv had names of users, or names of items in it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "series2cat(ratings, 'userId','movieId')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that this step is done, we can directly create a [`Learner`](/basic_train.html#Learner) object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn = get_collab_learner(ratings, n_factors=50, pct_val=0.2, min_score=0., max_score=5.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And the immediately begin training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "VBox(children=(HBox(children=(IntProgress(value=0, max=5), HTML(value='0.00% [0/5 00:00<00:00]'))), HTML(value…"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total time: 00:04\n",
      "epoch  train loss  valid loss\n",
      "1      2.368736    1.849535    (00:00)\n",
      "2      1.080932    0.691473    (00:00)\n",
      "3      0.740156    0.669135    (00:00)\n",
      "4      0.629487    0.658641    (00:00)\n",
      "5      0.599293    0.654870    (00:00)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "learn.fit_one_cycle(5, 5e-3, wd=0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"CollabFilteringDataset\"><code>class</code> <code>CollabFilteringDataset</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/collab.py#L10\" class=\"source_link\">[source]</a></h2>\n",
       "\n",
       "> <code>CollabFilteringDataset</code>(`user`:`Series`, `item`:`Series`, `ratings`:`ndarray`) :: [`DatasetBase`](/basic_data.html#DatasetBase)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(CollabFilteringDataset, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the basic class to buil a [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) suitable for colaborative filtering. `user` and `item` should be categorical series that will be replaced with their codes internally and have the corresponding `ratings`. One of the factory methods will prepare the data in this format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"CollabFilteringDataset.from_df\"><code>from_df</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/collab.py#L34\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>from_df</code>(`rating_df`:`DataFrame`, `pct_val`:`float`=`0.2`, `user_name`:`Optional`\\[`str`\\]=`None`, `item_name`:`Optional`\\[`str`\\]=`None`, `rating_name`:`Optional`\\[`str`\\]=`None`) → `Tuple`\\[`ColabFilteringDataset`, `ColabFilteringDataset`\\]"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(CollabFilteringDataset.from_df, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Takes a `rating_df` and splits it randomly for train and test following `pct_val` (unless it's None). `user_name`, `item_name` and `rating_name` give the names of the corresponding columns (defaults to the first, the second and the third column)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"CollabFilteringDataset.from_csv\"><code>from_csv</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/collab.py#L50\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>from_csv</code>(`csv_name`:`str`, `kwargs`) → `Tuple`\\[`ColabFilteringDataset`, `ColabFilteringDataset`\\]"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(CollabFilteringDataset.from_csv, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Opens the file in `csv_name` as a `DataFrame` and feeds it to `show_doc.from_df` with the `kwargs`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model and [`Learner`](/basic_train.html#Learner)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"EmbeddingDotBias\"><code>class</code> <code>EmbeddingDotBias</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/collab.py#L56\" class=\"source_link\">[source]</a></h3>\n",
       "\n",
       "> <code>EmbeddingDotBias</code>(`n_factors`:`int`, `n_users`:`int`, `n_items`:`int`, `min_score`:`float`=`None`, `max_score`:`float`=`None`) :: [`Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(EmbeddingDotBias, doc_string=False, title_level=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creates a simple model with `Embedding` weights and biases for `n_users` and `n_items`, with `n_factors` latent factors. Takes the dot product of the embeddings and adds the bias, then feed the result to a sigmoid rescaled to go from `min_score` to `max_score`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"get_collab_learner\"><code>get_collab_learner</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/collab.py#L71\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>get_collab_learner</code>(`ratings`:`DataFrame`, `n_factors`:`int`, `pct_val`:`float`=`0.2`, `user_name`:`Optional`\\[`str`\\]=`None`, `item_name`:`Optional`\\[`str`\\]=`None`, `rating_name`:`Optional`\\[`str`\\]=`None`, `test`:`DataFrame`=`None`, `metrics`=`None`, `min_score`:`float`=`None`, `max_score`:`float`=`None`, `kwargs`) → [`Learner`](/basic_train.html#Learner)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(get_collab_learner, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creates a [`Learner`](/basic_train.html#Learner) object built from the data in `ratings`, `pct_val`, `user_name`, `item_name`, `rating_name` to [`CollabFilteringDataset`](/collab.html#CollabFilteringDataset). Optionally, creates another [`CollabFilteringDataset`](/collab.html#CollabFilteringDataset) for `test`. `kwargs` are fed to [`DataBunch.create`](/basic_data.html#DataBunch.create) with these datasets. The model is given by [`EmbeddingDotBias`](/collab.html#EmbeddingDotBias) with `n_factors`, `min_score` and `max_score` (the numbers of users and items will be inferred from the data)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Undocumented Methods - Methods moved below this line will intentionally be hidden"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"EmbeddingDotBias.forward\"><code>forward</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/collab.py#L65\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>forward</code>(`users`:`LongTensor`, `items`:`LongTensor`) → `Tensor`\n",
       "\n",
       "Defines the computation performed at every call. Should be overridden by all subclasses.\n",
       "\n",
       ".. note::\n",
       "    Although the recipe for forward pass needs to be defined within\n",
       "    this function, one should call the :class:`Module` instance afterwards\n",
       "    instead of this since the former takes care of running the\n",
       "    registered hooks while the latter silently ignores them. "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(EmbeddingDotBias.forward)"
   ]
  }
 ],
 "metadata": {
  "jekyll": {
   "keywords": "fastai",
   "summary": "Application to collaborative filtering",
   "title": "collab"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}