{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Collaborative filtering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [],
   "source": [
    "from fastai.gen_doc.nbdoc import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This package contains all the necessary functions to quickly train a model for a collaborative filtering task. Let's start by importing all we'll need."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai import *\n",
    "from fastai.collab import * "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Collaborative filtering is when you're tasked to predict how much a user is going to like a certain item. The fastai library contains a [`CollabFilteringDataset`](/collab.html#CollabFilteringDataset) class that will help you create datasets suitable for training, and a function `get_colab_learner` to build a simple model directly from a ratings table. Let's first see how we can get started before devling in the documentation.\n",
    "\n",
    "For our example, we'll use a small subset of the [MovieLens](https://grouplens.org/datasets/movielens/) dataset. In there, we have to predict the rating a user gave a given movie (from 0 to 5). It comes in the form of a csv file where each line is the rating of a movie by a given person."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>movieId</th>\n",
       "      <th>rating</th>\n",
       "      <th>timestamp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>73</td>\n",
       "      <td>1097</td>\n",
       "      <td>4.0</td>\n",
       "      <td>1255504951</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>561</td>\n",
       "      <td>924</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1172695223</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>157</td>\n",
       "      <td>260</td>\n",
       "      <td>3.5</td>\n",
       "      <td>1291598691</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>358</td>\n",
       "      <td>1210</td>\n",
       "      <td>5.0</td>\n",
       "      <td>957481884</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>130</td>\n",
       "      <td>316</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1138999234</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   userId  movieId  rating   timestamp\n",
       "0      73     1097     4.0  1255504951\n",
       "1     561      924     3.5  1172695223\n",
       "2     157      260     3.5  1291598691\n",
       "3     358     1210     5.0   957481884\n",
       "4     130      316     2.0  1138999234"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "path = untar_data(URLs.ML_SAMPLE)\n",
    "ratings = pd.read_csv(path/'ratings.csv')\n",
    "ratings.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll first turn the `userId` and `movieId` columns in category codes, so that we can replace them with their codes when it's time to feed them to an `Embedding` layer. This step would be even more important if our csv had names of users, or names of items in it. To do it, we wimply have to call a [`CollabDataBunch`](/collab.html#CollabDataBunch) factory method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = CollabDataBunch.from_df(ratings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that this step is done, we can directly create a [`Learner`](/basic_train.html#Learner) object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "learn = collab_learner(data, n_factors=50, y_range=(0.,5.))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then immediately begin training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total time: 00:02\n",
      "epoch  train_loss  valid_loss\n",
      "1      2.361941    1.874407    (00:00)\n",
      "2      1.093075    0.657915    (00:00)\n",
      "3      0.741212    0.631365    (00:00)\n",
      "4      0.630556    0.618452    (00:00)\n",
      "5      0.585503    0.616357    (00:00)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "learn.fit_one_cycle(5, 5e-3, wd=0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h2 id=\"CollabDataBunch\"><code>class</code> <code>CollabDataBunch</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/collab.py#L40\" class=\"source_link\">[source]</a></h2>\n",
       "\n",
       "> <code>CollabDataBunch</code>(`train_dl`:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), `valid_dl`:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), `test_dl`:`Optional`\\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\\]=`None`, `device`:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=`None`, `tfms`:`Optional`\\[`Collection`\\[`Callable`\\]\\]=`None`, `path`:`PathOrStr`=`'.'`, `collate_fn`:`Callable`=`'data_collate'`) :: [`DataBunch`](/basic_data.html#DataBunch)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(CollabDataBunch, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the basic class to buil a [`DataBunch`](/basic_data.html#DataBunch) suitable for colaborative filtering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"CollabDataBunch.from_df\"><code>from_df</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/collab.py#L41\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>from_df</code>(`ratings`:`DataFrame`, `pct_val`:`float`=`0.2`, `user_name`:`Optional`\\[`str`\\]=`None`, `item_name`:`Optional`\\[`str`\\]=`None`, `rating_name`:`Optional`\\[`str`\\]=`None`, `test`:`DataFrame`=`None`, `seed`=`None`, `kwargs`)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(CollabDataBunch.from_df, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Takes a `ratings` dataframe and splits it randomly for train and test following `pct_val` (unless it's None). `user_name`, `item_name` and `rating_name` give the names of the corresponding columns (defaults to the first, the second and the third column). Optionally a `test` dataframe can be passed an a `seed` for the separation between training and validation set. The `kwargs` will be passed to [`DataBunch.create`](/basic_data.html#DataBunch.create)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model and [`Learner`](/basic_train.html#Learner)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"EmbeddingDotBias\"><code>class</code> <code>EmbeddingDotBias</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/collab.py#L25\" class=\"source_link\">[source]</a></h3>\n",
       "\n",
       "> <code>EmbeddingDotBias</code>(`n_factors`:`int`, `n_users`:`int`, `n_items`:`int`, `y_range`:`Point`=`None`) :: [`Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(EmbeddingDotBias, doc_string=False, title_level=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creates a simple model with `Embedding` weights and biases for `n_users` and `n_items`, with `n_factors` latent factors. Takes the dot product of the embeddings and adds the bias, then if `y_range` is specified, feed the result to a sigmoid rescaled to go from `y_range[0]` to `y_range[1]`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"collab_learner\"><code>collab_learner</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/collab.py#L76\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>collab_learner</code>(`data`, `n_factors`:`int`=`None`, `use_nn`:`bool`=`False`, `metrics`=`None`, `emb_szs`:`Dict`\\[`str`, `int`\\]=`None`, `wd`:`float`=`0.01`, `kwargs`) → [`Learner`](/basic_train.html#Learner)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(collab_learner, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creates a [`Learner`](/basic_train.html#Learner) object built from the data in `ratings`, `pct_val`, `user_name`, `item_name`, `rating_name` to [`CollabFilteringDataset`](/collab.html#CollabFilteringDataset). Optionally, creates another [`CollabFilteringDataset`](/collab.html#CollabFilteringDataset) for `test`. `kwargs` are fed to [`DataBunch.create`](/basic_data.html#DataBunch.create) with these datasets. The model is given by [`EmbeddingDotBias`](/collab.html#EmbeddingDotBias) with `n_factors` if `use_nn` is `False`, and is a neural net with `emb_szs` otherwise. In both cases the numbers of users and items will be inferred from the data, `y_range` is the range of the output (optional) and you can pass [`metrics`](/metrics.html#metrics)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Links with the Data Block API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"CollabLine\"><code>class</code> <code>CollabLine</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/collab.py#L11\" class=\"source_link\">[source]</a></h3>\n",
       "\n",
       "> <code>CollabLine</code>(`cats`, `conts`, `classes`, `names`) :: [`TabularLine`](/tabular.data.html#TabularLine)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(CollabLine, doc_string=False, title_level=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Subclass of [`TabularLine`](/tabular.data.html#TabularLine) for collaborative filtering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h3 id=\"CollabList\"><code>class</code> <code>CollabList</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/collab.py#L16\" class=\"source_link\">[source]</a></h3>\n",
       "\n",
       "> <code>CollabList</code>(`items`:`Iterator`, `cat_names`:`OptStrList`=`None`, `cont_names`:`OptStrList`=`None`, `procs`=`None`, `kwargs`) → `TabularList` :: [`TabularList`](/tabular.data.html#TabularList)"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(CollabList, title_level=3, doc_string=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Subclass of [`TabularList`](/tabular.data.html#TabularList) for collaborative filtering."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Undocumented Methods - Methods moved below this line will intentionally be hidden"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hide_input": true
   },
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "<h4 id=\"EmbeddingDotBias.forward\"><code>forward</code><a href=\"https://github.com/fastai/fastai/blob/master/fastai/collab.py#L34\" class=\"source_link\">[source]</a></h4>\n",
       "\n",
       "> <code>forward</code>(`users`:`LongTensor`, `items`:`LongTensor`) → `Tensor`\n",
       "\n",
       "Defines the computation performed at every call. Should be overridden by all subclasses.\n",
       "\n",
       ".. note::\n",
       "    Although the recipe for forward pass needs to be defined within\n",
       "    this function, one should call the :class:`Module` instance afterwards\n",
       "    instead of this since the former takes care of running the\n",
       "    registered hooks while the latter silently ignores them. "
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "show_doc(EmbeddingDotBias.forward)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## New Methods - Please document or move to the undocumented section"
   ]
  }
 ],
 "metadata": {
  "jekyll": {
   "keywords": "fastai",
   "summary": "Application to collaborative filtering",
   "title": "collab"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}