{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Collaborative filtering"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [],
"source": [
"from fastai.gen_doc.nbdoc import *"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This package contains all the necessary functions to quickly train a model for a collaborative filtering task. Let's start by importing all we'll need."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from fastai.collab import * "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Collaborative filtering is when you're tasked to predict how much a user is going to like a certain item. The fastai library contains a [`CollabFilteringDataset`](/collab.html#CollabFilteringDataset) class that will help you create datasets suitable for training, and a function `get_colab_learner` to build a simple model directly from a ratings table. Let's first see how we can get started before delving into the documentation.\n",
"\n",
"For this example, we'll use a small subset of the [MovieLens](https://grouplens.org/datasets/movielens/) dataset to predict the rating a user would give a particular movie (from 0 to 5). The dataset comes in the form of a csv file where each line is a rating of a movie by a given person."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" userId | \n",
" movieId | \n",
" rating | \n",
" timestamp | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 73 | \n",
" 1097 | \n",
" 4.0 | \n",
" 1255504951 | \n",
"
\n",
" \n",
" 1 | \n",
" 561 | \n",
" 924 | \n",
" 3.5 | \n",
" 1172695223 | \n",
"
\n",
" \n",
" 2 | \n",
" 157 | \n",
" 260 | \n",
" 3.5 | \n",
" 1291598691 | \n",
"
\n",
" \n",
" 3 | \n",
" 358 | \n",
" 1210 | \n",
" 5.0 | \n",
" 957481884 | \n",
"
\n",
" \n",
" 4 | \n",
" 130 | \n",
" 316 | \n",
" 2.0 | \n",
" 1138999234 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" userId movieId rating timestamp\n",
"0 73 1097 4.0 1255504951\n",
"1 561 924 3.5 1172695223\n",
"2 157 260 3.5 1291598691\n",
"3 358 1210 5.0 957481884\n",
"4 130 316 2.0 1138999234"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"path = untar_data(URLs.ML_SAMPLE)\n",
"ratings = pd.read_csv(path/'ratings.csv')\n",
"ratings.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll first turn the `userId` and `movieId` columns in category codes, so that we can replace them with their codes when it's time to feed them to an `Embedding` layer. This step would be even more important if our csv had names of users, or names of items in it. To do it, we simply have to call a [`CollabDataBunch`](/collab.html#CollabDataBunch) factory method."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = CollabDataBunch.from_df(ratings)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that this step is done, we can directly create a [`Learner`](/basic_train.html#Learner) object:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"learn = collab_learner(data, n_factors=50, y_range=(0.,5.))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And then immediately begin training"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"Total time: 00:09 \n",
" \n",
" epoch | \n",
" train_loss | \n",
" valid_loss | \n",
"
\n",
" \n",
" 1 | \n",
" 2.427430 | \n",
" 1.999472 | \n",
"
\n",
" \n",
" 2 | \n",
" 1.116335 | \n",
" 0.663345 | \n",
"
\n",
" \n",
" 3 | \n",
" 0.736155 | \n",
" 0.636640 | \n",
"
\n",
" \n",
" 4 | \n",
" 0.612827 | \n",
" 0.626773 | \n",
"
\n",
" \n",
" 5 | \n",
" 0.565003 | \n",
" 0.626336 | \n",
"
\n",
"
\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"learn.fit_one_cycle(5, 5e-3, wd=0.1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/markdown": [
"class
CollabDataBunch
[source]
\n",
"\n",
"> CollabDataBunch
(**`train_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), **`valid_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), **`fix_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)=***`None`***, **`test_dl`**:`Optional`\\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\\]=***`None`***, **`device`**:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=***`None`***, **`dl_tfms`**:`Optional`\\[`Collection`\\[`Callable`\\]\\]=***`None`***, **`path`**:`PathOrStr`=***`'.'`***, **`collate_fn`**:`Callable`=***`'data_collate'`***, **`no_check`**:`bool`=***`False`***) :: [`DataBunch`](/basic_data.html#DataBunch)\n",
"\n",
"Base [`DataBunch`](/basic_data.html#DataBunch) for collaborative filtering. \n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_doc(CollabDataBunch)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The init function shouldn't be called directly (as it's the one of a basic [`DataBunch`](/basic_data.html#DataBunch)), instead, you'll want to use the following factory method."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"\n",
"> from_df
(**`ratings`**:`DataFrame`, **`valid_pct`**:`float`=***`0.2`***, **`user_name`**:`Optional`\\[`str`\\]=***`None`***, **`item_name`**:`Optional`\\[`str`\\]=***`None`***, **`rating_name`**:`Optional`\\[`str`\\]=***`None`***, **`test`**:`DataFrame`=***`None`***, **`seed`**:`int`=***`None`***, **`path`**:`PathOrStr`=***`'.'`***, **`bs`**:`int`=***`64`***, **`val_bs`**:`int`=***`None`***, **`num_workers`**:`int`=***`4`***, **`dl_tfms`**:`Optional`\\[`Collection`\\[`Callable`\\]\\]=***`None`***, **`device`**:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=***`None`***, **`collate_fn`**:`Callable`=***`'data_collate'`***, **`no_check`**:`bool`=***`False`***) → `CollabDataBunch`\n",
"\n",
"Create a [`DataBunch`](/basic_data.html#DataBunch) suitable for collaborative filtering from `ratings`. \n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_doc(CollabDataBunch.from_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Take a `ratings` dataframe and splits it randomly for train and test following `pct_val` (unless it's None). `user_name`, `item_name` and `rating_name` give the names of the corresponding columns (defaults to the first, the second and the third column). Optionally a `test` dataframe can be passed an a `seed` for the separation between training and validation set. The `kwargs` will be passed to [`DataBunch.create`](/basic_data.html#DataBunch.create)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model and [`Learner`](/basic_train.html#Learner)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/markdown": [
"class
CollabLearner
[source]
\n",
"\n",
"> CollabLearner
(**`data`**:[`DataBunch`](/basic_data.html#DataBunch), **`model`**:[`Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module), **`opt_func`**:`Callable`=***`'Adam'`***, **`loss_func`**:`Callable`=***`None`***, **`metrics`**:`Collection`\\[`Callable`\\]=***`None`***, **`true_wd`**:`bool`=***`True`***, **`bn_wd`**:`bool`=***`True`***, **`wd`**:`Floats`=***`0.01`***, **`train_bn`**:`bool`=***`True`***, **`path`**:`str`=***`None`***, **`model_dir`**:`str`=***`'models'`***, **`callback_fns`**:`Collection`\\[`Callable`\\]=***`None`***, **`callbacks`**:`Collection`\\[[`Callback`](/callback.html#Callback)\\]=***``***, **`layer_groups`**:`ModuleList`=***`None`***, **`add_time`**:`bool`=***`True`***) :: [`Learner`](/basic_train.html#Learner)\n",
"\n",
"[`Learner`](/basic_train.html#Learner) suitable for collaborative filtering. \n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_doc(CollabLearner, title_level=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a subclass of [`Learner`](/basic_train.html#Learner) that just introduces helper functions to analyze results, the initialization is the same as a regular [`Learner`](/basic_train.html#Learner)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"\n",
"> bias
(**`arr`**:`Collection`\\[`T_co`\\], **`is_item`**:`bool`=***`True`***)\n",
"\n",
"Bias for item or user (based on `is_item`) for all in `arr`. (Set model to `cpu` and no grad.) \n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_doc(CollabLearner.bias)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"\n",
"> get_idx
(**`arr`**:`Collection`\\[`T_co`\\], **`is_item`**:`bool`=***`True`***)\n",
"\n",
"Fetch item or user (based on `is_item`) for all in `arr`. (Set model to `cpu` and no grad.) \n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_doc(CollabLearner.get_idx)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"\n",
"> weight
(**`arr`**:`Collection`\\[`T_co`\\], **`is_item`**:`bool`=***`True`***)\n",
"\n",
"Bias for item or user (based on `is_item`) for all in `arr`. (Set model to `cpu` and no grad.) \n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_doc(CollabLearner.weight)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/markdown": [
"class
EmbeddingDotBias
[source]
\n",
"\n",
"> EmbeddingDotBias
(**`n_factors`**:`int`, **`n_users`**:`int`, **`n_items`**:`int`, **`y_range`**:`Point`=***`None`***) :: [`Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)\n",
"\n",
"Base dot model for collaborative filtering. \n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_doc(EmbeddingDotBias, title_level=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Creates a simple model with `Embedding` weights and biases for `n_users` and `n_items`, with `n_factors` latent factors. Takes the dot product of the embeddings and adds the bias, then if `y_range` is specified, feed the result to a sigmoid rescaled to go from `y_range[0]` to `y_range[1]`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/markdown": [
"class
EmbeddingNN
[source]
\n",
"\n",
"> EmbeddingNN
(**`emb_szs`**:`ListSizes`, **`layers`**:`Collection`\\[`int`\\]=***`None`***, **`ps`**:`Collection`\\[`float`\\]=***`None`***, **`emb_drop`**:`float`=***`0.0`***, **`y_range`**:`OptRange`=***`None`***, **`use_bn`**:`bool`=***`True`***, **`bn_final`**:`bool`=***`False`***) :: [`TabularModel`](/tabular.models.html#TabularModel)\n",
"\n",
"Subclass [`TabularModel`](/tabular.models.html#TabularModel) to create a NN suitable for collaborative filtering. \n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_doc(EmbeddingNN, title_level=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`emb_szs` will overwrite the default and `kwargs` are passed to [`TabularModel`](/tabular.models.html#TabularModel)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"\n",
"> collab_learner
(**`data`**, **`n_factors`**:`int`=***`None`***, **`use_nn`**:`bool`=***`False`***, **`emb_szs`**:`Dict`\\[`str`, `int`\\]=***`None`***, **`layers`**:`Collection`\\[`int`\\]=***`None`***, **`ps`**:`Collection`\\[`float`\\]=***`None`***, **`emb_drop`**:`float`=***`0.0`***, **`y_range`**:`OptRange`=***`None`***, **`use_bn`**:`bool`=***`True`***, **`bn_final`**:`bool`=***`False`***, **\\*\\*`learn_kwargs`**) → [`Learner`](/basic_train.html#Learner)\n",
"\n",
"Create a Learner for collaborative filtering on `data`. \n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_doc(collab_learner)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More specifically, binds [`data`](/tabular.data.html#tabular.data) with a model that is either an [`EmbeddingDotBias`](/collab.html#EmbeddingDotBias) with `n_factors` if `use_nn=False` or a [`EmbeddingNN`](/collab.html#EmbeddingNN) with `emb_szs` otherwise. In both cases the numbers of users and items will be inferred from the data, `y_range` can be specified in the `kwargs` and you can pass [`metrics`](/metrics.html#metrics) or `wd` to the [`Learner`](/basic_train.html#Learner) constructor."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Links with the Data Block API"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"\n",
"> CollabLine
(**`cats`**, **`conts`**, **`classes`**, **`names`**) :: [`TabularLine`](/tabular.data.html#TabularLine)\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_doc(CollabLine, doc_string=False, title_level=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Subclass of [`TabularLine`](/tabular.data.html#TabularLine) for collaborative filtering."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"\n",
"> CollabList
(**`items`**:`Iterator`\\[`T_co`\\], **`cat_names`**:`OptStrList`=***`None`***, **`cont_names`**:`OptStrList`=***`None`***, **`procs`**=***`None`***, **\\*\\*`kwargs`**) → `TabularList` :: [`TabularList`](/tabular.data.html#TabularList)\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_doc(CollabList, title_level=3, doc_string=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Subclass of [`TabularList`](/tabular.data.html#TabularList) for collaborative filtering."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Undocumented Methods - Methods moved below this line will intentionally be hidden"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"hide_input": true
},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"\n",
"> forward
(**`users`**:`LongTensor`, **`items`**:`LongTensor`) → `Tensor`\n",
"\n",
"Defines the computation performed at every call. Should be overridden by all subclasses.\n",
"\n",
".. note::\n",
" Although the recipe for forward pass needs to be defined within\n",
" this function, one should call the :class:`Module` instance afterwards\n",
" instead of this since the former takes care of running the\n",
" registered hooks while the latter silently ignores them. \n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_doc(EmbeddingDotBias.forward)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"\n",
"> reconstruct
(**`t`**:`Tensor`)\n",
"\n",
"Reconstruct one of the underlying item for its data `t`. \n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_doc(CollabList.reconstruct)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"\n",
"> forward
(**`users`**:`LongTensor`, **`items`**:`LongTensor`) → `Tensor`\n",
"\n",
"Defines the computation performed at every call. Should be overridden by all subclasses.\n",
"\n",
".. note::\n",
" Although the recipe for forward pass needs to be defined within\n",
" this function, one should call the :class:`Module` instance afterwards\n",
" instead of this since the former takes care of running the\n",
" registered hooks while the latter silently ignores them. \n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"show_doc(EmbeddingNN.forward)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## New Methods - Please document or move to the undocumented section"
]
}
],
"metadata": {
"jekyll": {
"keywords": "fastai",
"summary": "Application to collaborative filtering",
"title": "collab"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}