{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Collaborative filtering" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.gen_doc.nbdoc import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This package contains all the necessary functions to quickly train a model for a collaborative filtering task. Let's start by importing all we'll need." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from fastai.collab import * " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Collaborative filtering is when you're tasked to predict how much a user is going to like a certain item. The fastai library contains a [`CollabFilteringDataset`](/collab.html#CollabFilteringDataset) class that will help you create datasets suitable for training, and a function `get_colab_learner` to build a simple model directly from a ratings table. Let's first see how we can get started before delving into the documentation.\n", "\n", "For this example, we'll use a small subset of the [MovieLens](https://grouplens.org/datasets/movielens/) dataset to predict the rating a user would give a particular movie (from 0 to 5). The dataset comes in the form of a csv file where each line is a rating of a movie by a given person." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userIdmovieIdratingtimestamp
07310974.01255504951
15619243.51172695223
21572603.51291598691
335812105.0957481884
41303162.01138999234
\n", "
" ], "text/plain": [ " userId movieId rating timestamp\n", "0 73 1097 4.0 1255504951\n", "1 561 924 3.5 1172695223\n", "2 157 260 3.5 1291598691\n", "3 358 1210 5.0 957481884\n", "4 130 316 2.0 1138999234" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.ML_SAMPLE)\n", "ratings = pd.read_csv(path/'ratings.csv')\n", "ratings.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll first turn the `userId` and `movieId` columns in category codes, so that we can replace them with their codes when it's time to feed them to an `Embedding` layer. This step would be even more important if our csv had names of users, or names of items in it. To do it, we simply have to call a [`CollabDataBunch`](/collab.html#CollabDataBunch) factory method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = CollabDataBunch.from_df(ratings)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that this step is done, we can directly create a [`Learner`](/basic_train.html#Learner) object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn = collab_learner(data, n_factors=50, y_range=(0.,5.))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then immediately begin training" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "Total time: 00:09

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochtrain_lossvalid_loss
12.4274301.999472
21.1163350.663345
30.7361550.636640
40.6128270.626773
50.5650030.626336
\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "learn.fit_one_cycle(5, 5e-3, wd=0.1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class CollabDataBunch[source]

\n", "\n", "> CollabDataBunch(**`train_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), **`valid_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader), **`fix_dl`**:[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)=***`None`***, **`test_dl`**:`Optional`\\[[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)\\]=***`None`***, **`device`**:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=***`None`***, **`dl_tfms`**:`Optional`\\[`Collection`\\[`Callable`\\]\\]=***`None`***, **`path`**:`PathOrStr`=***`'.'`***, **`collate_fn`**:`Callable`=***`'data_collate'`***, **`no_check`**:`bool`=***`False`***) :: [`DataBunch`](/basic_data.html#DataBunch)\n", "\n", "Base [`DataBunch`](/basic_data.html#DataBunch) for collaborative filtering. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CollabDataBunch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The init function shouldn't be called directly (as it's the one of a basic [`DataBunch`](/basic_data.html#DataBunch)), instead, you'll want to use the following factory method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

from_df[source]

\n", "\n", "> from_df(**`ratings`**:`DataFrame`, **`pct_val`**:`float`=***`0.2`***, **`user_name`**:`Optional`\\[`str`\\]=***`None`***, **`item_name`**:`Optional`\\[`str`\\]=***`None`***, **`rating_name`**:`Optional`\\[`str`\\]=***`None`***, **`test`**:`DataFrame`=***`None`***, **`seed`**:`int`=***`None`***, **`path`**:`PathOrStr`=***`'.'`***, **`bs`**:`int`=***`64`***, **`val_bs`**:`int`=***`None`***, **`num_workers`**:`int`=***`4`***, **`dl_tfms`**:`Optional`\\[`Collection`\\[`Callable`\\]\\]=***`None`***, **`device`**:[`device`](https://pytorch.org/docs/stable/tensor_attributes.html#torch-device)=***`None`***, **`collate_fn`**:`Callable`=***`'data_collate'`***, **`no_check`**:`bool`=***`False`***) → `CollabDataBunch`\n", "\n", "Create a [`DataBunch`](/basic_data.html#DataBunch) suitable for collaborative filtering from `ratings`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CollabDataBunch.from_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Take a `ratings` dataframe and splits it randomly for train and test following `pct_val` (unless it's None). `user_name`, `item_name` and `rating_name` give the names of the corresponding columns (defaults to the first, the second and the third column). Optionally a `test` dataframe can be passed an a `seed` for the separation between training and validation set. The `kwargs` will be passed to [`DataBunch.create`](/basic_data.html#DataBunch.create)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model and [`Learner`](/basic_train.html#Learner)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class CollabLearner[source]

\n", "\n", "> CollabLearner(**`data`**:[`DataBunch`](/basic_data.html#DataBunch), **`model`**:[`Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module), **`opt_func`**:`Callable`=***`'Adam'`***, **`loss_func`**:`Callable`=***`None`***, **`metrics`**:`Collection`\\[`Callable`\\]=***`None`***, **`true_wd`**:`bool`=***`True`***, **`bn_wd`**:`bool`=***`True`***, **`wd`**:`Floats`=***`0.01`***, **`train_bn`**:`bool`=***`True`***, **`path`**:`str`=***`None`***, **`model_dir`**:`str`=***`'models'`***, **`callback_fns`**:`Collection`\\[`Callable`\\]=***`None`***, **`callbacks`**:`Collection`\\[[`Callback`](/callback.html#Callback)\\]=***``***, **`layer_groups`**:`ModuleList`=***`None`***) :: [`Learner`](/basic_train.html#Learner)\n", "\n", "[`Learner`](/basic_train.html#Learner) suitable for collaborative filtering. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CollabLearner, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is a subclass of [`Learner`](/basic_train.html#Learner) that just introduces helper functions to analyze results, the initialization is the same as a regular [`Learner`](/basic_train.html#Learner)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

bias[source]

\n", "\n", "> bias(**`arr`**:`Collection`\\[`T_co`\\], **`is_item`**:`bool`=***`True`***)\n", "\n", "Bias for item or user (based on `is_item`) for all in `arr`. (Set model to `cpu` and no grad.) " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CollabLearner.bias)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

get_idx[source]

\n", "\n", "> get_idx(**`arr`**:`Collection`\\[`T_co`\\], **`is_item`**:`bool`=***`True`***)\n", "\n", "Fetch item or user (based on `is_item`) for all in `arr`. (Set model to `cpu` and no grad.) " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CollabLearner.get_idx)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

weight[source]

\n", "\n", "> weight(**`arr`**:`Collection`\\[`T_co`\\], **`is_item`**:`bool`=***`True`***)\n", "\n", "Bias for item or user (based on `is_item`) for all in `arr`. (Set model to `cpu` and no grad.) " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CollabLearner.weight)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class EmbeddingDotBias[source]

\n", "\n", "> EmbeddingDotBias(**`n_factors`**:`int`, **`n_users`**:`int`, **`n_items`**:`int`, **`y_range`**:`Point`=***`None`***) :: [`Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)\n", "\n", "Base dot model for collaborative filtering. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(EmbeddingDotBias, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creates a simple model with `Embedding` weights and biases for `n_users` and `n_items`, with `n_factors` latent factors. Takes the dot product of the embeddings and adds the bias, then if `y_range` is specified, feed the result to a sigmoid rescaled to go from `y_range[0]` to `y_range[1]`. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class EmbeddingNN[source]

\n", "\n", "> EmbeddingNN(**`emb_szs`**:`ListSizes`, **`layers`**:`Collection`\\[`int`\\]=***`None`***, **`ps`**:`Collection`\\[`float`\\]=***`None`***, **`emb_drop`**:`float`=***`0.0`***, **`y_range`**:`OptRange`=***`None`***, **`use_bn`**:`bool`=***`True`***, **`bn_final`**:`bool`=***`False`***) :: [`TabularModel`](/tabular.models.html#TabularModel)\n", "\n", "Subclass [`TabularModel`](/tabular.models.html#TabularModel) to create a NN suitable for collaborative filtering. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(EmbeddingNN, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`emb_szs` will overwrite the default and `kwargs` are passed to [`TabularModel`](/tabular.models.html#TabularModel)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

collab_learner[source]

\n", "\n", "> collab_learner(**`data`**, **`n_factors`**:`int`=***`None`***, **`use_nn`**:`bool`=***`False`***, **`emb_szs`**:`Dict`\\[`str`, `int`\\]=***`None`***, **`layers`**:`Collection`\\[`int`\\]=***`None`***, **`ps`**:`Collection`\\[`float`\\]=***`None`***, **`emb_drop`**:`float`=***`0.0`***, **`y_range`**:`OptRange`=***`None`***, **`use_bn`**:`bool`=***`True`***, **`bn_final`**:`bool`=***`False`***, **\\*\\*`learn_kwargs`**) → [`Learner`](/basic_train.html#Learner)\n", "\n", "Create a Learner for collaborative filtering on `data`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(collab_learner)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More specifically, binds [`data`](/tabular.data.html#tabular.data) with a model that is either an [`EmbeddingDotBias`](/collab.html#EmbeddingDotBias) with `n_factors` if `use_nn=False` or a [`EmbeddingNN`](/collab.html#EmbeddingNN) with `emb_szs` otherwise. In both cases the numbers of users and items will be inferred from the data, `y_range` can be specified in the `kwargs` and you can pass [`metrics`](/metrics.html#metrics) or `wd` to the [`Learner`](/basic_train.html#Learner) constructor." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Links with the Data Block API" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class CollabLine[source]

\n", "\n", "> CollabLine(**`cats`**, **`conts`**, **`classes`**, **`names`**) :: [`TabularLine`](/tabular.data.html#TabularLine)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CollabLine, doc_string=False, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Subclass of [`TabularLine`](/tabular.data.html#TabularLine) for collaborative filtering." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class CollabList[source]

\n", "\n", "> CollabList(**`items`**:`Iterator`\\[`T_co`\\], **`cat_names`**:`OptStrList`=***`None`***, **`cont_names`**:`OptStrList`=***`None`***, **`procs`**=***`None`***, **\\*\\*`kwargs`**) → `TabularList` :: [`TabularList`](/tabular.data.html#TabularList)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CollabList, title_level=3, doc_string=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Subclass of [`TabularList`](/tabular.data.html#TabularList) for collaborative filtering." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Undocumented Methods - Methods moved below this line will intentionally be hidden" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

forward[source]

\n", "\n", "> forward(**`users`**:`LongTensor`, **`items`**:`LongTensor`) → `Tensor`\n", "\n", "Defines the computation performed at every call. Should be overridden by all subclasses.\n", "\n", ".. note::\n", " Although the recipe for forward pass needs to be defined within\n", " this function, one should call the :class:`Module` instance afterwards\n", " instead of this since the former takes care of running the\n", " registered hooks while the latter silently ignores them. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(EmbeddingDotBias.forward)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

reconstruct[source]

\n", "\n", "> reconstruct(**`t`**:`Tensor`)\n", "\n", "Reconstruct one of the underlying item for its data `t`. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CollabList.reconstruct)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "

forward[source]

\n", "\n", "> forward(**`users`**:`LongTensor`, **`items`**:`LongTensor`) → `Tensor`\n", "\n", "Defines the computation performed at every call. Should be overridden by all subclasses.\n", "\n", ".. note::\n", " Although the recipe for forward pass needs to be defined within\n", " this function, one should call the :class:`Module` instance afterwards\n", " instead of this since the former takes care of running the\n", " registered hooks while the latter silently ignores them. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(EmbeddingNN.forward)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## New Methods - Please document or move to the undocumented section" ] } ], "metadata": { "jekyll": { "keywords": "fastai", "summary": "Application to collaborative filtering", "title": "collab" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 2 }