{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Collaborative filtering" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "hide_input": true }, "outputs": [], "source": [ "from fastai.gen_doc.nbdoc import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This package contains all the necessary functions to quickly train a model for a collaborative filtering task. Let's start by importing all we'll need." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from fastai import *\n", "from fastai.collab import * " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Collaborative filtering is when you're tasked to predict how much a user is going to like a certain item. The fastai library contains a [`CollabFilteringDataset`](/collab.html#CollabFilteringDataset) class that will help you create datasets suitable for training, and a function `get_colab_learner` to build a simple model directly from a ratings table. Let's first see how we can get started before devling in the documentation.\n", "\n", "For our example, we'll use a small subset of the [MovieLens](https://grouplens.org/datasets/movielens/) dataset. In there, we have to predict the rating a user gave a given movie (from 0 to 5). It comes in the form of a csv file where each line is the rating of a movie by a given person." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
userIdmovieIdratingtimestamp
07310974.01255504951
15619243.51172695223
21572603.51291598691
335812105.0957481884
41303162.01138999234
\n", "
" ], "text/plain": [ " userId movieId rating timestamp\n", "0 73 1097 4.0 1255504951\n", "1 561 924 3.5 1172695223\n", "2 157 260 3.5 1291598691\n", "3 358 1210 5.0 957481884\n", "4 130 316 2.0 1138999234" ] }, "execution_count": null, "metadata": {}, "output_type": "execute_result" } ], "source": [ "path = untar_data(URLs.ML_SAMPLE)\n", "ratings = pd.read_csv(path/'ratings.csv')\n", "ratings.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll first turn the `userId` and `movieId` columns in category codes, so that we can replace them with their codes when it's time to feed them to an `Embedding` layer. This step would be even more important if our csv had names of users, or names of items in it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "series2cat(ratings, 'userId','movieId')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that this step is done, we can directly create a [`Learner`](/basic_train.html#Learner) object:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "learn = get_collab_learner(ratings, n_factors=50, pct_val=0.2, min_score=0., max_score=5.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And the immediately begin training" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(HBox(children=(IntProgress(value=0, max=5), HTML(value='0.00% [0/5 00:00<00:00]'))), HTML(value…" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Total time: 00:04\n", "epoch train loss valid loss\n", "1 2.368736 1.849535 (00:00)\n", "2 1.080932 0.691473 (00:00)\n", "3 0.740156 0.669135 (00:00)\n", "4 0.629487 0.658641 (00:00)\n", "5 0.599293 0.654870 (00:00)\n", "\n" ] } ], "source": [ "learn.fit_one_cycle(5, 5e-3, wd=0.1)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class CollabFilteringDataset[source]

\n", "\n", "> CollabFilteringDataset(`user`:`Series`, `item`:`Series`, `ratings`:`ndarray`) :: [`DatasetBase`](/basic_data.html#DatasetBase)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CollabFilteringDataset, doc_string=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the basic class to buil a [`Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) suitable for colaborative filtering. `user` and `item` should be categorical series that will be replaced with their codes internally and have the corresponding `ratings`. One of the factory methods will prepare the data in this format." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

from_df[source]

\n", "\n", "> from_df(`rating_df`:`DataFrame`, `pct_val`:`float`=`0.2`, `user_name`:`Optional`\\[`str`\\]=`None`, `item_name`:`Optional`\\[`str`\\]=`None`, `rating_name`:`Optional`\\[`str`\\]=`None`) → `Tuple`\\[`ColabFilteringDataset`, `ColabFilteringDataset`\\]" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CollabFilteringDataset.from_df, doc_string=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Takes a `rating_df` and splits it randomly for train and test following `pct_val` (unless it's None). `user_name`, `item_name` and `rating_name` give the names of the corresponding columns (defaults to the first, the second and the third column)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

from_csv[source]

\n", "\n", "> from_csv(`csv_name`:`str`, `kwargs`) → `Tuple`\\[`ColabFilteringDataset`, `ColabFilteringDataset`\\]" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(CollabFilteringDataset.from_csv, doc_string=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Opens the file in `csv_name` as a `DataFrame` and feeds it to `show_doc.from_df` with the `kwargs`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model and [`Learner`](/basic_train.html#Learner)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

class EmbeddingDotBias[source]

\n", "\n", "> EmbeddingDotBias(`n_factors`:`int`, `n_users`:`int`, `n_items`:`int`, `min_score`:`float`=`None`, `max_score`:`float`=`None`) :: [`Module`](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(EmbeddingDotBias, doc_string=False, title_level=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creates a simple model with `Embedding` weights and biases for `n_users` and `n_items`, with `n_factors` latent factors. Takes the dot product of the embeddings and adds the bias, then feed the result to a sigmoid rescaled to go from `min_score` to `max_score`. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

get_collab_learner[source]

\n", "\n", "> get_collab_learner(`ratings`:`DataFrame`, `n_factors`:`int`, `pct_val`:`float`=`0.2`, `user_name`:`Optional`\\[`str`\\]=`None`, `item_name`:`Optional`\\[`str`\\]=`None`, `rating_name`:`Optional`\\[`str`\\]=`None`, `test`:`DataFrame`=`None`, `metrics`=`None`, `min_score`:`float`=`None`, `max_score`:`float`=`None`, `kwargs`) → [`Learner`](/basic_train.html#Learner)" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(get_collab_learner, doc_string=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creates a [`Learner`](/basic_train.html#Learner) object built from the data in `ratings`, `pct_val`, `user_name`, `item_name`, `rating_name` to [`CollabFilteringDataset`](/collab.html#CollabFilteringDataset). Optionally, creates another [`CollabFilteringDataset`](/collab.html#CollabFilteringDataset) for `test`. `kwargs` are fed to [`DataBunch.create`](/basic_data.html#DataBunch.create) with these datasets. The model is given by [`EmbeddingDotBias`](/collab.html#EmbeddingDotBias) with `n_factors`, `min_score` and `max_score` (the numbers of users and items will be inferred from the data)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Undocumented Methods - Methods moved below this line will intentionally be hidden" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "hide_input": true }, "outputs": [ { "data": { "text/markdown": [ "

forward[source]

\n", "\n", "> forward(`users`:`LongTensor`, `items`:`LongTensor`) → `Tensor`\n", "\n", "Defines the computation performed at every call. Should be overridden by all subclasses.\n", "\n", ".. note::\n", " Although the recipe for forward pass needs to be defined within\n", " this function, one should call the :class:`Module` instance afterwards\n", " instead of this since the former takes care of running the\n", " registered hooks while the latter silently ignores them. " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "show_doc(EmbeddingDotBias.forward)" ] } ], "metadata": { "jekyll": { "keywords": "fastai", "summary": "Application to collaborative filtering", "title": "collab" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }