{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "2021-06-27-analytics-zoo-ncf-goodreads.ipynb", "provenance": [], "collapsed_sections": [], "toc_visible": true, "authorship_tag": "ABX9TyNYHqJssm3LIgxxMIb7VwO7" }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "vWAyqrazbVkU" }, "source": [ "# Analytics Zoo Recommendation Part 2\n", "> Applying NCF on Goodreads using Analytics Zoo library\n", "\n", "- toc: true\n", "- badges: true\n", "- comments: true\n", "- categories: [Book, BigData, PySpark, AnalyticsZoo, NCF]\n", "- image:" ] }, { "cell_type": "markdown", "metadata": { "id": "5tJLYN9yYJxO" }, "source": [ "## Introduction" ] }, { "cell_type": "markdown", "metadata": { "id": "f4e-a8D8YSXJ" }, "source": [ "NCF Recommender with Explict Feedback\n", "\n", "In this notebook we demostrate how to build a neural network recommendation system, Neural Collaborative Filtering(NCF) with explict feedback. We use Recommender API in Analytics Zoo to build a model, and use optimizer of BigDL to train the model. \n", "\n", "The system ([Recommendation systems: Principles, methods and evaluation](http://www.sciencedirect.com/science/article/pii/S1110866515000341)) normally prompts the user through the system interface to provide ratings for items in order to construct and improve the model. The accuracy of recommendation depends on the quantity of ratings provided by the user. \n", "\n", "NCF([He, 2015](https://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf)) leverages a multi-layer perceptrons to learn the user–item interaction function, at the mean time, NCF can express and generalize matrix factorization under its framework. includeMF(Boolean) is provided for users to build a NCF with or without matrix factorization. \n", "\n", "Data: \n", "* Goodreads book ratings dataset \n", " \n", "References: \n", "* A Keras implementation of Movie Recommendation([notebook](https://github.com/ririw/ririw.github.io/blob/master/assets/Recommending%20movies.ipynb)) from the [blog](http://blog.richardweiss.org/2016/09/25/movie-embeddings.html).\n", "* Nerual Collaborative filtering ([He, 2015](https://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf))\n", "\n", "Python interface:\n", "\n", "```python\n", "ncf = NeuralCF(user_count, item_count, class_num, user_embed=20, item_embed=20, hidden_layers=(40, 20, 10), include_mf=True, mf_embed=20)\n", "```\n", "\n", "- `user_count`: The number of users. Positive int.\n", "- `item_count`: The number of classes. Positive int.\n", "- `class_num`: The number of classes. Positive int.\n", "- `user_embed`: Units of user embedding. Positive int. Default is 20.\n", "- `item_embed`: itemEmbed Units of item embedding. Positive int. Default is 20.\n", "- `hidden_layers`: Units of hidden layers for MLP. Tuple of positive int. Default is (40, 20, 10).\n", "- `include_mf`: Whether to include Matrix Factorization. Boolean. Default is True.\n", "- `mf_embed`: Units of matrix factorization embedding. Positive int. Default is 20." ] }, { "cell_type": "markdown", "metadata": { "id": "IPqTsm9pSckt" }, "source": [ "## Installation" ] }, { "cell_type": "markdown", "metadata": { "id": "u5UkOC-3SuwP" }, "source": [ "### Install Java 8" ] }, { "cell_type": "markdown", "metadata": { "id": "nY8_OcwpSyha" }, "source": [ "Run the command on the colaboratory file to install jdk 1.8" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "8oH5g_58SScM", "outputId": "dfb99804-1975-4159-ef51-d23e021ec621" }, "source": [ "# Install jdk8\n", "!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n", "# Set jdk environment path which enables you to run Pyspark in your Colab environment.\n", "import os\n", "os.environ[\"JAVA_HOME\"] = \"/usr/lib/jvm/java-8-openjdk-amd64\"\n", "!update-alternatives --set java /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" ], "execution_count": 1, "outputs": [ { "output_type": "stream", "text": [ "update-alternatives: using /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java to provide /usr/bin/java (java) in manual mode\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "zPqef37WS2r-" }, "source": [ "### Install Analytics Zoo from pip" ] }, { "cell_type": "markdown", "metadata": { "id": "Mo4najsbS5RX" }, "source": [ "You can add the following command on your colab file to install the analytics-zoo via pip easily:" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "RtHKGTcSS1cD", "outputId": "00364579-40ef-4fb9-e770-70428dc04ddd" }, "source": [ "# Install latest release version of analytics-zoo \n", "# Installing analytics-zoo from pip will automatically install pyspark, bigdl, and their dependencies.\n", "!pip install analytics-zoo" ], "execution_count": 2, "outputs": [ { "output_type": "stream", "text": [ "Collecting analytics-zoo\n", "\u001b[?25l Downloading https://files.pythonhosted.org/packages/49/32/2fe969c682b8683fb3792f06107fe8ac56f165e307d327756c506aa76d31/analytics_zoo-0.10.0-py2.py3-none-manylinux1_x86_64.whl (158.9MB)\n", "\u001b[K |████████████████████████████████| 158.9MB 81kB/s \n", "\u001b[?25hCollecting conda-pack==0.3.1\n", " Downloading https://files.pythonhosted.org/packages/e9/e7/d942780c4281a665f34dbfffc1cd1517c5843fb478c133a1e1fa0df30cd6/conda_pack-0.3.1-py2.py3-none-any.whl\n", "Collecting bigdl==0.12.2\n", "\u001b[?25l Downloading https://files.pythonhosted.org/packages/1a/40/81fed203a633536dbb83454f1a102ebde86214f576572769290afb9427c9/BigDL-0.12.2-py2.py3-none-manylinux1_x86_64.whl (114.1MB)\n", "\u001b[K |████████████████████████████████| 114.1MB 2.5MB/s \n", "\u001b[?25hCollecting pyspark==2.4.3\n", "\u001b[?25l Downloading https://files.pythonhosted.org/packages/37/98/244399c0daa7894cdf387e7007d5e8b3710a79b67f3fd991c0b0b644822d/pyspark-2.4.3.tar.gz (215.6MB)\n", "\u001b[K |████████████████████████████████| 215.6MB 69kB/s \n", "\u001b[?25hRequirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from conda-pack==0.3.1->analytics-zoo) (57.0.0)\n", "Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from bigdl==0.12.2->analytics-zoo) (1.15.0)\n", "Requirement already satisfied: numpy>=1.7 in /usr/local/lib/python3.7/dist-packages (from bigdl==0.12.2->analytics-zoo) (1.19.5)\n", "Collecting py4j==0.10.7\n", "\u001b[?25l Downloading https://files.pythonhosted.org/packages/e3/53/c737818eb9a7dc32a7cd4f1396e787bd94200c3997c72c1dbe028587bd76/py4j-0.10.7-py2.py3-none-any.whl (197kB)\n", "\u001b[K |████████████████████████████████| 204kB 53.7MB/s \n", "\u001b[?25hBuilding wheels for collected packages: pyspark\n", " Building wheel for pyspark (setup.py) ... \u001b[?25l\u001b[?25hdone\n", " Created wheel for pyspark: filename=pyspark-2.4.3-py2.py3-none-any.whl size=215964968 sha256=6d89c562d8dc3446296b8872b20b52d86bddd6342fd3dd59448c51a9d25a0513\n", " Stored in directory: /root/.cache/pip/wheels/8d/20/f0/b30e2024226dc112e256930dd2cd4f06d00ab053c86278dcf3\n", "Successfully built pyspark\n", "Installing collected packages: conda-pack, py4j, pyspark, bigdl, analytics-zoo\n", "Successfully installed analytics-zoo-0.10.0 bigdl-0.12.2 conda-pack-0.3.1 py4j-0.10.7 pyspark-2.4.3\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "M8KiG7WhTAPa" }, "source": [ "### Initialize context" ] }, { "cell_type": "markdown", "metadata": { "id": "ZWWxCdrkTCqY" }, "source": [ "Call init_nncontext() that will create a SparkContext with optimized performance configurations." ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Wy6CnH-8S81r", "outputId": "06a6e670-b16f-487d-f2a4-857f6d157603" }, "source": [ "from zoo.common.nncontext import*\n", "\n", "sc = init_nncontext()" ], "execution_count": 3, "outputs": [ { "output_type": "stream", "text": [ "Prepending /usr/local/lib/python3.7/dist-packages/bigdl/share/conf/spark-bigdl.conf to sys.path\n", "Adding /usr/local/lib/python3.7/dist-packages/zoo/share/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.10.0-jar-with-dependencies.jar to BIGDL_JARS\n", "Prepending /usr/local/lib/python3.7/dist-packages/zoo/share/conf/spark-analytics-zoo.conf to sys.path\n", "pyspark_submit_args is: --driver-class-path /usr/local/lib/python3.7/dist-packages/zoo/share/lib/analytics-zoo-bigdl_0.12.2-spark_2.4.3-0.10.0-jar-with-dependencies.jar:/usr/local/lib/python3.7/dist-packages/bigdl/share/lib/bigdl-0.12.2-jar-with-dependencies.jar pyspark-shell \n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "nLQymd-_TbKx" }, "source": [ "Analytics Zoo provides three Recommenders, including Wide and Deep (WND) model, Neural network-based Collaborative Filtering (NCF) model and Session Recommender model. Easy-to-use Keras-Style defined models which provides compile and fit methods for training. Alternatively, they could be fed into NNFrames or BigDL Optimizer.\n", "\n", "WND and NCF recommenders can handle either explict or implicit feedback, given corresponding features." ] }, { "cell_type": "markdown", "metadata": { "id": "_KTBMy52T5he" }, "source": [ "## Imports" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "nsib3Wg_T7E5", "outputId": "7ebed2eb-383e-4dda-b8c9-6a8b14fa9495" }, "source": [ "from zoo.pipeline.api.keras.layers import *\n", "from zoo.models.recommendation import UserItemFeature\n", "from zoo.models.recommendation import NeuralCF\n", "from zoo.common.nncontext import init_nncontext\n", "import matplotlib\n", "from sklearn import metrics\n", "from operator import itemgetter\n", "from bigdl.util.common import *\n", "\n", "import os\n", "import numpy as np\n", "from sklearn import preprocessing\n", "\n", "matplotlib.use('agg')\n", "import matplotlib.pyplot as plt\n", "%pylab inline" ], "execution_count": 4, "outputs": [ { "output_type": "stream", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "Cl6H99sBTgOt" }, "source": [ "## Download goodreads dataset" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "wkKAeEORTPZ_", "outputId": "17ec66dd-dda9-4934-e57f-99692399c955" }, "source": [ "!wget https://github.com/sparsh-ai/reco-data/raw/master/goodreads/ratings.csv" ], "execution_count": 5, "outputs": [ { "output_type": "stream", "text": [ "--2021-06-27 12:57:02-- https://github.com/sparsh-ai/reco-data/raw/master/goodreads/ratings.csv\n", "Resolving github.com (github.com)... 192.30.255.113\n", "Connecting to github.com (github.com)|192.30.255.113|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://raw.githubusercontent.com/sparsh-ai/reco-data/master/goodreads/ratings.csv [following]\n", "--2021-06-27 12:57:04-- https://raw.githubusercontent.com/sparsh-ai/reco-data/master/goodreads/ratings.csv\n", "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n", "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 4976229 (4.7M) [text/plain]\n", "Saving to: ‘ratings.csv’\n", "\n", "ratings.csv 100%[===================>] 4.75M --.-KB/s in 0.1s \n", "\n", "2021-06-27 12:57:04 (41.9 MB/s) - ‘ratings.csv’ saved [4976229/4976229]\n", "\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "cizJHBZhVV15" }, "source": [ "## Read the dataset" ] }, { "cell_type": "code", "metadata": { "id": "vT8aZM0BT2oi" }, "source": [ "def read_data_sets(data_dir):\n", " rating_files = os.path.join(data_dir,\"ratings.csv\")\n", " rating_list = [i.strip().split(\",\") for i in open(rating_files,\"r\").readlines()] \n", " goodreads_data = np.array(rating_list[1:]).astype(int)\n", " return goodreads_data \n", "\n", "def get_id_pairs(data_dir):\n", "\tgoodreads_data = read_data_sets(data_dir)\n", "\treturn goodreads_data[:, 0:2]\n", "\n", "def get_id_ratings(data_dir):\n", " goodreads_data = read_data_sets(data_dir)\n", " le_user = preprocessing.LabelEncoder()\n", " goodreads_data[:, 0] = le_user.fit_transform(goodreads_data[:, 0])\n", " le_item = preprocessing.LabelEncoder()\n", " goodreads_data[:, 1] = le_item.fit_transform(goodreads_data[:, 1])\n", " return goodreads_data[:, 0:3]" ], "execution_count": 6, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "7n6vQz-3VR4I" }, "source": [ "goodreads_data = get_id_ratings(\"/content\")" ], "execution_count": 7, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "GCRIqd0hV2-5" }, "source": [ "## Understand the data" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "nY3Bbk1xV8wL", "outputId": "61215aa9-a1a2-4f01-9de1-798c0316afe0" }, "source": [ "min_user_id = np.min(goodreads_data[:,0])\n", "max_user_id = np.max(goodreads_data[:,0])\n", "min_book_id = np.min(goodreads_data[:,1])\n", "max_book_id = np.max(goodreads_data[:,1])\n", "rating_labels= np.unique(goodreads_data[:,2])\n", "\n", "print(goodreads_data.shape)\n", "print(min_user_id, max_user_id, min_book_id, max_book_id, rating_labels)" ], "execution_count": 8, "outputs": [ { "output_type": "stream", "text": [ "(454517, 3)\n", "0 4999 0 4999 [1 2 3 4 5]\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "Vrs_dyQcVz8z" }, "source": [ "Each record is in format of (userid, bookid, rating_score). Both UserIDs and BookIDs range between 0 and 4999. Ratings are made on a 5-star scale (whole-star ratings only). Counts of users and books are recorded for later use." ] }, { "cell_type": "markdown", "metadata": { "id": "O-Ks_RL2WMT7" }, "source": [ "## Transformation" ] }, { "cell_type": "markdown", "metadata": { "id": "cnUhH-ksWOxh" }, "source": [ "Transform original data into RDD of sample. We use optimizer of BigDL directly to train the model, it requires data to be provided in format of RDD(Sample). A Sample is a BigDL data structure which can be constructed using 2 numpy arrays, feature and label respectively. The API interface is Sample.from_ndarray(feature, label)." ] }, { "cell_type": "code", "metadata": { "id": "gmD0lS0eVnK7" }, "source": [ "def build_sample(user_id, item_id, rating):\n", " sample = Sample.from_ndarray(np.array([user_id, item_id]), np.array([rating]))\n", " return UserItemFeature(user_id, item_id, sample)" ], "execution_count": 9, "outputs": [] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "BcwsB2SsWVBZ", "outputId": "fc09f520-5afa-4e58-dc2a-bd7c2340ffcd" }, "source": [ "pairFeatureRdds = sc.parallelize(goodreads_data).map(lambda x: build_sample(x[0], x[1], x[2]-1))\n", "pairFeatureRdds.take(3)" ], "execution_count": 19, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[,\n", " ,\n", " ]" ] }, "metadata": { "tags": [] }, "execution_count": 19 } ] }, { "cell_type": "markdown", "metadata": { "id": "wfqh_7FsWkrw" }, "source": [ "## Split" ] }, { "cell_type": "markdown", "metadata": { "id": "qw37Flz4Wm0F" }, "source": [ "Randomly split the data into train (80%) and validation (20%)" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "plRhKQ0BWdD5", "outputId": "d7d7cafb-82f8-46fd-9537-aac438ce823c" }, "source": [ "trainPairFeatureRdds, valPairFeatureRdds = pairFeatureRdds.randomSplit([0.8, 0.2], seed= 1)\n", "valPairFeatureRdds.cache()\n", "\n", "train_rdd= trainPairFeatureRdds.map(lambda pair_feature: pair_feature.sample)\n", "val_rdd= valPairFeatureRdds.map(lambda pair_feature: pair_feature.sample)\n", "val_rdd.persist()" ], "execution_count": 20, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "PythonRDD[37] at RDD at PythonRDD.scala:53" ] }, "metadata": { "tags": [] }, "execution_count": 20 } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "PCWetBlaWuyb", "outputId": "e1774d82-1356-4222-830e-5f182c495ccc" }, "source": [ "train_rdd.count()" ], "execution_count": 21, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "363662" ] }, "metadata": { "tags": [] }, "execution_count": 21 } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "czo63lHuW2DB", "outputId": "caf5f0d3-427d-4ffe-df95-9d930908e300" }, "source": [ "train_rdd.take(3)" ], "execution_count": 22, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[Sample: features: [JTensor: storage: [510. 276.], shape: [2], float], labels: [JTensor: storage: [3.], shape: [1], float],\n", " Sample: features: [JTensor: storage: [2289. 2492.], shape: [2], float], labels: [JTensor: storage: [1.], shape: [1], float],\n", " Sample: features: [JTensor: storage: [3919. 1597.], shape: [2], float], labels: [JTensor: storage: [4.], shape: [1], float]]" ] }, "metadata": { "tags": [] }, "execution_count": 22 } ] }, { "cell_type": "markdown", "metadata": { "id": "KnONxSMsXARF" }, "source": [ "## Build model" ] }, { "cell_type": "markdown", "metadata": { "id": "IhpOgCvFXG2e" }, "source": [ "In Analytics Zoo, it is simple to build NCF model by calling NeuralCF API. You need specify the user count, item count and class number according to your data, then add hidden layers as needed, you can also choose to include matrix factorization in the network. The model could be fed into an Optimizer of BigDL or NNClassifier of analytics-zoo. Please refer to the document for more details. In this example, we demostrate how to use optimizer of BigDL." ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4xMI7FZqW3zv", "outputId": "f3f248e2-c626-4a4a-ec98-3bd58c96436c" }, "source": [ "ncf = NeuralCF(user_count=max_user_id, \n", " item_count=max_book_id, \n", " class_num=5, \n", " hidden_layers=[20, 10], \n", " include_mf = False)" ], "execution_count": 23, "outputs": [ { "output_type": "stream", "text": [ "creating: createZooKerasInput\n", "creating: createZooKerasFlatten\n", "creating: createZooKerasSelect\n", "creating: createZooKerasFlatten\n", "creating: createZooKerasSelect\n", "creating: createZooKerasEmbedding\n", "creating: createZooKerasEmbedding\n", "creating: createZooKerasFlatten\n", "creating: createZooKerasFlatten\n", "creating: createZooKerasMerge\n", "creating: createZooKerasDense\n", "creating: createZooKerasDense\n", "creating: createZooKerasDense\n", "creating: createZooKerasModel\n", "creating: createZooNeuralCF\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "BVfHXNF2XMLp" }, "source": [ "## Compile model" ] }, { "cell_type": "markdown", "metadata": { "id": "R_Xgwv_MXUUI" }, "source": [ "Compile model given specific optimizers, loss, as well as metrics for evaluation. Optimizer tries to minimize the loss of the neural net with respect to its weights/biases, over the training set. To create an Optimizer in BigDL, you want to at least specify arguments: model(a neural network model), criterion(the loss function), traing_rdd(training dataset) and batch size. Please refer to [ProgrammingGuide](https://bigdl-project.github.io/master/#ProgrammingGuide/optimization/) and [Optimizer](https://bigdl-project.github.io/master/#APIGuide/Optimizers/Optimizer/) for more details to create efficient optimizers." ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "EpiU9RG_XKKM", "outputId": "d730c086-e91e-44f3-b0fc-81d66a32d509" }, "source": [ "ncf.compile(optimizer= \"adam\",\n", " loss= \"sparse_categorical_crossentropy\",\n", " metrics=['accuracy'])" ], "execution_count": 24, "outputs": [ { "output_type": "stream", "text": [ "creating: createAdam\n", "creating: createZooKerasSparseCategoricalCrossEntropy\n", "creating: createZooKerasSparseCategoricalAccuracy\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "pOvfFZP3Xk0r" }, "source": [ "## Collect logs" ] }, { "cell_type": "markdown", "metadata": { "id": "ult1tu6QXnaX" }, "source": [ "You can leverage tensorboard to see the summaries." ] }, { "cell_type": "code", "metadata": { "id": "E44IwOlzXjYW" }, "source": [ "tmp_log_dir = create_tmp_path()\n", "ncf.set_tensorboard(tmp_log_dir, \"training_ncf\")" ], "execution_count": 25, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "IKipPFhsXqxS" }, "source": [ "## Train the model" ] }, { "cell_type": "code", "metadata": { "id": "4IHzRgUyXpu8" }, "source": [ "ncf.fit(train_rdd, \n", " nb_epoch= 10, \n", " batch_size= 5000,\n", " validation_data=val_rdd)" ], "execution_count": 26, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "YLj6L9JPXxri" }, "source": [ "## Prediction" ] }, { "cell_type": "markdown", "metadata": { "id": "uzDas0oIX11i" }, "source": [ "Zoo models make inferences based on the given data using model.predict(val_rdd) API. A result of RDD is returned. predict_class returns the predicted label." ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "gUkaK6TUXuDJ", "outputId": "16018ece-c981-4a17-ad43-014a1cae71e6" }, "source": [ "results = ncf.predict(val_rdd)\n", "results.take(5)" ], "execution_count": 27, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[array([0.00071773, 0.00534406, 0.07192371, 0.46765593, 0.45435852],\n", " dtype=float32),\n", " array([0.14319365, 0.23656234, 0.28515524, 0.24342458, 0.09166414],\n", " dtype=float32),\n", " array([0.0084668 , 0.02982639, 0.17435056, 0.46424043, 0.32311577],\n", " dtype=float32),\n", " array([5.7515636e-04, 1.4479134e-02, 3.0612889e-01, 6.0694826e-01,\n", " 7.1868621e-02], dtype=float32),\n", " array([0.00603686, 0.01832466, 0.12356475, 0.32203293, 0.5300408 ],\n", " dtype=float32)]" ] }, "metadata": { "tags": [] }, "execution_count": 27 } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "aLhHUm7sZm1Y", "outputId": "b36d73bf-cf28-4197-9f5d-db752ddc7ace" }, "source": [ "results_class = ncf.predict_class(val_rdd)\n", "results_class.take(5)" ], "execution_count": 28, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[4, 3, 4, 4, 5]" ] }, "metadata": { "tags": [] }, "execution_count": 28 } ] }, { "cell_type": "markdown", "metadata": { "id": "WqOfNL7gZrG6" }, "source": [ "In Analytics Zoo, Recommender has provied 3 unique APIs to predict user-item pairs and make recommendations for users or items given candidates." ] }, { "cell_type": "markdown", "metadata": { "id": "9ffsw5z7aBlO" }, "source": [ "### Predict for user item pairs" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "NXHMNrSsZnOD", "outputId": "df07c238-f264-4550-cfe8-2c09b9eb8073" }, "source": [ "userItemPairPrediction = ncf.predict_user_item_pair(valPairFeatureRdds)\n", "for result in userItemPairPrediction.take(5): print(result)" ], "execution_count": 29, "outputs": [ { "output_type": "stream", "text": [ "UserItemPrediction [user_id: 2765, item_id: 53, prediction: 4, probability: 0.4676559269428253]\n", "UserItemPrediction [user_id: 4573, item_id: 95, prediction: 3, probability: 0.2851552367210388]\n", "UserItemPrediction [user_id: 3907, item_id: 15, prediction: 4, probability: 0.4642404317855835]\n", "UserItemPrediction [user_id: 790, item_id: 156, prediction: 4, probability: 0.6069482564926147]\n", "UserItemPrediction [user_id: 3315, item_id: 1721, prediction: 5, probability: 0.5300408005714417]\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "SCgDNVQaZ-mJ" }, "source": [ "### Recommend 3 items for each user given candidates in the feature RDDs" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "W2uMAeiOZ5RJ", "outputId": "e09e990f-e600-4541-ba87-77f455cea036" }, "source": [ "userRecs = ncf.recommend_for_user(valPairFeatureRdds, 3)\n", "for result in userRecs.take(5): print(result)" ], "execution_count": 30, "outputs": [ { "output_type": "stream", "text": [ "UserItemPrediction [user_id: 3586, item_id: 850, prediction: 5, probability: 0.49247753620147705]\n", "UserItemPrediction [user_id: 3586, item_id: 2354, prediction: 4, probability: 0.6351815462112427]\n", "UserItemPrediction [user_id: 3586, item_id: 2787, prediction: 4, probability: 0.6106586456298828]\n", "UserItemPrediction [user_id: 1084, item_id: 4588, prediction: 5, probability: 0.7689940333366394]\n", "UserItemPrediction [user_id: 1084, item_id: 554, prediction: 5, probability: 0.7594876289367676]\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "kOHBbcDAaH8K" }, "source": [ "### Recommend 3 users for each item given candidates in the feature RDDs" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "CLp29k3kaHB-", "outputId": "faf205ed-dcbe-4e27-a988-fa83f56fc0ef" }, "source": [ "itemRecs = ncf.recommend_for_item(valPairFeatureRdds, 3)\n", "for result in itemRecs.take(5): print(result)" ], "execution_count": 31, "outputs": [ { "output_type": "stream", "text": [ "UserItemPrediction [user_id: 3111, item_id: 3558, prediction: 5, probability: 0.48693835735321045]\n", "UserItemPrediction [user_id: 2024, item_id: 3558, prediction: 5, probability: 0.42628324031829834]\n", "UserItemPrediction [user_id: 4909, item_id: 3558, prediction: 4, probability: 0.562437891960144]\n", "UserItemPrediction [user_id: 3023, item_id: 1084, prediction: 5, probability: 0.8389995694160461]\n", "UserItemPrediction [user_id: 4790, item_id: 1084, prediction: 5, probability: 0.6715853810310364]\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "FRJoSBJGaO43" }, "source": [ "## Evaluation" ] }, { "cell_type": "markdown", "metadata": { "id": "t8AOXDp9aSwD" }, "source": [ "Plot the train and validation loss curves" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 407 }, "id": "MbS3TGZbaKva", "outputId": "1f7a9581-9feb-4022-95a3-bf013ab06fb7" }, "source": [ "#retrieve train and validation summary object and read the loss data into ndarray's. \n", "train_loss = np.array(ncf.get_train_summary(\"Loss\"))\n", "val_loss = np.array(ncf.get_validation_summary(\"Loss\"))\n", "#plot the train and validation curves\n", "# each event data is a tuple in form of (iteration_count, value, timestamp)\n", "plt.figure(figsize = (12,6))\n", "plt.plot(train_loss[:,0],train_loss[:,1],label='train loss')\n", "plt.plot(val_loss[:,0],val_loss[:,1],label='val loss',color='green')\n", "plt.scatter(val_loss[:,0],val_loss[:,1],color='green')\n", "plt.legend();\n", "plt.xlim(0,train_loss.shape[0]+10)\n", "plt.grid(True)\n", "plt.title(\"loss\")" ], "execution_count": 32, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Text(0.5, 1.0, 'loss')" ] }, "metadata": { "tags": [] }, "execution_count": 32 }, { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] }, { "cell_type": "markdown", "metadata": { "id": "-Yh804RoaW8M" }, "source": [ "Plot accuracy" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 390 }, "id": "F0JtfjEMaVLc", "outputId": "2c22ed0b-8858-4180-bbe8-3f2d230210fd" }, "source": [ "plt.figure(figsize = (12,6))\n", "top1 = np.array(ncf.get_validation_summary(\"Top1Accuracy\"))\n", "plt.plot(top1[:,0],top1[:,1],label='top1')\n", "plt.title(\"top1 accuracy\")\n", "plt.grid(True)\n", "plt.legend();" ], "execution_count": 33, "outputs": [ { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [], "needs_background": "light" } } ] } ] }