{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "QU13Xg9n0bSM" }, "source": [ "# Retail Product Recommendations using word2vec\n", "> Creating a system that automatically recommends a certain number of products to the consumers on an E-commerce website based on the past purchase behavior of the consumers.\n", "\n", "- toc: true\n", "- badges: true\n", "- comments: true\n", "- categories: [sequence, retail]\n", "- image: " ] }, { "cell_type": "markdown", "metadata": { "id": "VwgbuHwVtshy" }, "source": [ "A person involved in sports-related activities might have an online buying pattern similar to this:" ] }, { "cell_type": "markdown", "metadata": { "id": "NQLbbKfmtn_O" }, "source": [ "![image.png]()" ] }, { "cell_type": "markdown", "metadata": { "id": "2tYh4SrWtynJ" }, "source": [ "If we can represent each of these products by a vector, then we can easily find similar products. So, if a user is checking out a product online, then we can easily recommend him/her similar products by using the vector similarity score between the products." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "executionInfo": { "elapsed": 1427, "status": "ok", "timestamp": 1619252390068, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "W0wI5j74nc9W" }, "outputs": [], "source": [ "#hide\n", "import pandas as pd\n", "import numpy as np\n", "import random\n", "from tqdm import tqdm\n", "from gensim.models import Word2Vec \n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "import warnings;\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "markdown", "metadata": { "id": "cDi5Gu8Ou7bb" }, "source": [ "## Data gathering and understanding" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3274, "status": "ok", "timestamp": 1619252391927, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "TlW0pTzGniGo", "outputId": "0392f779-4c42-4e9e-b096-51ee6487b53d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "--2021-04-24 08:19:50-- https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx\n", "Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252\n", "Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 23715344 (23M) [application/x-httpd-php]\n", "Saving to: ‘Online Retail.xlsx’\n", "\n", "Online Retail.xlsx 100%[===================>] 22.62M 22.7MB/s in 1.0s \n", "\n", "2021-04-24 08:19:51 (22.7 MB/s) - ‘Online Retail.xlsx’ saved [23715344/23715344]\n", "\n" ] } ], "source": [ "#hide-output\n", "!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 204 }, "executionInfo": { "elapsed": 45709, "status": "ok", "timestamp": 1619252434373, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "z_Kt8wtZnjRm", "outputId": "e30cdfd6-fd16-48ab-fa22-283fdb3d2578" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850.0United Kingdom
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850.0United Kingdom
\n", "
" ], "text/plain": [ " InvoiceNo StockCode ... CustomerID Country\n", "0 536365 85123A ... 17850.0 United Kingdom\n", "1 536365 71053 ... 17850.0 United Kingdom\n", "2 536365 84406B ... 17850.0 United Kingdom\n", "3 536365 84029G ... 17850.0 United Kingdom\n", "4 536365 84029E ... 17850.0 United Kingdom\n", "\n", "[5 rows x 8 columns]" ] }, "execution_count": 3, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "df = pd.read_excel('Online Retail.xlsx')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "id": "KJHexOy0oPHl" }, "source": [ "Given below is the description of the fields in this dataset:\n", "\n", "1. __InvoiceNo:__ Invoice number, a unique number assigned to each transaction.\n", "\n", "2. __StockCode:__ Product/item code. a unique number assigned to each distinct product.\n", "\n", "3. __Description:__ Product description\n", "\n", "4. __Quantity:__ The quantities of each product per transaction.\n", "\n", "5. __InvoiceDate:__ Invoice Date and time. The day and time when each transaction was generated.\n", "\n", "6. __CustomerID:__ Customer number, a unique number assigned to each customer." ] }, { "cell_type": "markdown", "metadata": { "id": "h8BwXsF5ox--" }, "source": [ "## Data Preprocessing" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 45703, "status": "ok", "timestamp": 1619252434375, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "F58y_bU6nypT", "outputId": "26f4b722-e131-46af-9dde-3cf388986882" }, "outputs": [ { "data": { "text/plain": [ "InvoiceNo 0\n", "StockCode 0\n", "Description 1454\n", "Quantity 0\n", "InvoiceDate 0\n", "UnitPrice 0\n", "CustomerID 135080\n", "Country 0\n", "dtype: int64" ] }, "execution_count": 4, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# check for missing values\n", "df.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "JM04nhUAot7Y" }, "source": [ "Since we have sufficient data, we will drop all the rows with missing values." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 45696, "status": "ok", "timestamp": 1619252434376, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "uqEjaaTKorZ4", "outputId": "061df95f-dfbf-4fda-fd6c-f3f329df87ed" }, "outputs": [ { "data": { "text/plain": [ "InvoiceNo 0\n", "StockCode 0\n", "Description 0\n", "Quantity 0\n", "InvoiceDate 0\n", "UnitPrice 0\n", "CustomerID 0\n", "Country 0\n", "dtype: int64" ] }, "execution_count": 5, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# remove missing values\n", "df.dropna(inplace=True)\n", "\n", "# again check missing values\n", "df.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "executionInfo": { "elapsed": 46353, "status": "ok", "timestamp": 1619252435036, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "RUib7vkCpd1E" }, "outputs": [], "source": [ "# Convert the StockCode to string datatype\n", "df['StockCode']= df['StockCode'].astype(str)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 46347, "status": "ok", "timestamp": 1619252435036, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "wPI7YWkQo6wu", "outputId": "e751a6d9-6ba2-42a7-8a4f-1aaee616f2b4" }, "outputs": [ { "data": { "text/plain": [ "4372" ] }, "execution_count": 7, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# Check out the number of unique customers in our dataset\n", "customers = df[\"CustomerID\"].unique().tolist()\n", "len(customers)" ] }, { "cell_type": "markdown", "metadata": { "id": "ED5y4TDxpOfL" }, "source": [ "There are 4,372 customers in our dataset. For each of these customers we will extract their buying history. In other words, we can have 4,372 sequences of purchases." ] }, { "cell_type": "markdown", "metadata": { "id": "Aogi1piGvEGl" }, "source": [ "## Data Preparation" ] }, { "cell_type": "markdown", "metadata": { "id": "F_WFU-rTvJ_l" }, "source": [ "It is a good practice to set aside a small part of the dataset for validation purpose. Therefore, we will use data of 90% of the customers to create word2vec embeddings. Let's split the data." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "executionInfo": { "elapsed": 46345, "status": "ok", "timestamp": 1619252435037, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "WxWaZx3zpPFW" }, "outputs": [], "source": [ "# shuffle customer ID's\n", "random.shuffle(customers)\n", "\n", "# extract 90% of customer ID's\n", "customers_train = [customers[i] for i in range(round(0.9*len(customers)))]\n", "\n", "# split data into train and validation set\n", "train_df = df[df['CustomerID'].isin(customers_train)]\n", "validation_df = df[~df['CustomerID'].isin(customers_train)]" ] }, { "cell_type": "markdown", "metadata": { "id": "ul94yT2cqJ38" }, "source": [ "Let's create sequences of purchases made by the customers in the dataset for both the train and validation set." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 51786, "status": "ok", "timestamp": 1619252440487, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "uhIwWEe-qGwK", "outputId": "6de33b11-1871-4fea-e60e-56e4cb03db7a" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 3935/3935 [00:05<00:00, 664.97it/s]\n" ] } ], "source": [ "# list to capture purchase history of the customers\n", "purchases_train = []\n", "\n", "# populate the list with the product codes\n", "for i in tqdm(customers_train):\n", " temp = train_df[train_df[\"CustomerID\"] == i][\"StockCode\"].tolist()\n", " purchases_train.append(temp)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 52224, "status": "ok", "timestamp": 1619252440935, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "VGT9oyVeqhky", "outputId": "8198f6f0-9649-4214-ef55-20576f4a6e9f" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 437/437 [00:00<00:00, 1006.50it/s]\n" ] } ], "source": [ "# list to capture purchase history of the customers\n", "purchases_val = []\n", "\n", "# populate the list with the product codes\n", "for i in tqdm(validation_df['CustomerID'].unique()):\n", " temp = validation_df[validation_df[\"CustomerID\"] == i][\"StockCode\"].tolist()\n", " purchases_val.append(temp)" ] }, { "cell_type": "markdown", "metadata": { "id": "AgDLwI_4q4Fm" }, "source": [ "## Build word2vec Embeddings for Products" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 106693, "status": "ok", "timestamp": 1619252495414, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "rr_tHmmuqu24", "outputId": "dcedddc7-d410-4d0c-bd3a-79f6022f42ef" }, "outputs": [ { "data": { "text/plain": [ "(3657318, 3696290)" ] }, "execution_count": 11, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# train word2vec model\n", "model = Word2Vec(window = 10, sg = 1, hs = 0,\n", " negative = 10, # for negative sampling\n", " alpha=0.03, min_alpha=0.0007,\n", " seed = 14)\n", "\n", "model.build_vocab(purchases_train, progress_per=200)\n", "\n", "model.train(purchases_train, total_examples = model.corpus_count, \n", " epochs=10, report_delay=1)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "executionInfo": { "elapsed": 106690, "status": "ok", "timestamp": 1619252495414, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "_CwTzmNqq_lQ" }, "outputs": [], "source": [ "# save word2vec model\n", "model.save(\"word2vec_2.model\")" ] }, { "cell_type": "markdown", "metadata": { "id": "HBbudJ7hrDw0" }, "source": [ "As we do not plan to train the model any further, we are calling init_sims(), which will make the model much more memory-efficient" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "executionInfo": { "elapsed": 106689, "status": "ok", "timestamp": 1619252495415, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "UGYK8p1xrAJy" }, "outputs": [], "source": [ "model.init_sims(replace=True)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 106680, "status": "ok", "timestamp": 1619252495416, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "M_AbUfMOrGsI", "outputId": "66d6d641-2dfb-4129-b27c-6eeafffd07a2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Word2Vec(vocab=3153, size=100, alpha=0.03)\n" ] } ], "source": [ "print(model)" ] }, { "cell_type": "markdown", "metadata": { "id": "HKFT2SECrMY1" }, "source": [ "Now we will extract the vectors of all the words in our vocabulary and store it in one place for easy access" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 106664, "status": "ok", "timestamp": 1619252495416, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "odn6f3rorG7T", "outputId": "f8c06bf9-1ba4-48f9-e491-1fb2369fdadc" }, "outputs": [ { "data": { "text/plain": [ "(3153, 100)" ] }, "execution_count": 15, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# extract all vectors\n", "X = model[model.wv.vocab]\n", "\n", "X.shape" ] }, { "cell_type": "markdown", "metadata": { "id": "VQr5YRjIrTC2" }, "source": [ "## Visualize word2vec Embeddings" ] }, { "cell_type": "markdown", "metadata": { "id": "EwW0xbwkrVoB" }, "source": [ "It is always quite helpful to visualize the embeddings that you have created. Over here we have 100 dimensional embeddings. We can't even visualize 4 dimensions let alone 100. Therefore, we are going to reduce the dimensions of the product embeddings from 100 to 2 by using the UMAP algorithm, it is used for dimensionality reduction." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 109415, "status": "ok", "timestamp": 1619252498177, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "A7XmYq1Eragv", "outputId": "9dfed0d8-4a6c-4cd5-bf14-9bbe0e8248f2" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: umap-learn in /usr/local/lib/python3.7/dist-packages (0.5.1)\n", "Requirement already satisfied: numba>=0.49 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (0.51.2)\n", "Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (1.19.5)\n", "Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (0.5.2)\n", "Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (1.4.1)\n", "Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.7/dist-packages (from umap-learn) (0.22.2.post1)\n", "Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn) (56.0.0)\n", "Requirement already satisfied: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.7/dist-packages (from numba>=0.49->umap-learn) (0.34.0)\n", "Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from pynndescent>=0.5->umap-learn) (1.0.1)\n" ] } ], "source": [ "#hide\n", "!pip install umap-learn" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 54 }, "executionInfo": { "elapsed": 150522, "status": "ok", "timestamp": 1619252539293, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "Y8wKkbRQrQAy", "outputId": "0b928341-1790-411d-fb91-a824f3c05264" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light", "tags": [] }, "output_type": "display_data" } ], "source": [ "#collapse\n", "import umap\n", "\n", "cluster_embedding = umap.UMAP(n_neighbors=30, min_dist=0.0,\n", " n_components=2, random_state=42).fit_transform(X)\n", "\n", "plt.figure(figsize=(10,9))\n", "plt.scatter(cluster_embedding[:, 0], cluster_embedding[:, 1], s=3, cmap='Spectral');" ] }, { "cell_type": "markdown", "metadata": { "id": "wNQ9hotsr1eC" }, "source": [ "Every dot in this plot is a product. As you can see, there are several tiny clusters of these datapoints. These are groups of similar products." ] }, { "cell_type": "markdown", "metadata": { "id": "5_KBjh2Jr-Qg" }, "source": [ "## Generate and validate recommendations" ] }, { "cell_type": "markdown", "metadata": { "id": "he6azpQysC_M" }, "source": [ "We are finally ready with the word2vec embeddings for every product in our online retail dataset. Now our next step is to suggest similar products for a certain product or a product's vector. \n", "\n", "Let's first create a product-ID and product-description dictionary to easily map a product's description to its ID and vice versa." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "executionInfo": { "elapsed": 150520, "status": "ok", "timestamp": 1619252539294, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "sckVfOMHrYHp" }, "outputs": [], "source": [ "products = train_df[[\"StockCode\", \"Description\"]]\n", "\n", "# remove duplicates\n", "products.drop_duplicates(inplace=True, subset='StockCode', keep=\"last\")\n", "\n", "# create product-ID and product-description dictionary\n", "products_dict = products.groupby('StockCode')['Description'].apply(list).to_dict()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 150514, "status": "ok", "timestamp": 1619252539296, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "Ldj0ew7UsOiw", "outputId": "e03b87e5-8f3c-466e-a85c-ec004746efdc" }, "outputs": [ { "data": { "text/plain": [ "['RED WOOLLY HOTTIE WHITE HEART.']" ] }, "execution_count": 19, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# test the dictionary\n", "products_dict['84029E']" ] }, { "cell_type": "markdown", "metadata": { "id": "eneoWxrwsSjt" }, "source": [ "We have defined the function below. It will take a product's vector (n) as input and return top 6 similar products." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "executionInfo": { "elapsed": 150514, "status": "ok", "timestamp": 1619252539298, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "TmgZr0c2sPuz" }, "outputs": [], "source": [ "#hide\n", "def similar_products(v, n = 6):\n", " \n", " # extract most similar products for the input vector\n", " ms = model.similar_by_vector(v, topn= n+1)[1:]\n", " \n", " # extract name and similarity score of the similar products\n", " new_ms = []\n", " for j in ms:\n", " pair = (products_dict[j[0]][0], j[1])\n", " new_ms.append(pair)\n", " \n", " return new_ms " ] }, { "cell_type": "markdown", "metadata": { "id": "H3ygmMQHsZRA" }, "source": [ "Let's try out our function by passing the vector of the product '90019A' ('SILVER M.O.P ORBIT BRACELET')" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 150506, "status": "ok", "timestamp": 1619252539299, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "pWnP50EVsWEj", "outputId": "bec2fe21-2259-4ae6-a30d-5cb4667d53d7" }, "outputs": [ { "data": { "text/plain": [ "[('SILVER M.O.P ORBIT DROP EARRINGS', 0.7879312634468079),\n", " ('AMBER DROP EARRINGS W LONG BEADS', 0.7682332992553711),\n", " ('JADE DROP EARRINGS W FILIGREE', 0.761816143989563),\n", " ('DROP DIAMANTE EARRINGS PURPLE', 0.7489826679229736),\n", " ('SILVER LARIAT BLACK STONE EARRINGS', 0.7389366626739502),\n", " ('WHITE VINT ART DECO CRYSTAL NECKLAC', 0.7352254390716553)]" ] }, "execution_count": 21, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "similar_products(model['90019A'])" ] }, { "cell_type": "markdown", "metadata": { "id": "aH-2nt_ishkM" }, "source": [ "Cool! The results are pretty relevant and match well with the input product. However, this output is based on the vector of a single product only. What if we want recommend a user products based on the multiple purchases he or she has made in the past?\n", "\n", "One simple solution is to take average of all the vectors of the products he has bought so far and use this resultant vector to find similar products. For that we will use the function below that takes in a list of product ID's and gives out a 100 dimensional vector which is mean of vectors of the products in the input list." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "executionInfo": { "elapsed": 150504, "status": "ok", "timestamp": 1619252539299, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "MqtpgpqFsauG" }, "outputs": [], "source": [ "#collapse\n", "def aggregate_vectors(products):\n", " product_vec = []\n", " for i in products:\n", " try:\n", " product_vec.append(model[i])\n", " except KeyError:\n", " continue\n", " \n", " return np.mean(product_vec, axis=0)" ] }, { "cell_type": "markdown", "metadata": { "id": "mXPqS4h0sojc" }, "source": [ "If you can recall, we have already created a separate list of purchase sequences for validation purpose. Now let's make use of that." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 151381, "status": "ok", "timestamp": 1619252540183, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "KfI_RrLZsn8W", "outputId": "929f549e-1621-474d-9c96-f4012a1aa5f9" }, "outputs": [ { "data": { "text/plain": [ "28" ] }, "execution_count": 23, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "#hide\n", "len(purchases_val[0])" ] }, { "cell_type": "markdown", "metadata": { "id": "nuaqWGa4ssZr" }, "source": [ "The length of the first list of products purchased by a user is 314. We will pass this products' sequence of the validation set to the function aggregate_vectors." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 151376, "status": "ok", "timestamp": 1619252540184, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "OdurKB3ysp6c", "outputId": "c2635b4c-584c-47a0-e436-f4175fc87dad" }, "outputs": [ { "data": { "text/plain": [ "(100,)" ] }, "execution_count": 24, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "#hide\n", "aggregate_vectors(purchases_val[0]).shape" ] }, { "cell_type": "markdown", "metadata": { "id": "gwaXOQSus9ya" }, "source": [ "Well, the function has returned an array of 100 dimension. It means the function is working fine. Now we can use this result to get the most similar products. Let's do it." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 151371, "status": "ok", "timestamp": 1619252540186, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "PKTVsoc3s7ZK", "outputId": "ec1f2e74-74bc-4035-e8f6-b01b95b35f4d" }, "outputs": [ { "data": { "text/plain": [ "[('WHITE SPOT BLUE CERAMIC DRAWER KNOB', 0.6860978603363037),\n", " ('RED SPOT CERAMIC DRAWER KNOB', 0.6785424947738647),\n", " ('BLUE STRIPE CERAMIC DRAWER KNOB', 0.6783121824264526),\n", " ('BLUE SPOT CERAMIC DRAWER KNOB', 0.6738985776901245),\n", " ('CLEAR DRAWER KNOB ACRYLIC EDWARDIAN', 0.6731897592544556),\n", " ('RED STRIPE CERAMIC DRAWER KNOB', 0.6667704582214355)]" ] }, "execution_count": 25, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "similar_products(aggregate_vectors(purchases_val[0]))" ] }, { "cell_type": "markdown", "metadata": { "id": "MRsPrkMotDaZ" }, "source": [ "As it turns out, our system has recommended 6 products based on the entire purchase history of a user. Moreover, if you want to get products suggestions based on the last few purchases only then also you can use the same set of functions.\n", "\n", "Below we are giving only the last 10 products purchased as input." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 151364, "status": "ok", "timestamp": 1619252540187, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "OUqyW1-Ns_u2", "outputId": "c903bdf6-ac73-4a53-eaa6-8987a67cc8fb" }, "outputs": [ { "data": { "text/plain": [ "[('BLUE SPOT CERAMIC DRAWER KNOB', 0.7394766807556152),\n", " ('RED SPOT CERAMIC DRAWER KNOB', 0.7364704012870789),\n", " ('WHITE SPOT BLUE CERAMIC DRAWER KNOB', 0.7347637414932251),\n", " ('ASSORTED COLOUR BIRD ORNAMENT', 0.7345550060272217),\n", " ('RED STRIPE CERAMIC DRAWER KNOB', 0.7305896878242493),\n", " ('WHITE SPOT RED CERAMIC DRAWER KNOB', 0.6979628801345825)]" ] }, "execution_count": 26, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "similar_products(aggregate_vectors(purchases_val[0][-10:]))" ] }, { "cell_type": "markdown", "metadata": { "id": "59iEesnTzNHq" }, "source": [ "## References\n", "\n", "- [https://www.analyticsvidhya.com/blog/2019/07/how-to-build-recommendation-system-word2vec-python/](https://www.analyticsvidhya.com/blog/2019/07/how-to-build-recommendation-system-word2vec-python/)\n", "- [https://mccormickml.com/2018/06/15/applying-word2vec-to-recommenders-and-advertising/](https://mccormickml.com/2018/06/15/applying-word2vec-to-recommenders-and-advertising/)\n", "- [https://www.analyticsinsight.net/building-recommendation-system-using-item2vec/](https://www.analyticsinsight.net/building-recommendation-system-using-item2vec/)\n", "- [https://towardsdatascience.com/using-word2vec-for-music-recommendations-bb9649ac2484](https://towardsdatascience.com/using-word2vec-for-music-recommendations-bb9649ac2484)\n", "- [https://capablemachine.com/2020/06/23/word-embedding/](https://capablemachine.com/2020/06/23/word-embedding/)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "executionInfo": { "elapsed": 151363, "status": "ok", "timestamp": 1619252540188, "user": { "displayName": "sparsh agarwal", "photoUrl": "", "userId": "00322518567794762549" }, "user_tz": -330 }, "id": "Q--xnW3dtIZy" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "2021-04-24-rec-medium-word2vec.ipynb", "provenance": [ { "file_id": "https://gist.github.com/sparsh-ai/ca06820d9124f03e90b33dbe98348639#file-rec-medium-word2vec-ipynb", "timestamp": 1619249955624 } ] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }