{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "J0v8-m3JYFQo" }, "source": [ "# Session-based recommendation using word2vec\n", "> How to build a session-based product recommender using word2vec model on retail dataset\n", "\n", "- toc: true\n", "- badges: true\n", "- comments: true\n", "- categories: [session, retail]\n", "- image: " ] }, { "cell_type": "markdown", "metadata": { "id": "XIY8wbf-VFyT" }, "source": [ "### Install/import libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5pz-XFyoj5is" }, "outputs": [], "source": [ "# !sudo apt-get install -y ray\n", "# !pip install ray\n", "# !pip install ray[default]\n", "# !pip install ray[tune]" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "id": "zV-QhI_1UkjA" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import os\n", "import pickle\n", "from numpy.random import default_rng\n", "import collections\n", "import itertools\n", "from copy import deepcopy \n", "\n", "from gensim.models.word2vec import Word2Vec\n", "from gensim.models.callbacks import CallbackAny2Vec\n", "from ray import tune\n", "\n", "import os\n", "import argparse\n", "import ray\n", "import time\n", "\n", "MODEL_DIR = \"/content\"" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "id": "XDOxi0wabyQs" }, "outputs": [], "source": [ "plt.style.use(\"seaborn-white\")\n", "cldr_colors = ['#00b6b5', '#f7955b','#6c8cc7', '#828282']#\n", "cldr_green = '#a4d65d'\n", "color_palette = \"viridis\"\n", "\n", "rng = default_rng(123)\n", "\n", "ECOMM_PATH = \"/content\"\n", "ECOMM_FILENAME = \"OnlineRetail.csv\"\n", "\n", "%load_ext tensorboard" ] }, { "cell_type": "markdown", "metadata": { "id": "4UPgoIvlU9z2" }, "source": [ "### Load data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "t5DGOggiTryP" }, "outputs": [], "source": [ "!wget -O data.zip https://github.com/sparsh-ai/reco-data/blob/master/onlineretail.zip?raw=true\n", "!unzip data.zip" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "Hgc7vOvTV7y9" }, "outputs": [], "source": [ "def load_original_ecomm(pathname=ECOMM_PATH):\n", " df = pd.read_csv(os.path.join(pathname, ECOMM_FILENAME),\n", " encoding=\"ISO-8859-1\",\n", " parse_dates=[\"InvoiceDate\"],\n", " )\n", " return df" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 289 }, "id": "TmsQtNH4dEwb", "outputId": "f6fb36b1-69ac-4d1a-f014-50a407304374" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850.0United Kingdom
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850.0United Kingdom
\n", "
" ], "text/plain": [ " InvoiceNo StockCode ... CustomerID Country\n", "0 536365 85123A ... 17850.0 United Kingdom\n", "1 536365 71053 ... 17850.0 United Kingdom\n", "2 536365 84406B ... 17850.0 United Kingdom\n", "3 536365 84029G ... 17850.0 United Kingdom\n", "4 536365 84029E ... 17850.0 United Kingdom\n", "\n", "[5 rows x 8 columns]" ] }, "execution_count": 46, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "df = load_original_ecomm()\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QnlEr3cngRbq", "outputId": "d50991db-b741-453a-f264-97ba5f51430f" }, "outputs": [ { "data": { "text/plain": [ "InvoiceNo 0\n", "StockCode 0\n", "Description 1454\n", "Quantity 0\n", "InvoiceDate 0\n", "UnitPrice 0\n", "CustomerID 135080\n", "Country 0\n", "dtype: int64" ] }, "execution_count": 47, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "df.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "XedYV2JkWigc" }, "source": [ "### Preprocess" ] }, { "cell_type": "markdown", "metadata": { "id": "474PEoLOgFI_" }, "source": [ "There are some rows with missing information, so we'll filter those out. Since we want to define customer sessions, we'll use group by CustomerID field and filter out any customer entries that have fewer than three purchased items." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "id": "3KVsMP58gp5h" }, "outputs": [], "source": [ "def preprocess_ecomm(df, min_session_count=3):\n", "\n", " df.dropna(inplace=True)\n", " item_counts = df.groupby([\"CustomerID\"]).count()[\"StockCode\"]\n", " df = df[df[\"CustomerID\"].isin(item_counts[item_counts >= min_session_count].index)].reset_index(drop=True)\n", " \n", " return df" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 289 }, "id": "5138zkW_g1g0", "outputId": "18135a1a-f3df-4007-c987-c7a02bf27355" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850.0United Kingdom
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850.0United Kingdom
\n", "
" ], "text/plain": [ " InvoiceNo StockCode ... CustomerID Country\n", "0 536365 85123A ... 17850.0 United Kingdom\n", "1 536365 71053 ... 17850.0 United Kingdom\n", "2 536365 84406B ... 17850.0 United Kingdom\n", "3 536365 84029G ... 17850.0 United Kingdom\n", "4 536365 84029E ... 17850.0 United Kingdom\n", "\n", "[5 rows x 8 columns]" ] }, "execution_count": 57, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "df = preprocess_ecomm(df)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "0P1BcYYYhDqI", "outputId": "377b0962-8019-4c93-9812-932b7ff020cd" }, "outputs": [ { "data": { "text/plain": [ "4234" ] }, "execution_count": 58, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# Number of unique customers after preprocessing\n", "df.CustomerID.nunique()" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "wq63Y-D-hWKR", "outputId": "5e39d429-418b-4d08-d4f3-c529cfcf3252" }, "outputs": [ { "data": { "text/plain": [ "3684" ] }, "execution_count": 59, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "# Number of unique stock codes (products)\n", "df.StockCode.nunique()" ] }, { "cell_type": "markdown", "metadata": { "id": "lKpBJdaOiExk" }, "source": [ "### Product popularity\n", "Here we plot the frequency by which each product is purchased (occurs in a transaction)" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 441 }, "id": "TZLOGjaPiHPY", "outputId": "12ef2092-5394-41c3-b158-6b0f7b8809f0" }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAjgAAAGoCAYAAABL+58oAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nOzdd3hUVeLG8Tc9QAokBAgQuiGQBEJHAUFQUSmKi6KiLj8rKLuugLvoiooNXcSCKCpYsIJ13VWKuFhgFxAFBAKE3kkIgfQyycz5/YEZjSQwQCZ3ZvL9PI/PY86dZN6TZDIv9557r58xxggAAMCH+FsdAAAAoLpRcAAAgM+h4AAAAJ9DwQEAAD4n0OoA1am4uFibNm1STEyMAgICrI4DAADcyG63KzMzU0lJSQoNDa2wzacKzqZNmzR69GirYwAAgBr03nvvqXv37hXGfKrgxMTESDox0SZNmlicBgAAuFN6erpGjx7tfP//LZ8qOOWHpZo0aaLmzZtbnAYAANSEypalsMgYAAD4HAoOAADwORQcAADgcyg4AADA51BwAACAz6HgAAAAn0PBAQAAPoeCAwAAfA4FBwAA+BwKDgAA8DkUHAAA4HMoOAAAwOdQcAAAgM+h4AAAALcwxuinvccseW4KDgAAcIuVO7P0h9krteNIXo0/NwUHAAC4RW5xqSSp1G5q/LkpOAAAwC1svxSboAC/Gn9uCg4AAHCLMrtDkhQUUPN1g4IDAADcopSCAwAAfE35IapADlEBAABfUVp2Yg9OMHtwAACAryhzcIgKAAD4mFLnWVQUHAAA4CNsZeV7cFiDAwAAfESZw6FAfz/5+VFwAACAjyi1G0sOT0kUHAAA4Ca2Moclp4hLFBwAAOAmpXaHJaeISxQcAADgJmUcogIAAL6m1O5QUCCHqAAAgA+x2R0K8mcPDgAA8CGHsosUHhpoyXNTcAAAgFtkF5YqLqquJc9NwQEAAG6RX1KmesHswQEAAD6k0GZXvRAKDgAA8BHGGBXYyhQWEmDJ81NwAABAtSu02WWM2IMDAAB8R0FJmSSpLgUHAAD4igKbXZI4RAUAAHxH+R4czqICAAA+I/+XghPGISoAAOArNh/KlWTdImNrnvUs5Ofna9KkSSooKFBpaakeeOABderUyepYAACgEj/tOy5Jat8k3JLn95o9OF999ZUuvPBCvfPOO5o0aZJefPFFqyMBAIAqHM4u0gVtoxUaZM0iY6/Zg3P11Vc7///QoUNq0qSJhWkAAMCpZOaXqFuLBpY9vyV7cNLS0jR06FANHDiwwvjhw4c1duxY9erVS/3799ejjz4qm83m3J6Xl6eRI0fq5Zdf1l/+8peajg0AAFywMzNf+48V6bzG1hyekiwoOAsXLtRtt92mli1bnrRt/PjxatCggZYuXar3339f69at08yZM53bw8PD9fHHH+tPf/qTnnzyyZqMDQAAXDTvf3skSVd3bWZZhhovOIWFhVqwYIHOP//8CuMbN27U5s2bdd999ykiIkLNmjXTnXfeqQ8//FAOh0Pr169XVlaWJGngwIFav359TUcHAAAu2HAgR8EB/oqNrGNZhhovOCNHjlTTpk1PGk9NTVVsbKyioqKcY4mJicrJydG+ffu0cuVKffjhh5KkTZs2qVWrVjUVGQAAuMjuMNp0MEc39GphaQ6PWWScnZ2tiIiICmORkZGSpOPHj+umm27S3/72N40ePVp2u12PPvqoFTEBAMApHMkrVpnD6LzGYZbm8JiCI524tXpVwsLC9NJLL9VgGgAAcKb2ZhVKkprWt+7wlORB18GJiopSdnZ2hbHyj6Ojo62IBAAAzoAxRrO/3angAH8lWHSBv3IeU3CSkpKUkZGhzMxM59iGDRsUHR2tuLg4C5MBAABXZOaV6LttmbqzfxtLFxhLHlRwOnbsqJSUFE2fPl15eXnav3+/Zs+erdGjR8vPz8/qeAAA4DS+STsiSUpuFmlxEgvW4AwePFiHDh2Sw+FQWVmZkpOTJUmLFy/WCy+8oIcfflj9+vVTaGioRowYobFjx9Z0RAAAcBbeXrlXEaGBujA+xuooNV9wlixZcsrtr7zySg0lAQAA1SUzr0Sph3J13+D2lt1/6rc85hAVAADwXsu3n1hDe+F51u+9kSg4AACgGny/LVPR9YKV2DTi9A+uARQcAABwTnKLS/Xttkz1O6+h/P0948QgCg4AADgn0xZuUXZhqcb0aW11FCcKDgAAOGt2h9H3246qVXRdpcTVtzqOEwUHAACctQ9+2KeD2UW695J4q6NUQMEBAABnxeEwevX7nUpoEq7hnZtaHacCCg4AADgrq3cf0/5jRbqzfxuPu+sABQcAAJyVD3/cr7CQQF2WGGt1lJNQcAAAwBnbf6xQn607qJHdmqtOsPVXLv49Cg4AADhjH/10QJI0ulcLi5NUjoIDAADOSEFJmb7ZekRNI0N1XuNwq+NUioIDAABclltcqmteWamNB3P0l4s969Tw36rxu4kDAADv9fry3dqSnqtnrumskd2aWx2nSuzBAQAALjHG6NN1B9StRQOPLjcSBQcAALho1rId2n+sSNd09+xyI1FwAACAC+wOoy83HpYkXd2VggMAAHzAi8u2a2t6nl64LkVBAZ5fHzw/IQAAsNTn6w/q+a+36+ouzXRlSjOr47iEggMAAKr03x1Hde+C9UpsGqFpf0i2Oo7LKDgAAKBSO47ka/Tc1WoYFqL3buulkEDPuyVDVSg4AACgUm/9b7ck6dlrU1S/brDFac4MF/oDAAAVGGP0xn/36N1V+3RFchP1Pa+h1ZHOGAUHAAA45ZeU6d4F67V0c4b6tmuoZ69NsTrSWaHgAAAASSf23Pzt4w36z5YMPTikg27p01r+/n5WxzorFBwAACBJmrt8t77ceFj3DW6v2/q1sTrOOWGRMQAAkDFGH/ywT80b1NHY/m2tjnPOKDgAAECLNqVr19EC3TWgnQK89LDUb1FwAACo5dLS8zTpo5+V2DRC13rBjTRdQcEBAKAW23EkT9fPWaU6QQF6/Y89FOgF95lyhW/MAgAAnDGHw+jpxWkqszv08bgL1CQy1OpI1YazqAAAqIV2Hy3QbfPWaGdmgcYNaKvWDetZHalaUXAAAKhlcotLNXrOKh3KKdbU4Ym6+fyWVkeqdhQcAABqkZ2Z+brj7R91KKdYT/8hWaN6tLA6kltQcAAAqAWO5pfo7vfWavXuY4qsE6R3b+3llfeYchUFBwAAH3e8wKZrX1mpXUcLdH3PFrprQFvFRdW1OpZbUXAAAPBhGbnFuuaVldp3rFBjLmilR4YnWh2pRlBwAADwUT/vz9aYN3/Q8cJSvTy6q65IjrU6Uo2h4AAA4GM2HczRxz8d0Fv/2yNJuv/yhFpVbiQKDgAAPuNIXrHu/2Sj/rP1iCSpb7uGemR4R7VrFG5xsppHwQEAwMs5HEbv/7BPT3y5RUWldl3bvbkmXtpejSN858rEZ4qCAwCAFzuSW6w/vrlGWw7nKrlZpP4xspM6xEZYHctyFBwAALyQMUbfpB3RpI82qMhm1xMjknRdjxYK8PezOppHoOAAAOBF7A6jz9cf1Cvf7dS2jHzFRobqrf/roU7N61sdzaNQcAAA8ALGGH2+/pCeWLhFmXklala/jh4c0kGjesQpPDTI6ngeh4IDAICH+yo1XQ98tklH80vUNqae/jq4va7q0kxBAf5WR/NYFBwAADxURm6x7vt4g77flqmQQH/d2LuFHhzSUaFBAVZH83gUHAAAPNDerALd/MYP2nesUOMGtNU9g86j2JwBCg4AAB5k/7FCPfrFZi3dnKHQIH8tuON89WwdZXUsr0PBAQDAA5TaHZr/wz49sXCLbGUOje7VQrf2ba02MWFWR/NKFBwAACyUV1yq91bv0/ur92nfsUIlNo3QM9d05mJ954iCAwCABewOo+XbM3X3e2tVYLOrTcN6eurqZI3s1lyBnB11zig4AADUoO0ZeVq1+5je/t8ebT+Sr4jQQN1/eYLuuLCN/Py4CnF1oeAAAOBmDofR/DX79cnaA/pp73FJUlhIoB64IkGjurdQZF0u1FfdKDgAALiJw2H07w2HNPvbndqanqeQQH+NG9BWQ5JjdV7jMIUEctq3u1BwAABwg52Z+br7vbXamp6nyDpBevTKRN3UuyWHoWoIBQcAgGqUXWjTe6v3adayHbIbo8mXJ+jWvq25rUINo+AAAFBNVmw/qlvmrZGtzKG2MfX0yo3ddF7jcKtj1UoUHAAAzpExRruOFuie+etkK3No3i09deF5DTkcZSEKDgAAZym/pExvrtit+Wv262B2kYID/TV7dFf1j4+xOlqtR8EBAOAsfJN2RP/35hpJUkKTcP3l4vM0rHNTteXWCh6BggMAwBnYcCBbt877UZl5JZKkqcMTdfP5nB3laSg4AAC44Of92Xpn1V59/NMBSdINvVrob5clKLIOF+nzRBQcAACqUFxq1+frD2ru8t3afiRfknRhfIweHNJB8Zwd5dEoOAAA/EZBSZm+3pKhb9My9e+fD6nMYRRVL1j3XhyvK5KbcNq3l6DgAAAgaceRPM1atkP/XH/IOTYwoZGu6tJMlyc14UJ9XoaCAwCo1bILbZr67836bN1BSVL9ukEac0Er3davjcJCeJv0VvzkAAC10jdpR/TGit1avv2oJCkk0F/PXNNZwzo3tTgZqgMFBwBQaxTZ7Hp31V59tu6gNh/OlSQ1Cg/RpEvb69oecRanQ3Wi4AAAfFpmXon+/fMhffjjfm1Nz3OOd4iN0LPXdlaH2AgL08FdKDgAAJ819d+pevO/e5wft2sUput6xOmm81sqJDDAumBwOwoOAMDn/LT3mCZ++LP2ZBVKkp4YkaQRXZqpbjBve7UFP2kAgE/Yf6xQ/1x3UG/8d7eOF5ZKOnGPqHdv66WGYSEWp0NNo+AAALyWMUaLNqXryYVbdOB4kXN8QPsY/WngeerWsoGF6WAlCg4AwKvkl5Rpy+FcLd6UrtdX7HaOd2oeqTsvbKtBHRopNIj1NbUdBQcA4PHsDqP/bMnQ5+sP6cuNhytsG9U9Tndf1E4toutalA6eiIIDAPBYP+w+pk/XHtD8NfudY/WCA3T7hW10QduG6hwXydlQqNRZF5zc3FwdOHBA7dq1U3BwcHVmAgDUYnnFpfrnuoOa8nmqcyw40F8D4mN0/xUd1LphPQvTwVu4VHB27dqlcePGacaMGUpKStLq1as1duxYFRcXq0GDBpo7d646duzo7qwAAB+WmVeiu99bqx/2HHOODekUq7EXtlVy80gLk8EbuVRwpk2bprZt26pVq1aSpMcff1ydO3fW3/72N82bN0/PP/+8XnvtNXfmBAD4GLvD6HBOkZ5enKb1+49r/7ETZ0EFB/prypAOGpzYRI0iQi1OCW/lUsH5+eef9e677yosLEy7du3S9u3b9dhjj6lDhw66/fbbdeONN7o7JwDAR+w/Vqh3V+/VnO93yWFOjAUH+qtHqwa6vmcLXZnSTAH+ftaGhNdzqeCUlpYqLCxMkrRy5UpFRUUpJSVFkhQaGqrCwkL3JQQA+ARjjJ75Kk0vfbPTOXZ1l2a6KKGRhnaKlZ8fpQbVx6WC06pVKy1ZskRXXXWV5s+fr4EDBzq3rV27Vk2aNHFbQACAd9tztEAzl23Xoo3pKiq1S5KevbazLktqwq0T4DYu/Wbdfvvtuu+++/SPf/xDERERuu222yRJq1ev1iOPPKJx48a5NSQAwLvYyhx65budmrN8l/KKyyRJAf5+ujypiR67KolbJ8DtXCo4V1xxhRISEpSWlqauXbuqcePGkqTIyEj99a9/1ahRo9waEgDgHbLyS/Tuqn167uttzrFBCY10ddfmuiypCWtrUGNcKjj333+//v73v6tNmzYVxhMSEhQSEqI///nPmjlzplsCAgA82+GcIj30ear2HC3Q9iP5zvErkpvomWs6cxgKlnDpt+6f//yn7rvvvkq37dy5U9988021hgIAeLZCW5m2pufpox/364MfTlxluE5QgLq0qK9R3eP0h27NFRTgb3FK1GanLDgJCQnOVe19+vSp8nFc5A8Aage7w+i173fp6cVbK4xPuCRe4y9qJ38OQcFDnLLgrFixQuvXr9f48eM1duxY1a178o3MIiMjdckll7gtYLnS0lJNnjxZ6enpcjgcevzxx9W2bVu3Py8AQPps3QF99OMB/W9nlnPs+p5xGta5qc5vE80p3vA4pyw4DRs21MUXX6xp06ZpyJAhlt5z6vPPP1dMTIxmzJihb7/9Vi+99JKeffZZy/IAQG3wbdoRvbtqr77eckSSFBMeotG9WuiWvq0VERpkcTqgai6twRkxYoQyMzO1ZcsW5ebmyhhz0mOGDRtW7eF+a/jw4c7njY6OVk5OjlufDwBqqxOlZp++3pLhHPPzk/49vq+SmnFPKHgHlwrO559/rilTpshms1W63c/P74wKTlpamiZOnKjCwkItW7bMOX748GFNnTpV69atU2hoqAYNGqTJkycrODi4wt6jd999V5dffrnLzwcAOL09Rwv05MIt+mrziWITHOiv0b1aaFSPOLWLCVMgi4bhRVwqOC+99JIGDhyoW265RVFRUed0rHXhwoWaNm2aOnXqpC1btlTYNn78eMXHx2vp0qXKy8vT+PHjNXPmTE2aNMn5mFmzZslut2vkyJFnnQEAcILDYfTe6r16e+XeCqd4z7+jt3q3ibYwGXBuXCo4R44c0dy5c9WiRYtzfsLCwkItWLBAy5Ytq1BwNm7cqM2bN2vOnDmKiIhQRESE7rzzTj300EOaMGGC/P399fbbb2vbtm167rnnzjkHANRGxhg5jLR+/3H9c90hvbNqr3Nbz9ZRGtmtuUZ0acYp3vB6LhWc9u3b68iRI9VScKra85KamqrY2FhFRUU5xxITE5WTk6N9+/bJGKOFCxfq7bffVkBAwDnnAIDaoszu0LECm17/7269+t2uk7aP7NZcf72svRqFh1qQDnAPlwrOlClT9I9//EMTJkxQYmKigoKqf+V8dna2IiIiKoxFRp5YzHb8+HEtW7ZMWVlZuvXWWyWdOMOLPTkAcDJbmUOHsov09ZYMbcvI08c/HZDjN+eGXJ7URB1iIzQwoZESmoSztgY+yaWCc/fdd6uoqEjXX3+9JFW6B2XTpk3nHKays7PKTZw4URMnTjzn5wAAX3XgeKHmLt+td1btlf03jSY2MlQJTcI1tFNTXdWlGfeDQq3gUsEZOXKk2y/iFBUVpezs7Apj5R9HR7PQDQAqU2gr01epGZq3co/W7fv1b+ighEYa1rmp+rRrqJhw7tyN2selgvOnP/3J3TmUlJSkjIwMZWZmKiYmRpK0YcMGRUdHKy4uzu3PDwDeZP+xQr27em+FNTVtY+rpjgvbaFjnptzgErWeS6+ANWvWnPYxPXr0OKcgHTt2VEpKiqZPn64pU6YoOztbs2fP1ujRo7kEOABIKimz69u0TL363U6t/c3emuGdm+qui9oqoUnEKT4bqF1cKjg33XST/Pz8KqyR+X3p+P01baoyePBgHTp0SA6HQ2VlZUpOTpYkLV68WC+88IIefvhh9evXT6GhoRoxYoTGjh3r6lwAwCfZHUbz/rdHj36x2TnWMrqu7h7QToOTmiiyDrdMAH7PpYKzYMGCk8YKCwu1fv16LVu2TFOmTHH5CZcsWXLK7a+88orLXwsAfJnDYTRn+S49//V2FZXaJUkXtY/RvZfEq1Pz+hanAzybSwWnc+fOlY6ff/75atWqlV5++WWKCQBUg//tPKoth/P0ny0ZFe7c3bVFfb1yUzeuVQO46JxXoSUlJemBBx6ojiwAUCut3pWlnw9k640Ve5SeW1xh298uS9AtfVspJJALnAJn4pwKTkFBgRYsWKB69epVVx4AqBVsZQ4t+HG/3vrvbu3MLHCOX9Kxse68sI3OaxyuesEBXIQPOEsuFZzExMSTFhUbY+RwOCRJd911V/UnAwAfZMyJdTXPLNkmm/3E39DuLRtowiXxatmwnprVr2NxQsA3uFRwxo4dW+mp2mFhYUpKSlL37t2rPRgA+ApjjJZuztDRfJse+Gyjc7xnqyg9MSJJ5zUOtzAd4Js85kJ/AOBrjuaX6L1V+/TBD/sqrK3pHx+jF2/ooohQTu8G3MXlNThbt27V/PnztXnzZuXn5ys8PFzJyckaPXq0Wrdu7c6MAOBV0nOK9Y/FW/XpuoPOsQvaRuvvQzooul6ImkRyJhTgbi4VnBUrVmjs2LFq2LChEhMT1apVK+Xl5Wnx4sX6+OOP9dZbbyklJcXdWQHAYxWX2jV3+S59svagdh/9ddHw5MsTNLRTrJo3qGthOqD2cangvPjii7rqqqv02GOPVViLY7fbNXnyZM2YMUPvvPOO20ICgCdbuPGwJn74s/NifMM7N1XnuPq6pU8rbjUDWMSlgrN161ZNmzbtpBdqQECAbr/9do0aNcot4QDAE+UWl2p3ZoEmffSzCm12HcwukiR1jquv567trDYxYRYnBOBSwQkICFBpaWml2357fyoA8GVZ+SV6b/U+Pbt0m3OscUSIru3eXLf0bc3NLgEP4lLB6dKli1566SU99dRTqlv31+PI+fn5evbZZ9W1a1e3BQQAKxWX2vXPdQf1zqq9Sj2U6xy/MD5Gt/ZtrQvaRiuIi/EBHselgjNp0iT98Y9/1AUXXKB27dopLCxMeXl52rFjh+rUqaO3337b3TkBoEat3Xdcu345DFXu4g6NFd84TH+9LMHCZABc4VLB6dChg7788kt98sknSk1NVX5+vpo1a6bBgwdr5MiRioqKcndOAHC7o/klemZJmopK7fp8/SHneM9WUXpwaAfu4A14EZevgxMTE6M77rhD/v6/7ootKSlRSEiIW4IBQE0pLrVrxfajuu3tHyVJEaGBahldV3cPaKe+5zVUbGQoZ0MBXsalgpOfn68HH3xQYWFhevzxx53jt956q6KiovTkk08qLIyzBgB4l/3HCvXT3uO67+OfVWo/ccLE//VppYeGdqTQAF7OpZVxM2bMUGpqqi699NIK47fffrt27Nih6dOnuyUcALjLf3ccVb9/fKO/LFivUrtRx9gIvXdbLz087OSbCwPwPi7twVm2bJmef/55denSpcJ4//79FR4ernvuuUdTp051S0AAqE7HC2y6Z8F6fb8tU5J0fc8WGtu/jeIa1JW/P8UG8BUuFZzc3FxFRFR+fYeoqCjl5+dXaygAqE5H8oplK3PoxrmrtSerUJLUrH4dPXBFBw3pFGtxOgDu4FLBSUlJ0dy5czV16lQFBwc7x/Pz8/XMM88oKSnJbQEB4GxtOpijj37cr3kr9zrHYiNDdWPvlhrXvy17bAAf5lLBuf/++zVmzBj17t1bbdu2VZ06dZSfn++8Ds6bb77p7pwA4JJ9WYX6fnum9h8v1Kvf7XKOPzEiSXWDAzQ4sYnqBrt8AikAL+XSqzw+Pl5ffPGFPv30U6Wmpio3N1dt2rTRsGHDNHLkSIWHh7s7JwCc0oYD2Vq8KV0vf7uzwviTI5LV77yGiovibt5AbeLyP2OioqJ02223uTMLAJwRY4ye/3q7Dhwv0idrD0iSAvz9NKxTrB4Y0kGhQQGKCA2yOCUAK7CfFoBXevarNH27LVMbDuRIOrFo+IZeLXT3Re0sTgbAE1BwAHiN3OJS/fmDdcouLNX6/dmSpB6tGmj6yM5q1bCexekAeBIKDgCP9/Tirfo2LVNb03NljNQwLFj942P018vaK7FppNXxAHigaik4xhiu/AmgWh3NL9Gt835UQUmZdhw5ca2tSzo2VmSdID1+VZJCgwIsTgjAk7lUcAYNGqSPP/5YDRo0OGnbli1bdPvtt2vFihXVHg5A7fPIv1K1YsdRFdnsOphdpH7nNVRCk3Dd2re1urQ4+W8QAFTmlAVnzZo1kqSDBw/qp59+UmRkxV3BxhitWLFCeXl57ksIwKdl5BbrrvfWqtBmlyRtz8hTi+i6SomrrwHtY/TwsEQFB7p02zwAcDplwXnooYe0Z88e+fn5afz48VU+bujQodUeDIBvKy61608frNOeowXafiRffdpFq25woFpE1dFdA9qpc1x9qyMC8GKnLDiLFi1STk6OevXqpVdffVX165/8ByciIkKtW7d2W0AAvmXBmn1aujlD+SVlWrXrmBKbRmh456Z65prO7KkBUG1OuwYnMjJS//nPf9S0aVMZY+Tv/+sfoJKSEoWEhLg1IADfMPvbndp/vFCLNh6Ww0jNG9RRz1ZRevnGrmoYxt8RANXLpUXGkZGRuvfeexUWFqbHH3/cOX7rrbcqKipKTz75pMLCwtwWEoB3+nz9QR3MLlJJqUMv/Ge7wkMCFRocoImXxOu6ni2sjgfAh7lUcGbMmKHU1FRNmTKlwvjtt9+up59+WtOnT9fUqVPdEhCAd9mbVaCt6Xkqstn1lwXrneOB/n5665Ye6tYyysJ0AGoLlwrOsmXL9Pzzz6tLly4Vxvv376/w8HDdc889FBygljuSW6ySMofuePsnpWX8emblm2N66Py20Qrw91NQAGtsANQMlwpObm6uIiIiKt0WFRWl/Pz8ag0FwLus2H5UN76+2vnxH7o21y19Wyk0KEBtGtbjQqAAapxLBSclJUVz587V1KlTFRwc7BzPz8/XM888o6SkJLcFBOC5nl26Te+v3qfi0hPXsHnsykTVCQ7URe1jFM3CYQAWcqng3H///RozZox69+6ttm3bqk6dOsrPz9eOHTtUp04dvfnmm+7OCcBDHM4p0jNLtslmd+i/O46qTlCALk1srLgGdXXT+a2sjgcAklwsOPHx8friiy/06aefKjU1Vbm5uWrTpo2GDRumkSNHKjw83N05AVjs27QjOlZg06pdWfpk7QG1jK6r+nWCdFu/NrqhF2dEAfAsLt9sMyoqSrfddps7swDwMGV2h8ocRgezizTmzTXO8cg6QfrPhP4KZNEwAA/lUsGZNWvWaR9zqls5APA+xwps6v+Pb5RXUuYce35Uirq0qK/6dYMpNwA8mksF57XXXjtprKysTA6HQ/Xr11e9evUoOIAPMMbo6cVp2n+sUDlFpcorKdOo7seutP0AACAASURBVHFq1bCe6oUEaEinWE71BuAVXCo4GzZsOGmstLRUGzZs0PPPP6+JEydWezAANWf/sULtP16oghK7Xvlup2LCQxRZJ0id4+pr0uD2ignnjCgA3sXlNTi/FxQUpG7duumee+7Ro48+qk8//bQ6cwGoQde8slLpucXOj58ckaxLOja2MBEAnJuzLjjloqOjtXPnzurIAqCGlNkduvv9tTqcc6LUpOcWa1T3OI3o2kwhgf7q3Ly+xQkB4Ny4VHDWrl170pgxRrm5uXr33XfVpEmTag8GwD0OHC/UD7uPaUlqhjrERqhJRIgu7tBYY/q0UofYyq9YDgDexqWCc8MNN1R6qXVjjMLCwvTUU09VezAA7jH23Z+06WCuJOnBIR3Up11DixMBQPVzqeC8/fbbJ435+fkpLCxMLVq0UL169ao9GIDq8+naA/ps3UFJ0raMfF2e1ER/HnSeEppwkU4AvsmlgtOzZ0935wBQzUrK7DpeUCpJenvlXu3MzFe7RmFKbhap63q24HAUAJ9WZcG55ZZbzugLvfHGG+ccBkD1uf61VVq7L9v58dBOsZp1Q1cLEwFAzamy4JSWllb4eOfOncrPz1fbtm1Vt25d5ebmateuXYqOjlanTp3cHhTAqeWXlOnz9QdVWuaQJKWl5+mCttEa1rmpJKkva20A1CJVFpx33nnH+f+ffvqpli5dqunTpyssLMw5npGRoQceeEADBw50b0oAp7Vw42H9/bNNFcYuT47V9T25ESaA2selNTizZ8/WrFmzKpQbSWrcuLEmTpyoP//5z7r66qvdEhBA5fYcLdAzX6Wp1O745eNCSdLK+wcqNDBA/n5+iqwbZGVEALCMSwUnIyNDxphKt/n5+SkjI6NaQwE4vWVbj+iLDYcV3zhM/n5+8vOThnSKVWxkHaujAYDlXCo47du315QpUzRlyhQlJCQoODhYNptNGzdu1PTp0xUfH+/unECttvtogca+85OKSu3OsZyiUvn5SYvuuVAB/idfpwoAajOXCs5jjz2mcePGadSoUZJO7LUxxsgYo4YNG+rll192a0igtttwIFtpGXm6pGNjhYX8+rJt3ySccgMAlXCp4CQkJOirr77S6tWrtXPnThUUFKhu3bpq3bq1evfurZAQ7jQMVLdlWzM0/v11KnMY2R0nDhE/OSKZO3sDgAtcvtlmUFCQzj//fMXGxqqgoEBhYWFq1aqV/P393ZkPqLU2HshVoc2uO/u3kZ/81LR+KOUGAFzkUsGx2+2aMWOGPvzwQxUUFDjHw8PD9cc//lF333232wICtcW6fcd1/ZxVKvnlOjbGSOEhgbr/8g4WJwMA7+NSwXnxxRc1f/583XTTTUpOTla9evWUn5+vtWvXas6cOQoNDdWtt97q7qyAT0tLz1NxqUO39GmtsJAASeJ2CgBwllwqOP/617/0yCOPaPjw4RXGL7nkErVp00Zz5syh4ABnaOnmDN3/6Qbn+pri0hN7bu695DyFh3L9GgA4Fy4VnCNHjqhr18rvYdO7d29NnTq1WkMBtcFPe4/reGGpRvf69UrDLaLqUm4AoBq4VHCioqK0a9cuNW/e/KRt27dvV4MGDao9GOCLnl26Tat3ZUmS9mQVKLJOkB69MsniVADge1wqOIMHD9bf//533XPPPUpJSVFYWJjy8vK0du1avfjiixoyZIi7cwI+4f3VexXg76dW0fXUKrqeerWOsjoSAPgklwrOpEmTdPToUU2ZMqXCuJ+fn4YOHaqJEye6JRzg7UrK7Ppg9T4V/nIF4pyiUt3StzVnRgGAm7lUcEJCQvTcc89p8uTJSk1NVX5+vsLDw9WxY0c1btzY3RkBr7Vq1zE98u/NFcbiG4VblAYAag+XCs5NN92k5557To0bN6bQAKeQW1yq396XNiO3WJL0xZ/6ql2jMPn5SSGBARalA4Daw+W7ie/du1cNGzZ0dx7Aa72+Yrce+2JzpdsahYcoNIhiAwA1xaWCc++99+rpp5/WxRdfrA4dOqhevXonPaaq08iB2mJnZr7CQwL1l0viK4w3Cg9Ro4hQi1IBQO3kcsGRpA0bNkg6sbi4nDFGfn5+2rJlixviAZ5t/f5sLd2cLkn6ac9xRYcF69a+rS1OBQBwqeDMmzevQqkBcMKL/9mu/2w9okD/E6+PK5JjLU4EAJBcLDi9evVydw7AK+WVlKl3myjNv+N8q6MAAH7jlAVn8+bN+uCDD5Senq64uDhdddVV6tSpU01lAzzOU4u26tu0I86P92QV6IK2LL4HAE/jX9WGNWvW6Nprr9U333yj4uJiffvtt7r++uv19ddf12Q+wKN8seGQ8orL1DK6rlpG11X/+Bhd37PF6T8RAFCjqtyDM2vWLA0YMEDPPvusgoODZYzR9OnT9dRTT+niiy+uyYyAxyiy2XVZUhM9MSLZ6igAgFOosuBs3LhR8+bNU3BwsKQTZ06NGzdOb7zxhrKyshQdHV1jIQGrXPXSf7U9I8/5cYHNrnohLi1dAwBYqMq/1IWFhYqNrXhGSHh4uOrUqaPCwkIKDnyercyh9fuz1b1lA6XE1Zck+fv76boecRYnAwCczin/Kcqp4ajNin65QeZlSU10W782FqcBAJwJCg7wO3/9+Gf9d0eW7I4TN5WqE8wtFgDA25yy4Nxzzz0KCgqqMGaz2fS3v/1NoaG/Xnrez89Pr7/+unsSAjVs2dZMRdQJVNcWDRQU4K+BCY2sjgQAOENVFpwePXpIkkpLSyuMl99z6vfjgK8oLrVreOememhYR6ujAADOUpUF55133qnJHECNszuMXvh6m44XVizrhbYy1Qmu8hJRAAAvwPmuqLV2ZuZr5rIdCgsJVHDgr4Umql6IOjevb2EyAMC5ouCg1iqynThL6oXrUjSoQ2OL0wAAqhMFB7VGqd1R4eOCkjJJUp0gzpICAF9DwUGt8OTCLXrt+12VbqvLlYkBwOfwlx21Qlp6nmIjQzW6V8UbY4aFBCqpaYRFqQAA7kLBQa1QUmZXXFRdjR94ntVRAAA1wKvOhV27dq0uuOACff/991ZHgQeyO4x2Hy3Qrsz8k/7LKSpTSKBX/boDAM6B1+zBOXr0qF599VV16dLF6ijwUM98labZ3+6scvuQ5NgqtwEAfIvXFJyIiAjNmjVLU6ZMsToKPFRGbrGi6gXr4SquQNyjVVQNJwIAWMWSgpOWlqaJEyeqsLBQy5Ytc44fPnxYU6dO1bp16xQaGqpBgwZp8uTJCg4OVnBwsBVR4UVsZQ7VrxukK1OaWR0FAGCxGl+UsHDhQt12221q2bLlSdvGjx+vBg0aaOnSpXr//fe1bt06zZw5s6YjwkuVlDkUEsg1bQAAFuzBKSws1IIFC7Rs2TJt2bLFOb5x40Zt3rxZc+bMUUREhCIiInTnnXfqoYce0oQJE+TvzwJRnHDgeKFGz12tghJ7hfHcolJ15JRvAIAsKDgjR46sdDw1NVWxsbGKivp1nURiYqJycnK0b98+tWrVqoYSwtPtzCzQ3qxCXdqxsRqGh1TYdlH7RhalAgB4Eo9ZZJydna2IiIr/+o6MjJQkHT9+XEePHtULL7ygXbt2KTU1VR9++KFmzZplRVRYrLTsxC0Xxg9sp07cFBMAUAmPKTiSZIypclv37t31zjvv1GAaeCrbL/eUYr0NAKAqHlNwoqKilJ2dXWGs/OPo6GgrIsED2B1GWw7nOkuNJG3PyJckBXPhPgBAFTym4CQlJSkjI0OZmZmKiYmRJG3YsEHR0dGKi4uzOB2s8u+fD+kvC9ZXui081GN+fQEAHsZj3iE6duyolJQUTZ8+XVOmTFF2drZmz56t0aNHy8/Pz+p4sMixApsk6eXRXVU3+NdDUtH1QtQwLKSqTwMA1HI1XnAGDx6sQ4cOyeFwqKysTMnJyZKkxYsX64UXXtDDDz+sfv36KTQ0VCNGjNDYsWNrOiI8SOkvh6YGtI9R3WCP6eMAAA9X4+8YS5YsOeX2V155pYaSwBuUF5ygANbbAABcx7sGPJrNfuLMukB/DlMCAFzHPn9YJiO3WJe/sFx5xaVVPqbMYRQS6M86LADAGaHgwDIHjhfpWIFNQzvFqkVU3SofF984vAZTAQB8AQUHlin7ZX3NDT1b6IJ2DS1OAwDwJazBgWVKf1lfE8QF+wAA1Yx3Flim/AwpFhADAKobh6jgdqV2h+yOk+8zVmizS+IUcABA9aPgwK12Hy3Q4Oe/l63MUeVjQoMoOACA6kXBgVsdyi6SrcyhG3q1UFyDk8+Uql83SG1jwixIBgDwZRQcuFXZL4em/tC1ubq1bGBxGgBAbcGxAbhVmfNWCywkBgDUHAoO3Kr8VPAAzpQCANQgCg7cqszBzTIBADWPNTg4J3uzClRUaq9y+/5jRZLYgwMAqFkUHJy1n/Ye0x9mr3TpsWEh/KoBAGoO7zo4a1n5NknSA1ckVHoKeLmoesFqHBFaU7EAAKDg4OyVX534wvgYJTSJsDgNAAC/YuUnzlr5NW64lxQAwNNQcHDWyvfgBPjzawQA8Cy8M+GssQcHAOCpKDg4a/ZfrnHDKeAAAE/DImNUqbjUrpJT3AU8r7hMEntwAACeh4KDSu0/VqhBM76TzV51wSkXHMiOQACAZ6HgoFIZucWy2R26oVcLtY0Jq/JxTSJCVb9ucA0mAwDg9Cg4qFT5GVJDkmPVp11Di9MAAHBmOLaAStnNiYLj78f6GgCA96HgoFK/nCDFGVIAAK9EwUGlypyngFscBACAs8DbFyrlMFylGADgvXj3QqXKzw4PYA0OAMALUXBQqfKzqNiBAwDwRpwmXksdL7A5r0RcmSN5xZJYZAwA8E4UnFroWIFNvZ78WqV2c9rH1g3iVwQA4H1496qFsgttKrUbje7VQl1bNKjycQ3qBalFdN0aTAYAQPWg4NRC5WdI9WoTreGdm1qcBgCA6scS0lrol/XDnCEFAPBZFJxayHmGFP0GAOCjKDi1UPkhKn8aDgDAR1FwaqHy+0xxI00AgK+i4NRCv96GweIgAAC4CW9xtZC9/BAVe3AAAD6KglMLORwUHACAb6Pg1ELO08RZZAwA8FEUnFqo/DRxduAAAHwVVzL2cpsO5mhvVuEZfU5aRp4kLvQHAPBdFBwvN3ruauUUlZ7V5zaoF1zNaQAA8AwUHC9XZLNrVPc43dqv9Rl9Xr2QQDWrX8dNqQAAsBYFx8sZGTUMD1Z843CrowAA4DFYZOzlHIbTvQEA+D0KjpdzGCPqDQAAFVFwvJgxRsZIfuzBAQCgAgqOF/vljgscogIA4HcoOF7sl34jLkgMAEBFFBwvVn5XcH8aDgAAFVBwvFh5wQEAABVRcLwYa3AAAKgcBceL/VpwrM0BAICnoeB4MecaHPbgAABQAQXHi5UXHPoNAAAVUXC8mOOXQ1Rc6A8AgIooOF7MOA9RWRwEAAAPQ8HxYpxFBQBA5QKtDlCb7crM196swrP+/NziUknswQEA4PcoOBa6Yc5qpecWn/PXiagTVA1pAADwHRQcCxXYyjSkU6xu79fmrL9GUICfOjSJqMZUAAB4PwqOlYzUKDxEKXH1rU4CAIBPYZGxhYwkP7GABgCA6kbBsZAxhov0AQDgBhQcC53YgwMAAKobBcdCxnCbBQAA3IGCYyEjw20WAABwAwqOhdiDAwCAe1BwLMRZVAAAuAcFx0rswQEAwC0oOBYyMuy/AQDADSg4FmINDgAA7kHBsRBrcAAAcA8KjoW4kjEAAO5BwbEQVzIGAMA9KDgWMkYswgEAwA0oOBYxxkhiDw4AAO5AwbHIL/2GHTgAALgBBcciv/QbzqICAMANKDgWcR6iot8AAFDtKDgW+XUPDgAAqG4UHIuwBgcAAPeh4FjEqPwQFQ0HAIDqRsGxSPkeHAAAUP0oOBZjBw4AANUv0OoAZ2Lq1KnasmWLAgMDNW3aNMXFxVkd6aw51+CwzBgAgGrnNXtwVq5cqaysLM2fP1/jxo3Ts88+a3Wkc/LrGhyLgwAA4IO8puCsWrVKAwYMkCSdf/75+vnnn60NdI5+3YMDAACqmyUFJy0tTUOHDtXAgQMrjB8+fFhjx45Vr1691L9/fz366KOy2WySpKysLEVFRUmS/P395XA45HA4ajx7dXFeB4eGAwBAtavxNTgLFy7UtGnT1KlTJ23ZsqXCtvHjxys+Pl5Lly5VXl6exo8fr5kzZ2rSpEknfR1Tw6chORxG/95wSJl5JdXy9UrKTpQz1uAAAFD9arzgFBYWasGCBVq2bFmFgrNx40Zt3rxZc+bMUUREhCIiInTnnXfqoYce0oQJExQTE6OjR49KkkpLS+Xv7y9//5rbAVVUateUf25SbnFZtX7dZg3qVOvXAwAAFhSckSNHVjqempqq2NhY52EoSUpMTFROTo727dunCy64QG+88YZGjhyp7777Tj179qypyJKkeiGBWvPgxbKVVd9hsQB/P9UN9qoT2QAA8Aoe8+6anZ2tiIiICmORkZGSpOPHj6tHjx76+uuvdd111yk4OFhPP/10jWcMCQxQSGBAjT8vAAA4Mx5TcKTTr6u5//77aygJAADwZh5zmnhUVJSys7MrjJV/HB0dbUUkAADgpTym4CQlJSkjI0OZmZnOsQ0bNig6Otqrr1gMAABqnscUnI4dOyolJUXTp09XXl6e9u/fr9mzZ2v06NHccRsAAJyRGl+DM3jwYB06dEgOh0NlZWVKTk6WJC1evFgvvPCCHn74YfXr10+hoaEaMWKExo4dW9MRAQCAl6vxgrNkyZJTbn/llVdqKAkAAPBVHnOICgAAoLpQcAAAgM+h4AAAAJ9DwQEAAD6HggMAAHwOBQcAAPgcCg4AAPA5HnWzzXNlt9slSenp6RYnAQAA7lb+fl/+/v9bPlVwyu9jNXr0aIuTAACAmpKZmamWLVtWGPMzxhiL8lS74uJibdq0STExMQoICLA6DgAAcCO73a7MzEwlJSUpNDS0wjafKjgAAAASi4wBAIAPouAAAACfQ8EBAAA+h4LjgsOHD2vs2LHq1auX+vfvr0cffVQ2m83qWNWiffv2SkpKUnJysvO/hx9+WJL0ww8/6Nprr1XXrl112WWX6YMPPqjwue+9954uv/xyde3aVddee61+/PFHK6ZwRtLS0jR06FANHDiwwvi5zNVms2nq1KkaMGCAevXqpbFjx3rkpQoqm/vq1avVvn37Cj//5ORkffHFF87HePvcDx48qD/96U/q3bu3evfurXvuuUcZGRmSTnxPbr75ZnXv3l2DBg3Siy++qN8uS1y8eLGuvPJKdenSRcOHD9dXX33l3GaM0cyZM3XxxRere/fuuvnmm7V9+/Yan9+pVDX3AwcOVPraf+2115yf6+1zX79+vW688UZ17dpVffr00YQJE5xn2vry672qedeG1/pJDE7r6quvNpMnTzY5OTnmwIED5qqrrjLTp0+3Ola1iI+PN6tWrTpp/MiRI6ZLly7mvffeM0VFReann34yXbt2Nd99950xxphvvvnGdO3a1axZs8YUFxebDz74wHTt2tVkZmbW9BRc9uWXX5q+ffuau+66y1x00UXO8XOd61NPPWWuvPJKs2/fPpObm2smT55srrnmGkvmWJWq5r5q1SoTHx9f5ef5wtyHDh1qJk6caPLy8szRo0fNzTffbO644w5TVFRk+vfvb5599lmTn59vtm3bZvr372/ef/99Y4wxW7ZsMUlJSWbp0qWmuLjYfP311yY5OdmkpaUZY4x59913Tf/+/c3WrVtNQUGBee6558xFF11kiouLrZxuBVXNff/+/SY+Pt7s37+/0s/z9rlnZ2ebLl26mLfeesvYbDZz9OhRc+ONN5px48b59Ov9VPOuDa/136PgnMaGDRtMQkKCycrKco4tWrTI9OjRw9jtdguTVY+qCs7cuXPN0KFDK4xNnTrVjBs3zhhjzB133GEee+yxCtuHDBli3nzzTbdlPVcfffSROXjwoHnnnXcqvMmfy1xLS0tNt27dzJIlS5zbsrKyTPv27c3mzZvdOJszU9XcT/dHz9vnnpOTYyZPnmzS09OdY1988YXp0qWLWbRokenZs6cpLS11bps7d64ZPny4MebE78Cdd95Z4evdcccd5vHHHzfGnPg+vPHGG85tNpvNdO/e3SxdutSdU3LZqeZ+uoLj7XM/cuSI+fjjjyuMzZs3z1x00UU+/Xo/1bx9/bVeGQ5RnUZqaqpiY2MVFRXlHEtMTFROTo727dtnYbLqM2/ePA0aNEjdunXTX//6V+Xm5io1NVWJiYkVHtexY0dt3LhR0onvS8eOHavc7olGjhyppk2bnjR+LnPdt2+f8vLyKmyPiopSkyZNPOp7UdXcy02aNEkXXHCB+vTpo9mzZ8vhcEjy/rlHRERo2rRpaty4sXPs8OHDaty4sVJTUxUfH6/AwF+vd9qxY0dt27ZNJSUlp/y9KC4u1o4dOyrMPSgoSPHx8V4x93LTp0/XhRdeqJ49e+rJJ590Hnr39rnHxMToD3/4g6QTh9N27typzz77TEOGDPHp1/up5l3OV1/rlaHgnEZ2drYiIiIqjEVGRkqSjh8/bkWkatW5c2d1795dX375pT799FOlpaXpoYceqnTe9evXd865qu9LdnZ2jWWvLucy1/L5lv9O/Ha7N/x+hIWFqUuXLho6dKi+++47zZgxQ6+//rrmz58vyffmvmvXLs2ePVt33XVXlT93h8OhnJycKud+/Phx5eTkyBjjtXMPDg5WSkqKBgwYoK+//lpvvfWWli5dqhdeeEFS1T93b5v71q1blZSUpKFDhyo5OVl/+ctfasXrvbJ517bXukTBcYnx4Wshfvjhh7rlllsUGhqqli1basKECVq8eLHMicOXVserMec6V2/9XiUmJmr+/PkaMGCAgoKC1Lt3b40aNUqff/65y1/DW+a+ceNG3Xjjjfq///s/DRs2TNLps5/rdk/x+7k3atRICxYs0IgRIxQcHKyOHTvq9ttvr/Bz94W5JyQkaNOmTfriiy+0e/duTZgwQZLvv94rm3dteq2Xo+CcRlRU1El7Jco/jo6OtiKSWzVv3lzGmErnffz4ceecGzRocFJzz87OrnAoz1s0aNDgrOdaPt/Kfke88XshSc2aNdORI0ck+c7cly9frjFjxmj8+PEaP368pKpf2wEBAapfv36lvxfZ2dmKjo5W/fr15e/v77Vzr0yzZs2UlZUlu93uM3OXJD8/P7Vt29b5jze73V4rXu+/n3f5GWS/5Yuv9d+i4JxGUlKSMjIyKvxybNiwQdHR0YqLi7Mw2bnbvHmznnrqqQpjO3fuVFBQkDp06KBNmzZV2LZx40Z17txZ0onvy++3b9iwQSkpKe4N7QbJyclnPde4uDhFRkZW2J6RkaH09HSv+F4sWrRI77//foWxXbt2qXnz5pJ8Y+4///yz7r33Xj399NO64YYbnONJSUlKS0urcMmHDRs2qEOHDgoODq507uW/FyEhITrvvPMqrD+w2WzaunWrV8x95cqVmj17doXH7tq1S7GxsQoICPD6uS9atEhXX311hTF//xNvd/379/fZ1/up5r1mzRqff62fpIYXNXulUaNGmfvuu8/k5uaaffv2mSuuuMLMmjXL6ljnLD093aSkpJhXX33VlJSUmF27dpkrrrjCTJ061WRlZZlu3bqZd9991xQXF5tVq1aZlJQU88MPPxhjjFm+fLlJSUlxnlL45ptvml69epns7GyLZ3V6vz+T6FznOmPGDDN06FCzf/9+k5OTYyZMmGBuvvlmS+Z2Or+f+9KlS02nTp3M8uXLjc1mMytWrDApKSlm0aJFxhjvn3tpaam54oorzFtvvXXStpKSEjNw4EDzzDPPmIKCArNlyxbTp08f89lnnxljjNm+fbtJSkoyX331lSkpKTELFy40nTp1Mnv27DHGGDN//nzTt29fk5aWZgoKCsxTTz1lBg8ebGw2W43OsSqnmvvGjRtNYmKi+ec//2lsNpvZsGGD6dOnj5k7d64xxvvnnp6ebrp27WpmzZplioqKzNGjR82tt95qrrvuOp9+vZ9q3r7+Wq8MBccF6enp5s477zSdO3c2vXr1Mk899ZQpKyuzOla1+OGHH8yoUaNMSkqK6dmzp5k2bZrzWhY//vijGTFihElKSjKDBg1y/uEvt2DBAnPRRReZpKQkM3LkSPPzzz9bMQWXXXrppSYpKcl07NjRxMfHm6SkJJOUlGQOHDhwTnO12WzmscceMz179jQpKSnm7rvvrnBZAU9wqrnPnz/fXHrppSY5OdlcdNFF5sMPP6zwud489zVr1lSY72//O3DggNmxY4cZPXq0SU5ONn379jVz5syp8PlLly41l112mUlMTDRDhgxxXiul3KxZs0yfPn1McnKy+eMf/+gsAJ7gdHP/6quvzPDhw02nTp1Mnz59zCuvvFLh0hfePHdjjFm/fr0ZNWqUSU5ONueff7659957nafM+/Lr/VTz9uXXemW4mzgAAPA5rMEBAAA+h4IDAAB8DgUHAAD4HAoOAADwORQcAADgcyg4AADA51BwAEiSbrrpJrVv377Cf126dNHNN9+sH374wa3PO2bMGLd9/cqsXr1a7du3148//nhOX+fTTz9V+/btlZ6eXk3JAFQXCg4Ap+7du2vFihVasWKFli9frnnz5ik8PFy33HLLSZdx92SXXXaZVq9eXeX2Ll26aMWKFc7L8wPwPRQcAE5BQUGKiYlRTEyMGjVqpE6dOum5555TZGSkPvjgA6vjuSQnJ0d79uw55WOCg4MVExOjoKCgmgkFoMZRcACcUnBwsFq3bu08DFN+eGfhwoW65JJLNHr0aElScXGxnnjiCfXr109JSUkaOHCgnnvuOZWVlTm/1tatW3XNNdcoOTlZgwYN0ieffFLhuao6dJScnKwXX3zR+XFaWprGjBmjlJQU9evXT4888ojy8/N14MABcSVcFAAABqhJREFU9ezZU8YY3XzzzRo4cGClc/r980yePFnXX3+9vv/+ew0bNkydO3fW0KFDtXz5cufn2Gw2Pfjgg+rWrZu6deumyZMnq6ioqMLXdTgceu211zRkyBB16tRJAwcO1GuvvabyC8Z/+eWX6tixo7Zs2eL8nLVr1yohIUFLlixx7QcCwCUUHACn5HA4dPDgQcXFxVUYf+ONN/Tkk0/queeekyTdf//9WrRokR577DEtWrRIf/7zn/X2229rxowZkk4UhHHjxsnhcGj+/Pl66aWXtHTpUu3YseOM8mRlZWnMmDFq3LixPvroIz3//PNasWKFHnjgAcXGxuq1116TJL344ov6+OOPXf66hw8f1ltvvaUnnnhCn3zyierXr6/77rtPJSUlkqSZM2fqX//6lx5++GF98sknSkhI0Kuvvlrha7z88suaOXOmbrjhBv373//W3XffrZdeeklz586VJA0ZMkT9+/fX1KlTZYyR3W7Xo48+qsGDB2vw4MFn9H0AcGqBVgcA4Lny8vI0e/Zspaen68orr6ywbdCgQerRo4ck/X/79hfSVB/HcfytJbm1zVnGjKiRRVDCKpCSLrKLCsNRQhFUUEFBgTKYXVhKzaLQ/kDSQChpRUhdBOVNRhdddFEXUgYtd2OuYhVTr7ZcWKv2XMgOngqeZ+gDD3s+Lxic32+/c/bdufrwPb9DPB7n4cOHnDlzhk2bNgGwePFiotEoPT09NDU10d/fz6dPn7h8+TKVlZUAXLx4kQ0bNuRU0/3795mYmOD06dMUFxcDcPLkSR49ekQmk6GkpASAkpIS5s2b94+vG4/HuXPnDgsXLgRg7969+P1+YrEYy5cvp7e3l/r6erZv3w7AwYMHefXqFQ8ePAAgnU4TCoXYs2eP0dVyu928efOGUCjEoUOHKCwspK2tjbq6Ou7du8fExATxeJzr16/ndA9E5O8p4IiIob+/n7Vr1xrjL1++sGjRIjo7O03zAKtWrTKOBwcHyWQyrFmzxrTG4/GQSqV4//690alZuXKl8b3dbmfZsmU51fj69WsqKiqMcANQU1NDTU1NTtf5VVlZmRFuACMcJRIJkskkY2NjptoBVq9ebQSc4eFhUqkU1dXVpjXr1q0jFAoxOjpKeXk5LpeL5uZmLl26xI8fPwgEAsyfP39atYvI7xRwRMTg8Xg4f/68MbZarSxYsOCPa+fOnWscj4+PA2Cz2f64Znx8nFQqRUFBAXPmzDGtsVqtOdWYTCZzPuefsFgspnFBQQEAmUyGVCr1xzVT68jeA7/fz6xZs4z5nz9/AjA2NkZ5eTkAXq+X9vZ2ioqK2LJlywz/ExEBBRwRmaK4uBi3253zeXa7HZh8pDVVdmy327FarWQyGb5+/WoKOZ8/fzY6GNlQMdW3b99MG5VLS0uJRqM51zgd2WDz66biqf83ew8CgQBVVVW/XcPlchnHV65cweVykU6nCQaDHDt27N8oW+R/TZuMRWTaKisrKSwsZGBgwDT/8uVL7HY7brebpUuXAhAOh43vR0ZGGB4eNsbZDlAikTDmwuGw0QXJ/tbQ0BDJZNKYe/LkCfv27TMFkOybSzPB6XRSWlpqqh3g2bNnxnFFRQU2m43R0VHcbrfxcTgcWK1W45FaOBzm1q1btLW1cerUKUKhEIODgzNWq4hMUsARkWlzuVx4vV6CwSCPHz8mFotx9+5dbt++zYEDB5g9ezbV1dWUlZVx4cIFIpEIkUiElpYW00bgJUuWYLPZ6OnpIRqN8vz5czo7O3E6ncaaXbt2YbFYOH78OG/fvmVgYICOjg6cTicWiwWHwwHA06dPiUQiMxZ0vF4vfX199PX18e7dO7q7u01vgBUVFbF//366u7vp7e0lFovx4sULjhw5gs/nAyY3Ire2tuL1elm/fj0bN25k8+bNnDhxgnQ6PSN1isgkBRwRmRFnz56lrq6OQCBAbW0tV69epaGhgcbGRmDy8VdXVxffv39n9+7dNDY2sm3bNjwej3ENm81GR0cHHz58YMeOHZw7dw6/32/a7+NwOLhx4waJRIL6+np8Ph9VVVW0t7cDk50Ur9fLzZs3OXz4sKn7Mx1NTU1s3bqV1tZWdu7cydDQEH6/37TG5/Nx9OhRgsEgtbW1NDQ0sGLFCrq6ugC4du0aIyMjNDc3G+e0tLTw8ePH3145F5HpKcjMZB9XRERE5D9AHRwRERHJOwo4IiIikncUcERERCTvKOCIiIhI3lHAERERkbyjgCMiIiJ5RwFHRERE8o4CjoiIiOSdvwDTlwrJl4QMlQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "tags": [] }, "output_type": "display_data" } ], "source": [ "plt.style.use(\"seaborn-white\")\n", "\n", "# Number of unique customer IDs\n", "product_counts = df.groupby(['StockCode']).count()['InvoiceNo'].values\n", "\n", "fig = plt.figure(figsize=(8,6))\n", "plt.yticks(fontsize=14)\n", "plt.xticks(fontsize=14)\n", "\n", "plt.semilogy(sorted(product_counts))\n", "plt.ylabel(\"Product counts\", fontsize=16);\n", "plt.xlabel(\"Product index\", fontsize=16);\n", "\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": { "id": "I4evk-MwiT-f" }, "source": [ "The left side of the figure corresponds to products that are not very popular (because they aren't purchased very often), while the far right side indicates that some products are extremely popular and have been purchased hundreds of times.\n", "\n", "### Customer session lengths\n", "We define a customer's \"session\" as all the products they purchased in each transaction, in the order in which they were purchased (ordered InvoiceDate). We can then examine statistics regarding the length of these sessions. Below is a boxplot of all customer session lengths." ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 441 }, "id": "0nlaj3ieiILl", "outputId": "36a93a1c-2047-44f6-9880-164a37107b14" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [] }, "output_type": "display_data" } ], "source": [ "session_lengths = df.groupby(\"CustomerID\").count()['InvoiceNo'].values\n", "\n", "fig = plt.figure(figsize=(8,6))\n", "plt.xticks(fontsize=14)\n", "\n", "ax = sns.boxplot(x=session_lengths, color=cldr_colors[2])\n", "\n", "for patch in ax.artists:\n", " r, g, b, a = patch.get_facecolor()\n", " patch.set_facecolor((r, g, b, .7))\n", " \n", "plt.xlim(0,600)\n", "plt.xlabel(\"Session length (# of products purchased)\", fontsize=16);\n", "\n", "plt.tight_layout()\n", "plt.savefig(\"session_lengths.png\", transparent=True, dpi=150)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "a6aVEx4uiIIi", "outputId": "ecf53307-79cf-444e-9b90-e0e049bfee97" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Minimum session length: \t 3\n", "Maximum session length: \t 7983\n", "Mean session length: \t \t 96.03967879074162\n", "Median session length: \t \t 44.0\n", "Total number of purchases: \t 406632\n" ] } ], "source": [ "print(\"Minimum session length: \\t\", min(session_lengths))\n", "print(\"Maximum session length: \\t\", max(session_lengths))\n", "print(\"Mean session length: \\t \\t\", np.mean(session_lengths))\n", "print(\"Median session length: \\t \\t\", np.median(session_lengths))\n", "print(\"Total number of purchases: \\t\", np.sum(session_lengths))" ] }, { "cell_type": "markdown", "metadata": { "id": "pJ7qK-MCXBwx" }, "source": [ "### Sessionization" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "id": "V_tNkOldV7t6" }, "outputs": [], "source": [ "def construct_session_sequences(df, sessionID, itemID, save_filename):\n", " \"\"\"\n", " Given a dataset in pandas df format, construct a list of lists where each sublist\n", " represents the interactions relevant to a specific session, for each sessionID. \n", " These sublists are composed of a series of itemIDs (str) and are the core training \n", " data used in the Word2Vec algorithm. \n", " This is performed by first grouping over the SessionID column, then casting to list\n", " each group's series of values in the ItemID column. \n", " INPUTS\n", " ------------\n", " df: pandas dataframe\n", " sessionID: str column name in the df that represents invididual sessions\n", " itemID: str column name in the df that represents the items within a session\n", " save_filename: str output filename \n", " \n", " Example:\n", " Given a df that looks like \n", " SessionID | ItemID \n", " ----------------------\n", " 1 | 111\n", " 1 | 123\n", " 1 | 345\n", " 2 | 045 \n", " 2 | 334\n", " 2 | 342\n", " 2 | 8970\n", " 2 | 345\n", " \n", " Retrun a list of lists like this: \n", " sessions = [\n", " ['111', '123', '345'],\n", " ['045', '334', '342', '8970', '345'],\n", " ]\n", " \"\"\"\n", " grp_by_session = df.groupby([sessionID])\n", "\n", " session_sequences = []\n", " for name, group in grp_by_session:\n", " session_sequences.append(list(group[itemID].values))\n", "\n", " pickle.dump(session_sequences, open(save_filename, \"wb\"))\n", " return session_sequences" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 171 }, "id": "jiswHnavV7wt", "outputId": "a4078ea5-9b1c-4253-aa38-97ab6628ced3" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'85116 --> 22375 --> 71477 --> 22492 --> 22771 --> 22772 --> 22773 --> 22774 --> 22775 --> 22805 --> 22725 --> 22726 --> 22727 --> 22728 --> 22729 --> 22212 --> 85167B --> 21171 --> 22195 --> 84969 --> 84997C --> 84997B --> 84997D --> 22494 --> 22497 --> 85232D --> 21064 --> 21731 --> 84558A --> 20780 --> 20782 --> 84625A --> 84625C --> 85116 --> 20719 --> 22375 --> 22376 --> 20966 --> 22725 --> 22726 --> 22727 --> 22728 --> 22729 --> 22196 --> 84992 --> 84991 --> 21976 --> 22417 --> 47559B --> 21154 --> 21041 --> 21035 --> 22423 --> 84969 --> 22134 --> 21832 --> 22422 --> 22497 --> 21731 --> 84558A --> 22376 --> 22374 --> 22371 --> 22375 --> 20665 --> 23076 --> 21791 --> 22550 --> 23177 --> 22432 --> 22774 --> 22195 --> 22196 --> 21975 --> 21041 --> 22423 --> 22699 --> 21731 --> 22492 --> 84559A --> 84559B --> 16008 --> 22821 --> 22497 --> 23084 --> 23162 --> 23171 --> 23172 --> 23170 --> 23173 --> 23174 --> 23175 --> 22371 --> 22375 --> 85178 --> 17021 --> 23146 --> 22196 --> 84558A --> 51014C --> 22727 --> 22725 --> 23308 --> 23297 --> 22375 --> 22374 --> 22376 --> 22371 --> 22372 --> 21578 --> 20719 --> 22727 --> 23146 --> 23147 --> 47559B --> 84992 --> 84991 --> 21975 --> 22423 --> 23175 --> 84558A --> 22992 --> 21791 --> 23316 --> 23480 --> 21265 --> 21636 --> 22372 --> 22375 --> 22371 --> 22374 --> 22252 --> 22945 --> 22423 --> 23173 --> 47580 --> 47567B --> 47559B --> 22698 --> 22697 --> 84558A --> 23084 --> 21731 --> 23177 --> 21791 --> 23508 --> 23506 --> 23503 --> 22992 --> 22561 --> 22492 --> 22621 --> 23146 --> 23421 --> 23422 --> 23420 --> 22699 --> 22725 --> 22728 --> 22726 --> 22727 --> 21976 --> 22417 --> 23308 --> 84991 --> 84992 --> 22196 --> 22195 --> 20719 --> 23162 --> 22131 --> 23497 --> 23552 --> 21064 --> 84625A --> 21731 --> 23084 --> 20719 --> 21265 --> 23271 --> 23506 --> 23508'" ] }, "execution_count": 63, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "filename = os.path.join(ECOMM_PATH, ECOMM_FILENAME.replace(\".csv\", \"_sessions.pkl\"))\n", "sessions = construct_session_sequences(df, \"CustomerID\", \"StockCode\", save_filename=filename)\n", "' --> '.join(sessions[0])" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "id": "7nvizM2Tl4_n" }, "outputs": [], "source": [ "def load_ecomm(filename=None):\n", " \"\"\"\n", " Checks to see if the processed Online Retail ecommerce session sequence file exists\n", " If True: loads and returns the session sequences\n", " If False: creates and returns the session sequences constructed from the original data file\n", " \"\"\"\n", " original_filename = os.path.join(ECOMM_PATH, ECOMM_FILENAME)\n", " if filename is None:\n", " processed_filename = original_filename.replace(\".csv\", \"_sessions.pkl\")\n", " if os.path.exists(processed_filename):\n", " return pickle.load(open(processed_filename,'rb'))\n", "\n", " df = load_original_ecomm(original_filename)\n", " df = preprocess_ecomm(df)\n", " session_sequences = construct_session_sequences(df, \"CustomerID\", \"StockCode\",\n", " save_filename=original_filename)\n", " return session_sequences" ] }, { "cell_type": "markdown", "metadata": { "id": "4IgJEMr7XHB2" }, "source": [ "### Splitting" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "id": "GpIVmG6QV7rG" }, "outputs": [], "source": [ "def train_test_split(session_sequences, test_size: int = 10000, rng=rng):\n", " \"\"\"\n", " Next Event Prediction (NEP) does not necessarily follow the traditional train/test split. \n", " Instead training is perform on the first n-1 items in a session sequence of n items. \n", " The test set is constructed of (n-1, n) \"query\" pairs where the n-1 item is used to generate \n", " recommendation predictions and it is checked whether the nth item is included in those recommendations. \n", " Example:\n", " Given a session sequence ['045', '334', '342', '8970', '128']\n", " Training is done on ['045', '334', '342', '8970']\n", " Testing (and validation) is done on ['8970', '128']\n", " \n", " Test and Validation sets are constructed to be disjoint. \n", " \"\"\"\n", "\n", " ## Construct training set\n", " # use (1 st, ..., n-1 th) items from each session sequence to form the train set (drop last item)\n", " train = [sess[:-1] for sess in session_sequences]\n", "\n", " if test_size > len(train):\n", " print(\n", " f\"Test set cannot be larger than train set. Train set contains {len(train)} sessions.\"\n", " )\n", " return\n", "\n", " ### Construct test and validation sets\n", " # sub-sample 10k sessions, and use (n-1 th, n th) pairs of items from session_squences to form the\n", " # disjoint validaton and test sets\n", " test_validation = [sess[-2:] for sess in session_sequences]\n", " index = np.random.choice(range(len(test_validation)), test_size * 2, replace=False)\n", " test = np.array(test_validation)[index[:test_size]].tolist()\n", " validation = np.array(test_validation)[index[test_size:]].tolist()\n", "\n", " return train, test, validation" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dR2iVWuscfvt", "outputId": "218a61ca-53c1-4381-8af0-b00651e97fe5" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4234\n", "4234 1000 1000\n" ] } ], "source": [ "print(len(sessions))\n", "train, test, valid = train_test_split(sessions, test_size=1000)\n", "print(len(train), len(valid), len(test))" ] }, { "cell_type": "markdown", "metadata": { "id": "nrfW72QYXL3Z" }, "source": [ "### Metrics" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "id": "9gZKiSz4fALx" }, "outputs": [], "source": [ "def recall_at_k(test, embeddings, k: int = 10) -> float:\n", " \"\"\"\n", " test must be a list of (query, ground truth) pairs\n", " embeddings must be a gensim.word2vec.wv thingy\n", " \"\"\"\n", " ratk_score = 0\n", " for query_item, ground_truth in test:\n", " # get the k most similar items to the query item (computes cosine similarity)\n", " neighbors = embeddings.similar_by_vector(query_item, topn=k)\n", " # clean up the list\n", " recommendations = [item for item, score in neighbors]\n", " # check if ground truth is in the recommedations\n", " if ground_truth in recommendations:\n", " ratk_score += 1\n", " ratk_score /= len(test)\n", " return ratk_score\n", "\n", "\n", "def recall_at_k_baseline(test, comatrix, k: int = 10) -> float:\n", " \"\"\"\n", " test must be a list of (query, ground truth) pairs\n", " embeddings must be a gensim.word2vec.wv thingy\n", " \"\"\"\n", " ratk_score = 0\n", " for query_item, ground_truth in test:\n", " # get the k most similar items to the query item (computes cosine similarity)\n", " try:\n", " co_occ = collections.Counter(comatrix[query_item])\n", " items_and_counts = co_occ.most_common(k)\n", " recommendations = [item for (item, counts) in items_and_counts]\n", " if ground_truth in recommendations: \n", " ratk_score +=1\n", " except:\n", " pass\n", " ratk_score /= len(test)\n", " return ratk_score\n", "\n", "\n", "def hitratio_at_k(test, embeddings, k: int = 10) -> float:\n", " \"\"\"\n", " Implemented EXACTLY as was done in the Hyperparameters Matter paper. \n", " In the paper this metric is described as \n", " • Hit ratio at K (HR@K). It is equal to 1 if the test item appears\n", " in the list of k predicted items and 0 otherwise [13]. \n", " \n", " But this is not what they implement, where they instead divide by k. \n", " What they have actually implemented is more like Precision@k.\n", " However, Precision@k doesn't make a lot of sense in this context because\n", " there is only ONE possible correct answer in the list of generated \n", " recommendations. I don't think this is the best metric to use but \n", " I'll keep it here for posterity. \n", " test must be a list of (query, ground truth) pairs\n", " embeddings must be a gensim.word2vec.wv thingy\n", " \"\"\"\n", " hratk_score = 0\n", " for query_item, ground_truth in test:\n", " # If the query item and next item are the same, prediction is automatically correct\n", " if query_item == ground_truth:\n", " hratk_score += 1 / k\n", " else:\n", " # get the k most similar items to the query item (computes cosine similarity)\n", " neighbors = embeddings.similar_by_vector(query_item, topn=k)\n", " # clean up the list\n", " recommendations = [item for item, score in neighbors]\n", " # check if ground truth is in the recommedations\n", " if ground_truth in recommendations:\n", " hratk_score += 1 / k\n", " hratk_score /= len(test)\n", " return hratk_score*1000\n", "\n", "\n", "def mrr_at_k(test, embeddings, k: int) -> float:\n", " \"\"\"\n", " Mean Reciprocal Rank. \n", " test must be a list of (query, ground truth) pairs\n", " embeddings must be a gensim.word2vec.wv thingy\n", " \"\"\"\n", " mrratk_score = 0\n", " for query_item, ground_truth in test:\n", " # get the k most similar items to the query item (computes cosine similarity)\n", " neighbors = embeddings.similar_by_vector(query_item, topn=k)\n", " # clean up the list\n", " recommendations = [item for item, score in neighbors]\n", " # check if ground truth is in the recommedations\n", " if ground_truth in recommendations:\n", " # identify where the item is in the list\n", " rank_idx = (\n", " np.argwhere(np.array(recommendations) == ground_truth)[0][0] + 1\n", " )\n", " # score higher-ranked ground truth higher than lower-ranked ground truth\n", " mrratk_score += 1 / rank_idx\n", " mrratk_score /= len(test)\n", " return mrratk_score\n", "\n", "\n", "def mrr_at_k_baseline(test, comatrix, k: int = 10) -> float:\n", " \"\"\"\n", " Mean Reciprocal Rank. \n", " test must be a list of (query, ground truth) pairs\n", " embeddings must be a gensim.word2vec.wv thingy\n", " \"\"\"\n", " mrratk_score = 0\n", " for query_item, ground_truth in test:\n", " # get the k most similar items to the query item (computes cosine similarity)\n", " try:\n", " co_occ = collections.Counter(comatrix[query_item])\n", " items_and_counts = co_occ.most_common(k)\n", " recommendations = [item for (item, counts) in items_and_counts]\n", " if ground_truth in recommendations: \n", " rank_idx = (\n", " np.argwhere(np.array(recommendations) == ground_truth)[0][0] + 1\n", " )\n", " mrratk_score += 1 / rank_idx\n", " except:\n", " pass\n", " mrratk_score /= len(test)\n", " return mrratk_score" ] }, { "cell_type": "markdown", "metadata": { "id": "rxY8Dbfzshsv" }, "source": [ "### Baseline analysis" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "id": "ZPmpUqm_smzm" }, "outputs": [], "source": [ "def association_rules_baseline(train_sessions):\n", " \"\"\"\n", " Constructs a co-occurence matrix that counts how frequently each item \n", " co-occurs with any other item in a given session. This matrix can \n", " then be used to generate a list of recommendations according to the most\n", " frequently co-occurring items for the item in question. \n", "\n", " These recommendations must be evaluated using the \"_baseline\" recall/mrr functions in metrics.py\n", " \"\"\"\n", " comatrix = collections.defaultdict(list)\n", " for session in train_sessions:\n", " for (x, y) in itertools.permutations(session, 2):\n", " comatrix[x].append(y)\n", " return comatrix" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dCUFeOw4smwI", "outputId": "bb1aaf8c-b440-4e45-f69c-a6f473e193b6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Recall@10: 0.143\n", "MRR@10: 0.06450158730158734\n" ] } ], "source": [ "# Construct a co-occurrence matrix containing how frequently \n", "# each item is found in the same session as any other item\n", "comatrix = association_rules_baseline(train)\n", "\n", "# Recommendations are generated as the top K most frequently co-occurring items\n", "# Compute metrics on these recommendations for each (query item, ground truth item)\n", "# pair in the test set\n", "recall_at_10 = recall_at_k_baseline(test, comatrix, k=10)\n", "mrr_at_10 = mrr_at_k_baseline(test, comatrix, k=10)\n", "\n", "print(\"Recall@10:\", recall_at_10)\n", "print(\"MRR@10:\", mrr_at_10)" ] }, { "cell_type": "markdown", "metadata": { "id": "sfCnGNBUXT-V" }, "source": [ "### Initializing Ray" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9uT3Woact705", "outputId": "36390597-5e7d-482b-ba7d-702bff389666" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2021-06-11 06:21:17,485\tINFO services.py:1274 -- View the Ray dashboard at \u001b[1m\u001b[32mhttp://127.0.0.1:8265\u001b[39m\u001b[22m\n" ] }, { "data": { "text/plain": [ "{'metrics_export_port': 42187,\n", " 'node_id': '36950d392a77ab525dbcfe5ffdb2a3629ba8446f2aed261213be1f1f',\n", " 'node_ip_address': '172.28.0.2',\n", " 'object_store_address': '/tmp/ray/session_2021-06-11_06-21-14_124261_62/sockets/plasma_store',\n", " 'raylet_ip_address': '172.28.0.2',\n", " 'raylet_socket_name': '/tmp/ray/session_2021-06-11_06-21-14_124261_62/sockets/raylet',\n", " 'redis_address': '172.28.0.2:6379',\n", " 'session_dir': '/tmp/ray/session_2021-06-11_06-21-14_124261_62',\n", " 'webui_url': '127.0.0.1:8265'}" ] }, "execution_count": 82, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "ray.init(num_cpus=4, ignore_reinit_error=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "nEn7YBp94I3K" }, "source": [ "### Train word2vec with logging" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "id": "C_6624rRjSdW" }, "outputs": [], "source": [ "def train_w2v(train_data, params:dict, callbacks=None, model_name=None):\n", " if model_name: \n", " # Load a model for additional training. \n", " model = Word2Vec.load(model_name)\n", " else: \n", " # train model\n", " if callbacks:\n", " model = Word2Vec(callbacks=callbacks, **params)\n", " else:\n", " model = Word2Vec(**params)\n", " model.build_vocab(train_data)\n", "\n", " model.train(train_data, total_examples=model.corpus_count, epochs=model.epochs, compute_loss=True)\n", " vectors = model.wv\n", " return vectors\n", " \n", "\n", "def tune_w2v(config):\n", " ratk_logger = RecallAtKLogger(valid, k=config['k'], ray_tune=True)\n", "\n", " # remove keys from config that aren't hyperparameters of word2vec\n", " config.pop('dataset')\n", " config.pop('k')\n", " train_w2v(train, params=config, callbacks=[ratk_logger])\n", "\n", "\n", "class RecallAtKLogger(CallbackAny2Vec):\n", " '''Report Recall@K at each epoch'''\n", " def __init__(self, validation_set, k, ray_tune=False, save_model=False):\n", " self.epoch = 0\n", " self.recall_scores = []\n", " self.validation = validation_set\n", " self.k = k\n", " self.tune = ray_tune\n", " self.save = save_model\n", "\n", " def on_epoch_begin(self, model):\n", " if not self.tune:\n", " print(f'Epoch: {self.epoch}', end='\\t')\n", "\n", " def on_epoch_end(self, model):\n", " # method 1: deepcopy the model and set the model copy's wv to None\n", " mod = deepcopy(model)\n", " mod.wv.norms = None # will cause it recalculate norms? \n", " \n", " # Every 10 epochs, save the model \n", " if self.epoch%10 == 0 and self.save: \n", " # method 2: save and reload the model\n", " model.save(f\"{MODEL_DIR}w2v_{self.epoch}.model\")\n", " #mod = Word2Vec.load(f\"w2v_{self.epoch}.model\")\n", " \n", " ratk_score = recall_at_k(self.validation, mod.wv, self.k) \n", "\n", " if self.tune: \n", " tune.report(recall_at_k = ratk_score) \n", " else:\n", " self.recall_scores.append(ratk_score)\n", " print(f' Recall@10: {ratk_score}')\n", " self.epoch += 1\n", "\n", "\n", "class LossLogger(CallbackAny2Vec):\n", " '''Report training loss at each epoch'''\n", " def __init__(self):\n", " self.epoch = 0\n", " self.previous_loss = 0\n", " self.training_loss = []\n", "\n", " def on_epoch_end(self, model):\n", " # the loss output by Word2Vec is more akin to a cumulative loss and increases each epoch\n", " # to get a value closer to loss per epoch, we subtract\n", " cumulative_loss = model.get_latest_training_loss()\n", " loss = cumulative_loss - self.previous_loss\n", " self.previous_loss = cumulative_loss\n", " self.training_loss.append(loss)\n", " print(f' Loss: {loss}')\n", " self.epoch += 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1TeCHqd30rcy" }, "outputs": [], "source": [ "expt_dir = '/content/big_HPO_no_distributed'" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 466 }, "id": "xEJnbuOgsmLQ", "outputId": "29669c60-14ec-444c-831e-3c05cc101d3c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Epoch: 0\t Recall@10: 0.181\n", " Loss: 731531.3125\n", "Epoch: 1\t Recall@10: 0.209\n", " Loss: 619873.5625\n", "Epoch: 2\t Recall@10: 0.213\n", " Loss: 511931.75\n", "Epoch: 3\t Recall@10: 0.218\n", " Loss: 568087.625\n", "Epoch: 4\t Recall@10: 0.216\n", " Loss: 587301.25\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [] }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "0.235\n", "0.13457301587301593\n" ] } ], "source": [ "use_saved_expt = False\n", "if use_saved_expt:\n", " analysis = Analysis(expt_dir, default_metric=\"recall_at_k\", default_mode=\"max\")\n", " w2v_params = analysis.get_best_config()\n", "else:\n", " w2v_params = {\n", " \"min_count\": 1,\n", " \"iter\": 5,\n", " \"workers\": 10,\n", " \"sg\": 1,\n", " }\n", "\n", "# Instantiate callback to measurs Recall@K on the validation set after each epoch of training\n", "ratk_logger = RecallAtKLogger(valid, k=10, save_model=True)\n", "# Instantiate callback to compute Word2Vec's training loss on the training set after each epoch of training\n", "loss_logger = LossLogger()\n", "# Train Word2Vec model and retrieve trained embeddings\n", "embeddings = train_w2v(train, w2v_params, [ratk_logger, loss_logger])\n", "\n", "# Save results\n", "pickle.dump(ratk_logger.recall_scores, open(os.path.join(\"/content\", f\"recall@k_per_epoch.pkl\"), \"wb\"))\n", "pickle.dump(loss_logger.training_loss, open(os.path.join(\"/content\", f\"trainloss_per_epoch.pkl\"), \"wb\"))\n", "\n", "# Save trained embeddings\n", "embeddings.save(os.path.join(\"/content\", f\"embeddings.wv\"))\n", "\n", "# Visualize metrics as a function of epoch\n", "plt.plot(np.array(ratk_logger.recall_scores)/np.max(ratk_logger.recall_scores))\n", "plt.plot(np.array(loss_logger.training_loss)/np.max(loss_logger.training_loss))\n", "plt.show()\n", "\n", "# Print results on the test set\n", "print(recall_at_k(test, embeddings, k=10))\n", "print(mrr_at_k(test, embeddings, k=10))" ] }, { "cell_type": "markdown", "metadata": { "id": "6F0OaTwC4NJK" }, "source": [ "### Tune word2vec with ray" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 258 }, "id": "W-rvETTgsepP", "outputId": "93a8ae6c-cc79-45df-8b08-899c017f8135" }, "outputs": [ { "data": { "text/html": [ "== Status ==
Memory usage on this node: 2.6/12.7 GiB
Using AsyncHyperBand: num_stopped=308\n", "Bracket: Iter 40.000: None | Iter 10.000: 0.206
Resources requested: 4.0/4 CPUs, 0/0 GPUs, 0.0/7.36 GiB heap, 0.0/3.68 GiB objects
Current best trial: a0670_00440 with recall_at_k=0.244 and parameters={'size': 16, 'window': 1, 'ns_exponent': 0.7999999999999996, 'alpha': 0.1, 'negative': 19, 'iter': 10, 'min_count': 1, 'workers': 6, 'sg': 1}
Result logdir: /content/big_HPO_no_distributed
Number of trials: 512/25872 (16 PENDING, 4 RUNNING, 492 TERMINATED)

" ], "text/plain": [ "" ] }, "metadata": { "tags": [] }, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "2021-06-11 09:06:44,670\tERROR tune.py:545 -- Trials did not complete: [tune_w2v_a0670_00492, tune_w2v_a0670_00493, tune_w2v_a0670_00494, tune_w2v_a0670_00495, tune_w2v_a0670_00496, tune_w2v_a0670_00497, tune_w2v_a0670_00498, tune_w2v_a0670_00499, tune_w2v_a0670_00500, tune_w2v_a0670_00501, tune_w2v_a0670_00502, tune_w2v_a0670_00503, tune_w2v_a0670_00504, tune_w2v_a0670_00505, tune_w2v_a0670_00506, tune_w2v_a0670_00507, tune_w2v_a0670_00508, tune_w2v_a0670_00509, tune_w2v_a0670_00510, tune_w2v_a0670_00511]\n", "2021-06-11 09:06:44,678\tINFO tune.py:549 -- Total run time: 6750.68 seconds (6746.07 seconds for the tuning loop).\n", "2021-06-11 09:06:44,680\tWARNING tune.py:554 -- Experiment has been interrupted, but the most recent state was saved. You can continue running this experiment by passing `resume=True` to `tune.run()`\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Best hyperparameters found were: {'dataset': 'ecomm', 'k': 10, 'size': 16, 'window': 1, 'ns_exponent': 0.7999999999999996, 'alpha': 0.1, 'negative': 19, 'iter': 10, 'min_count': 1, 'workers': 6, 'sg': 1}\n" ] } ], "source": [ "from ray.tune import Analysis\n", "from ray.tune.schedulers import ASHAScheduler\n", "\n", "# Define the hyperparameter search space for Word2Vec algorithm\n", "search_space = {\n", " \"dataset\": \"ecomm\",\n", " \"k\": 10,\n", " \"size\": tune.grid_search(list(np.arange(10,106, 6))),\n", " \"window\": tune.grid_search(list(np.arange(1,22, 3))),\n", " \"ns_exponent\": tune.grid_search(list(np.arange(-1, 1.2, .2))),\n", " \"alpha\": tune.grid_search([0.001, 0.01, 0.1]),\n", " \"negative\": tune.grid_search(list(np.arange(1,22, 3))),\n", " \"iter\": 10,\n", " \"min_count\": 1,\n", " \"workers\": 6,\n", " \"sg\": 1,\n", "}\n", "\n", "use_asha = True\n", "smoke_test = False\n", "\n", "# The ASHA Scheduler will stop underperforming trials in a principled fashion\n", "asha_scheduler = ASHAScheduler(max_t=100, grace_period=10) if use_asha else None\n", "\n", "# Set the stopping critera -- use the smoke-test arg to test the system \n", "stopping_criteria = {\"training_iteration\": 1 if smoke_test else 9999}\n", "\n", "# Perform hyperparamter sweep with Ray Tune\n", "analysis = tune.run(\n", " tune_w2v,\n", " name=expt_dir,\n", " local_dir=\"ray_results\",\n", " metric=\"recall_at_k\",\n", " mode=\"max\",\n", " scheduler=asha_scheduler,\n", " stop=stopping_criteria,\n", " num_samples=1,\n", " verbose=1,\n", " resources_per_trial={\n", " \"cpu\": 1,\n", " \"gpu\": 0\n", " },\n", " config=search_space,\n", ")\n", "print(\"Best hyperparameters found were: \", analysis.best_config)" ] }, { "cell_type": "markdown", "metadata": { "id": "jjdH1oSHrt2p" }, "source": [ "Ray Tune saves the results of each trial in the ray_results directory. Each time Ray Tune performs an HPO sweep, the results for that run are saved under a unique subdirectory. In this case, we named that subdirectory big_HPO_no_distributed. Ray Tune provides methods for interacting with these results, starting with the Analysis class that loads the results from each trial, including performance metrics as a function of training time and tons of metadata.\n", "\n", "These results are stored as JSON but the Analysis class provides a nice wrapper for converting those results in a pandas dataframe." ] }, { "cell_type": "markdown", "metadata": { "id": "RexEQbOd3TNR" }, "source": [ "### Explore the results of the full hyperparameter sweep\n", "Next, we're going to look at how the Recall@10 score changes as a function of various hyperparameter configurations that we tuned over. We tuned over three hyperparameters: the context window size, negative sampling exponent, and the number of negative samples.\n", "\n", "We want to look at the Recall@10 scores for all of these configurations but this is a 3-dimensional space and, as such, will be difficult to visualize. Instead, we'll \"collapse\" one dimension, while examining the other two. To do this, we aggregate the Recall@10 scores (taking the mean) along the \"collapsed\" dimension." ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 949 }, "id": "O7lbr2-5sebq", "outputId": "a2704370-f4b4-464a-87ee-6b4b6f7ed347" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
recall_at_ktime_this_iter_sdonetimesteps_totalepisodes_totaltraining_iterationexperiment_iddatetimestamptime_total_spidhostnamenode_iptime_since_restoretimesteps_since_restoreiterations_since_restoretrial_idconfig/alphaconfig/datasetconfig/iterconfig/kconfig/min_countconfig/negativeconfig/ns_exponentconfig/sgconfig/sizeconfig/windowconfig/workerslogdir
00.2375.543207FalseNaNNaN101faab4240d3746cbb4c51569db0690512021-06-11_08-49-15162340135571.839448456843b5d652723e6172.28.0.271.839448010a0670_004190.100ecomm10101190.6116.01.06big_HPO_no_distributed/tune_w2v_a0670_00419_41...
10.2034.436352FalseNaNNaN10fe00d91e28cf49beb45a0f47b900d2ae2021-06-11_07-54-42162339808275.372504222523b5d652723e6172.28.0.275.372504010a0670_001870.010ecomm10101190.6110.01.06big_HPO_no_distributed/tune_w2v_a0670_00187_18...
20.03610.416867FalseNaNNaN9965436df76d64eb1badb38eceda079982021-06-11_08-42-17162340093755.470819431083b5d652723e6172.28.0.255.47081909a0670_003930.001ecomm10101160.4116.01.06big_HPO_no_distributed/tune_w2v_a0670_00393_39...
30.2365.960799FalseNaNNaN98a3090a631b447bd9fb9ee249ef02a302021-06-11_08-52-44162340156454.629899472863b5d652723e6172.28.0.254.62989909a0670_004340.100ecomm10101130.8116.01.06big_HPO_no_distributed/tune_w2v_a0670_00434_43...
40.0411.409388TrueNaNNaN10cce93e60ce3a4ff4a0db54a8e47fa3d92021-06-11_08-06-00162339876038.842165278103b5d652723e6172.28.0.238.842165010a0670_002410.010ecomm1010110-1.0116.01.06big_HPO_no_distributed/tune_w2v_a0670_00241_24...
..........................................................................................
5030.0235.446685FalseNaNNaN39f3c574035ac47ac850519bf18e371842021-06-11_09-01-30162340209022.201976510823b5d652723e6172.28.0.222.20197603a0670_004710.001ecomm1010110-1.0122.01.06big_HPO_no_distributed/tune_w2v_a0670_00471_47...
5040.0152.176694TrueNaNNaN10e927f2c1c10e42908c4c5b26d8ba5b862021-06-11_08-17-32162339945241.176507330623b5d652723e6172.28.0.241.176507010a0670_002940.001ecomm101011-0.4116.01.06big_HPO_no_distributed/tune_w2v_a0670_00294_29...
5050.2111.697303FalseNaNNaN9de953f60dd084976b0a01b62918adea52021-06-11_08-33-35162340041519.104989397823b5d652723e6172.28.0.219.10498909a0670_003590.100ecomm1010110.2116.01.06big_HPO_no_distributed/tune_w2v_a0670_00359_35...
5060.0241.774843FalseNaNNaN4d60beb2097b64164acbcb80fa85800192021-06-11_07-14-48162339568813.84004034543b5d652723e6172.28.0.213.84004004a0670_000030.001ecomm101014-1.0110.01.06big_HPO_no_distributed/tune_w2v_a0670_00003_3_...
5070.1982.985146FalseNaNNaN83ad21430e1004d108a4683096833aeb22021-06-11_07-46-29162339758925.868437189243b5d652723e6172.28.0.225.86843708a0670_001540.010ecomm1010170.4110.01.06big_HPO_no_distributed/tune_w2v_a0670_00154_15...
\n", "

508 rows × 29 columns

\n", "
" ], "text/plain": [ " recall_at_k ... logdir\n", "0 0.237 ... big_HPO_no_distributed/tune_w2v_a0670_00419_41...\n", "1 0.203 ... big_HPO_no_distributed/tune_w2v_a0670_00187_18...\n", "2 0.036 ... big_HPO_no_distributed/tune_w2v_a0670_00393_39...\n", "3 0.236 ... big_HPO_no_distributed/tune_w2v_a0670_00434_43...\n", "4 0.041 ... big_HPO_no_distributed/tune_w2v_a0670_00241_24...\n", ".. ... ... ...\n", "503 0.023 ... big_HPO_no_distributed/tune_w2v_a0670_00471_47...\n", "504 0.015 ... big_HPO_no_distributed/tune_w2v_a0670_00294_29...\n", "505 0.211 ... big_HPO_no_distributed/tune_w2v_a0670_00359_35...\n", "506 0.024 ... big_HPO_no_distributed/tune_w2v_a0670_00003_3_...\n", "507 0.198 ... big_HPO_no_distributed/tune_w2v_a0670_00154_15...\n", "\n", "[508 rows x 29 columns]" ] }, "execution_count": 112, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "analysis = Analysis(\"big_HPO_no_distributed/\", \n", " default_metric=\"recall_at_k\",\n", " default_mode=\"max\")\n", "\n", "results = analysis.dataframe()\n", "results" ] }, { "cell_type": "markdown", "metadata": { "id": "FHKEwe9u3LDv" }, "source": [ "The Analysis objects also has methods to quickly retrieve the best configuration found during the HPO sweep.\n", "\n" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "WnKAoaER2vkp", "outputId": "1f50520a-3267-4d0c-fb99-d7a566fbc95b" }, "outputs": [ { "data": { "text/plain": [ "{'alpha': 0.1,\n", " 'dataset': 'ecomm',\n", " 'iter': 10,\n", " 'k': 10,\n", " 'min_count': 1,\n", " 'negative': 19,\n", " 'ns_exponent': 0.7999999999999996,\n", " 'sg': 1,\n", " 'size': 16,\n", " 'window': 1,\n", " 'workers': 6}" ] }, "execution_count": 113, "metadata": { "tags": [] }, "output_type": "execute_result" } ], "source": [ "best_config = analysis.get_best_config()\n", "best_config" ] }, { "cell_type": "markdown", "metadata": { "id": "tGk8zNkp3Pkr" }, "source": [ "While the results dataframe contains the final Recall@10 scores for each of the 539 trials, it's also nice to explore how those scores evolved as a function of training for any given trial. Again, the Analysis class delivers, providing the ability to access the full training results for any of the trials. Below we plot the Recall@10 score as a function of training epochs for the best configuration." ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 276 }, "id": "y8QF0YZY3GXZ", "outputId": "01f09f26-14ee-4003-d621-7135cf1f0861" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [] }, "output_type": "display_data" } ], "source": [ "best_path = analysis.get_best_logdir()\n", "dfs = analysis.fetch_trial_dataframes()\n", "\n", "plt.plot(dfs[best_path]['recall_at_k']);\n", "plt.xlabel(\"Epoch\")\n", "plt.ylabel(\"Recall@10\");" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "id": "z6sCIiAK3Rmt" }, "outputs": [], "source": [ "def aggregate_z(x_name, y_name):\n", " grouped = results.groupby([f\"config/{x_name}\", f\"config/{y_name}\"])\n", " x_values = []\n", " y_values = []\n", " mean_recall_values = []\n", " \n", " for name, grp in grouped:\n", " x_values.append(name[0])\n", " y_values.append(name[1])\n", " mean_recall_values.append(grp['recall_at_k'].mean())\n", " return x_values, y_values, mean_recall_values" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 327 }, "id": "6MDZY3uQ3xDA", "outputId": "11d3cdd2-949c-4bc7-ee82-24ee63c5ba46" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [] }, "output_type": "display_data" } ], "source": [ "fig = plt.figure(figsize=(15,5))\n", "\n", "ax = fig.add_subplot(131)\n", "negative, ns_exp, recall = aggregate_z(\"negative\", \"ns_exponent\")\n", "cm = sns.scatterplot(x=ns_exp, y=negative, hue=recall, palette=color_palette, legend=None)\n", "ax.set_xlabel(\"Negative sampling exponent\", fontsize=16)\n", "ax.set_ylabel(\"Number of negative samples\", fontsize=16)\n", "plt.xticks(fontsize=12)\n", "plt.yticks(fontsize=12)\n", "ax.plot(0.75, 5, \n", " marker='*', \n", " color=cldr_colors[1],\n", " markersize=10)\n", "ax.plot(best_config['ns_exponent'], \n", " best_config['negative'], \n", " marker=\"o\", \n", " fillstyle='none', \n", " color=cldr_colors[0],\n", " markersize=15)\n", "ax = fig.add_subplot(132)\n", "\n", "window, ns_exp, recall = aggregate_z(\"window\", \"ns_exponent\")\n", "cm = sns.scatterplot(x=ns_exp, y=window, hue=recall, palette=color_palette, legend=None)\n", "ax.set_xlabel(\"Negative sampling exponent\", fontsize=16)\n", "ax.set_ylabel(\"Context window size\", fontsize=16)\n", "plt.xticks(fontsize=12)\n", "plt.yticks(fontsize=12)\n", "ax.plot(0.75, 5, \n", " marker='*', \n", " color=cldr_colors[1],\n", " markersize=10)\n", "ax.plot(best_config['ns_exponent'], \n", " best_config['window'], \n", " marker=\"o\", \n", " fillstyle='none', \n", " color=cldr_colors[0],\n", " markersize=15)\n", "\n", "ax = fig.add_subplot(133)\n", "window, negative, recall = aggregate_z(\"window\", \"negative\")\n", "cm = sns.scatterplot(x=window, y=negative, hue=recall, palette=color_palette, legend=None)\n", "ax.set_xlabel(\"Number of negative examples\", fontsize=16)\n", "ax.set_ylabel(\"Context window size\", fontsize=16)\n", "plt.xticks(fontsize=12)\n", "plt.yticks(fontsize=12)\n", "ax.plot(5, 5, \n", " marker='*',\n", " color=cldr_colors[1],\n", " markersize=10)\n", "ax.plot(best_config['window'], \n", " best_config['negative'], \n", " marker=\"o\", \n", " fillstyle='none', \n", " color=cldr_colors[0],\n", " markersize=15);\n", "\n", "plt.tight_layout()\n", "plt.savefig(\"hpsweep_results.png\", transparent=True, dpi=150)" ] }, { "cell_type": "markdown", "metadata": { "id": "-5KwIKIl36gA" }, "source": [ "And there we have it! Each panel shows the Recall@10 scores (where yellow is a high score and purple is a low score) associated with a unique configuration of hyperparameters. The best hyperparameter values for the Online Retail Data Set are denoted by the light blue circle. Word2vec’s default values are shown by the orange star. In all cases, the orange star is nowhere near the light blue circle, indicating that the default values are not optimal for this dataset." ] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyNmIBT1S3jeboIbbIKzkpye", "collapsed_sections": [], "name": "2021-06-11-recostep-session-based-recommender-using-word2vec.ipynb", "provenance": [], "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 4 }