{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Untitled",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true,
"authorship_tag": "ABX9TyNH5Y3J3nEfsbpgiRYvSzX8"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "VGKjc2Bda-Qn"
},
"source": [
"# Session-based travel recommendations\n",
"> Experimenting with 8 ML models on a simple Trivago-inspired session-based travel dataset\n",
"\n",
"- toc: true\n",
"- badges: true\n",
"- comments: true\n",
"- categories: [Logistic, LightGBM, KNN, Session, Sequence, Trivago, Travel]\n",
"- image:"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FN5eEPeRlt4r"
},
"source": [
"## Problem statement"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ENvgsFF9l7UV"
},
"source": [
"### Context"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RlHe9UDwl9G3"
},
"source": [
"Recommending hotels and other travel-related items is still a difficult task, as travel and tourism is a very complex domain. Planning a trip usually involves searching for a set or package of products that are interconnected (e.g., means of transportation, lodging, attractions), with rather limited availability, and where contextual aspects may have a major impact (e.g., time, location, social context). Users book much fewer hotels than, for example, listen to music tracks, and, given the financial obligation of booking a stay at a hotel, users usually exhibit a strong price sensitivity and a bigger need to be convinced by any given offer. Besides, travelers are often emotionally connected to the products and the experience they provide. Therefore, decision making is not only based on rational and objective criteria. As such, providing the right information to visitors to a travel site, such as a hotel booking service, at the right time is challenging. Information about items such as hotels is often available as item metadata. However, usually, in this domain information about users and their goals and preferences is harder to obtain. Systems need to analyze session-based data of anonymous or first-time users to adapt the search results and anticipate the hotels the users may be interested in.\n",
"\n",
"Trivago is a global hotel search platform focused on reshaping the way travelers search for and compare hotels, while enabling advertisers of hotels to grow their businesses by providing access to a broad audience of travelers via their websites and apps. Trivago provide aggregated information about the characteristics of each accommodation to help travelers to make an informed decision and find their ideal place to stay. Once a choice is made, the users get redirected to the selected booking site to complete the booking. Trivago has established 55 localized platforms in over 190 countries and provides access to over two million hotels, including alternative accommodations, with prices and availability from over 400+ booking sites and hotel chains. \n",
"\n",
"Our users can narrow down their search by selecting filters and specifying the desired characteristics of their preferred accommodation. They can interact with the different offers presented to them and consume the aggregated information for each listing to make an informed decision and find their ideal place to stay. Once a choice is made, the users get redirected to the selected booking site to complete the booking. It is in the interest of all participants (traveler, advertising booking site, and trivago) to suggest suitable accommodations that fit the needs of the traveler.\n",
"\n",
"> *We partnered with researchers from TU Wien, Politecnico di Milano, and Karlsruhe Institute of Technology to launch the RecSys Challenge 2019, the annual data science challenge of the ACM Recommender Systems conference. In this challenge, we invite participants to dig deep into our data and come up with creative ideas to detect the intent of our users and build a click-prediction model that can be used to update the recommendation of accommodations. To this end, we have released a data set of user interactions on our website.*"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JKH-ASs5mFdI"
},
"source": [
"### Challenges\n",
"\n",
"Trivago face a few challenges when it comes to recommending the best options for their visitors, so it’s important to effectively make use of the explicit and implicit user signals within a session (clicks, search refinement, filter usage) to detect the users’ intent as quickly as possible and to update the recommendations to tailor the result list to these needs.\n",
"\n",
"Due to the nature of our domain, we face specific challenges that make it difficult to build predictive models and recommendation systems that are tailored to the needs of our visitors. Here are a few examples of the problems that trivago data scientists have to address:\n",
"\n",
"- Users search for accommodations comparatively infrequently with sometimes long time intervals between their trips. Furthermore, user intent and preferences change over time and depend on the purpose of the trip (e.g. a business traveler who books accommodation for a weekend trip with her family).\n",
"- Booking accommodation is an expensive transaction. Visitors are price sensitive and careful when they make a decision. As the availability of the accommodations, the search criteria, and the actual pricing of the deals from the advertisers vary over time, the context of each search has to be taken into consideration.\n",
"- Information about the personal preferences of travelers is sparse. The service provided to our users is free of charge ─ users do not have to provide personal data or make an account in order to use the website.\n",
"\n",
"### Goal\n",
"\n",
"The task of the challenge is to use all the information about the behavioral, time-dependent patterns of the users and the content of the displayed accommodations to develop models that allow to predict which accommodations a user is most likely to click on when presented with a list of potential options."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "rPJD7CrAmJC7"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "fp_zJZsWl_P5"
},
"source": [
"### Illustration"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RDta8BgvlwWs"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SohhKa_Jl0xO"
},
"source": [
"Schematic illustration of the dataset and how it is split between train and test set. Each icon in the schematic represents a different type of website interaction, such as clicking on item content (image, info, rating, deals), refining the search parameters via filtering, or triggering new searches for accommodations (items) or destinations. All interactions can be performed by the users and are indicated by the icons in the schematic. Gaps between consecutive interactions indicate the start of a new user session. The train set contains sessions before November 7, 2018, while the test set contains sessions after said date. The item_id of the final click out (shown as the box with the question marks) has been withheld. Note that the question mark refers only to the accommodation identifier that needs to be predicted and not the action type and that every event for which a prediction needs to be made is a click-out. For the evaluation of the leaderboard, the test set has been split into confirmation and a validation set on a user basis."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0ZVdi9l5mRMV"
},
"source": [
"## Data description"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lX3-ZRc5mTdt"
},
"source": [
"Trivago provided two sets of data, the user-item interaction and the item metadata. The interaction dataset consists of the sequential website interactions of users visiting the trivago website between November 1, 2018, and November 8, 2018. Each website interaction corresponds to a specific timestamp in the dataset. Multiple website interactions can appear in a user session. A session is defined as all interactions of a user on a specific trivago country platform with no gaps between the interaction timestamps of >60 minutes. If a user stops interacting with the website and returns after a couple of minutes to continue the search, then the continued interactions will still be counted to belong to the same session. Because of the grouping of website interactions into sessions, the interactions are in the following often referred to as session actions. For each session interaction, data about the context of the interaction are provided, e.g., the country platform on which the interaction took place or the list of items that were shown at the moment of a click and the prices of the accommodations."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "vxHgnnLBmgO8"
},
"source": [
"### Description of Action Types and Reference Values for All Possible Session Interactions"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RZOJX7pkmjie"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Sqp40_X7mmNf"
},
"source": [
"Each website interaction corresponds to a specific timestamp in the dataset. Multiple website interactions can appear in a user session. A session is defined as all interactions of a user on a specific trivago country platform with no gaps between the interaction timestamps of >60 minutes. If a user stops interacting with the website and returns after a couple of minutes to continue the search, then the continued interactions will still be counted to belong to the same session. Because of the grouping of website interactions into sessions, the interactions are in the following often referred to as session actions. For each session interaction, data about the context of the interaction are provided, e.g., the country platform on which the interaction took place or the list of items that were shown at the moment of a click and the prices of the accommodations. Metadata for each of the accommodations are provided in a separate file."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "lP1ApcRgmrzf"
},
"source": [
"### General Statistics of the RecSys Challenge 2019 Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CcIq5gzamt38"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qRvA-jW9m1uB"
},
"source": [
"### Session Actions Files"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KRNvto-1m4AQ"
},
"source": [
"Each row in the session action files (train.csv and test.csv) corresponds to a particular user action in a given session. The schema of these files is shown in the below table. The split between the train and test sets was done at a particular split date. That is, sessions that occurred before November 7, 2018, were put into the train set, while those that occurred after were put into the test set. The target of the test set is items clicked out at the end of the sessions in the test set."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xVxRCkxqm51p"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "bMAQ3a_3m8TP"
},
"source": [
"In addition to the user actions, the train.csv and test.csv files also contain information about the accommodations that were displayed to the user at the time a clickout was made. An accommodation that is displayed is referred to as being “impressed” and all displayed accommodations are stored in the “impressions” column. Each row in that column is a list of accommodations (items) in the order in which they were displayed on the website. In case the user action was not a clickout, the impressions column is left empty."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "l2Tn8P3LWLII"
},
"source": [
"## Setup"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "69ynq4AbT8sc",
"outputId": "a86ad68c-12e8-447b-b5bf-042a9eee874e"
},
"source": [
"!pip install -q git+https://github.com/sparsh-ai/recochef.git"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
" Installing build dependencies ... \u001b[?25l\u001b[?25hdone\n",
" Getting requirements to build wheel ... \u001b[?25l\u001b[?25hdone\n",
" Preparing wheel metadata ... \u001b[?25l\u001b[?25hdone\n",
" Building wheel for recochef (PEP 517) ... \u001b[?25l\u001b[?25hdone\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "SNLFGrR2jDPK"
},
"source": [
"%load_ext autoreload\n",
"%autoreload 2"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "yfD8l7n3VqUS"
},
"source": [
"import sys\n",
"import time\n",
"import math\n",
"import random\n",
"import numpy as np\n",
"import pandas as pd\n",
"from scipy import sparse\n",
"\n",
"from sklearn.linear_model import LogisticRegression\n",
"import lightgbm as lgb\n",
"\n",
"from recochef.datasets.trivago import Trivago\n",
"from recochef.datasets.synthetic import Session"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "u0CbvZnWWMo4"
},
"source": [
"## Data loading"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4NQRh7CSivjy"
},
"source": [
"### Full load"
]
},
{
"cell_type": "code",
"metadata": {
"id": "tdC6SWFyWHg3"
},
"source": [
"trivago = Trivago()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 309
},
"id": "gMSx11nvV0ua",
"outputId": "e369a779-3a89-482e-a3c9-12e4ed6af22d"
},
"source": [
"df_train = trivago.load_train()\n",
"df_train.head()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" USERID | \n",
" SESSIONID | \n",
" TIMESTAMP | \n",
" STEP | \n",
" EVENTTYPE | \n",
" REFERENCE | \n",
" PLATFORM | \n",
" CITY | \n",
" DEVICE | \n",
" FILTERS | \n",
" IMPRESSIONS | \n",
" PRICES | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 00RL8Z82B2Z1 | \n",
" aff3928535f48 | \n",
" 1541037460 | \n",
" 1 | \n",
" search for poi | \n",
" Newtown | \n",
" AU | \n",
" Sydney, Australia | \n",
" mobile | \n",
" None | \n",
" None | \n",
" None | \n",
"
\n",
" \n",
" | 1 | \n",
" 00RL8Z82B2Z1 | \n",
" aff3928535f48 | \n",
" 1541037522 | \n",
" 2 | \n",
" interaction item image | \n",
" 666856 | \n",
" AU | \n",
" Sydney, Australia | \n",
" mobile | \n",
" None | \n",
" None | \n",
" None | \n",
"
\n",
" \n",
" | 2 | \n",
" 00RL8Z82B2Z1 | \n",
" aff3928535f48 | \n",
" 1541037522 | \n",
" 3 | \n",
" interaction item image | \n",
" 666856 | \n",
" AU | \n",
" Sydney, Australia | \n",
" mobile | \n",
" None | \n",
" None | \n",
" None | \n",
"
\n",
" \n",
" | 3 | \n",
" 00RL8Z82B2Z1 | \n",
" aff3928535f48 | \n",
" 1541037532 | \n",
" 4 | \n",
" interaction item image | \n",
" 666856 | \n",
" AU | \n",
" Sydney, Australia | \n",
" mobile | \n",
" None | \n",
" None | \n",
" None | \n",
"
\n",
" \n",
" | 4 | \n",
" 00RL8Z82B2Z1 | \n",
" aff3928535f48 | \n",
" 1541037532 | \n",
" 5 | \n",
" interaction item image | \n",
" 109038 | \n",
" AU | \n",
" Sydney, Australia | \n",
" mobile | \n",
" None | \n",
" None | \n",
" None | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" USERID SESSIONID TIMESTAMP ... FILTERS IMPRESSIONS PRICES\n",
"0 00RL8Z82B2Z1 aff3928535f48 1541037460 ... None None None\n",
"1 00RL8Z82B2Z1 aff3928535f48 1541037522 ... None None None\n",
"2 00RL8Z82B2Z1 aff3928535f48 1541037522 ... None None None\n",
"3 00RL8Z82B2Z1 aff3928535f48 1541037532 ... None None None\n",
"4 00RL8Z82B2Z1 aff3928535f48 1541037532 ... None None None\n",
"\n",
"[5 rows x 12 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 17
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 394
},
"id": "R6swP2ZlWOgO",
"outputId": "c09e3785-23ad-4add-e8b6-efb6bde76c5b"
},
"source": [
"df_test = trivago.load_test()\n",
"df_test.head()"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" USERID | \n",
" SESSIONID | \n",
" TIMESTAMP | \n",
" STEP | \n",
" EVENTTYPE | \n",
" REFERENCE | \n",
" PLATFORM | \n",
" CITY | \n",
" DEVICE | \n",
" FILTERS | \n",
" IMPRESSIONS | \n",
" PRICES | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 004A07DM0IDW | \n",
" 1d688ec168932 | \n",
" 1541555614 | \n",
" 1 | \n",
" interaction item image | \n",
" 2059240 | \n",
" CO | \n",
" Santa Marta, Colombia | \n",
" mobile | \n",
" None | \n",
" None | \n",
" None | \n",
"
\n",
" \n",
" | 1 | \n",
" 004A07DM0IDW | \n",
" 1d688ec168932 | \n",
" 1541555614 | \n",
" 2 | \n",
" interaction item image | \n",
" 2059240 | \n",
" CO | \n",
" Santa Marta, Colombia | \n",
" mobile | \n",
" None | \n",
" None | \n",
" None | \n",
"
\n",
" \n",
" | 2 | \n",
" 004A07DM0IDW | \n",
" 1d688ec168932 | \n",
" 1541555696 | \n",
" 3 | \n",
" clickout item | \n",
" 1050068 | \n",
" CO | \n",
" Santa Marta, Colombia | \n",
" mobile | \n",
" None | \n",
" 2059240|2033381|1724779|127131|399441|103357|1... | \n",
" 70|46|48|76|65|65|106|66|87|43|52|44|60|61|50|... | \n",
"
\n",
" \n",
" | 3 | \n",
" 004A07DM0IDW | \n",
" 1d688ec168932 | \n",
" 1541555707 | \n",
" 4 | \n",
" clickout item | \n",
" 1050068 | \n",
" CO | \n",
" Santa Marta, Colombia | \n",
" mobile | \n",
" None | \n",
" 2059240|2033381|1724779|127131|399441|103357|1... | \n",
" 70|46|48|76|65|65|106|66|87|43|52|44|60|61|50|... | \n",
"
\n",
" \n",
" | 4 | \n",
" 004A07DM0IDW | \n",
" 1d688ec168932 | \n",
" 1541555717 | \n",
" 5 | \n",
" clickout item | \n",
" 1050068 | \n",
" CO | \n",
" Santa Marta, Colombia | \n",
" mobile | \n",
" None | \n",
" 2059240|2033381|1724779|127131|399441|103357|1... | \n",
" 70|46|48|76|65|65|106|66|87|43|52|44|60|61|50|... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" USERID ... PRICES\n",
"0 004A07DM0IDW ... None\n",
"1 004A07DM0IDW ... None\n",
"2 004A07DM0IDW ... 70|46|48|76|65|65|106|66|87|43|52|44|60|61|50|...\n",
"3 004A07DM0IDW ... 70|46|48|76|65|65|106|66|87|43|52|44|60|61|50|...\n",
"4 004A07DM0IDW ... 70|46|48|76|65|65|106|66|87|43|52|44|60|61|50|...\n",
"\n",
"[5 rows x 12 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 18
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "RmdHIfMuixzw"
},
"source": [
"### Sample load"
]
},
{
"cell_type": "code",
"metadata": {
"id": "rS9gGutmizNT"
},
"source": [
"sample_session_data = Session(version='trivago')"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 359
},
"id": "EbnKHqJzizHq",
"outputId": "ce2c24c6-f754-472a-8420-0e3dda57afc1"
},
"source": [
"df_train = sample_session_data.train()\n",
"df_train"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" session_id | \n",
" timestamp | \n",
" step | \n",
" action_type | \n",
" reference | \n",
" impressions | \n",
" prices | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 64BL89 | \n",
" 3579f89 | \n",
" 1 | \n",
" 1 | \n",
" interaction item image | \n",
" 5001 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 1 | \n",
" 64BL89 | \n",
" 3579f89 | \n",
" 2 | \n",
" 2 | \n",
" clickout item | \n",
" 5002 | \n",
" 5014|5002|5010 | \n",
" 100|125|120 | \n",
"
\n",
" \n",
" | 2 | \n",
" 64BL89 | \n",
" 3579f89 | \n",
" 3 | \n",
" 3 | \n",
" interaction item info | \n",
" 5003 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 3 | \n",
" 64BL89 | \n",
" 3579f89 | \n",
" 4 | \n",
" 4 | \n",
" filter selection | \n",
" unknown | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 4 | \n",
" 64BLF | \n",
" 4504h9 | \n",
" 2 | \n",
" 1 | \n",
" interaction item image | \n",
" 5010 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 5 | \n",
" 64BLF | \n",
" 4504h9 | \n",
" 4 | \n",
" 2 | \n",
" clickout item | \n",
" 5001 | \n",
" 5001|5023|5040|5005 | \n",
" 75|110|65|210 | \n",
"
\n",
" \n",
" | 6 | \n",
" 64BL89 | \n",
" 5504hFL | \n",
" 7 | \n",
" 1 | \n",
" filter selection | \n",
" unknown | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 7 | \n",
" 64BL89 | \n",
" 5504hFL | \n",
" 8 | \n",
" 2 | \n",
" clickout item | \n",
" 5004 | \n",
" 5010|5001|5023|5004|5002|5008 | \n",
" 120|89|140|126|86|110 | \n",
"
\n",
" \n",
" | 8 | \n",
" 64BL89 | \n",
" 5504hFL | \n",
" 9 | \n",
" 3 | \n",
" interaction item image | \n",
" 5001 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 9 | \n",
" 64BL89 | \n",
" 5504hFL | \n",
" 10 | \n",
" 4 | \n",
" clickout item | \n",
" 5001 | \n",
" 5010|5001|5023|5004|5002|5008 | \n",
" 120|89|140|126|86|110 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id session_id ... impressions prices\n",
"0 64BL89 3579f89 ... NaN NaN\n",
"1 64BL89 3579f89 ... 5014|5002|5010 100|125|120\n",
"2 64BL89 3579f89 ... NaN NaN\n",
"3 64BL89 3579f89 ... NaN NaN\n",
"4 64BLF 4504h9 ... NaN NaN\n",
"5 64BLF 4504h9 ... 5001|5023|5040|5005 75|110|65|210\n",
"6 64BL89 5504hFL ... NaN NaN\n",
"7 64BL89 5504hFL ... 5010|5001|5023|5004|5002|5008 120|89|140|126|86|110\n",
"8 64BL89 5504hFL ... NaN NaN\n",
"9 64BL89 5504hFL ... 5010|5001|5023|5004|5002|5008 120|89|140|126|86|110\n",
"\n",
"[10 rows x 8 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 8
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JWSfYEIbnapv"
},
"source": [
"1. User ```64BL89``` viewed the image of item 5001.\n",
"2. User ```64BL89``` clicked on item 5002. 3 items (5014, 5002, 5010) were shown to the user and user clicked on 2nd item which is 5002. It was the costliest among the 3 items.\n",
"3. User ```64BL89``` viewed more info of item 5003.\n",
"4. User ```64BL89``` selected an unknown filter.\n",
"5. User ```64BLF``` viewed image of item 5010.\n",
"6. User ```64BLF``` clicked item 5001. It was on top of the recommended list.\n",
"7. User ```64BL89``` again came back after some time. A new session started for this user. User directly started with applying unknown filter.\n",
"8. User ```64BL89``` clicked on item 5004. It was at 4th position.\n",
"9. User ```64BL89``` again viewed the image of item 5001.\n",
"10. User ```64BL89``` now clicked item 5001 this time. It was at 2nd position.\n"
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "lv7z9wLkizEL",
"outputId": "b1275cf8-0dc5-4a55-dd10-df1d6728503e"
},
"source": [
"df_test = sample_session_data.test()\n",
"df_test"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" session_id | \n",
" timestamp | \n",
" step | \n",
" action_type | \n",
" reference | \n",
" impressions | \n",
" prices | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 64BL89 | \n",
" 3579f90 | \n",
" 5 | \n",
" 1 | \n",
" interaction item image | \n",
" 5023 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 1 | \n",
" 64BL89 | \n",
" 3579f90 | \n",
" 6 | \n",
" 2 | \n",
" clickout item | \n",
" NaN | \n",
" 5002|5003|5010|5004|5001|5023 | \n",
" 120|75|110|105|89|99 | \n",
"
\n",
" \n",
" | 2 | \n",
" 64BL91F2 | \n",
" 3779f92 | \n",
" 9 | \n",
" 1 | \n",
" interaction item info | \n",
" 5010 | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 3 | \n",
" 64BL91F2 | \n",
" 3779f92 | \n",
" 10 | \n",
" 2 | \n",
" clickout item | \n",
" NaN | \n",
" 5001|5004|5010|5014 | \n",
" 76|102|115|124 | \n",
"
\n",
" \n",
" | 4 | \n",
" 64BL91F2 | \n",
" 3779f92 | \n",
" 11 | \n",
" 3 | \n",
" filter selection | \n",
" unknown | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id session_id ... impressions prices\n",
"0 64BL89 3579f90 ... NaN NaN\n",
"1 64BL89 3579f90 ... 5002|5003|5010|5004|5001|5023 120|75|110|105|89|99\n",
"2 64BL91F2 3779f92 ... NaN NaN\n",
"3 64BL91F2 3779f92 ... 5001|5004|5010|5014 76|102|115|124\n",
"4 64BL91F2 3779f92 ... NaN NaN\n",
"\n",
"[5 rows x 8 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 9
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 173
},
"id": "e0F3Fc57kxHs",
"outputId": "3c1dba2a-9862-4bc7-9e9b-5b0efe17f1ca"
},
"source": [
"df_items = sample_session_data.items()\n",
"df_items"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" item_id | \n",
" properties | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 5001 | \n",
" Wifi|Croissant|TV | \n",
"
\n",
" \n",
" | 1 | \n",
" 5002 | \n",
" Wifi|TV | \n",
"
\n",
" \n",
" | 2 | \n",
" 5003 | \n",
" Croissant | \n",
"
\n",
" \n",
" | 3 | \n",
" 5004 | \n",
" Shoe dryer | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" item_id properties\n",
"0 5001 Wifi|Croissant|TV\n",
"1 5002 Wifi|TV\n",
"2 5003 Croissant\n",
"3 5004 Shoe dryer"
]
},
"metadata": {
"tags": []
},
"execution_count": 10
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5FlKnZO5qVA6"
},
"source": [
"## Utilities"
]
},
{
"cell_type": "code",
"metadata": {
"id": "g8BEGNFMqOZB"
},
"source": [
"def explode(df, col_expl):\n",
" \"\"\"Separate string in column col_expl and explode elements into multiple rows.\"\"\"\n",
"\n",
" s = df[col_expl].str.split('|', expand=True).stack()\n",
" i = s.index.get_level_values(0)\n",
" df2 = df.loc[i].copy()\n",
" df2[col_expl] = s.values\n",
"\n",
" return df2\n",
"\n",
"\n",
"def explode_mult(df_in, col_list):\n",
" \"\"\"Explode each column in col_list into multiple rows.\"\"\"\n",
"\n",
" df = df_in.copy()\n",
"\n",
" for col in col_list:\n",
" df.loc[:, col] = df.loc[:, col].str.split(\"|\")\n",
"\n",
" df_out = pd.DataFrame(\n",
" {col: np.repeat(df[col].to_numpy(),\n",
" df[col_list[0]].str.len())\n",
" for col in df.columns.drop(col_list)}\n",
" )\n",
"\n",
" for col in col_list:\n",
" df_out.loc[:, col] = np.concatenate(df.loc[:, col].to_numpy())\n",
"\n",
" return df_out\n",
"\n",
"\n",
"def group_concat(df, gr_cols, col_concat):\n",
" \"\"\"Concatenate multiple rows into one.\"\"\"\n",
"\n",
" df_out = (\n",
" df\n",
" .groupby(gr_cols)[col_concat]\n",
" .apply(lambda x: ' '.join(x))\n",
" .to_frame()\n",
" .reset_index()\n",
" )\n",
"\n",
" return df_out\n",
"\n",
"\n",
"def get_target_rows(df):\n",
" \"\"\"Restrict data frame to rows for which a prediction needs to be made.\"\"\"\n",
" \n",
" df_target = df[\n",
" (df.action_type == \"clickout item\") & \n",
" (df[\"reference\"].isna())\n",
" ]\n",
"\n",
" return df_target\n",
"\n",
"\n",
"def summarize_recs(df, rec_col):\n",
" \"\"\"Bring the data frame into submission format.\"\"\"\n",
"\n",
" df_rec = (\n",
" df\n",
" .sort_values(by=[\"user_id\", \"session_id\", \"timestamp\", \"step\", rec_col],\n",
" ascending=[True, True, True, True, False])\n",
" .groupby([\"user_id\", \"session_id\", \"timestamp\", \"step\"])[\"impressed_item\"]\n",
" .apply(lambda x: ' '.join(x))\n",
" .to_frame()\n",
" .reset_index()\n",
" .rename(columns={'impressed_item': 'item_recommendations'})\n",
" )\n",
"\n",
" return df_rec"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "38H0VRY0qOQi"
},
"source": [
"def print_time(s):\n",
" \"\"\"Print string s and current time.\"\"\"\n",
"\n",
" t = time.localtime()\n",
" current_time = time.strftime(\"%H:%M:%S\", t)\n",
" print(f\"{current_time} | {s}\")\n",
"\n",
"\n",
"def print_header(s):\n",
" \"\"\"Print a nice header for string s.\"\"\"\n",
"\n",
" print()\n",
" print(f\"##{'#'*len(s)}##\")\n",
" print(f\"# {s} #\")\n",
" print(f\"##{'#'*len(s)}##\")\n",
" print()\n",
"\n",
"\n",
"def validate_model_name(model_name):\n",
" \"\"\"Check if the inserted model name is valid.\"\"\"\n",
"\n",
" model_names = [\n",
" 'gbm_rank', 'logistic_regression',\n",
" 'nn_interaction', 'nn_item',\n",
" 'pop_abs', 'pop_user', \n",
" 'position', 'random'\n",
" ]\n",
"\n",
" try:\n",
" if model_name not in model_names: raise NameError\n",
" except NameError:\n",
" print(\"No such model. Please choose a valid one.\")\n",
" sys.exit(1)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "3nwHa4Y2wTsT"
},
"source": [
"## Feature engineering"
]
},
{
"cell_type": "code",
"metadata": {
"id": "lhP4qzy2qOUs"
},
"source": [
"def build_features(df):\n",
" \"\"\"Build features for the lightGBM and logistic regression model.\"\"\"\n",
"\n",
" # Select columns that are of interest for this method\n",
" print_time(\"start\")\n",
" cols = ['user_id', 'session_id', 'timestamp', 'step',\n",
" 'action_type', 'reference', 'impressions', 'prices']\n",
" df_cols = df.loc[:, cols] \n",
"\n",
" # We are only interested in action types, for wich the reference is an item ID\n",
" print_time(\"filter interactions\")\n",
" item_interactions = [\n",
" 'clickout item', 'interaction item deals', 'interaction item image',\n",
" 'interaction item info', 'interaction item rating', 'search for item'\n",
" ]\n",
" df_actions = (\n",
" df_cols\n",
" .loc[df_cols.action_type.isin(item_interactions), :]\n",
" .copy()\n",
" .rename(columns={'reference': 'referenced_item'})\n",
" )\n",
"\n",
" print_time(\"cleaning\")\n",
" # Clean of instances that have no reference\n",
" idx_rm = (df_actions.action_type != \"clickout item\") & (df_actions.referenced_item.isna())\n",
" df_actions = df_actions[~idx_rm]\n",
"\n",
" # Get item ID of previous interaction of a user in a session\n",
" print_time(\"previous interactions\")\n",
" df_actions.loc[:, \"previous_item\"] = (\n",
" df_actions\n",
" .sort_values(by=[\"user_id\", \"session_id\", \"timestamp\", \"step\"],\n",
" ascending=[True, True, True, True])\n",
" .groupby([\"user_id\"])[\"referenced_item\"]\n",
" .shift(1)\n",
" )\n",
"\n",
" # Combine the impressions and item column, they both contain item IDs\n",
" # and we can expand the impression lists in the next step to get the total\n",
" # interaction count for an item\n",
" print_time(\"combining columns - impressions\")\n",
" df_actions.loc[:, \"interacted_item\"] = np.where(\n",
" df_actions.impressions.isna(),\n",
" df_actions.referenced_item,\n",
" df_actions.impressions\n",
" )\n",
" df_actions = df_actions.drop(columns=\"impressions\")\n",
"\n",
" # Price array expansion will get easier without NAs\n",
" print_time(\"combining columns - prices\")\n",
" df_actions.loc[:, \"prices\"] = np.where(\n",
" df_actions.prices.isna(),\n",
" \"\",\n",
" df_actions.prices\n",
" )\n",
"\n",
" # Convert pipe separated lists into columns\n",
" print_time(\"explode arrays\")\n",
" df_items = explode_mult(df_actions, [\"interacted_item\", \"prices\"]).copy()\n",
"\n",
" # Feature: Number of previous interactions with an item\n",
" print_time(\"interaction count\")\n",
" df_items.loc[:, \"interaction_count\"] = (\n",
" df_items\n",
" .groupby([\"user_id\", \"interacted_item\"])\n",
" .cumcount()\n",
" )\n",
"\n",
" # Reduce to impression level again \n",
" print_time(\"reduce to impressions\")\n",
" df_impressions = (\n",
" df_items[df_items.action_type == \"clickout item\"]\n",
" .copy()\n",
" .drop(columns=\"action_type\")\n",
" .rename(columns={\"interacted_item\": \"impressed_item\"})\n",
" )\n",
"\n",
" # Feature: Position of item in the original list.\n",
" # Items are in original order after the explode for each index\n",
" print_time(\"position feature\")\n",
" df_impressions.loc[:, \"position\"] = (\n",
" df_impressions\n",
" .groupby([\"user_id\", \"session_id\", \"timestamp\", \"step\"])\n",
" .cumcount()+1\n",
" )\n",
"\n",
" # Feature: Is the impressed item the last interacted item\n",
" print_time(\"last interacted item feature\")\n",
" df_impressions.loc[:, \"is_last_interacted\"] = (\n",
" df_impressions[\"previous_item\"] == df_impressions[\"impressed_item\"]\n",
" ).astype(int)\n",
"\n",
" print_time(\"change price datatype\")\n",
" df_impressions.loc[:, \"prices\"] = df_impressions.prices.astype(int)\n",
"\n",
" return_cols = [\n",
" \"user_id\",\n",
" \"session_id\",\n",
" \"timestamp\",\n",
" \"step\",\n",
" \"position\",\n",
" \"prices\",\n",
" \"interaction_count\",\n",
" \"is_last_interacted\",\n",
" \"referenced_item\",\n",
" \"impressed_item\",\n",
" ]\n",
"\n",
" df_return = df_impressions[return_cols]\n",
"\n",
" return df_return"
],
"execution_count": 64,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "HQ5CBMSrgrKo"
},
"source": [
"## Models"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "c_qWk0GNreBc"
},
"source": [
"### List of models\n",
" - gbm_rank: lightGBM model\n",
" - log_reg: Logistic regression\n",
" - nn_interaction: kNN w/ session co-occurrence\n",
" - nn_item: kNN w/ metadata similarity\n",
" - pop_abs: Popularity - total clicks\n",
" - pop_user: Popularity - distinct users\n",
" - position: Original display position\n",
" - random: Random order"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DHHpskPBrgo0"
},
"source": [
"### random"
]
},
{
"cell_type": "code",
"metadata": {
"id": "i-aIjKUBrUp_"
},
"source": [
"#collapse-hide\n",
"class ModelRandom():\n",
" \"\"\"\n",
" Model class for the random ordering model.\n",
" Methods\n",
" fit(df): Not needed. Only added for consistency with other model classes\n",
" predict(df): Calculate recommendations for test data \n",
" \"\"\"\n",
" def fit(self, _):\n",
" pass\n",
"\n",
"\n",
" def predict(self, df):\n",
" \"\"\"Randomly sort the impressions list.\"\"\"\n",
"\n",
" # Target row, withheld item ID that needs to be predicted\n",
" print_time(\"target rows\")\n",
" df_target = get_target_rows(df.copy())\n",
"\n",
" # Summarize recommendations\n",
" print_time(\"summarize recommendations\")\n",
" random.seed(10121)\n",
" df_target.loc[:, \"item_recs_list\"] = (\n",
" df_target\n",
" .loc[:, \"impressions\"].str.split(\"|\")\n",
" .map(lambda x: sorted(x, key=lambda k: random.random()))\n",
" )\n",
"\n",
" df_target.loc[:, \"item_recommendations\"] = (\n",
" df_target[\"item_recs_list\"]\n",
" .map(lambda arr: ' '.join(arr))\n",
" )\n",
"\n",
" cols_rec = [\"user_id\", \"session_id\", \"timestamp\", \"step\", \"item_recommendations\"]\n",
" df_rec = df_target.loc[:, cols_rec]\n",
"\n",
" return df_rec"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "nu3Hxeb6rxX0",
"outputId": "191405a7-5ddf-45f5-c116-751cf3b84c6c"
},
"source": [
"model = ModelRandom()\n",
"model.fit(df_train)\n",
"df_recommendations = model.predict(df_test)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"14:47:52 | target rows\n",
"14:47:52 | summarize recommendations\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 111
},
"id": "Uo82KIhgrzkN",
"outputId": "53e8e774-b55a-4fa5-a641-7ec2327fdb2b"
},
"source": [
"df_recommendations"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" session_id | \n",
" timestamp | \n",
" step | \n",
" item_recommendations | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 | \n",
" 64BL89 | \n",
" 3579f90 | \n",
" 6 | \n",
" 2 | \n",
" 5004 5010 5001 5002 5003 5023 | \n",
"
\n",
" \n",
" | 3 | \n",
" 64BL91F2 | \n",
" 3779f92 | \n",
" 10 | \n",
" 2 | \n",
" 5001 5010 5014 5004 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id session_id timestamp step item_recommendations\n",
"1 64BL89 3579f90 6 2 5004 5010 5001 5002 5003 5023\n",
"3 64BL91F2 3779f92 10 2 5001 5010 5014 5004"
]
},
"metadata": {
"tags": []
},
"execution_count": 27
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "J09qiuF6sQcY"
},
"source": [
"### position"
]
},
{
"cell_type": "code",
"metadata": {
"id": "b_lcsrHMsI0M"
},
"source": [
"#collapse-hide\n",
"class ModelPosition():\n",
" \"\"\"\n",
" Model class for the model based on the original position in displayed list.\n",
" Methods\n",
" fit(df): Not needed. Only added for consistency with other model classes\n",
" predict(df): Calculate recommendations for test data \n",
" \"\"\"\n",
" def fit(self, _):\n",
" pass\n",
"\n",
"\n",
" def predict(self, df):\n",
" \"\"\"Return items in impressions list in original order.\"\"\"\n",
"\n",
" # Target row, withheld item ID that needs to be predicted\n",
" print_time(\"target rows\")\n",
" df_target = get_target_rows(df.copy())\n",
"\n",
" # Summarize recommendations\n",
" print_time(\"summarize recommendations\")\n",
" df_target[\"item_recommendations\"] = (\n",
" df_target\n",
" .apply(lambda x: x.impressions.replace(\"|\", \" \"), axis=1)\n",
" )\n",
"\n",
" cols_rec = [\"user_id\", \"session_id\", \"timestamp\", \"step\", \"item_recommendations\"]\n",
" df_rec = df_target.loc[:, cols_rec]\n",
"\n",
" return df_rec"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "-CvVzvKzslSy",
"outputId": "04074250-0ff3-4407-b7c7-f69393afc413"
},
"source": [
"model = ModelPosition()\n",
"model.fit(df_train)\n",
"df_recommendations = model.predict(df_test)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"14:51:15 | target rows\n",
"14:51:15 | summarize recommendations\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 111
},
"id": "f8cNjbSwsrbS",
"outputId": "08fa97d5-db46-44ee-d4e6-b7f6252772ad"
},
"source": [
"df_recommendations"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" session_id | \n",
" timestamp | \n",
" step | \n",
" item_recommendations | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 | \n",
" 64BL89 | \n",
" 3579f90 | \n",
" 6 | \n",
" 2 | \n",
" 5002 5003 5010 5004 5001 5023 | \n",
"
\n",
" \n",
" | 3 | \n",
" 64BL91F2 | \n",
" 3779f92 | \n",
" 10 | \n",
" 2 | \n",
" 5001 5004 5010 5014 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id session_id timestamp step item_recommendations\n",
"1 64BL89 3579f90 6 2 5002 5003 5010 5004 5001 5023\n",
"3 64BL91F2 3779f92 10 2 5001 5004 5010 5014"
]
},
"metadata": {
"tags": []
},
"execution_count": 30
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ndeQGvQFtWHh"
},
"source": [
"### pop_abs"
]
},
{
"cell_type": "code",
"metadata": {
"id": "cxjT_bvBtHAK"
},
"source": [
"#collapse-hide\n",
"class ModelPopAbs():\n",
" \"\"\"\n",
" Model class for the popularity model based on total number of clicks.\n",
" Methods\n",
" fit(df): Fit the model on training data\n",
" predict(df): Calculate recommendations for test data \n",
" \"\"\"\n",
"\n",
" def fit(self, df):\n",
" \"\"\"Count the number of clicks for each item.\"\"\"\n",
"\n",
" # Select columns that are of interest for this method\n",
" print_time(\"start\")\n",
" cols = ['user_id', 'session_id', 'timestamp', 'step',\n",
" 'action_type', 'reference']\n",
" df_cols = df.loc[:, cols] \n",
"\n",
" # We only need to count clickouts per item\n",
" print_time(\"clicks per item\")\n",
" df_item_clicks = (\n",
" df_cols\n",
" .loc[df_cols[\"action_type\"] == \"clickout item\", :]\n",
" .groupby(\"reference\")\n",
" .size()\n",
" .reset_index(name=\"n_clicks\")\n",
" .rename(columns={\"reference\": \"item\"})\n",
" )\n",
"\n",
" self.df_pop = df_item_clicks\n",
"\n",
"\n",
" def predict(self, df):\n",
" \"\"\"Sort the impression list by number of clicks in the training phase.\"\"\"\n",
"\n",
" # Select columns that are of interest for this method\n",
" print_time(\"start\")\n",
" cols = ['user_id', 'session_id', 'timestamp', 'step',\n",
" 'action_type', 'reference', \"impressions\"]\n",
" df_cols = df.loc[:, cols] \n",
"\n",
" # Target row, withheld item ID that needs to be predicted\n",
" print_time(\"target rows\")\n",
" df_target = get_target_rows(df_cols)\n",
"\n",
" # Explode to impression level\n",
" print_time(\"explode impression array\")\n",
" df_impressions = (\n",
" explode(df_target, \"impressions\")\n",
" .rename(columns={\"impressions\": \"impressed_item\"})\n",
" )\n",
" df_impressions = (\n",
" df_impressions\n",
" .merge(\n",
" self.df_pop,\n",
" left_on=\"impressed_item\",\n",
" right_on=\"item\",\n",
" how=\"left\"\n",
" )\n",
" )\n",
"\n",
" # Summarize recommendations\n",
" print_time(\"summarize recommendations\")\n",
" df_rec = summarize_recs(df_impressions, \"n_clicks\")\n",
"\n",
" return df_rec"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "z1zRlpp7tda5",
"outputId": "a7fcdd29-988c-4341-fe58-bcf49394b950"
},
"source": [
"model = ModelPopAbs()\n",
"model.fit(df_train)\n",
"df_recommendations = model.predict(df_test)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"14:57:30 | start\n",
"14:57:30 | clicks per item\n",
"14:57:30 | start\n",
"14:57:30 | target rows\n",
"14:57:30 | explode impression array\n",
"14:57:30 | summarize recommendations\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 111
},
"id": "WseXnTsSteDP",
"outputId": "211341f2-24e6-49bd-ff64-bef3a1006091"
},
"source": [
"df_recommendations"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" session_id | \n",
" timestamp | \n",
" step | \n",
" item_recommendations | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 64BL89 | \n",
" 3579f90 | \n",
" 6 | \n",
" 2 | \n",
" 5001 5002 5004 5003 5010 5023 | \n",
"
\n",
" \n",
" | 1 | \n",
" 64BL91F2 | \n",
" 3779f92 | \n",
" 10 | \n",
" 2 | \n",
" 5001 5004 5010 5014 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id session_id timestamp step item_recommendations\n",
"0 64BL89 3579f90 6 2 5001 5002 5004 5003 5010 5023\n",
"1 64BL91F2 3779f92 10 2 5001 5004 5010 5014"
]
},
"metadata": {
"tags": []
},
"execution_count": 34
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "MYCNr7aNtnw_"
},
"source": [
"### pop_user"
]
},
{
"cell_type": "code",
"metadata": {
"id": "O8uui6u-tnxA"
},
"source": [
"#collapse-hide\n",
"class ModelPopUsers():\n",
" \"\"\"\n",
" Model class for the popularity model based on distinct users.\n",
" Methods\n",
" fit(df): Fit the model on training data\n",
" predict(df): Calculate recommendations for test data \n",
" \"\"\"\n",
"\n",
" def fit(self, df):\n",
" \"\"\"Count the number of distinct users that click on an item.\"\"\"\n",
"\n",
" # Select columns that are of interest for this method\n",
" print_time(\"start\")\n",
" cols = ['user_id', 'session_id', 'timestamp', 'step',\n",
" 'action_type', 'reference']\n",
" df_cols = df.loc[:, cols] \n",
"\n",
" # We only need to count clickouts per item\n",
" print_time(\"clicks per item\")\n",
" df_item_clicks = (\n",
" df_cols\n",
" .loc[df_cols[\"action_type\"] == \"clickout item\", :]\n",
" .groupby(\"reference\")\n",
" .user_id\n",
" .nunique()\n",
" .reset_index(name=\"n_users\")\n",
" .rename(columns={\"reference\": \"item\"})\n",
" )\n",
"\n",
" self.df_pop = df_item_clicks\n",
"\n",
"\n",
" def predict(self, df):\n",
" \"\"\"Sort the impression list by number of distinct users in the training phase.\"\"\"\n",
"\n",
" # Select columns that are of interest for this method\n",
" print_time(\"start\")\n",
" cols = ['user_id', 'session_id', 'timestamp', 'step',\n",
" 'action_type', 'reference', \"impressions\"]\n",
" df_cols = df.loc[:, cols] \n",
"\n",
" # Target row, withheld item ID that needs to be predicted\n",
" print_time(\"target rows\")\n",
" df_target = get_target_rows(df_cols)\n",
"\n",
" # Explode to impression level\n",
" print_time(\"explode impression array\")\n",
" df_impressions = (\n",
" explode(df_target, \"impressions\")\n",
" .rename(columns={\"impressions\": \"impressed_item\"})\n",
" )\n",
" df_impressions = (\n",
" df_impressions\n",
" .merge(\n",
" self.df_pop,\n",
" left_on=\"impressed_item\",\n",
" right_on=\"item\",\n",
" how=\"left\"\n",
" )\n",
" )\n",
"\n",
" # Summarize recommendations\n",
" print_time(\"summarize recommendations\")\n",
" df_rec = summarize_recs(df_impressions, \"n_users\")\n",
"\n",
" return df_rec"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "pUPXq_tltnxB",
"outputId": "e80bddcb-1e55-44cc-dae2-53500f2ffc84"
},
"source": [
"model = ModelPopUsers()\n",
"model.fit(df_train)\n",
"df_recommendations = model.predict(df_test)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"14:59:09 | start\n",
"14:59:09 | clicks per item\n",
"14:59:09 | start\n",
"14:59:09 | target rows\n",
"14:59:09 | explode impression array\n",
"14:59:09 | summarize recommendations\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 111
},
"id": "zwHJYLQNtnxB",
"outputId": "0882c7b8-1d3c-4f17-fcfb-c9d5c693a6be"
},
"source": [
"df_recommendations"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" session_id | \n",
" timestamp | \n",
" step | \n",
" item_recommendations | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 64BL89 | \n",
" 3579f90 | \n",
" 6 | \n",
" 2 | \n",
" 5001 5002 5004 5003 5010 5023 | \n",
"
\n",
" \n",
" | 1 | \n",
" 64BL91F2 | \n",
" 3779f92 | \n",
" 10 | \n",
" 2 | \n",
" 5001 5004 5010 5014 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id session_id timestamp step item_recommendations\n",
"0 64BL89 3579f90 6 2 5001 5002 5004 5003 5010 5023\n",
"1 64BL91F2 3779f92 10 2 5001 5004 5010 5014"
]
},
"metadata": {
"tags": []
},
"execution_count": 37
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "cfJsTtwutog5"
},
"source": [
"### nn_item"
]
},
{
"cell_type": "code",
"metadata": {
"id": "0RXX6i_gv8v_"
},
"source": [
"#collapse-hide\n",
"def calc_item_sims(df, item_col, reference_col):\n",
" \"\"\"Calculate similarity of items based on nearest neighbor algorithm.\n",
" The final data frame will have similarity scores for pairs of items.\n",
" :param df: Data frame of training data\n",
" :param item_col: Name of data frame column that contains the item ID\n",
" :param reference_col: Name of the reference column, depending on the model either\n",
" 1. session_id for the similarity based on session co-occurrences\n",
" 2. properties for the similarity based on item metadata\n",
" :return: Data frame with item pairs and similarity scores\n",
" \"\"\"\n",
"\n",
" # Create data frame with item and reference indices\n",
" print_time(\"item and reference indices\")\n",
" unique_items = df[item_col].unique()\n",
" unique_refs = df[reference_col].unique()\n",
"\n",
" d_items = {item_col: unique_items, 'item_idx': range(0, len(unique_items))}\n",
" d_refs = {reference_col: unique_refs, 'ref_idx': range(0, len(unique_refs))}\n",
"\n",
" df_items = pd.DataFrame(data=d_items)\n",
" df_refs = pd.DataFrame(data=d_refs)\n",
"\n",
" df = (\n",
" df\n",
" .merge(\n",
" df_items,\n",
" how=\"inner\",\n",
" on=item_col\n",
" )\n",
" .merge(\n",
" df_refs,\n",
" how=\"inner\",\n",
" on=reference_col\n",
" )\n",
" )\n",
"\n",
" df_idx = (\n",
" df\n",
" .loc[:, [\"item_idx\", \"ref_idx\"]]\n",
" .assign(data=lambda x: 1.)\n",
" .drop_duplicates()\n",
" )\n",
"\n",
" # Build item co-ooccurrence matrix\n",
" print_time(\"item co-occurrence matrix\")\n",
" mat_coo = sparse.coo_matrix((df_idx.data, (df_idx.item_idx, df_idx.ref_idx)))\n",
" mat_item_coo = mat_coo.T.dot(mat_coo)\n",
"\n",
" # Calculate Cosine similarities\n",
" print_time(\"Cosine similarity\")\n",
" inv_occ = np.sqrt(1 / mat_item_coo.diagonal())\n",
" cosine_sim = mat_item_coo.multiply(inv_occ)\n",
" cosine_sim = cosine_sim.T.multiply(inv_occ)\n",
"\n",
" # Create item similarity data frame\n",
" print_time(\"item similarity data frame\")\n",
" idx_ref, idx_item, sim = sparse.find(cosine_sim)\n",
" d_item_sim = {'idx_ref': idx_ref, 'idx_item': idx_item, 'similarity': sim}\n",
" df_item_sim = pd.DataFrame(data=d_item_sim)\n",
"\n",
" df_item_sim = (\n",
" df_item_sim\n",
" .merge(\n",
" df_items.assign(item_ref=df_items[item_col]),\n",
" how=\"inner\",\n",
" left_on=\"idx_ref\",\n",
" right_on=\"item_idx\"\n",
" )\n",
" .merge(\n",
" df_items.assign(item_sim=df_items[item_col]),\n",
" how=\"inner\",\n",
" left_on=\"idx_item\",\n",
" right_on=\"item_idx\"\n",
" )\n",
" .loc[:, [\"item_ref\", \"item_sim\", \"similarity\"]]\n",
" )\n",
"\n",
" return df_item_sim\n",
"\n",
"\n",
"def predict_nn(df, df_item_sim):\n",
" \"\"\"Calculate predictions based on the item similarity scores.\"\"\"\n",
"\n",
" # Select columns that are of interest for this function\n",
" print_time(\"start\")\n",
" cols = ['user_id', 'session_id', 'timestamp', 'step',\n",
" 'action_type', 'reference', 'impressions']\n",
" df_cols = df.loc[:, cols] \n",
"\n",
" # Get previous reference per user\n",
" print_time(\"previous reference\")\n",
" df_cols[\"previous_reference\"] = (\n",
" df_cols\n",
" .sort_values(by=[\"user_id\", \"session_id\", \"timestamp\"],\n",
" ascending=[True, True, True])\n",
" .groupby([\"user_id\"])[\"reference\"]\n",
" .shift(1)\n",
" )\n",
"\n",
" # Target row, withheld item ID that needs to be predicted\n",
" print_time(\"target rows\")\n",
" df_target = get_target_rows(df_cols)\n",
"\n",
" # Explode to impression level\n",
" print_time(\"explode impression array\")\n",
" df_impressions = explode(df_target, \"impressions\")\n",
"\n",
" df_item_sim[\"item_ref\"] = df_item_sim[\"item_ref\"].astype(str)\n",
" df_item_sim[\"item_sim\"] = df_item_sim[\"item_sim\"].astype(str)\n",
"\n",
" # Get similarities\n",
" print_time(\"get similarities\")\n",
" df_impressions = (\n",
" df_impressions\n",
" .merge(\n",
" df_item_sim,\n",
" how=\"left\",\n",
" left_on=[\"previous_reference\", \"impressions\"],\n",
" right_on=[\"item_ref\", \"item_sim\"]\n",
" )\n",
" .fillna(value={'similarity': 0})\n",
" .sort_values(by=[\"user_id\", \"timestamp\", \"step\", \"similarity\"],\n",
" ascending=[True, True, True, False])\n",
" )\n",
"\n",
" # Summarize recommendations\n",
" print_time(\"summarize recommendations\")\n",
" df_rec = group_concat(\n",
" df_impressions, [\"user_id\", \"session_id\", \"timestamp\", \"step\"], \n",
" \"impressions\"\n",
" )\n",
"\n",
" df_rec = (\n",
" df_rec\n",
" .rename(columns={'impressions': 'item_recommendations'})\n",
" .loc[:, [\"user_id\", \"session_id\", \"timestamp\", \"step\", \"item_recommendations\"]]\n",
" )\n",
"\n",
" return df_rec"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "lpt9T1uftog6"
},
"source": [
"#collapse-hide\n",
"class ModelNNItem():\n",
" \"\"\"\n",
" Model class for the item metadata nearest neighbor model.\n",
" Methods\n",
" fit(df): Fit the model on training data\n",
" predict(df): Calculate recommendations for test data \n",
" \"\"\"\n",
"\n",
" def fit(self, df):\n",
" \"\"\"Calculate item similarity based on item metadata.\"\"\"\n",
"\n",
" # Explode property arrays\n",
" print_time(\"explode properties\")\n",
" df_properties = explode(df, \"properties\")\n",
"\n",
" df_item_sim = calc_item_sims(df_properties, \"item_id\", \"properties\")\n",
"\n",
" self.df_item_sim = df_item_sim\n",
"\n",
"\n",
" def predict(self, df):\n",
" \"\"\"Sort impression list by similarity.\"\"\"\n",
"\n",
" df_rec = predict_nn(df, self.df_item_sim)\n",
"\n",
" return df_rec"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "n2guldDTtog7",
"outputId": "13205c1d-fe07-483a-e30e-d3c5b119fb29"
},
"source": [
"model = ModelNNItem()\n",
"model.fit(df_items)\n",
"df_recommendations = model.predict(df_test)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"15:07:50 | explode properties\n",
"15:07:50 | item and reference indices\n",
"15:07:50 | item co-occurrence matrix\n",
"15:07:50 | Cosine similarity\n",
"15:07:50 | item similarity data frame\n",
"15:07:50 | start\n",
"15:07:50 | previous reference\n",
"15:07:50 | target rows\n",
"15:07:50 | explode impression array\n",
"15:07:50 | get similarities\n",
"15:07:50 | summarize recommendations\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 111
},
"id": "fh5UB1X6tog8",
"outputId": "4896e452-1a9d-4292-b320-ffaeae7e54f6"
},
"source": [
"df_recommendations"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" session_id | \n",
" timestamp | \n",
" step | \n",
" item_recommendations | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 64BL89 | \n",
" 3579f90 | \n",
" 6 | \n",
" 2 | \n",
" 5002 5003 5010 5004 5001 5023 | \n",
"
\n",
" \n",
" | 1 | \n",
" 64BL91F2 | \n",
" 3779f92 | \n",
" 10 | \n",
" 2 | \n",
" 5001 5004 5010 5014 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id session_id timestamp step item_recommendations\n",
"0 64BL89 3579f90 6 2 5002 5003 5010 5004 5001 5023\n",
"1 64BL91F2 3779f92 10 2 5001 5004 5010 5014"
]
},
"metadata": {
"tags": []
},
"execution_count": 56
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GfyyfSBktpIp"
},
"source": [
"### nn_interaction"
]
},
{
"cell_type": "code",
"metadata": {
"id": "FqfcAfjwtpIq"
},
"source": [
"#collapse-hide\n",
"class ModelNNInteraction():\n",
" \"\"\"\n",
" Model class for the session co-occurrence nearest neighbor model.\n",
" Methods\n",
" fit(df): Fit the model on training data\n",
" predict(df): Calculate recommendations for test data \n",
" \"\"\"\n",
"\n",
" def fit(self, df):\n",
" \"\"\"Calculate item similarity based on session co-occurrence.\"\"\"\n",
"\n",
" # Select columns that are of interest for this method\n",
" print_time(\"start\")\n",
" cols = ['user_id', 'session_id', 'timestamp', 'step',\n",
" 'action_type', 'reference']\n",
" df_cols = df.loc[:, cols] \n",
"\n",
" # We are only interested in action types, for wich the reference is an item ID\n",
" print_time(\"filter interactions\")\n",
" item_interactions = [\n",
" 'clickout item', 'interaction item deals', 'interaction item image',\n",
" 'interaction item info', 'interaction item rating', 'search for item'\n",
" ]\n",
" df_actions = (\n",
" df_cols\n",
" .loc[df_cols.action_type.isin(item_interactions), :]\n",
" .rename(columns={'reference': 'item'})\n",
" .drop(columns='action_type')\n",
" )\n",
"\n",
" df_item_sim = calc_item_sims(df_actions, \"item\", \"session_id\")\n",
"\n",
" self.df_item_sim = df_item_sim\n",
"\n",
"\n",
" def predict(self, df):\n",
" \"\"\"Sort impression list by similarity.\"\"\"\n",
"\n",
" df_rec = predict_nn(df, self.df_item_sim)\n",
"\n",
" return df_rec"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "uXUiInYetpIr",
"outputId": "ccd00752-1248-43bd-f7e1-8ae43155a669"
},
"source": [
"model = ModelNNInteraction()\n",
"model.fit(df_train)\n",
"df_recommendations = model.predict(df_test)"
],
"execution_count": null,
"outputs": [
{
"output_type": "stream",
"text": [
"15:09:13 | start\n",
"15:09:13 | filter interactions\n",
"15:09:13 | item and reference indices\n",
"15:09:13 | item co-occurrence matrix\n",
"15:09:13 | Cosine similarity\n",
"15:09:13 | item similarity data frame\n",
"15:09:13 | start\n",
"15:09:13 | previous reference\n",
"15:09:13 | target rows\n",
"15:09:13 | explode impression array\n",
"15:09:13 | get similarities\n",
"15:09:13 | summarize recommendations\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 111
},
"id": "FXJhGxTltpIs",
"outputId": "feaa355d-4e96-4bac-a8df-5c5b67fbd005"
},
"source": [
"df_recommendations"
],
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" session_id | \n",
" timestamp | \n",
" step | \n",
" item_recommendations | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 64BL89 | \n",
" 3579f90 | \n",
" 6 | \n",
" 2 | \n",
" 5002 5003 5010 5004 5001 5023 | \n",
"
\n",
" \n",
" | 1 | \n",
" 64BL91F2 | \n",
" 3779f92 | \n",
" 10 | \n",
" 2 | \n",
" 5001 5004 5010 5014 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id session_id timestamp step item_recommendations\n",
"0 64BL89 3579f90 6 2 5002 5003 5010 5004 5001 5023\n",
"1 64BL91F2 3779f92 10 2 5001 5004 5010 5014"
]
},
"metadata": {
"tags": []
},
"execution_count": 59
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JDHr8CGAtpSX"
},
"source": [
"### log_reg"
]
},
{
"cell_type": "code",
"metadata": {
"id": "_gieDTNbtpSY"
},
"source": [
"#collapse-hide\n",
"class ModelLogReg():\n",
" \"\"\"\n",
" Model class for the logistic regression model.\n",
" Methods\n",
" fit(df): Fit the model on training data\n",
" predict(df): Calculate recommendations for test data \n",
" \"\"\"\n",
"\n",
" def fit(self, df):\n",
" \"\"\"Train the logistic regression model.\"\"\"\n",
"\n",
" df_impressions = build_features(df)\n",
"\n",
" # Target column, item that was clicked\n",
" print_time(\"target column\")\n",
" df_impressions.loc[:, \"is_clicked\"] = (\n",
" df_impressions[\"referenced_item\"] == df_impressions[\"impressed_item\"]\n",
" ).astype(int)\n",
"\n",
" features = [\n",
" \"position\",\n",
" \"prices\",\n",
" \"interaction_count\",\n",
" \"is_last_interacted\",\n",
" ]\n",
"\n",
" X = df_impressions[features]\n",
" y = df_impressions.is_clicked\n",
"\n",
" # Training the actual model\n",
" print_time(\"training logistic regression model\")\n",
" self.logreg = LogisticRegression(solver=\"lbfgs\", max_iter=100, tol=1e-11, C=1e10).fit(X, y)\n",
"\n",
"\n",
" def predict(self, df):\n",
" \"\"\"Calculate click probability based on trained logistic regression model.\"\"\"\n",
"\n",
" df_impressions = build_features(df)\n",
"\n",
" # Target row, withheld item ID that needs to be predicted\n",
" df_impressions = df_impressions[df_impressions.referenced_item.isna()]\n",
"\n",
" features = [\n",
" \"position\",\n",
" \"prices\",\n",
" \"interaction_count\",\n",
" \"is_last_interacted\"\n",
" ]\n",
"\n",
" # Predict clickout probabilities for each impressed item\n",
" print_time(\"predict clickout item\")\n",
" df_impressions.loc[:, \"click_probability\"] = (\n",
" self\n",
" .logreg\n",
" .predict_proba(df_impressions[features])[:, 1]\n",
" )\n",
"\n",
" # Summarize recommendations\n",
" print_time(\"summarize recommendations\")\n",
" df_rec = summarize_recs(df_impressions, \"click_probability\")\n",
"\n",
" return df_rec"
],
"execution_count": 65,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "EG6cppXHtpSZ",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "4a79dcd6-6c54-41a7-897a-f3b326f41671"
},
"source": [
"model = ModelLogReg()\n",
"model.fit(df_train)\n",
"df_recommendations = model.predict(df_test)"
],
"execution_count": 66,
"outputs": [
{
"output_type": "stream",
"text": [
"15:16:04 | start\n",
"15:16:04 | filter interactions\n",
"15:16:04 | cleaning\n",
"15:16:04 | previous interactions\n",
"15:16:04 | combining columns - impressions\n",
"15:16:04 | combining columns - prices\n",
"15:16:04 | explode arrays\n",
"15:16:04 | interaction count\n",
"15:16:04 | reduce to impressions\n",
"15:16:04 | position feature\n",
"15:16:04 | last interacted item feature\n",
"15:16:04 | change price datatype\n",
"15:16:04 | target column\n",
"15:16:04 | training logistic regression model\n",
"15:16:04 | start\n",
"15:16:04 | filter interactions\n",
"15:16:04 | cleaning\n",
"15:16:04 | previous interactions\n",
"15:16:04 | combining columns - impressions\n",
"15:16:04 | combining columns - prices\n",
"15:16:04 | explode arrays\n",
"15:16:04 | interaction count\n",
"15:16:04 | reduce to impressions\n",
"15:16:04 | position feature\n",
"15:16:04 | last interacted item feature\n",
"15:16:04 | change price datatype\n",
"15:16:04 | predict clickout item\n",
"15:16:04 | summarize recommendations\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "WYb8RtQ1tpSZ",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 111
},
"outputId": "b2b25ee0-e759-4d25-e519-fbee972b95d4"
},
"source": [
"df_recommendations"
],
"execution_count": 67,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" session_id | \n",
" timestamp | \n",
" step | \n",
" item_recommendations | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 64BL89 | \n",
" 3579f90 | \n",
" 6 | \n",
" 2 | \n",
" 5023 5002 5003 5010 5004 5001 | \n",
"
\n",
" \n",
" | 1 | \n",
" 64BL91F2 | \n",
" 3779f92 | \n",
" 10 | \n",
" 2 | \n",
" 5010 5001 5004 5014 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id session_id timestamp step item_recommendations\n",
"0 64BL89 3579f90 6 2 5023 5002 5003 5010 5004 5001\n",
"1 64BL91F2 3779f92 10 2 5010 5001 5004 5014"
]
},
"metadata": {
"tags": []
},
"execution_count": 67
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "q9r_boM1tpf6"
},
"source": [
"### gbm_rank"
]
},
{
"cell_type": "code",
"metadata": {
"id": "QatREEBjtpf7"
},
"source": [
"#collapse-hide\n",
"class ModelGbmRank():\n",
" \"\"\"\n",
" Model class for the lightGBM model.\n",
" Methods\n",
" fit(df): Fit the model on training data\n",
" predict(df): Calculate recommendations for test data \n",
" \"\"\"\n",
"\n",
" def fit(self, df):\n",
" \"\"\"Train the lightGBM model.\"\"\"\n",
"\n",
" df_impressions = build_features(df)\n",
"\n",
" # Target column, item that was clicked\n",
" print_time(\"target column\")\n",
" df_impressions.loc[:, \"is_clicked\"] = (\n",
" df_impressions[\"referenced_item\"] == df_impressions[\"impressed_item\"]\n",
" ).astype(int)\n",
"\n",
" features = [\n",
" \"position\",\n",
" \"prices\",\n",
" \"interaction_count\",\n",
" \"is_last_interacted\",\n",
" ]\n",
"\n",
" # Bring to format suitable for lightGBM\n",
" print_time(\"lightGBM format\")\n",
" X = df_impressions[features]\n",
" y = df_impressions.is_clicked\n",
"\n",
" q = (\n",
" df_impressions\n",
" .groupby([\"user_id\", \"session_id\", \"timestamp\", \"step\"])\n",
" .size()\n",
" .reset_index(name=\"query_length\")\n",
" .query_length\n",
" )\n",
"\n",
" # Training the actual model\n",
" print_time(\"training lightGBM model\")\n",
" self.gbm = lgb.LGBMRanker()\n",
" self.gbm.fit(X, y, group=q, verbose=True)\n",
"\n",
"\n",
" def predict(self, df):\n",
" \"\"\"Calculate item ranking based on trained lightGBM model.\"\"\"\n",
"\n",
" df_impressions = build_features(df)\n",
"\n",
" # Target row, withheld item ID that needs to be predicted\n",
" df_impressions = df_impressions[df_impressions.referenced_item.isna()]\n",
"\n",
" features = [\n",
" \"position\",\n",
" \"prices\",\n",
" \"interaction_count\",\n",
" \"is_last_interacted\"\n",
" ]\n",
"\n",
" df_impressions.loc[:, \"click_propensity\"] = self.gbm.predict(df_impressions[features])\n",
"\n",
" # Summarize recommendations\n",
" print_time(\"summarize recommendations\")\n",
" df_rec = summarize_recs(df_impressions, \"click_propensity\")\n",
" \n",
" return df_rec"
],
"execution_count": 68,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "meVWs09Ptpf7",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "b59757d7-90ae-41b8-e1df-43ab5f0ba0b1"
},
"source": [
"model = ModelGbmRank()\n",
"model.fit(df_train)\n",
"df_recommendations = model.predict(df_test)"
],
"execution_count": 69,
"outputs": [
{
"output_type": "stream",
"text": [
"15:16:53 | start\n",
"15:16:53 | filter interactions\n",
"15:16:53 | cleaning\n",
"15:16:53 | previous interactions\n",
"15:16:53 | combining columns - impressions\n",
"15:16:53 | combining columns - prices\n",
"15:16:53 | explode arrays\n",
"15:16:53 | interaction count\n",
"15:16:53 | reduce to impressions\n",
"15:16:53 | position feature\n",
"15:16:53 | last interacted item feature\n",
"15:16:53 | change price datatype\n",
"15:16:53 | target column\n",
"15:16:53 | lightGBM format\n",
"15:16:53 | training lightGBM model\n",
"15:16:53 | start\n",
"15:16:53 | filter interactions\n",
"15:16:53 | cleaning\n",
"15:16:53 | previous interactions\n",
"15:16:53 | combining columns - impressions\n",
"15:16:53 | combining columns - prices\n",
"15:16:53 | explode arrays\n",
"15:16:53 | interaction count\n",
"15:16:53 | reduce to impressions\n",
"15:16:53 | position feature\n",
"15:16:53 | last interacted item feature\n",
"15:16:53 | change price datatype\n",
"15:16:53 | summarize recommendations\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "RrtkaemEtpf7",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 111
},
"outputId": "69d8d504-a4c1-43ec-9aef-3a180bdabeb1"
},
"source": [
"df_recommendations"
],
"execution_count": 70,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" user_id | \n",
" session_id | \n",
" timestamp | \n",
" step | \n",
" item_recommendations | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 64BL89 | \n",
" 3579f90 | \n",
" 6 | \n",
" 2 | \n",
" 5002 5003 5010 5004 5001 5023 | \n",
"
\n",
" \n",
" | 1 | \n",
" 64BL91F2 | \n",
" 3779f92 | \n",
" 10 | \n",
" 2 | \n",
" 5001 5004 5010 5014 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" user_id session_id timestamp step item_recommendations\n",
"0 64BL89 3579f90 6 2 5002 5003 5010 5004 5001 5023\n",
"1 64BL91F2 3779f92 10 2 5001 5004 5010 5014"
]
},
"metadata": {
"tags": []
},
"execution_count": 70
}
]
}
]
}