{ "cells": [ { "cell_type": "markdown", "id": "85aefede", "metadata": {}, "source": [ "# EDA of x5 dataset for Retail Hero competition" ] }, { "cell_type": "markdown", "id": "a5f82872", "metadata": {}, "source": [ "The dataset is provided by X5 Retail Group at the RetailHero hackaton hosted in winter 2019." ] }, { "cell_type": "markdown", "id": "1de7a7de", "metadata": {}, "source": [ "## Data loading" ] }, { "cell_type": "markdown", "id": "24dc9d37", "metadata": {}, "source": [ "Let's download the data. We will use standard [fetch funnction](https://www.uplift-modeling.com/en/latest/api/datasets/fetch_x5.html#x5-retailhero-uplift-modeling-dataset)" ] }, { "cell_type": "code", "execution_count": 1, "id": "26f40fb1", "metadata": {}, "outputs": [], "source": [ "from sklift.datasets import fetch_x5\n", "import pandas as pd\n", "import numpy as np \n", "import plotly.express as px\n", "from sklearn.model_selection import train_test_split\n", "from sklift.metrics import uplift_at_k\n", "from sklift.models import SoloModel\n", "from catboost import CatBoostClassifier\n", "\n", "pd.set_option('display.float_format', lambda x: '%.3f' % x)\n", "\n", "dataset = fetch_x5()" ] }, { "cell_type": "markdown", "id": "d5a07d8e", "metadata": {}, "source": [ "dataset object is iterable dictionary-like object. \n", " Let's make a brief look at what we have here." ] }, { "cell_type": "code", "execution_count": 2, "id": "213d1a1b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['data', 'target', 'treatment', 'DESCR', 'feature_names', 'target_name', 'treatment_name'])" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dataset.keys()" ] }, { "cell_type": "code", "execution_count": 3, "id": "5faf7ad6", "metadata": {}, "outputs": [], "source": [ "### data has 3 datasets\n", "datasets_data = dataset['data']\n", "\n", "### clients data\n", "clients = datasets_data['clients']\n", "\n", "### train data\n", "train = datasets_data['train']\n", "\n", "### purchases\n", "purchases = datasets_data['purchases']\n", "\n", "### Let's look at other dataset objects\n", "\n", "### treatment feature: \n", "## 1 - mean there was an interaction with customer; 0 - no interaction\n", "\n", "treatment = dataset['treatment']\n", "\n", "### target feature\n", "\n", "target = dataset['target']" ] }, { "cell_type": "markdown", "id": "0782dfc3", "metadata": {}, "source": [ "## EDA " ] }, { "cell_type": "markdown", "id": "b3941d8b", "metadata": {}, "source": [ "### Clients " ] }, { "cell_type": "markdown", "id": "3334fc5a", "metadata": {}, "source": [ "Clients dataset consist of general information about customers." ] }, { "cell_type": "code", "execution_count": 4, "id": "fab5aca3", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
client_idfirst_issue_datefirst_redeem_dateagegender
0000012768d2017-08-05 15:40:482018-01-04 19:30:0745U
1000036f9032017-04-10 13:54:232017-04-23 12:37:5672F
2000048b7a62018-12-15 13:33:11NaN68F
3000073194a2017-05-23 12:56:142017-11-24 11:18:0160F
400007c71332017-05-22 16:17:082018-12-31 17:17:3367U
\n", "
" ], "text/plain": [ " client_id first_issue_date first_redeem_date age gender\n", "0 000012768d 2017-08-05 15:40:48 2018-01-04 19:30:07 45 U\n", "1 000036f903 2017-04-10 13:54:23 2017-04-23 12:37:56 72 F\n", "2 000048b7a6 2018-12-15 13:33:11 NaN 68 F\n", "3 000073194a 2017-05-23 12:56:14 2017-11-24 11:18:01 60 F\n", "4 00007c7133 2017-05-22 16:17:08 2018-12-31 17:17:33 67 U" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clients.head()" ] }, { "cell_type": "code", "execution_count": 5, "id": "176b5003", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(400162, 5)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clients.shape" ] }, { "cell_type": "markdown", "id": "53f12089", "metadata": {}, "source": [ "Let's look at variables in clients dataset in more detail:" ] }, { "cell_type": "code", "execution_count": 6, "id": "bed12163", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 400162.000\n", "mean 46.488\n", "std 43.871\n", "min -7491.000\n", "25% 34.000\n", "50% 45.000\n", "75% 59.000\n", "max 1901.000\n", "Name: age, dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### age variable \n", "\n", "clients['age'].describe()" ] }, { "cell_type": "markdown", "id": "2efaaa80", "metadata": {}, "source": [ "Seems like we have some outliers in data. Let's simply remove them using quantile values." ] }, { "cell_type": "code", "execution_count": 7, "id": "d839baa4", "metadata": {}, "outputs": [ { "data": { "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "q_upper = np.quantile(clients['age'],0.99)\n", "q_lower = np.quantile(clients['age'],0.01)\n", "\n", "clients_age_cleaned = clients[(clients['age'] < q_upper) & (clients['age'] > q_lower)]['age']\n", "\n", "\n", "fig = px.histogram(clients_age_cleaned)\n", "fig.show('png')" ] }, { "cell_type": "code", "execution_count": 8, "id": "c3f169b0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gender\n", "F 147649\n", "M 66807\n", "U 185706\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "### gender variable\n", "clients.groupby('gender').size()" ] }, { "cell_type": "code", "execution_count": 9, "id": "1cb705d9", "metadata": {}, "outputs": [ { "data": { "image/png": "" }, "metadata": {}, "output_type": "display_data" } ], "source": [ "### first_issue_date\n", "\n", "clients['first_issue_date'] = pd.to_datetime(clients['first_issue_date'])\n", "clients['first_issue_date_d'] = clients['first_issue_date'].dt.date\n", "clients_issues_dates = pd.DataFrame(clients.groupby('first_issue_date_d').size()).\\\n", "reset_index().rename(columns = {0: 'Number of clients'})\n", "\n", "fig = px.line(clients_issues_dates, x = 'first_issue_date_d', y = 'Number of clients')\n", "fig.show('png')" ] }, { "cell_type": "markdown", "id": "8ee096ef", "metadata": {}, "source": [ "### Purchases" ] }, { "cell_type": "markdown", "id": "440c0b7a", "metadata": {}, "source": [ "Purchases dataset present information about customer transactions." ] }, { "cell_type": "code", "execution_count": 10, "id": "102d15b2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
client_idtransaction_idtransaction_datetimeregular_points_receivedexpress_points_receivedregular_points_spentexpress_points_spentpurchase_sumstore_idproduct_idproduct_quantitytrn_sum_from_isstrn_sum_from_red
0000012768d7e3e2e39842018-12-01 07:12:4510.0000.0000.0000.0001007.00054a4a11a299a80204f782.00080.000NaN
1000012768d7e3e2e39842018-12-01 07:12:4510.0000.0000.0000.0001007.00054a4a11a29da89ebd3741.00065.000NaN
2000012768d7e3e2e39842018-12-01 07:12:4510.0000.0000.0000.0001007.00054a4a11a290a95e1151d1.00024.000NaN
3000012768d7e3e2e39842018-12-01 07:12:4510.0000.0000.0000.0001007.00054a4a11a294055b15e4a2.00050.000NaN
4000012768d7e3e2e39842018-12-01 07:12:4510.0000.0000.0000.0001007.00054a4a11a29a685f1916b1.00022.000NaN
\n", "
" ], "text/plain": [ " client_id transaction_id transaction_datetime regular_points_received \\\n", "0 000012768d 7e3e2e3984 2018-12-01 07:12:45 10.000 \n", "1 000012768d 7e3e2e3984 2018-12-01 07:12:45 10.000 \n", "2 000012768d 7e3e2e3984 2018-12-01 07:12:45 10.000 \n", "3 000012768d 7e3e2e3984 2018-12-01 07:12:45 10.000 \n", "4 000012768d 7e3e2e3984 2018-12-01 07:12:45 10.000 \n", "\n", " express_points_received regular_points_spent express_points_spent \\\n", "0 0.000 0.000 0.000 \n", "1 0.000 0.000 0.000 \n", "2 0.000 0.000 0.000 \n", "3 0.000 0.000 0.000 \n", "4 0.000 0.000 0.000 \n", "\n", " purchase_sum store_id product_id product_quantity trn_sum_from_iss \\\n", "0 1007.000 54a4a11a29 9a80204f78 2.000 80.000 \n", "1 1007.000 54a4a11a29 da89ebd374 1.000 65.000 \n", "2 1007.000 54a4a11a29 0a95e1151d 1.000 24.000 \n", "3 1007.000 54a4a11a29 4055b15e4a 2.000 50.000 \n", "4 1007.000 54a4a11a29 a685f1916b 1.000 22.000 \n", "\n", " trn_sum_from_red \n", "0 NaN \n", "1 NaN \n", "2 NaN \n", "3 NaN \n", "4 NaN " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "purchases.head()" ] }, { "cell_type": "code", "execution_count": 11, "id": "968d7679", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
regular_points_receivedexpress_points_receivedregular_points_spentexpress_points_spentpurchase_sumproduct_quantitytrn_sum_from_isstrn_sum_from_red
count45786568.00045786568.00045786568.00045786568.00045786568.00045786568.00045786568.0003043356.000
mean8.0500.061-5.313-0.318777.5211.24773.48876.774
std12.6852.42636.0363.288796.5353.13887.54084.271
min0.0000.000-5066.000-300.0000.0000.0000.0000.000
25%1.4000.0000.0000.000286.0001.00030.00031.000
50%3.8000.0000.0000.000539.0001.00051.00055.000
75%10.3000.0000.0000.000976.0001.00090.00095.000
max2399.000300.0000.0000.00035149.04014941.00035149.0008789.000
\n", "
" ], "text/plain": [ " regular_points_received express_points_received regular_points_spent \\\n", "count 45786568.000 45786568.000 45786568.000 \n", "mean 8.050 0.061 -5.313 \n", "std 12.685 2.426 36.036 \n", "min 0.000 0.000 -5066.000 \n", "25% 1.400 0.000 0.000 \n", "50% 3.800 0.000 0.000 \n", "75% 10.300 0.000 0.000 \n", "max 2399.000 300.000 0.000 \n", "\n", " express_points_spent purchase_sum product_quantity trn_sum_from_iss \\\n", "count 45786568.000 45786568.000 45786568.000 45786568.000 \n", "mean -0.318 777.521 1.247 73.488 \n", "std 3.288 796.535 3.138 87.540 \n", "min -300.000 0.000 0.000 0.000 \n", "25% 0.000 286.000 1.000 30.000 \n", "50% 0.000 539.000 1.000 51.000 \n", "75% 0.000 976.000 1.000 90.000 \n", "max 0.000 35149.040 14941.000 35149.000 \n", "\n", " trn_sum_from_red \n", "count 3043356.000 \n", "mean 76.774 \n", "std 84.271 \n", "min 0.000 \n", "25% 31.000 \n", "50% 55.000 \n", "75% 95.000 \n", "max 8789.000 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "purchases.describe()" ] }, { "cell_type": "markdown", "id": "abef56e9", "metadata": {}, "source": [ "Seems like our dataset structure is row per client/transaction_id/product_id." ] }, { "cell_type": "markdown", "id": "0586e9e6", "metadata": {}, "source": [ "### train/treatment/target dataset" ] }, { "cell_type": "markdown", "id": "eac8db80", "metadata": {}, "source": [ "Treatment and target presents only for train part of the dataset." ] }, { "cell_type": "code", "execution_count": 12, "id": "31d25d57", "metadata": {}, "outputs": [], "source": [ "train['treatment'] = treatment\n", "train['target'] = target" ] }, { "cell_type": "code", "execution_count": 13, "id": "4eded387", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
client_idfirst_issue_datefirst_redeem_dateagegenderfirst_issue_date_dtreatmenttarget
0000012768d2017-08-05 15:40:482018-01-04 19:30:0745U2017-08-0501
1000036f9032017-04-10 13:54:232017-04-23 12:37:5672F2017-04-1011
200010925a52018-07-24 16:21:292018-09-14 16:12:4983U2018-07-2411
30001f552b02017-06-30 19:20:382018-08-28 12:59:4533F2017-06-3011
400020e7b182017-11-27 11:41:452018-01-10 17:50:0573U2017-11-2711
...........................
200034fffe0abb972017-11-27 08:56:542018-02-11 09:26:0835F2017-11-2700
200035fffe0ed7192017-09-15 08:53:242017-12-12 14:50:1269U2017-09-1501
200036fffea1204c2018-01-31 16:59:372018-03-12 17:02:2773F2018-01-3101
200037fffeca6d222017-12-28 11:56:13NaN77F2017-12-2810
200038fffff6ce772017-08-03 20:25:122017-08-26 16:41:4142U2017-08-0301
\n", "

200039 rows × 8 columns

\n", "
" ], "text/plain": [ " client_id first_issue_date first_redeem_date age gender \\\n", "0 000012768d 2017-08-05 15:40:48 2018-01-04 19:30:07 45 U \n", "1 000036f903 2017-04-10 13:54:23 2017-04-23 12:37:56 72 F \n", "2 00010925a5 2018-07-24 16:21:29 2018-09-14 16:12:49 83 U \n", "3 0001f552b0 2017-06-30 19:20:38 2018-08-28 12:59:45 33 F \n", "4 00020e7b18 2017-11-27 11:41:45 2018-01-10 17:50:05 73 U \n", "... ... ... ... ... ... \n", "200034 fffe0abb97 2017-11-27 08:56:54 2018-02-11 09:26:08 35 F \n", "200035 fffe0ed719 2017-09-15 08:53:24 2017-12-12 14:50:12 69 U \n", "200036 fffea1204c 2018-01-31 16:59:37 2018-03-12 17:02:27 73 F \n", "200037 fffeca6d22 2017-12-28 11:56:13 NaN 77 F \n", "200038 fffff6ce77 2017-08-03 20:25:12 2017-08-26 16:41:41 42 U \n", "\n", " first_issue_date_d treatment target \n", "0 2017-08-05 0 1 \n", "1 2017-04-10 1 1 \n", "2 2018-07-24 1 1 \n", "3 2017-06-30 1 1 \n", "4 2017-11-27 1 1 \n", "... ... ... ... \n", "200034 2017-11-27 0 0 \n", "200035 2017-09-15 0 1 \n", "200036 2018-01-31 0 1 \n", "200037 2017-12-28 1 0 \n", "200038 2017-08-03 0 1 \n", "\n", "[200039 rows x 8 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df = clients.merge(train,how = 'right', on = 'client_id')\n", "train_df" ] }, { "cell_type": "code", "execution_count": 14, "id": "ae4c4694", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
target01
treatment
019.84430.176
118.16731.813
\n", "
" ], "text/plain": [ "target 0 1\n", "treatment \n", "0 19.844 30.176\n", "1 18.167 31.813" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "100*(pd.crosstab(train_df['treatment'],train_df['target'])/train_df.shape[0])" ] }, { "cell_type": "markdown", "id": "10d4fb9c", "metadata": {}, "source": [ "So as we see above treatment and control groups are almost equal and also proportion of target in both groups is almost the same." ] }, { "cell_type": "markdown", "id": "94ef6c5e", "metadata": {}, "source": [ "## Data processing" ] }, { "cell_type": "markdown", "id": "27c67b63", "metadata": {}, "source": [ "Make some data processing before uplift modeling." ] }, { "cell_type": "code", "execution_count": 15, "id": "aaacec5d", "metadata": {}, "outputs": [], "source": [ "# Extracting features from client data\n", "\n", "train_df['first_issue_time'] = \\\n", " (pd.to_datetime(train_df['first_issue_date'])\n", " - pd.Timestamp('1970-01-01')) // pd.Timedelta('1s')\n", "\n", "train_df['first_redeem_time'] = \\\n", " (pd.to_datetime(train_df['first_redeem_date'])\n", " - pd.Timestamp('1970-01-01')) // pd.Timedelta('1s')\n", "\n", "train_df['issue_redeem_delay'] = train_df['first_redeem_time'] \\\n", " - train_df['first_issue_time']\n", "\n", "\n", "### Count number of transactions/products from purchases dataset\n", "\n", "purchases_train = purchases.merge(train,how = 'right', on = 'client_id')\n", "purchases_features = purchases_train.groupby('client_id')[['transaction_id','product_id']].nunique().reset_index()\n" ] }, { "cell_type": "code", "execution_count": 16, "id": "cbf53815", "metadata": {}, "outputs": [], "source": [ "train_df_purchases = train_df.merge(purchases_features, how = 'left', on = 'client_id')" ] }, { "cell_type": "code", "execution_count": 17, "id": "425eba62", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
client_idfirst_issue_datefirst_redeem_dateagegenderfirst_issue_date_dtreatmenttargetfirst_issue_timefirst_redeem_timeissue_redeem_delaytransaction_idproduct_id
0000012768d2017-08-05 15:40:482018-01-04 19:30:0745U2017-08-050115019476481515094207.00013146559.000446
1000036f9032017-04-10 13:54:232017-04-23 12:37:5672F2017-04-101114918324631492951076.0001118613.0003296
200010925a52018-07-24 16:21:292018-09-14 16:12:4983U2018-07-241115324492891536941569.0004492280.0001858
30001f552b02017-06-30 19:20:382018-08-28 12:59:4533F2017-06-301114988504381535461185.00036610747.0001579
400020e7b182017-11-27 11:41:452018-01-10 17:50:0573U2017-11-271115117829051515606605.0003823700.00018175
..........................................
200034fffe0abb972017-11-27 08:56:542018-02-11 09:26:0835F2017-11-270015117730141518341168.0006568154.000936
200035fffe0ed7192017-09-15 08:53:242017-12-12 14:50:1269U2017-09-150115054656041513090212.0007624608.0003089
200036fffea1204c2018-01-31 16:59:372018-03-12 17:02:2773F2018-01-310115174179771520874147.0003456170.0001745
200037fffeca6d222017-12-28 11:56:13NaN77F2017-12-28101514462173NaNNaN1634
200038fffff6ce772017-08-03 20:25:122017-08-26 16:41:4142U2017-08-030115017919121503765701.0001973789.00032137
\n", "

200039 rows × 13 columns

\n", "
" ], "text/plain": [ " client_id first_issue_date first_redeem_date age gender \\\n", "0 000012768d 2017-08-05 15:40:48 2018-01-04 19:30:07 45 U \n", "1 000036f903 2017-04-10 13:54:23 2017-04-23 12:37:56 72 F \n", "2 00010925a5 2018-07-24 16:21:29 2018-09-14 16:12:49 83 U \n", "3 0001f552b0 2017-06-30 19:20:38 2018-08-28 12:59:45 33 F \n", "4 00020e7b18 2017-11-27 11:41:45 2018-01-10 17:50:05 73 U \n", "... ... ... ... ... ... \n", "200034 fffe0abb97 2017-11-27 08:56:54 2018-02-11 09:26:08 35 F \n", "200035 fffe0ed719 2017-09-15 08:53:24 2017-12-12 14:50:12 69 U \n", "200036 fffea1204c 2018-01-31 16:59:37 2018-03-12 17:02:27 73 F \n", "200037 fffeca6d22 2017-12-28 11:56:13 NaN 77 F \n", "200038 fffff6ce77 2017-08-03 20:25:12 2017-08-26 16:41:41 42 U \n", "\n", " first_issue_date_d treatment target first_issue_time \\\n", "0 2017-08-05 0 1 1501947648 \n", "1 2017-04-10 1 1 1491832463 \n", "2 2018-07-24 1 1 1532449289 \n", "3 2017-06-30 1 1 1498850438 \n", "4 2017-11-27 1 1 1511782905 \n", "... ... ... ... ... \n", "200034 2017-11-27 0 0 1511773014 \n", "200035 2017-09-15 0 1 1505465604 \n", "200036 2018-01-31 0 1 1517417977 \n", "200037 2017-12-28 1 0 1514462173 \n", "200038 2017-08-03 0 1 1501791912 \n", "\n", " first_redeem_time issue_redeem_delay transaction_id product_id \n", "0 1515094207.000 13146559.000 4 46 \n", "1 1492951076.000 1118613.000 32 96 \n", "2 1536941569.000 4492280.000 18 58 \n", "3 1535461185.000 36610747.000 15 79 \n", "4 1515606605.000 3823700.000 18 175 \n", "... ... ... ... ... \n", "200034 1518341168.000 6568154.000 9 36 \n", "200035 1513090212.000 7624608.000 30 89 \n", "200036 1520874147.000 3456170.000 17 45 \n", "200037 NaN NaN 16 34 \n", "200038 1503765701.000 1973789.000 32 137 \n", "\n", "[200039 rows x 13 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df_purchases" ] }, { "cell_type": "markdown", "id": "ee90ffac", "metadata": {}, "source": [ "### train/validation split" ] }, { "cell_type": "code", "execution_count": 18, "id": "25945762", "metadata": {}, "outputs": [], "source": [ "### Selected columns \n", "\n", "selected_features = ['age','gender','first_issue_time','first_redeem_time','issue_redeem_delay',\n", " 'transaction_id','product_id','treatment']\n", "y = train_df_purchases['target']\n", "\n" ] }, { "cell_type": "code", "execution_count": 19, "id": "1bc58728", "metadata": {}, "outputs": [], "source": [ "X_train, X_val, y_train, y_val = train_test_split(train_df_purchases[selected_features],y,\n", " test_size=0.3, random_state=123)" ] }, { "cell_type": "code", "execution_count": 20, "id": "1bc884a8", "metadata": {}, "outputs": [], "source": [ "treatment_train = X_train['treatment']\n", "treatment_val = X_val['treatment']\n", "\n", "X_train = X_train.drop(columns = ['treatment'])\n", "X_val = X_val.drop(columns = ['treatment'])" ] }, { "cell_type": "markdown", "id": "858d89ee", "metadata": {}, "source": [ "## Model" ] }, { "cell_type": "markdown", "id": "5892989f", "metadata": {}, "source": [ "We will use simple SoloModel, for more complicated models please look at [models guide](https://www.uplift-modeling.com/en/latest/user_guide/models/index.html)." ] }, { "cell_type": "code", "execution_count": 21, "id": "07b48079", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Learning rate set to 0.08499\n", "0:\tlearn: 0.6704391\ttotal: 195ms\tremaining: 3m 15s\n", "100:\tlearn: 0.5483643\ttotal: 9.53s\tremaining: 1m 24s\n", "200:\tlearn: 0.5448690\ttotal: 17.6s\tremaining: 1m 10s\n", "300:\tlearn: 0.5421207\ttotal: 26.2s\tremaining: 1m\n", "400:\tlearn: 0.5397941\ttotal: 33.7s\tremaining: 50.3s\n", "500:\tlearn: 0.5375020\ttotal: 41.4s\tremaining: 41.3s\n", "600:\tlearn: 0.5353400\ttotal: 49.5s\tremaining: 32.9s\n", "700:\tlearn: 0.5332216\ttotal: 57.6s\tremaining: 24.6s\n", "800:\tlearn: 0.5312421\ttotal: 1m 5s\tremaining: 16.2s\n", "900:\tlearn: 0.5294681\ttotal: 1m 12s\tremaining: 8.02s\n", "999:\tlearn: 0.5277310\ttotal: 1m 21s\tremaining: 0us\n" ] } ], "source": [ "estimator = CatBoostClassifier(verbose=100, \n", " cat_features=['gender'],\n", " random_state=42,\n", " thread_count=1)\n", "\n", "sm = SoloModel(estimator)\n", "sm = sm.fit(X_train, y_train, treatment_train)" ] }, { "cell_type": "code", "execution_count": 22, "id": "9aff558a", "metadata": {}, "outputs": [], "source": [ "## Let's make predict and calculate uplift score \n", "uplift_sm = sm.predict(X_val)\n", "sm_score = uplift_at_k(y_true=y_val, uplift=uplift_sm, treatment=treatment_val, strategy='overall', k=0.1)" ] }, { "cell_type": "code", "execution_count": 23, "id": "70da39bd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.09103971954830925" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sm_score" ] }, { "cell_type": "markdown", "id": "dcbcb3e7", "metadata": {}, "source": [ "Thank you for reading! For more tutorials on uplift modeling please visit this [page](https://www.uplift-modeling.com/en/latest/tutorials.html) ." ] }, { "cell_type": "code", "execution_count": null, "id": "2ed7486d", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" } }, "nbformat": 4, "nbformat_minor": 5 }