{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# The 2nd place solution of the [RetailHero Uplift Modelling contest ](https://retailhero.ai/c/uplift_modeling/overview) \n", "\n", "### To open Notebook for read, use [nbviewer](https://nbviewer.jupyter.org/github/kirrlix1994/Retail_hero/blob/master/Retail_hero_contest_2nd_place_solution.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary of the solution:\n", "###### Main points to get good results in this competition:\n", "-\tUse few features, pre-select features. My best submission used only 6 ones!\n", "\n", "\n", "-\tMake NOT very complex estimators for fitting. I used mainly different gradient boosting realizations (Catboost / Xgboost / Lightgbm), best results gave models with trees of death 1-2.\n", "\n", "\n", "-\tUse Class transformation approach for uplift modelling.\n", "\n", "\n", "- Try to make more accurate local validation and do not rely only on public validation score. For testing hypotheses I used to make N (N = 30, 50) train - test splits and calculate test scores mean and standard deviation for making decisions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Additional info:\n", " \n", " Package with different uplift model realizations: \n", " [uplift modelling package](https://github.com/maks-sh/scikit-uplift/)\n", "\n", " Tutorial about uplift and approaches (in Russian): \n", " [uplift modeling approaches article](https://habr.com/ru/company/ru_mts/blog/485980/)\n", " \n", " Additional code with my feature engenearing, experimemts and validation: \n", " [retail hero research](/research)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### code is on py37, requirements:\n", "- numpy==1.16.1\n", "- pandas==0.24.1\n", "- scikit-learn==0.21.3\n", "- scikit-uplift==0.0.3\n", "- xgboost==0.90\n", "- matplotlib==3.1.1" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import gc\n", "import pandas as pd\n", "import numpy as np\n", "from datetime import datetime\n", "\n", "from xgboost import XGBClassifier\n", "from sklift.models import ClassTransformation\n", "\n", "from sklift.metrics import uplift_at_k\n", "import matplotlib.pyplot as plt\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### LOAD INITIAL DATA" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df_clients = pd.read_csv(\n", " 'data/clients.csv', \n", " index_col='client_id', \n", " parse_dates=['first_issue_date', 'first_redeem_date']\n", ")\n", "\n", "df_purchases = pd.read_csv(\n", " 'data/purchases.csv'\n", " index_col='client_id', \n", " parse_dates=['transaction_datetime']\n", ")\n", "\n", "df_dtrain = pd.read_csv('data/uplift_train.csv', index_col='client_id')\n", "df_test = pd.read_csv('data/uplift_test.csv', index_col='client_id')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### MAKE FEATURES FOR TRAIN AND TEST" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###### Final second place submission used only 6 features: " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "'first_redeem_date' - make date as an integer feature" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df_clients['first_redeem_date'] =\\\n", " df_clients['first_redeem_date']\\\n", " .fillna(datetime(2019, 3, 19, 0, 0))\n", "\n", "df_clients.loc[:, 'first_redeem_date'] =\\\n", " ((df_clients['first_redeem_date'] - pd.Timestamp(\"1970-01-01\")) // pd.Timedelta('1d'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Aggregate df_purchases data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# two -level aggregation: \n", "\n", "# 1. aggregate to (client, transaction) take one row only \n", "# ('express_points_spent', 'purchase_sum' - same for one transaction)\n", "\n", "# 2. aggregate all transactions to client.\n", "\n", "\n", "df_purch = df_purchases\\\n", " .groupby(['client_id','transaction_id'])[['express_points_spent', 'purchase_sum']]\\\n", " .last()\\\n", " .groupby('client_id')\\\n", " .agg({\n", " 'express_points_spent': ['mean', 'sum'], \n", " 'purchase_sum': ['sum']\n", " })\n", " \n", "# set readable column names:\n", "df_purch.columns =\\\n", " ['express_spent_mean', 'express_points_spent_sum', 'purchase_sum__sum']" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#'regular_points_received_sum_last_m'\n", "\n", "reg_points_last_m = df_purchases[\n", " df_purchases['transaction_datetime'] > '2019-02-18'\n", " ]\\\n", " .groupby(['client_id', 'transaction_id'])['regular_points_received']\\\n", " .last()\\\n", " .groupby('client_id')\\\n", " .sum()\n", " \n", "reg_points_last_m = pd.DataFrame({\n", " 'regular_points_received_sum_last_m': reg_points_last_m \n", "})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make 'transaction_datetime' from purchases data frame as integer. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Join clients df and purchases df and calculate difference between 'transaction_datetime' and ''first_redeem_date' (date_diff)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df_purchases.loc[:,'purch_day'] =\\\n", " ((df_purchases['transaction_datetime'] - pd.Timestamp(\"1970-01-01\")) // pd.Timedelta('1d'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculate features from purchases data for all transactions after redeem date:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 'after_redeem_sum'\n", "\n", "df_purch_joined = pd.merge(\n", " df_purchases[\n", " ['client_id', 'purch_day', 'transaction_id', 'purchase_sum']\n", " ],\n", " df_clients\\\n", " .reset_index()[\n", " ['client_id', 'first_redeem_date']\n", " ], on='client_id', how='left')\n", "\n", "df_purch_joined = df_purch_joined\\\n", " .assign(date_diff=\\\n", " df_purch_joined['first_redeem_date'] - df_purch_joined['purch_day']\n", " )\n", "\n", "df_purch_agg = df_purch_joined[\n", " df_purch_joined['date_diff'] <= 0\n", " ]\\\n", " .groupby(\n", " ['client_id', 'transaction_id']\n", " )\\\n", " .last()\\\n", " .groupby('client_id')['purchase_sum']\\\n", " .sum()\n", " \n", "after_redeem_sum = pd.DataFrame(data={\n", " 'after_redeem_sum': df_purch_agg\n", "})\n", "\n", "del df_purch_joined, df_purch_agg\n", "gc.collect();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "'purch_delta' - difference in days between last and first transaction " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df_purch_delta_agg = df_purchases\\\n", " .groupby('client_id')\\\n", " .agg({\n", " 'purch_day': ['max', 'min']\n", " })\n", "\n", "df_purch_delta = pd.DataFrame(\n", " data=df_purch_delta_agg['purch_day']['max'] -\\\n", " df_purch_delta_agg['purch_day']['min'] + 1,\n", " columns=['purch_delta']\n", ")\n", "\n", "del df_purch_delta_agg\n", "gc.collect();" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Concat all data together:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "df_feats = pd.concat([\n", " df_clients[['first_redeem_date']], \n", " df_purch, \n", " df_purch_delta, \n", " reg_points_last_m, \n", " after_redeem_sum\n", "], axis=1, sort=False)\n", "\n", "df_feats = df_feats\\\n", " .assign(\n", " avg_spent_perday=\\\n", " df_feats['purchase_sum__sum'] / df_feats['purch_delta'],\n", " after_redeem_sum_perday =\\\n", " df_feats['after_redeem_sum'] / df_feats['purch_delta']\n", " )\\\n", " .drop([\n", " 'purch_delta', 'purchase_sum__sum', 'after_redeem_sum'\n", " ], axis=1\n", " )" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(400162, 6)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
first_redeem_dateexpress_spent_meanexpress_points_spent_sumregular_points_received_sum_last_mavg_spent_perdayafter_redeem_sum_perday
000012768d175350.00.010.026.95192326.951923
000036f903172790.00.013.789.13636489.136364
\n", "
" ], "text/plain": [ " first_redeem_date express_spent_mean express_points_spent_sum \\\n", "000012768d 17535 0.0 0.0 \n", "000036f903 17279 0.0 0.0 \n", "\n", " regular_points_received_sum_last_m avg_spent_perday \\\n", "000012768d 10.0 26.951923 \n", "000036f903 13.7 89.136364 \n", "\n", " after_redeem_sum_perday \n", "000012768d 26.951923 \n", "000036f903 89.136364 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# train and test features\n", "print(df_feats.shape)\n", "df_feats.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make train and test data: join clints ids and treatment and target for train." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Train df shape: (200039, 8)\n", "Test df shape: (200123, 6)\n" ] } ], "source": [ "# train data:\n", "df_train_feats = df_train\\\n", " .join(\n", " df_feats, \n", " how='left'\n", " )\n", " \n", "df_test_feats = df_test\\\n", " .join(\n", " df_feats, \n", " how='left'\n", " )\n", " \n", "print(f'Train df shape: {df_train_feats.shape}')\n", "print(f'Test df shape: {df_test_feats.shape}')" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# save final train / test data \n", "df_train_feats.to_csv('data/retail_hero_final_model_train_data.csv')\n", "df_test_feats.to_csv('data/retail_hero_final_model_test_data.csv')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "-----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### SET MODEL\n", "Use uplift class tranformation approach for uplift prediction with XGBoost as an estimator." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "xgb_est_params = {\n", " 'max_depth':2,\n", " 'learning_rate': 0.2, \n", " 'n_estimators': 100,\n", " 'nthread':40,\n", " 'n_gpus':0,\n", " 'seed':42\n", "}\n", "\n", "estimator = XGBClassifier(\n", " **xgb_est_params\n", ")\n", "\n", "uplift_model_cl_tr = ClassTransformation(\n", " estimator=estimator\n", ")" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "-----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### FIT MODEL ON ALL TRAIN DATA" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "scrolled": true }, "outputs": [], "source": [ "uplift_model_cl_tr.fit(\n", " X=df_train_feats\\\n", " .drop(\n", " ['treatment_flg', 'target'], \n", " axis=1\n", " ),\n", " y=df_train_feats['target'],\n", " treatment=df_train_feats['treatment_flg']\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "--------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Make predictions on whole train and test sets.\n", "\n", "#### Calculate uplift@30% metric on train data.\n", "\n", "###### **NOTE: In this contest uplift@30% metric calculates as follows: \n", "All test data are sorted in descending order by 'predicted uplift'. Then calculates target share / conversion in top 30% of treatment group and conversion in top 30% of control group SEPARATELY. Metric is then the difference between two conversions.\n", "
\n", "\n", " Usually uplift@30% metric calculates differently. Firstly selects top 30% of ALL test data and only then calculates conversion in treatment and control group and they are subtracted." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# predicts\n", "uplift_tr = uplift_model_cl_tr.predict(\n", " df_train_feats\\\n", " .drop(['treatment_flg', 'target'], axis=1),\n", ")\n", "\n", "uplift_ts = uplift_model_cl_tr.predict(\n", " df_test_feats\n", ")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Uplift@30% on train data: 0.105\n" ] } ], "source": [ "# score on train:\n", "df_train_scores = df_train_feats[['treatment_flg', 'target']]\\\n", " .assign(uplift_score=uplift_tr)\n", " \n", "train_score = uplift_at_k(\n", " y_true=df_train_scores['target'],\n", " uplift=df_train_scores['uplift_score'],\n", " treatment=df_train_scores['treatment_flg']\n", ")\n", "\n", "print(f'Uplift@30% on train data: {train_score:.3f}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Make submit:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Submit data shape: (200123, 1)\n", "\n", "client_id,uplift\n", "000048b7a6,0.03721845\n", "000073194a,0.019023776\n", "00007c7133,0.032182574\n", "00007f9014,0.016914487\n" ] } ], "source": [ "df_submit = df_test_feats\\\n", " .assign(uplift=uplift_ts)[['uplift']]\n", "\n", "print(f'Submit data shape: {df_submit.shape}\\n')\n", "df_submit.head(2)\n", "\n", "df_submit.to_csv('submissions/retail_hero_2nd_place_submit.csv')\n", "\n", "!head -5 submissions/retail_hero_2nd_place_submit.csv" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "-----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### FEATURE IMPORTANCE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is the area of research, how to calculate feature importance in uplift models. \n", "\n", "Here feature importance is just vanilla information gain in XGBoost estimator, used in Class transformation approach.\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
featurevalue
4first_redeem_date64.531791
2express_points_spent_sum14.171003
3express_spent_mean14.151961
5regular_points_received_sum_last_m7.514503
1avg_spent_perday5.645311
0after_redeem_sum_perday4.776374
\n", "
" ], "text/plain": [ " feature value\n", "4 first_redeem_date 64.531791\n", "2 express_points_spent_sum 14.171003\n", "3 express_spent_mean 14.151961\n", "5 regular_points_received_sum_last_m 7.514503 \n", "1 avg_spent_perday 5.645311 \n", "0 after_redeem_sum_perday 4.776374 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_feat_imp = pd.DataFrame([\n", " uplift_model_cl_tr\\\n", " .estimator\\\n", " .get_booster()\\\n", " .get_score(importance_type='gain')\n", " ]\n", ").T.reset_index()\n", "\n", "df_feat_imp.columns =\\\n", " [\"feature\", \"value\"]\n", " \n", "df_feat_imp\\\n", " .sort_values('value', ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "------" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### PLOT ULIFT PREDICTIONS FOR TRAIN / TEST DATA" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(20, 8));\n", "\n", "plt.hist(\n", " df_train_scores['uplift_score'],\n", " bins=200, \n", " color='red', \n", ");\n", "\n", "plt.xlim(-0.2, 0.75);\n", "plt.title('Uplift predictions on train data', size=15);\n", "\n", "\n", "plt.figure(figsize=(20, 8));\n", "\n", "plt.hist(\n", " df_submit['uplift'],\n", " bins=200, \n", " color='blue', \n", ");\n", "\n", "plt.xlim(-0.2, 0.75);\n", "plt.title('Uplift predictions on test data', size=15);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks quite similar." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }