{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "colab": { "name": "2021-08-18-Causal_inference_for_decision_making_in_growth_hacking_and_upselling_in_Python.ipynb", "provenance": [], "collapsed_sections": [] } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "iOQxSDQrumWb" }, "source": [ "# Causal inference for decision-making in growth hacking and upselling in Python\n", "\n", "> \"In this article we discuss differences between experimental and observational data and pitfalls in using the latter for data-driven decision-making.\"\n", "- toc: true\n", "- branch: master\n", "- badges: true\n", "- comments: true\n", "- categories: [python, dowhy, causal inference]" ] }, { "cell_type": "markdown", "metadata": { "id": "6Oo6ETETld4t" }, "source": [ "\n", "## Introduction\n", "\n", "Wow, growth hacking *and* upselling all in the same article? Also Python.\n", "\n", "Okay, let's start at the beginning. Imagine the following scenario: You're responsible for increasing the amount of money users spend on your e-commerce platform.\n", "\n", "You and your team come up with different measures you could implement to achieve your goal. Two of these measures could be:\n", "\n", "- Provide a discount on your best-selling items,\n", "- Implement a rewards program that incentivices repeat purchases.\n", "\n", "Both of these measures are fairly complex with each incurring a certain, probably known, amount of cost and an unknown effect on your customers' spending behaviour.\n", "\n", "To decide which of these two possible measures is worth both the effort and incurred cost you need to estimate their effect on customer spend.\n", "\n", "A natural way of estimating this effect is computing the following:\n", "\n", "$\\textrm{avg}(\\textrm{spend} | \\textrm{treatment} = 1) - \\textrm{avg}(\\textrm{spend} | \\textrm{treatment} = 0) = \\textrm{ATE}$.\n", "\n", "Essentially you would compute the average spend of users who received the treatment (received a discount or signed up for rewards) and subtract from that the average spend of users who didn't receive the treatment.\n", "\n", "Without discussing the details of the underlying potential outcomes framework, the above expression is called the average treatment effect (ATE).\n", "\n", "## Let's estimate the average treatment effect and make a decision!\n", "\n", "So now we'll just analyze our e-commerce data of treated and untreated customers and compute the average treatment effect (ATE) for each proposed measure, right? Right?\n", "\n", "Before you rush ahead with your ATE computations - now is a good time to take a step back and contemplate how your data was generated in the first place(data-generating process).\n", "\n", "## References and further material ...\n", "\n", "Before we continue: My example here is based on a tutorial by the authors of the excellent DoWhy library. You can find the original tutorial here:\n", "\n", "https://github.com/microsoft/dowhy/blob/master/docs/source/example_notebooks/dowhy_example_effect_of_memberrewards_program.ipynb\n", "\n", "And more on DoWhy here: https://microsoft.github.io/dowhy/" ] }, { "cell_type": "markdown", "metadata": { "id": "axpOMHkmSMxB" }, "source": [ "## Install and load libraries" ] }, { "cell_type": "code", "metadata": { "id": "dbEmWDDng9Xm" }, "source": [ "!pip install dowhy --quiet" ], "execution_count": 1, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "NtKCnld4-mAE" }, "source": [ "import random\n", "\n", "import pandas as pd\n", "import numpy as np\n", "\n", "np.random.seed(42)\n", "random.seed(42)" ], "execution_count": 2, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "zdJwuUU8it_G" }, "source": [ "## Randomized controlled trial / experimental data\n", "\n", "So where were we ... ah right! Where does our e-commerece data come from?\n", "\n", "Since we don't actually run an e-commerce operation here we will have to simulate our data (remember: these ideas are based on the above DoWhy tutorial).\n", "\n", "Imagine we observe the monthly spend of each of our 10,000 users over the course of a year. Each user will spend with a certain distribution (here, a Poisson distribution) and there are both high and low spenders with different mean spends.\n", "\n", "Over the course of the year, each user can sign up to our rewards program in any month and once they have signed up their spend goes up by 50% relative to what they would've spent without signing up.\n", "\n", "So far so mundane: Different customers show different spending behaviour and signing up to our rewards program increases their spend.\n", "\n", "Now the big question is: How are treatment assignment (rewards program signup) and outcome (spending behaviour) related? \n", "\n", "If treatment and outcome, interpreted as random variables, are independent of one another then according to the potential outcome framework we can compute the ATE as easily as shown above:\n", "\n", "$\\textrm{ATE} = \\textrm{avg}(\\textrm{spend} | \\textrm{treatment} = 1) - \\textrm{avg}(\\textrm{spend} | \\textrm{treatment} = 0)$\n", "\n", "When are treatment and outcome independent? The gold standard for achieving their independence in a data set is the randomized controlled trial (RCT).\n", "\n", "In our scenario what an RCT would look like is randomly signing up our users to our rewards program - indepndent of their spending behaviour or any other characteristic.\n", "\n", "So we would go through our list of 10,000 users and flip a coin for each of them, sign them up to our program in a random month of the year based on our coin, and send them on their merry way to continue buying stuff in our online shop.\n", "\n", "Let's put all of this into a bit of code that simulates the spending behaviour of our users according to our thought experiment:" ] }, { "cell_type": "code", "metadata": { "id": "mlw4vvkie8pz" }, "source": [ "# Creating some simulated data for our example\n", "num_users = 10000\n", "num_months = 12\n", "\n", "df = pd.DataFrame({\n", " 'user_id': np.repeat(np.arange(num_users), num_months),\n", " 'month': np.tile(np.arange(1, num_months+1), num_users), # months are from 1 to 12\n", " 'high_spender': np.repeat(np.random.randint(0, 2, size=num_users), num_months),\n", "})\n", "\n", "df['spend'] = None\n", "df.loc[df['high_spender'] == 0, 'spend'] = np.random.poisson(250, df.loc[df['high_spender'] == 0].shape[0])\n", "df.loc[df['high_spender'] == 1, 'spend'] = np.random.poisson(750, df.loc[df['high_spender'] == 1].shape[0])\n", "df[\"spend\"] = df[\"spend\"] - df[\"month\"] * 10\n", "\n", "signup_months = np.random.choice(\n", " np.arange(1, num_months),\n", " num_users\n", ") * np.random.randint(0, 2, size=num_users) # signup_months == 0 means customer did not sign up\n", "\n", "df['signup_month'] = np.repeat(signup_months, num_months)\n", "\n", "# A customer is in the treatment group if and only if they signed up\n", "df[\"treatment\"] = df[\"signup_month\"] > 0\n", "\n", "# Simulating a simple treatment effect of 50%\n", "after_signup = (df[\"signup_month\"] < df[\"month\"]) & (df[\"treatment\"])\n", "df.loc[after_signup, \"spend\"] = df[after_signup][\"spend\"] * 1.5" ], "execution_count": 3, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "XrY6roZSZjCj" }, "source": [ "Let's look at user `0` and their treatment assignment as well as spend (since we're sampling random variables here you'll see something different from me):" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 421 }, "id": "hlwbC0lLZiGw", "outputId": "240c8090-2994-46a0-945a-a4277b3fcf39" }, "source": [ "df.loc[df['user_id'] == 0]" ], "execution_count": 4, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmonthhigh_spenderspendsignup_monthtreatment
00102350False
10202490False
20302400False
30402240False
40501840False
50601720False
60701820False
70801550False
80901200False
901001530False
1001101480False
1101201590False
\n", "
" ], "text/plain": [ " user_id month high_spender spend signup_month treatment\n", "0 0 1 0 235 0 False\n", "1 0 2 0 249 0 False\n", "2 0 3 0 240 0 False\n", "3 0 4 0 224 0 False\n", "4 0 5 0 184 0 False\n", "5 0 6 0 172 0 False\n", "6 0 7 0 182 0 False\n", "7 0 8 0 155 0 False\n", "8 0 9 0 120 0 False\n", "9 0 10 0 153 0 False\n", "10 0 11 0 148 0 False\n", "11 0 12 0 159 0 False" ] }, "metadata": { "tags": [] }, "execution_count": 4 } ] }, { "cell_type": "markdown", "metadata": { "id": "Q7SPSlRTe8p1" }, "source": [ "## Average treatment effect on post-signup spend for experimental data\n", "\n", "The effect we're interested in is the impact of rewards signup on spending behaviour - i.e. the effect on post-signup spend.\n", "\n", "Since customers can sign up any month of the year, we'll choose one month at random and compute the effect with respect to that one month.\n", "\n", "So let's create a new table from our time series where we collect post-signup spend for those customers that signed up in `month = 6` alongside the spend of customers who never signed up." ] }, { "cell_type": "code", "metadata": { "scrolled": true, "colab": { "base_uri": "https://localhost:8080/" }, "id": "_HxHA4GXe8p3", "outputId": "39dc76c8-dbcb-4193-a370-b81c1ad85dd2" }, "source": [ "month = 6\n", "\n", "post_signup_spend = (\n", " df[df.signup_month.isin([0, month])]\n", " .groupby([\"user_id\", \"signup_month\", \"treatment\"])\n", " .apply(\n", " lambda x: pd.Series(\n", " {\n", " \"post_spend\": x.loc[x.month > month, \"spend\"].mean(),\n", " }\n", " )\n", " )\n", " .reset_index()\n", ")\n", "print(post_signup_spend)" ], "execution_count": 5, "outputs": [ { "output_type": "stream", "text": [ " user_id signup_month treatment post_spend\n", "0 0 0 False 152.833333\n", "1 3 0 False 162.166667\n", "2 4 0 False 146.333333\n", "3 6 0 False 153.666667\n", "4 7 6 True 240.750000\n", "... ... ... ... ...\n", "5451 9990 0 False 629.833333\n", "5452 9993 0 False 674.500000\n", "5453 9994 0 False 681.000000\n", "5454 9995 0 False 641.333333\n", "5455 9998 0 False 658.833333\n", "\n", "[5456 rows x 4 columns]\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "meQD6PzFq-ML" }, "source": [ "To get the average treatment effect (ATE) of our rewards signup treatment we now compute the average post-signup spend of the customers who signed up and subtract from that the average spend of users who didn't sign up:" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "mispkMh35rVu", "outputId": "ebe193cf-a539-4275-b75f-f0b0be235626" }, "source": [ "post_spend = post_signup_spend\\\n", " .groupby('treatment')\\\n", " .agg({'post_spend': 'mean'})\n", "\n", "post_spend" ], "execution_count": 6, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
post_spend
treatment
False403.512239
True610.140371
\n", "
" ], "text/plain": [ " post_spend\n", "treatment \n", "False 403.512239\n", "True 610.140371" ] }, "metadata": { "tags": [] }, "execution_count": 6 } ] }, { "cell_type": "markdown", "metadata": { "id": "wOJ2rZ25rvKE" }, "source": [ "So the ATE of rewards signup on post-signup spend is:" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "5vAxLCIoUTVT", "outputId": "17a2496e-a140-46df-99b2-2dd2cc04d0bc" }, "source": [ "post_spend.loc[True, 'post_spend'] - post_spend.loc[False, 'post_spend']" ], "execution_count": 7, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "206.62813242372852" ] }, "metadata": { "tags": [] }, "execution_count": 7 } ] }, { "cell_type": "markdown", "metadata": { "id": "OKEJdagJr1IO" }, "source": [ "Since we simulated the treatment effect ourselves (50% post-signup spend increase) let's see if we can recover this effect from our data:" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Ia_DKBM6rmS5", "outputId": "3af5a20b-51e0-4c71-8e85-16530609d89a" }, "source": [ "post_spend.loc[True, 'post_spend'] / post_spend.loc[False, 'post_spend']" ], "execution_count": 8, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1.5120740154875112" ] }, "metadata": { "tags": [] }, "execution_count": 8 } ] }, { "cell_type": "markdown", "metadata": { "id": "urSbFWzysIHV" }, "source": [ "The post-signup spend for treated customers is roughly 50% greater than the spend for untreated customers - exactly the treatment effect we simulated!\n", "\n", "Remember, however, that we are dealing with clean experimental data from a randomized controlled trial (RCT) here! The potential outcome framework tells us that for data from an RCT the simple ATE formula we used here yields the correct treatment effect due to independence of treatment assignment and outcome.\n", "\n", "So the fact that we recovered the actual (simulated) treatment effect is nice to see but not surprising." ] }, { "cell_type": "markdown", "metadata": { "id": "Nfg1xDo7zABH" }, "source": [ "## The issue with randomized controlled trials and observational data\n", "\n", "Our above thought experiment where we randomly assigned our customers to our rewards program isn't very realistic.\n", "\n", "Randomly signing up paying customers to rewards programs without their consent may upset some and may not even be permissible. The same issue with randomized treatment assignment pops up everywhere - clean randomized controlled trials are oftentimes too expensive, infeasible to implement, unethical, or not permitted.\n", "\n", "But since we still need to experiment with our shop to drive spending behaviour we'll still go ahead and implement our rewards program. Only that this time we'll place a regular signup page in our shop where our customers can decide for themselves if they want to sign up or not.\n", "\n", "Activating our signup page and simply observing how users and their spend behaves gives us **observational data**.\n", "\n", "We usually call \"observational data\" just \"data\" without giving much thought to where they came from. I mean we've all dealt with lots of different kinds of data (marketing data, R&D measurements, HR data, etc.) and all these data were simply \"observed\" and didn't come out of a carefully set up experiment.\n", "\n", "Simulating our observational data we've got the same 10,000 customers over a span of a year. We still have the same high and low spenders.\n", "\n", "Only that now our high spenders are far more likely to sign up to our rewards program than our low spenders. My reasoning for this is that customers who spend more are also more likely to show greater brand loyalty towards us and our rewards program. Further, they visit our shop more frequently hence are more likely to notice our new rewards program and the signup page. We could also add this behaviour as random variables to our simulation below but just take a shortcut and give low spenders a 5% chance of signing up and high spenders a 95% chance." ] }, { "cell_type": "code", "metadata": { "id": "vFQ5Bgf2zzKQ", "colab": { "base_uri": "https://localhost:8080/", "height": 419 }, "outputId": "2eef91a7-247a-45c1-fbbf-780900964a51" }, "source": [ "num_users = 10000\n", "num_months = 12\n", "\n", "df = pd.DataFrame({\n", " 'user_id': np.repeat(np.arange(num_users), num_months),\n", " 'month': np.tile(np.arange(1, num_months+1), num_users), # months are from 1 to 12\n", " 'high_spender': np.repeat(np.random.randint(0, 2, size=num_users), num_months),\n", "})\n", "\n", "df['spend'] = None\n", "df.loc[df['high_spender'] == 0, 'spend'] = np.random.poisson(250, df.loc[df['high_spender'] == 0].shape[0])\n", "df.loc[df['high_spender'] == 1, 'spend'] = np.random.poisson(750, df.loc[df['high_spender'] == 1].shape[0])\n", "\n", "signup_months = df[['user_id', 'high_spender']].drop_duplicates().copy()\n", "signup_months['signup_month'] = None\n", "\n", "signup_months.loc[signup_months['high_spender'] == 0, 'signup_month'] = np.random.choice(\n", " np.arange(1, num_months),\n", " (signup_months['high_spender'] == 0).sum()\n", ") * np.random.binomial(1, .05, size=(signup_months['high_spender'] == 0).sum())\n", "\n", "signup_months.loc[signup_months['high_spender'] == 1, 'signup_month'] = np.random.choice(\n", " np.arange(1, num_months),\n", " (signup_months['high_spender'] == 1).sum()\n", ") * np.random.binomial(1, .95, size=(signup_months['high_spender'] == 1).sum())\n", "\n", "df = df.merge(signup_months)\n", "\n", "df[\"treatment\"] = df[\"signup_month\"] > 0\n", "\n", "after_signup = (df[\"signup_month\"] < df[\"month\"]) & (df[\"treatment\"])\n", "df.loc[after_signup, \"spend\"] = df[after_signup][\"spend\"] * 1.5\n", "\n", "df" ], "execution_count": 9, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
user_idmonthhigh_spenderspendsignup_monthtreatment
001177811True
102172611True
203170411True
304172311True
405171811True
.....................
1199959999817450False
1199969999917770False
11999799991017760False
11999899991117440False
11999999991217680False
\n", "

120000 rows × 6 columns

\n", "
" ], "text/plain": [ " user_id month high_spender spend signup_month treatment\n", "0 0 1 1 778 11 True\n", "1 0 2 1 726 11 True\n", "2 0 3 1 704 11 True\n", "3 0 4 1 723 11 True\n", "4 0 5 1 718 11 True\n", "... ... ... ... ... ... ...\n", "119995 9999 8 1 745 0 False\n", "119996 9999 9 1 777 0 False\n", "119997 9999 10 1 776 0 False\n", "119998 9999 11 1 744 0 False\n", "119999 9999 12 1 768 0 False\n", "\n", "[120000 rows x 6 columns]" ] }, "metadata": { "tags": [] }, "execution_count": 9 } ] }, { "cell_type": "markdown", "metadata": { "id": "rpP-8x-VYFAa" }, "source": [ "Now imagine you weren't aware of causality, confounders, high / low spenders, and all that. You simply published your rewards signup page and observed your customers' spending behaviour over a span of a year. Chances are you'll compute the average treatment effect the exact same way we did above for our randomized controlled trial:" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MuQ0udCR9qYK", "outputId": "ec2a770f-3736-4345-87b7-f20fc34854ca" }, "source": [ "month = 6\n", "\n", "post_signup_spend = (\n", " df[df.signup_month.isin([0, month])]\n", " .groupby([\"user_id\", \"signup_month\", \"treatment\"])\n", " .apply(\n", " lambda x: pd.Series(\n", " {\n", " \"post_spend\": x.loc[x.month > month, \"spend\"].mean(),\n", " \"pre_spend\": x.loc[x.month < month, \"spend\"].mean(),\n", " }\n", " )\n", " )\n", " .reset_index()\n", ")\n", "print(post_signup_spend)" ], "execution_count": 10, "outputs": [ { "output_type": "stream", "text": [ " user_id signup_month treatment post_spend pre_spend\n", "0 1 0 False 251.666667 239.0\n", "1 3 0 False 246.166667 252.8\n", "2 4 0 False 740.833333 737.2\n", "3 6 0 False 254.333333 247.0\n", "4 7 0 False 249.166667 253.2\n", "... ... ... ... ... ...\n", "5499 9992 0 False 246.000000 240.6\n", "5500 9995 0 False 254.833333 256.4\n", "5501 9996 0 False 248.833333 239.8\n", "5502 9997 0 False 249.500000 247.8\n", "5503 9999 0 False 761.666667 752.6\n", "\n", "[5504 rows x 5 columns]\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 142 }, "id": "yoJqX6g4Tjtw", "outputId": "5e17a814-b50e-4978-ffbd-8ce065e141a6" }, "source": [ "post_spend = post_signup_spend\\\n", " .groupby('treatment')\\\n", " .agg({'post_spend': 'mean'})\n", "\n", "post_spend" ], "execution_count": 11, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
post_spend
treatment
False275.891760
True1075.543699
\n", "
" ], "text/plain": [ " post_spend\n", "treatment \n", "False 275.891760\n", "True 1075.543699" ] }, "metadata": { "tags": [] }, "execution_count": 11 } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "7D74LF4kUAkB", "outputId": "6d8d05f5-b555-4343-e7d0-111122159cef" }, "source": [ "post_spend.loc[True, 'post_spend'] - post_spend.loc[False, 'post_spend']" ], "execution_count": 12, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "799.6519394104552" ] }, "metadata": { "tags": [] }, "execution_count": 12 } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "NbeU9ZmlZGIZ", "outputId": "217c9554-bbae-467e-8405-20dafc6c3098" }, "source": [ "post_spend.loc[True, 'post_spend'] / post_spend.loc[False, 'post_spend']" ], "execution_count": 13, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "3.8984263250854134" ] }, "metadata": { "tags": [] }, "execution_count": 13 } ] }, { "cell_type": "markdown", "metadata": { "id": "W308IJJ4ZKBv" }, "source": [ "Performing the exact same computation as above, now we're estimating an average treatment effect of almost 400% instead of the actual 50%!\n", "\n", "So what went wrong here?\n", "\n", "Observational data got us!\n", "\n", "Realize that in our observational data the outcome (spend) is not indepndent of treatment assignmnet (rewards program signup): High spenders are far more likely to sign up hence are overrepresented in our treatment group while low spenders are overrepresented in our control group (users that didn't sign up).\n", "\n", "So when we compute the above difference or ratio we don't just see the average treatment effect of rewards signup we also see the inherent difference in spending between high and low spenders.\n", "\n", "So if we ignore how our observational data are generated we'll overestimate the effect our rewards program has and likely make decisions that seem to be supported by data but in reality aren't.\n", "\n", "Also notice that we often make this same mistake when training machine learning algorithms on observational data. Chances are someone will ask you to train a regression model to predict the effectiveness of the rewards program and your model will end up with the same inflated estimate as above." ] }, { "cell_type": "markdown", "metadata": { "id": "-wv-YTihe8p4" }, "source": [ "So how do we fix this? And how can we estimate the true treatment effect from our observational data?\n", "\n", "Generally, we know from experience in e-commerece that people who tend to spend more are more likely to sign up to our rewards program. So we could segment our users into spend buckets and compute the treatment effect within each bucket to try and breeak this confounding link in our observational data.\n", "\n", "Notice that in practice we won't have a `high spender` flag for our customers so we'll have to go by our customers' observed spending behaviour.\n", "\n", "The causal inference framework offers an established approach here: Relying on our domain knowledge, we define a causal model that describes how we believe our observational data were generated.\n", "\n", "Let's draw this as a graph with nodes and edges:" ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 248 }, "id": "Y51lPko52klm", "outputId": "f02cc228-c836-41e1-a008-6bbb485ec8d4" }, "source": [ "import os, sys\n", "sys.path.append(os.path.abspath(\"../../../\"))\n", "import dowhy\n", "\n", "causal_graph = \"\"\"digraph {\n", "treatment[label=\"Program Signup in month i\"];\n", "pre_spend;\n", "post_spend;\n", "U[label=\"Unobserved Confounders\"]; \n", "pre_spend -> treatment;\n", "pre_spend -> post_spend;\n", "treatment->post_spend;\n", "U->treatment; U->pre_spend; U->post_spend;\n", "}\"\"\"\n", "\n", "model = dowhy.CausalModel(\n", " data=post_signup_spend,\n", " graph=causal_graph.replace(\"\\n\", \" \"),\n", " treatment=\"treatment\",\n", " outcome=\"post_spend\"\n", ")\n", "model.view_model()" ], "execution_count": 14, "outputs": [ { "output_type": "display_data", "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "tags": [] } } ] }, { "cell_type": "markdown", "metadata": { "id": "QxJ6MxYXmvhD" }, "source": [ "Our causal model states what we described above: Pre-signup spend influences both rewards signup (treatment assignment) and post-signup spend. This is the story about our high and low spenders.\n", "\n", "Treatment (rewards signup) influences post-signup spending behaviour - this is the effect we're actually interested in.\n", "\n", "We also added a node `U` to signify possible other confounding factors that may exist in reality but weren't observed as part of our data." ] }, { "cell_type": "markdown", "metadata": { "id": "nkek0w7Re8p5" }, "source": [ "## Identification / identifying the causal effect\n", "\n", "We will now apply do-calculus to our causal model from above to figure out ways to cleanly estimate the treatment effect we're after:" ] }, { "cell_type": "code", "metadata": { "scrolled": true, "colab": { "base_uri": "https://localhost:8080/" }, "id": "j1smWlfve8p5", "outputId": "7d06e053-e073-4354-ac65-50c63ef8fa7f" }, "source": [ "identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)\n", "print(identified_estimand)" ], "execution_count": 15, "outputs": [ { "output_type": "stream", "text": [ "Estimand type: nonparametric-ate\n", "\n", "### Estimand : 1\n", "Estimand name: backdoor\n", "Estimand expression:\n", " d \n", "────────────(Expectation(post_spend|pre_spend))\n", "d[treatment] \n", "Estimand assumption 1, Unconfoundedness: If U→{treatment} and U→post_spend then P(post_spend|treatment,pre_spend,U) = P(post_spend|treatment,pre_spend)\n", "\n", "### Estimand : 2\n", "Estimand name: iv\n", "No such variable found!\n", "\n", "### Estimand : 3\n", "Estimand name: frontdoor\n", "No such variable found!\n", "\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "RZKrj9Qne8p6" }, "source": [ "Very broadly and sloppily stated there a three ways to segment (or slice and dice) our observational data to get to subsets of our data within which we can cleanly compute the average treatment effect:\n", "\n", "- Backdoor adjustment,\n", "- Frontdoor adjustment, and\n", "- Instrumental variables.\n", "\n", "The above printout tells us that based on both the causal model we constructed and our observational data there is a backdoor-adjusted estimator for our desired treatment effect.\n", "\n", "This backdoor adjustment actually follows closely what we already said above: We'll compute the post-spend given pre-spend (segment our customers based on their spending behaviour)." ] }, { "cell_type": "markdown", "metadata": { "id": "KwNeZccJe8p6" }, "source": [ "## Estimating the treatment effect\n", "\n", "There are various ways to perform the backdoor adjustment that DoWhy identified for us above. One of them is called propensity score matching:" ] }, { "cell_type": "code", "metadata": { "scrolled": true, "colab": { "base_uri": "https://localhost:8080/" }, "id": "QXRu-N1ie8p7", "outputId": "ce197642-0ac6-4d68-c0b3-a1df51fd1f1e" }, "source": [ "estimate = model.estimate_effect(\n", " identified_estimand,\n", " method_name=\"backdoor1.propensity_score_matching\",\n", " target_units=\"ate\"\n", ")\n", "\n", "print(estimate)" ], "execution_count": 16, "outputs": [ { "output_type": "stream", "text": [ "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py:760: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " y = column_or_1d(y, warn=True)\n" ], "name": "stderr" }, { "output_type": "stream", "text": [ "*** Causal Estimate ***\n", "\n", "## Identified estimand\n", "Estimand type: nonparametric-ate\n", "\n", "## Realized estimand\n", "b: post_spend~treatment+pre_spend\n", "Target units: ate\n", "\n", "## Estimate\n", "Mean value: 159.15597747093042\n", "\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "RhH_KpINxzih" }, "source": [ "DoWhy provides us with an estimated ATE for our observational data that is pretty close to the ATE we computed for our experimental data from our randomized controlled trial.\n", "\n", "Even if the ATE estimate DoWhy provides doesn't match exactly our experimental ATE we're now in a much better position to take a decision regarding our rewards program based on our observational data.\n", "\n", "So next time we want to base decisions on observational data it'll be worthwhile defining a causal model of the underlying data-generating process and using a library such as DoWhy that helps us identify and apply adjustment strategies." ] } ] }