{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The Effects of Marketing Decisions using the Bank Marketing Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [UCI Bank Marketing dataset](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) comes from a real-world bank's direct marketing campaign. It is a popular dataset most commonly treated as a classification task, to predict whether a client will open a term deposit account. In this notebook, we show that it is equally suitable for causal inference. The fraction of clients making term deposits is an *outcome* that the bank would like to increase, and the dataset contains several variables that could be seen as *interventions* or *treatments* (we will use the terms interchangeably) for doing so. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. [Understand Data](#data)\n", " 1. [Identify treatment variables](#data-treatment)\n", " 2. [Identiy potential confounders](#data-confounders)\n", "2. [Effect of Contact Mode](#contact)\n", " 1. [Inverse propensity weighting: First attempt](#contact-ipw1)\n", " 2. [Characterizing the region of treatment non-overlap using rules](#contact-non-overlap)\n", " 3. [Inverse propensity weighting: After excluding non-overlap region](#contact-ipw2)\n", " 4. [Standardization](#contact-standardization)\n", " 5. [Summary and comparison with non-causal analysis](#contact-summary)\n", "3. [Effect of Number of Contacts](#campaign)\n", " 1. [Redefine treatment variable and potential confounders](#campaign-treatment-confounders)\n", " 2. [A closer look at the intervention scenario](#campaign-closer-look)\n", " 3. [Inverse propensity weighting](#campaign-ipw)\n", " 4. [Standardization](#campaign-standardization)\n", " 5. [Summary and comparison with non-causal analysis](#campaign-summary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE:** [Section 2.B](#contact-non-overlap) uses another package called [AIX360](https://github.com/Trusted-AI/AIX360) (also created by IBM Research). If you do not wish to install and run AIX360, you can simply skip the first code cell in that section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Understand Data\n", "\n", "Load the data:" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "def read_data_from_UCI():\n", " \"\"\"Reads the bank-marketing data table from a zip file directly from UCI\"\"\"\n", " import zipfile\n", " import io\n", " from urllib import request\n", "\n", " url = \"https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip\"\n", " with request.urlopen(url) as r:\n", " with zipfile.ZipFile(io.BytesIO(r.read())) as zf:\n", " csv_file = zf.open(\"bank-additional/bank-additional-full.csv\")\n", " df = pd.read_csv(csv_file, sep=\";\")\n", " return df" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(41188, 21)" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = read_data_from_UCI()\n", "data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first step is to understand what variables are present in the data." ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',\n", " 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',\n", " 'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',\n", " 'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'],\n", " dtype='object')" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "According to the [data description](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing):\n", "* The first seven variables (`'age'`-`'loan'`) relate to the client, including basic credit characteristics (`'default'`, `'housing'`, `'loan'`).\n", "* The next four variables relate to the last contact with the client during the current campaign: the mode of communication (`'contact'`, cellular/telephone), date (`'month'`, `'day_of_week'`), and duration of the contact (`'duration'`). `'campaign'` is the number of contacts made during this campaign.\n", "* The three subsequent variables relate to previous marketing campaigns, if applicable: the number of days since the last contact from a previous campaign (`'pdays'`), the number of contacts in previous campaigns (`'previous'`), and their outcome (`'poutcome'`).\n", "* Variables `'emp.var.rate'`-`'nr.employed'` are economic indicators such as the employment rate and consumer price index.\n", "* The last variable `'y'` is the outcome of whether the client opened a term deposit." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We go ahead and binarize `'y'`, mapping `'yes'` to value 1." ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['no' 'yes']\n" ] }, { "data": { "text/plain": [ "0.11265417111780131" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(data['y'].unique())\n", "\n", "from sklearn.preprocessing import LabelEncoder\n", "le = LabelEncoder()\n", "y = pd.Series(le.fit_transform(data['y']))\n", "y.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only 11.3% of clients sign up for a term deposit." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Identify treatment variables\n", "\n", "Next, we consider which of the above variables could be regarded as interventions, undertaken by bank employees, to increase the rate of positive outcomes. These are immediately limited to the variables associated with the current campaign, since client characteristics and economic conditions cannot be controlled by the bank, nor can past events be changed. In addition, as discussed [here](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing), `'duration'` is not known until the last contact is completed and is mostly determined by the client. Of the remaining variables, in this notebook we will investigate the effects of `'contact'` (mode of communication) and `'campaign'` (number of contacts). `'day_of_week'` can be treated similarly as `'contact'`, as indicated below.\n", "\n", "We consider `'contact'` first and encode it as a 0/1-valued variable `a` (0 for cellular, 1 for telephone):" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['telephone' 'cellular']\n" ] }, { "data": { "text/plain": [ "0.3652520151500437" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(data['contact'].unique())\n", "a = pd.Series(le.fit_transform(data['contact']))\n", "a.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Identify potential confounders\n", "\n", "To estimate causal effects from observational data, we must also identify which of the variables are potential *confounders*, variables that could affect both the outcome `'y'` as well as the decision to intervene. We will want to \"adjust for\" (i.e. control) these confounders to isolate the causal effect of the intervention on `'y'`. For this dataset, confounder selection can be done by:\n", "1. Following the rule of thumb of avoiding post-intervention variables, i.e., those that may be affected by the intervention, and \n", "2. Putting ourselves in the shoes of the hypothetical bank employee who made the decision. \n", "\n", "Consideration 1 eliminates `'duration'` since it is a result of the last contact with the client. On the flip side, we will always include as potential confounders:\n", "* Client characteristics `'age'`-`'loan'`: These clearly affect the client's decision to invest in a term deposit (the outcome). We assume that the bank may also have most of this information in their records and a bank employee may consult it in contacting the client.\n", "* Previous campaigns `'pdays', 'previous', 'poutcome'`: These indicate the client's previous receptiveness to the bank's products and would also be part of the client's record.\n", "* Economic indicators `'emp.var.rate'`-`'nr.employed'`: These conditions may influence the client's decision as well as the bank's practices." ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "confounders = ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'pdays',\n", " 'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the `'contact'` intervention, the decision is whether the bank employee should call a cell phone or landline as the next (and last) contact with the client. Thus we will also include as potential confounders `'month'`, to account for any seasonality effects, and `'campaign'`, the number of contacts up until this point." ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "confounders += ['month', 'campaign']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will discuss the `'campaign'` intervention later below.\n", "\n", "Now we just extract the confounders into a variable `X` and dummy-code (aka one-hot code) the categorical ones in preparation for modelling." ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age int64\n", "job object\n", "marital object\n", "education object\n", "default object\n", "housing object\n", "loan object\n", "pdays int64\n", "previous int64\n", "poutcome object\n", "emp.var.rate float64\n", "cons.price.idx float64\n", "cons.conf.idx float64\n", "euribor3m float64\n", "nr.employed float64\n", "month object\n", "campaign int64\n", "dtype: object" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X = data[confounders]\n", "X.dtypes" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | age | \n", "pdays | \n", "previous | \n", "emp.var.rate | \n", "cons.price.idx | \n", "cons.conf.idx | \n", "euribor3m | \n", "nr.employed | \n", "campaign | \n", "job=blue-collar | \n", "... | \n", "poutcome=success | \n", "month=aug | \n", "month=dec | \n", "month=jul | \n", "month=jun | \n", "month=mar | \n", "month=may | \n", "month=nov | \n", "month=oct | \n", "month=sep | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "56 | \n", "999 | \n", "0 | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
1 | \n", "57 | \n", "999 | \n", "0 | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
2 | \n", "37 | \n", "999 | \n", "0 | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
3 | \n", "40 | \n", "999 | \n", "0 | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
4 | \n", "56 | \n", "999 | \n", "0 | \n", "1.1 | \n", "93.994 | \n", "-36.4 | \n", "4.857 | \n", "5191.0 | \n", "1 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "
5 rows × 47 columns
\n", "