{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# \"Causality modelling in Python for data scientists\"\n", "> \"Data science is increasingly commonplace in industry and the enterprise. Industrial data scientists have a vast toolbox for descriptive and predictive analyses at their disposal. However, data science tools for decision-making in industry and the enterprise are less well established. Here we survey Python packages that can aid industrial data scientists facilitate intelligent decision-making through causality modelling.\"\n", "- hidden: true\n", "- toc: true\n", "- branch: master\n", "- badges: true\n", "- comments: true\n", "- categories: [causal inference, causal discovery, causality modelling, python]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The need for causality modelling\n", "\n", "Intelligent planning and decision-making lie at the heart of most business success.\n", "\n", "The decisions that our business needs to evaluate can range from those that are relatively low effort and we take potentially thousands or millions of times a day to those that are high effort and are taken every couple of months:\n", "\n", "1. What will happen if I show an advertising banner to a particular user?\n", "2. What will happen if I change the retail prices for certain products in my shop?\n", "3. What will happen if I alter my manufacturing process?\n", "4. What will happen if I swap out a particular mechanical piece in a vehicle I develop?\n", "5. What will happen if I invest in new property, machinery, or processes?\n", "6. What will happen if I hire this applicant?\n", "7. What if I increase remuneration of my workforce?\n", "\n", "As industrial data scientists we are oftentimes called upon to evaluate these proposed business decisions using analytics, machine learning methodologies, and past data.\n", "\n", "What we may end up doing for the above proposed business decisions is:\n", "\n", "1. Compute and rank past click-through rates for given pairs of ad banner and user,\n", "2. Correlate past demand with set retail prices for product groups of interest,\n", "3. Correlate past manufacturing parameters with achieved output quality,\n", "4. Correlate the mechanical behavior of my vehicles with the mechnical parts used in it,\n", "5. Use past data to forecast the development of real estate prices,\n", "6. Use past data to correlate and predict the productivity of my team given e.g. its size or makeup, and\n", "7. Use past data to correlate productivity and remuneration levels.\n", "\n", "The way I formulated these is already pretty suggestive - but essentially some of our common approaches to evaluating business decisions do not compare our business outcomes with and without said business decisions but they rather look at our data outside the context of decision-making.\n", "\n", "Put another way, we oftentimes analyze past data without considering the state our business or customer is in when those data were generated.\n", "For illustration:\n", "\n", "![Data fusion process (5)](https://user-images.githubusercontent.com/3273502/85201681-a999f580-b301-11ea-9174-056649a1bebb.png)\n", "\n", "So really when tasked with evaluating the above proposed business decisions we should instead think in terms of questions akin the following:\n", "\n", "1. How would the user of interest behave differently if we didn't show them (and pay for) a banner now?\n", "2. For each Euro we shave off a price tag how much higher will our revenue be since more customers are inclinced to place an order?\n", "3. \n", "4.\n", "5.\n", "6.\n", "7." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How to do causality modelling\n", "\n", "The authors Hünermund and Bareinboim (https://arxiv.org/abs/1912.09104) proposed a methodology they called data-fusion process.\n", "\n", "The data-fusion process maps out the individual steps necessary for evaluating the impact of past and potential future decisions:\n", "\n", "![The data-fusion process.](https://user-images.githubusercontent.com/3273502/85201682-aacb2280-b301-11ea-9529-59e63c19f945.png \"With the data-fusion process we iterate through applying and validating our causal understanding of our system, modelling the impact of our proposed business decision on our system, and estimate its impact using our validated causal model and historical data. Adapted from Hünermund and Bareinboim: https://arxiv.org/abs/1912.09104.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use case: The impact of direct marketing on customer behavior\n", "\n", "We'll use a data set provided by UCI (https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) that demonstrates the potential impact of direct marketing on customer success.\n", "\n", "Let's dive right in, download the data set and see what we are working with." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The direct marketing success data set" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "!wget --quiet https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "!unzip -oqq bank.zip" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "#collapse\n", "df = pd.read_csv('bank.csv', delimiter=';')\n", "df['success'] = df['y']\n", "del df['y']\n", "df['success'] = df['success'].replace('no', 0)\n", "df['success'] = df['success'].replace('yes', 1)\n", "del df['duration']\n", "df['no_contacts'] = df['campaign']\n", "del df['campaign']" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaultbalancehousingloancontactdaymonthpdayspreviouspoutcomesuccessno_contacts
030unemployedmarriedprimaryno1787nonocellular19oct-10unknown01
133servicesmarriedsecondaryno4789yesyescellular11may3394failure01
235managementsingletertiaryno1350yesnocellular16apr3301failure01
330managementmarriedtertiaryno1476yesyesunknown3jun-10unknown04
459blue-collarmarriedsecondaryno0yesnounknown5may-10unknown01
\n", "
" ], "text/plain": [ " age job marital education default balance housing loan \\\n", "0 30 unemployed married primary no 1787 no no \n", "1 33 services married secondary no 4789 yes yes \n", "2 35 management single tertiary no 1350 yes no \n", "3 30 management married tertiary no 1476 yes yes \n", "4 59 blue-collar married secondary no 0 yes no \n", "\n", " contact day month pdays previous poutcome success no_contacts \n", "0 cellular 19 oct -1 0 unknown 0 1 \n", "1 cellular 11 may 339 4 failure 0 1 \n", "2 cellular 16 apr 330 1 failure 0 1 \n", "3 unknown 3 jun -1 0 unknown 0 4 \n", "4 unknown 5 may -1 0 unknown 0 1 " ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our tabular marketing and sales data shows a number of features we observe about a given customer and our interaction with them:\n", "\n", "- The customer's age, job, marital status, education, current account balance, and whether or not they already took out a loan are recorded,\n", "- Our direct marketing interaction with a given customer is also recorded, for instance, how often we already contacted them.\n", "\n", "A more detailed description of the features in our data can be found here:\n", "\n", "https://archive.ics.uci.edu/ml/datasets/Bank+Marketing\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Trying to help our business with machine learning only" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "target = 'success'\n", "features = [column for column in df.columns if column != target]" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "import lightgbm as lgb\n", "from sklearn.preprocessing import OrdinalEncoder" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "model = lgb.LGBMClassifier()" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "X, y = df[features], df[target]" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "numerical_features = ['age', 'balance', 'no_contacts', 'previous', 'pdays']\n", "categorical_features = [feature for feature in features if feature not in numerical_features]" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "encoder = OrdinalEncoder(dtype=int)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "X_numeric = pd.concat(\n", " [\n", " X[numerical_features],\n", " pd.DataFrame(\n", " data=encoder.fit_transform(X[categorical_features]),\n", " columns=categorical_features\n", " )\n", " ],\n", " axis=1\n", ")" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agebalanceno_contactspreviouspdaysjobmaritaleducationdefaulthousingloancontactdaymonthpoutcome
030178710-11010000018103
13347891433971101101080
23513501133042201001500
330147640-14120112263
459010-11110102483
\n", "
" ], "text/plain": [ " age balance no_contacts previous pdays job marital education \\\n", "0 30 1787 1 0 -1 10 1 0 \n", "1 33 4789 1 4 339 7 1 1 \n", "2 35 1350 1 1 330 4 2 2 \n", "3 30 1476 4 0 -1 4 1 2 \n", "4 59 0 1 0 -1 1 1 1 \n", "\n", " default housing loan contact day month poutcome \n", "0 0 0 0 0 18 10 3 \n", "1 0 1 1 0 10 8 0 \n", "2 0 1 0 0 15 0 0 \n", "3 0 1 1 2 2 6 3 \n", "4 0 1 0 2 4 8 3 " ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_numeric.head()" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LGBMClassifier()" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(X_numeric, y)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "lgb.plot_importance(model);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are numerous ways to compute feature importance and this one implemented in the LightGBM library measures the number of times a given feature is used in the constructed trees:\n", "\n", "https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.plot_importance.html\n", "\n", "In general, feature importance gives us a measure of how well a given measured variable correlates with the target (marketing success in our case).\n", "\n", "The question here is: How can we use our trained success predictor and our feature importances to aid intelligent plannning and decision-making in our business?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Uses for " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 4 }