{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Multi-Touch Attribution with Markov Chains\n", "\n", "This notebook explores implementing a multi-touch attribution model using Markov chains. The following articles \n", "explain this approach:\n", "* [A Beginner’s Guide to Channel Attribution Modeling in Marketing](https://www.analyticsvidhya.com/blog/2018/01/channel-attribution-modeling-using-markov-chains-in-r/)\n", "* [Marketing Multi-Channel Attribution model with R](https://analyzecore.com/2016/08/03/attribution-model-r-part-1/)\n", "* [Markov Chains in Python: Beginner Tutorial](https://www.datacamp.com/community/tutorials/markov-chains-python-tutorial)\n", "\n", "\n", "This is the graph we are using as our test case:
\n", "![](https://i0.wp.com/analyzecore.com/wp-content/uploads/2016/07/Screenshot-2016-07-22-14.26.50.png?w=938&ssl=1)" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Import the Data" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Load the journey dataset and insert a \"Start\" column at the beginning, to explicitly mark the initial state.\n", "df = pd.read_csv('simple_conversion_data.csv', header=None)\n", "df.insert(0, 'Start', 'Start')" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Start0123
0StartC1C2C3Conversion
1StartC1NaNNaNNaN
2StartC2C3NaNNaN
\n", "
" ], "text/plain": [ " Start 0 1 2 3\n", "0 Start C1 C2 C3 Conversion\n", "1 Start C1 NaN NaN NaN\n", "2 Start C2 C3 NaN NaN" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Transform the Data" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Reshape the journey dataset to get a new, two-column dataset that contains each \n", "# transition. That is, each row in the new dataset will contain: (T, T+1) \n", "transitions = list()\n", "for n in range(0, len(df.columns)-1):\n", " df_transition = pd.DataFrame({\n", " 't': df.iloc[:,n], \n", " 't_plus_1': df.iloc[:,n+1]\n", " })\n", " transitions.append(df_transition)\n", "df_transitions = pd.concat(transitions)\n", "\n", "# We can drop all rows with NaN values in state T (the starting state).\n", "# These represent journey's that have already been completed.\n", "df_transitions.dropna(subset=['t'], inplace=True)\n", "\n", "# Let's replace the NaN's in the T+1 columns with a value indicating that the user bailed.\n", "# This will avoid problems later when we do groupby functions on the dataframe.\n", "df_transitions.fillna('Exit', inplace=True)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tt_plus_1
0StartC1
1StartC1
2StartC2
0C1C2
1C1Exit
2C2C3
0C2C3
2C3Exit
0C3Conversion
\n", "
" ], "text/plain": [ " t t_plus_1\n", "0 Start C1\n", "1 Start C1\n", "2 Start C2\n", "0 C1 C2\n", "1 C1 Exit\n", "2 C2 C3\n", "0 C2 C3\n", "2 C3 Exit\n", "0 C3 Conversion" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_transitions" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Create a separate lookup table with row counts for each starting state, t, in the \n", "# transitions table. We'll use this to calculate probabilities later, and it will \n", "# help us avoid the slow performing df.apply() method.\n", "df_initial_state_counts = df_transitions.groupby(by=['t'], as_index=False).count()\n", "df_initial_state_counts.rename(columns={'t_plus_1':'count_of_t'}, inplace=True)" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tcount_of_t
0C12
1C22
2C32
3Start3
\n", "
" ], "text/plain": [ " t count_of_t\n", "0 C1 2\n", "1 C2 2\n", "2 C3 2\n", "3 Start 3" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_initial_state_counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Calculate Transition Probabilities" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Join the lookup table to the transitions table, to pull in the counts for each starting state, t.\n", "df_transitions = pd.merge(df_transitions, df_initial_state_counts, on='t', how='inner')" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tt_plus_1count_of_t
0StartC13
1StartC13
2StartC23
3C1C22
4C1Exit2
5C2C32
6C2C32
7C3Exit2
8C3Conversion2
\n", "
" ], "text/plain": [ " t t_plus_1 count_of_t\n", "0 Start C1 3\n", "1 Start C1 3\n", "2 Start C2 3\n", "3 C1 C2 2\n", "4 C1 Exit 2\n", "5 C2 C3 2\n", "6 C2 C3 2\n", "7 C3 Exit 2\n", "8 C3 Conversion 2" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_transitions" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Calculate the individual probability for each transition instance\n", "df_transitions['probability'] = 1/df_transitions['count_of_t']" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tt_plus_1count_of_tprobability
0StartC130.333333
1StartC130.333333
2StartC230.333333
3C1C220.500000
4C1Exit20.500000
5C2C320.500000
6C2C320.500000
7C3Exit20.500000
8C3Conversion20.500000
\n", "
" ], "text/plain": [ " t t_plus_1 count_of_t probability\n", "0 Start C1 3 0.333333\n", "1 Start C1 3 0.333333\n", "2 Start C2 3 0.333333\n", "3 C1 C2 2 0.500000\n", "4 C1 Exit 2 0.500000\n", "5 C2 C3 2 0.500000\n", "6 C2 C3 2 0.500000\n", "7 C3 Exit 2 0.500000\n", "8 C3 Conversion 2 0.500000" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_transitions" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Calculate the total probability of transitioning from one state to another\n", "df_transition_prob = df_transitions.groupby(by=['t', 't_plus_1'], as_index=False).sum()\n", "df_transition_prob.drop(['count_of_t'], axis=1, inplace=True) # We don't need this column anymore" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tt_plus_1probability
0C1C20.500000
1C1Exit0.500000
2C2C31.000000
3C3Conversion0.500000
4C3Exit0.500000
5StartC10.666667
6StartC20.333333
\n", "
" ], "text/plain": [ " t t_plus_1 probability\n", "0 C1 C2 0.500000\n", "1 C1 Exit 0.500000\n", "2 C2 C3 1.000000\n", "3 C3 Conversion 0.500000\n", "4 C3 Exit 0.500000\n", "5 Start C1 0.666667\n", "6 Start C2 0.333333" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_transition_prob" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tprobability
0C11.0
1C21.0
2C31.0
3Start1.0
\n", "
" ], "text/plain": [ " t probability\n", "0 C1 1.0\n", "1 C2 1.0\n", "2 C3 1.0\n", "3 Start 1.0" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Double-check to make sure the total probability for each starting state, t, equals 1.0 (i.e. 100%)\n", "df_test = df_transition_prob.groupby(by='t', as_index=False).sum()\n", "df_test" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", "The probability calculation test passed!!! :)\n", " \n" ] } ], "source": [ "print(' ')\n", "if df_test['probability'].sum() != len(df_test):\n", " print('[ERROR]: The probability calculation test failed. :(')\n", "else:\n", " print('The probability calculation test passed!!! :)')\n", "print(' ')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Calculate Total Conversion Probability" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def print_node(node, debug=False):\n", " # Print each node as it's processed. For debugging purposes.\n", " \n", " if not debug:\n", " return\n", " \n", " if node['t_plus_1'] in ['Exit', 'Conversion']:\n", " node_type = 'Leaf'\n", " else:\n", " node_type = 'Parent'\n", " \n", " print('%s > %s' % (node['t'], node['t_plus_1']))\n", " print('Type: %s' % node_type)\n", " print('Prob: %0.2f' % node['probability'])\n", " print('----------------------------')" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "collapsed": true }, "outputs": [], "source": [ "calculated_node_probabilities = dict()\n", "\n", "def calc_conversion_probability(starting_state, df_transitions, cum_probability, calculated_nodes, debug=True):\n", " # Calculates the cumulative probability of reaching a conversion, given a starting state.\n", " # Assumes the transition dataframe represents a Directed Acyclic Graph (DAG)\n", " \n", " # Get the transition probabilities for the starting state we're evaluating\n", " df_nodes = df_transitions[df_transitions['t'] == starting_state]\n", "\n", " \n", " # Loop through the starting states and either return the probability for \n", " # a leaf node, or recurse to keep following the tree.\n", " \n", " node_conversion_probability = 0\n", " \n", " child_node_proabilities = []\n", " \n", " for index, row in df_nodes.iterrows():\n", " \n", " # These are leaf nodes: either an exit or conversion\n", " if row['t_plus_1'] == 'Exit':\n", " print_node(row, debug)\n", " child_node_proabilities.append(0)\n", " \n", " elif row['t_plus_1'] == 'Conversion':\n", " print_node(row, debug)\n", " child_node_proabilities.append(row['probability'])\n", " \n", " # This is a parent node: Keep following the rabbit hole\n", " else:\n", " \n", " # Have we cached the total probability for this node???\n", " if row['t_plus_1'] in calculated_nodes:\n", " if debug:\n", " print('Cache Hit for %s! Cum probability from child: %0.2f' % (row['t_plus_1'], calculated_nodes[row['t_plus_1']]))\n", " child_probability = calculated_nodes[row['t_plus_1']]\n", " \n", " # No cached value found. We'll walk through the tree to calculated the value.\n", " else:\n", " # Recursive call\n", " child_probability = calc_conversion_probability(row['t_plus_1'], \n", " df_transitions, \n", " cum_probability + row['probability'],\n", " calculated_nodes, \n", " debug)\n", " node_conversion_prob = child_probability * row['probability']\n", " \n", " print_node(row, debug)\n", " child_node_proabilities.append(node_conversion_prob)\n", " \n", " if debug:\n", " print('%s > %s' % (row['t'], row['t_plus_1']))\n", " print('Cum Prob from Child : %0.2f' % child_probability)\n", " print('Prob to Child Node : %0.2f' % row['probability'])\n", " print('Node Conv Proability: %0.2f' % node_conversion_prob)\n", " print('----------------------------')\n", " \n", " total_node_probability = sum(child_node_proabilities)\n", " if debug:\n", " print('Node Conversion Probability for %s: %0.2f' % (starting_state, total_node_probability))\n", " print('----------------------------')\n", " \n", " # We'll cache the calculated total probability for the node, so we don't have to calculate it again.\n", " calculated_node_probabilities[starting_state] = total_node_probability\n", " \n", " return total_node_probability\n" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "====== START DEBUG PRINT ======\n", "C3 > Conversion\n", "Type: Leaf\n", "Prob: 0.50\n", "----------------------------\n", "C3 > Exit\n", "Type: Leaf\n", "Prob: 0.50\n", "----------------------------\n", "Node Conversion Probability for C3: 0.50\n", "----------------------------\n", "C2 > C3\n", "Type: Parent\n", "Prob: 1.00\n", "----------------------------\n", "C2 > C3\n", "Cum Prob from Child : 0.50\n", "Prob to Child Node : 1.00\n", "Node Conv Proability: 0.50\n", "----------------------------\n", "Node Conversion Probability for C2: 0.50\n", "----------------------------\n", "C1 > C2\n", "Type: Parent\n", "Prob: 0.50\n", "----------------------------\n", "C1 > C2\n", "Cum Prob from Child : 0.50\n", "Prob to Child Node : 0.50\n", "Node Conv Proability: 0.25\n", "----------------------------\n", "C1 > Exit\n", "Type: Leaf\n", "Prob: 0.50\n", "----------------------------\n", "Node Conversion Probability for C1: 0.25\n", "----------------------------\n", "Start > C1\n", "Type: Parent\n", "Prob: 0.67\n", "----------------------------\n", "Start > C1\n", "Cum Prob from Child : 0.25\n", "Prob to Child Node : 0.67\n", "Node Conv Proability: 0.17\n", "----------------------------\n", "Cache Hit for C2! Cum probability from child: 0.50\n", "Start > C2\n", "Type: Parent\n", "Prob: 0.33\n", "----------------------------\n", "Start > C2\n", "Cum Prob from Child : 0.50\n", "Prob to Child Node : 0.33\n", "Node Conv Proability: 0.17\n", "----------------------------\n", "Node Conversion Probability for Start: 0.33\n", "----------------------------\n", "====== END DEBUG PRINT ======\n", " \n", "Total Conversion Probability from Start: 0.33\n" ] } ], "source": [ "starting_node = 'Start'\n", "print('====== START DEBUG PRINT ======')\n", "total_probability = calc_conversion_probability(starting_node, df_transition_prob, 0, calculated_node_probabilities)\n", "print('====== END DEBUG PRINT ======')\n", "print(' ')\n", "print('Total Conversion Probability from %s: %0.2f' % (starting_node, total_probability))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Calculate Removal Effect" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tt_plus_1probability
0C1C20.500000
1C1Exit0.500000
2C2C31.000000
3C3Conversion0.500000
4C3Exit0.500000
5StartC10.666667
6StartC20.333333
\n", "
" ], "text/plain": [ " t t_plus_1 probability\n", "0 C1 C2 0.500000\n", "1 C1 Exit 0.500000\n", "2 C2 C3 1.000000\n", "3 C3 Conversion 0.500000\n", "4 C3 Exit 0.500000\n", "5 Start C1 0.666667\n", "6 Start C2 0.333333" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's look at our transition probability table again. We'll tweak this to calculate the \n", "# removal effect.\n", "df_transition_prob" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": true }, "outputs": [], "source": [ "unique_channels = df_transition_prob[df_transition_prob['t'] != 'Start']['t'].unique()" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [], "source": [ "pd.options.mode.chained_assignment = None # I know, this makes me a very bad person.\n", "removal_effects = dict()\n", "\n", "# Remove each channel and calculate the impact\n", "for channel in unique_channels:\n", " \n", " # Remove the channel from the transition probability matrix\n", " df_reduced_graph = df_transition_prob[df_transition_prob['t'] != channel]\n", " df_reduced_graph.loc[df_reduced_graph['t_plus_1']==channel, 't_plus_1'] = 'Exit'\n", " \n", " # Recalculate the total conversion probability\n", " calculated_node_probabilities = dict()\n", " new_total_probability = calc_conversion_probability('Start', \n", " df_reduced_graph, \n", " 0, \n", " calculated_node_probabilities, \n", " debug=False)\n", " \n", " # Calculate the difference in conversion probability\n", " removal_effect = (total_probability - new_total_probability)/total_probability\n", " removal_effects[channel] = removal_effect\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Results: Which channels have the greatest impact?" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Removal effect by channel:\n", "C1: 0.50\n", "C2: 1.00\n", "C3: 1.00\n" ] } ], "source": [ "print('Removal effect by channel:')\n", "for key, value in removal_effects.items(): \n", " print('%s: %0.2f' % (key, value))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" } }, "nbformat": 4, "nbformat_minor": 2 }