# Churn prediction with Adventure Works

In this notebook we'll use the staging tables created in the previous challenges and try to predict whether our customers will churn.

Since the dataset doesn't contain any churn flag, we're going to start with deciding what churn means in our case and then add that information to our data so that we can train a model.

## Parameters


In [None]:
REGION="us-central1" #@param region {type:"string"}

## Getting started with `pandas`

The well known `pandas` framework supports reading data from bigquery tables, and Colab comes pre-installed with all of the required libraries.

> BigQuery also provides the [BigFrames](https://cloud.google.com/bigquery/docs/dataframes-quickstart) package that's designed to be compatible with `pandas` data frames and can handle large amounts of data, but since we're dealing relatively small datasets we'll stick to the familiar `pandas` data frames.

In [None]:
import pandas as pd

df = pd.read_gbq(f"curated.stg_sales_order_header")

Let's have a quick look at our data.

In [None]:
df.head(20)

## Churned or not

In order to decide whether a customer can be considered as _churned_ we're going to look at their last purchase date, if that's over a threshold, i.e, customer hasn't purchased anything since last _N_ days, we'll mark them as churned.

> There's a number of different methods to do churn analysis, including survival analysis, time to event predictions etc. These are beyond the scope of this exercise, so we're keeping things very simple.

But what's a good threshold for our dataset? Let's analyze our customer base and find out how many days have passed since the last purchase date of every customer.

In [None]:
lpd = df.groupby("customer_id")["order_date"].max()

Now we have the last purchase date for each customer, we could subtract that from the current date, but the dataset we're using (although updated for dates) only has data for a specific period. Let's find the date that we can use as the _current date_ for this dataset.

In [None]:
mpd = max(lpd)

Okay, we're ready to calculate the number of days since last purchase.

In [None]:
days_since_last_purchase = pd.to_datetime(mpd) - pd.to_datetime(lpd)

How does the distribution of this look like?

In [None]:
days_since_last_purchase.dt.days.hist();

We see that majority of our customers have been making relatively recent purchases, although there's a few that haven't bought anything since **3** years, those have certainly churned.

That's useful information, but we need more data. We need to find out how long it takes between two consecutive purchases, to determine our potential threshold.

In [None]:
diffs = df.sort_values(["customer_id", "order_date"]).groupby("customer_id")["order_date"].diff()

A picture is worth thousand words, let's visualize a histogram of this data.

In [None]:
diffs.dt.days.hist();

It looks like most purchases are done within 100 days. So, to stay on the safe side of things, we're going use **180** days (almost 6 months), to be our threshold. So, if a customer hasn't done a purchase for more than 180 days, we'll consider them as churned. 

In [None]:
df["last_purchase_date"] = df.groupby("customer_id")["order_date"].transform("max")

In [None]:
df["churned"] = (pd.to_datetime(mpd) - pd.to_datetime(df["last_purchase_date"])).dt.days > 180

Now we've established our churned customer definition, let's have a look at the distribution of customers who have churned.

In [None]:
df.groupby("churned")["churned"].count().plot.bar();

That looks pretty nice and balanced, although in real world we'd expect (or hope for) less customers churning.

## Training data

Alright, we're almost ready to do some training. We've now established which customers have churned, next step is to combine that information with for example customer details, so that we can make predictions based on customer details.

### Exploration playground

Data scientists typically need a separate place where they can create different types of derived tables, so let's create another dataset

In [None]:
! bq show exploration || bq mk --location=$REGION exploration

Now we have a separate dataset, let's store the dataframe that we used to determine the churn information.

In [None]:
tdf = df.groupby("customer_id", as_index=False).max("churned")

In [None]:
tdf[["customer_id", "churned"]].to_gbq("exploration.churn_labels", if_exists="replace")

The training data consists of customer details joined with the churn information, we can do that using `pandas` dataframes, or since both tables are now in BigQuery, using `SQL`.

In [None]:
%%bigquery
CREATE OR REPLACE TABLE
 exploration.churn_training AS
SELECT
 c.customer_id,
 p.*,
 l.churned
FROM
 curated.stg_customer c,
 curated.stg_person p,
 exploration.churn_labels l
WHERE
 c.person_id = p.business_entity_id AND
 c.customer_id = l.customer_id

## Model training

Now we have the data, we have multiple options. We can use any framework to train a new model, scikit-learn, Tensorflow, PyTorch etc. We could also use Vertex AI to do this training using a managed service on specific hardware. But since the star of this hack is BigQuery, we'll use **BQML**.

> Note that we're keeping things very simple, building an end to end MLOps pipeline is beyond the scope of this hack, however if you're interested in that, we have another [gHack](https://ghacks.dev/hacks/mlops-on-gcp) specifically designed for it.

Training a model with BigQuery is quite trivial, you can stick to the defaults for most of the parameters, but see the [docs](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-create-glm) for more information. BigQuery even automatically [pre-processes the features](https://cloud.google.com/bigquery/docs/auto-preprocessing)!

Go ahead and create a new _Logistic Regression_ model in the dataset `dwh` with the name `churn_model`, based on the training data created in the previous step.

In [None]:
%%bigquery
# TODO Challenge 6: Create or replace a new Logistic Regression model with BQML

The training should take less than a minute as we're dealing with a small dataset that converges relatively quickly.

The model is stored in BigQuery, however, it's also possible to [store it in Vertex AI Model Repository](https://cloud.google.com/bigquery/docs/create_vertex) in order to use the rest of the Vertex AI services.

### Evaluation

Great, we have a model now, but, how good is it?

In [None]:
%%bigquery
SELECT
 *
FROM
 ML.EVALUATE(MODEL `dwh.churn_model`)

That doesn't look too bad for the amount of effort that we spent on this (you should see an ROC AUC value of > 0.8)!

## Conclusion

This concludes our data science adventure. With this notebook we've shown how to connect to BigQuery from an interactive environment, use familiar Python libraries and train models using BQML.