{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load data\n",
"\n",
"This dataset contains transactions of an online retail store in the UK. The files takes a bit to load. I downloaded this file as an xls from the [UCI machine learning dataset library](https://archive.ics.uci.edu/ml/datasets/online+retail#). I used excel to save it as a csv. \n",
"\n",
"I will largely follow ths process described in a [Feature Tools tutorial](https://github.com/Featuretools/Automated-Manual-Comparison/blob/master/Retail%20Spending/notebooks/Automated%20Retail%20Spending.ipynb). Note that the tutorial changed the column names from the original file (which I have not done). "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline\n",
"\n",
"create_data = False"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Field info (from UCI machine learning dataset website):\n",
"* InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.\n",
"* StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.\n",
"* Description: Product (item) name. Nominal.\n",
"* Quantity: The quantities of each product (item) per transaction. Numeric.\n",
"* InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.\n",
"* UnitPrice: Unit price. Numeric, Product price per unit in sterling.\n",
"* CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.\n",
"* Country: Country name. Nominal, the name of the country where each customer resides. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('./data/raw/OnlineRetail.csv', parse_dates=[\"InvoiceDate\"])"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Restrict data to 2011 - from tutorial. not sure why they did this.\n",
"df = df[df['InvoiceDate'].dt.year == 2011]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',\n",
" 'UnitPrice', 'CustomerID', 'Country'],\n",
" dtype='object')"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"499428"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# label cancellation orders\n",
"df['Cancelled'] = df['InvoiceNo'].str.startswith('C')"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# drop the duplicates\n",
"df = df.drop_duplicates()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# drop rows with null customer id\n",
"df = df.dropna(axis=0)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Convert to dollars and create a field representing the total spent\n",
"df['UnitPrice'] = df['UnitPrice'] * 1.65\n",
"df['Total'] = df['UnitPrice'] * df['Quantity']"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Quantity | \n",
" UnitPrice | \n",
" CustomerID | \n",
" Total | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 375250.000000 | \n",
" 375250.000000 | \n",
" 375250.000000 | \n",
" 375250.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 12.252296 | \n",
" 5.762031 | \n",
" 15265.136168 | \n",
" 33.972397 | \n",
"
\n",
" \n",
" std | \n",
" 258.180877 | \n",
" 119.054102 | \n",
" 1710.927381 | \n",
" 733.803756 | \n",
"
\n",
" \n",
" min | \n",
" -80995.000000 | \n",
" 0.000000 | \n",
" 12346.000000 | \n",
" -277974.840000 | \n",
"
\n",
" \n",
" 25% | \n",
" 2.000000 | \n",
" 2.062500 | \n",
" 13901.000000 | \n",
" 7.012500 | \n",
"
\n",
" \n",
" 50% | \n",
" 5.000000 | \n",
" 3.217500 | \n",
" 15108.000000 | \n",
" 19.305000 | \n",
"
\n",
" \n",
" 75% | \n",
" 12.000000 | \n",
" 6.187500 | \n",
" 16767.000000 | \n",
" 32.670000 | \n",
"
\n",
" \n",
" max | \n",
" 80995.000000 | \n",
" 64300.500000 | \n",
" 18287.000000 | \n",
" 277974.840000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Quantity UnitPrice CustomerID Total\n",
"count 375250.000000 375250.000000 375250.000000 375250.000000\n",
"mean 12.252296 5.762031 15265.136168 33.972397\n",
"std 258.180877 119.054102 1710.927381 733.803756\n",
"min -80995.000000 0.000000 12346.000000 -277974.840000\n",
"25% 2.000000 2.062500 13901.000000 7.012500\n",
"50% 5.000000 3.217500 15108.000000 19.305000\n",
"75% 12.000000 6.187500 16767.000000 32.670000\n",
"max 80995.000000 64300.500000 18287.000000 277974.840000"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" InvoiceNo | \n",
" StockCode | \n",
" Description | \n",
" Quantity | \n",
" InvoiceDate | \n",
" UnitPrice | \n",
" CustomerID | \n",
" Country | \n",
" Cancelled | \n",
" Total | \n",
"
\n",
" \n",
" \n",
" \n",
" 42481 | \n",
" 539993 | \n",
" 22386 | \n",
" JUMBO BAG PINK POLKADOT | \n",
" 10 | \n",
" 2011-01-04 10:00:00 | \n",
" 3.2175 | \n",
" 13313.0 | \n",
" United Kingdom | \n",
" False | \n",
" 32.175 | \n",
"
\n",
" \n",
" 42482 | \n",
" 539993 | \n",
" 21499 | \n",
" BLUE POLKADOT WRAP | \n",
" 25 | \n",
" 2011-01-04 10:00:00 | \n",
" 0.6930 | \n",
" 13313.0 | \n",
" United Kingdom | \n",
" False | \n",
" 17.325 | \n",
"
\n",
" \n",
" 42483 | \n",
" 539993 | \n",
" 21498 | \n",
" RED RETROSPOT WRAP | \n",
" 25 | \n",
" 2011-01-04 10:00:00 | \n",
" 0.6930 | \n",
" 13313.0 | \n",
" United Kingdom | \n",
" False | \n",
" 17.325 | \n",
"
\n",
" \n",
" 42484 | \n",
" 539993 | \n",
" 22379 | \n",
" RECYCLING BAG RETROSPOT | \n",
" 5 | \n",
" 2011-01-04 10:00:00 | \n",
" 3.4650 | \n",
" 13313.0 | \n",
" United Kingdom | \n",
" False | \n",
" 17.325 | \n",
"
\n",
" \n",
" 42485 | \n",
" 539993 | \n",
" 20718 | \n",
" RED RETROSPOT SHOPPER BAG | \n",
" 10 | \n",
" 2011-01-04 10:00:00 | \n",
" 2.0625 | \n",
" 13313.0 | \n",
" United Kingdom | \n",
" False | \n",
" 20.625 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" InvoiceNo StockCode Description Quantity \\\n",
"42481 539993 22386 JUMBO BAG PINK POLKADOT 10 \n",
"42482 539993 21499 BLUE POLKADOT WRAP 25 \n",
"42483 539993 21498 RED RETROSPOT WRAP 25 \n",
"42484 539993 22379 RECYCLING BAG RETROSPOT 5 \n",
"42485 539993 20718 RED RETROSPOT SHOPPER BAG 10 \n",
"\n",
" InvoiceDate UnitPrice CustomerID Country Cancelled \\\n",
"42481 2011-01-04 10:00:00 3.2175 13313.0 United Kingdom False \n",
"42482 2011-01-04 10:00:00 0.6930 13313.0 United Kingdom False \n",
"42483 2011-01-04 10:00:00 0.6930 13313.0 United Kingdom False \n",
"42484 2011-01-04 10:00:00 3.4650 13313.0 United Kingdom False \n",
"42485 2011-01-04 10:00:00 2.0625 13313.0 United Kingdom False \n",
"\n",
" Total \n",
"42481 32.175 \n",
"42482 17.325 \n",
"42483 17.325 \n",
"42484 17.325 \n",
"42485 20.625 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"InvoiceNo object\n",
"StockCode object\n",
"Description object\n",
"Quantity int64\n",
"InvoiceDate datetime64[ns]\n",
"UnitPrice float64\n",
"CustomerID float64\n",
"Country object\n",
"Cancelled bool\n",
"Total float64\n",
"dtype: object"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create response variable\n",
"\n",
"The following code is a little complex. I compute how much a customer spent each month (including months with no transactions)."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"def monthly_spend(df):\n",
" \"\"\"Identify how much each customer spent each month.\"\"\"\n",
" df.index = df['InvoiceDate']\n",
" monthly_sums = df.groupby('CustomerID').resample('MS')['Total'].sum()\n",
" return monthly_sums.reset_index()\n",
"\n",
"\n",
"def customer_by_month(monthly_data, df):\n",
" \"\"\"Create an index for each customer for each month of the data set.\"\"\"\n",
" labels = monthly_data.set_index(['InvoiceDate', 'CustomerID'])\n",
" midx = pd.MultiIndex.from_product(\n",
" [pd.date_range('2011-01-01', '2012-01-01', freq='MS'), df['CustomerID'].unique()],\n",
" names=labels.index.names)\n",
" return labels.reindex(midx, fill_value=0).reset_index()\n",
"\n",
"\n",
"def monthly_min_date(monthly_data):\n",
" \"\"\"Create a table which has all months since a customer's first transaction\"\"\"\n",
" min_dates = (monthly_data\n",
" .groupby('CustomerID')['InvoiceDate']\n",
" .min()\n",
" .apply(lambda x: pd.date_range(x, '2012-01-01', freq='MS').tolist())\n",
" .reset_index()\n",
" )\n",
" return pd.DataFrame([(x, i) for x, y in zip(min_dates['CustomerID'], min_dates['InvoiceDate']) for i in y],\n",
" columns=['CustomerID', 'InvoiceDate'])\n",
"\n",
"\n",
"# join the table where each customer has a record each month with the table where each customer has a \n",
"# record since account creation. This way customers do not have records before their first transaction.\n",
"#labels = labels.merge(relevant_months, on=['CustomerID', 'InvoiceDate'], how='inner')\n",
"monthly_df = monthly_spend(df)\n",
"labels = customer_by_month(monthly_df, df).merge(monthly_min_date(monthly_df),\n",
" on=['CustomerID', 'InvoiceDate'],\n",
" how='inner')"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# tutorial starts with march data.\n",
"labels = labels[labels['InvoiceDate'] >= '2011-03-01']"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" InvoiceDate | \n",
" CustomerID | \n",
" Total | \n",
"
\n",
" \n",
" \n",
" \n",
" 2065 | \n",
" 2011-03-01 | \n",
" 13313.0 | \n",
" 0.000 | \n",
"
\n",
" \n",
" 2066 | \n",
" 2011-03-01 | \n",
" 18097.0 | \n",
" 0.000 | \n",
"
\n",
" \n",
" 2067 | \n",
" 2011-03-01 | \n",
" 16656.0 | \n",
" 589.248 | \n",
"
\n",
" \n",
" 2068 | \n",
" 2011-03-01 | \n",
" 16875.0 | \n",
" 0.000 | \n",
"
\n",
" \n",
" 2069 | \n",
" 2011-03-01 | \n",
" 13094.0 | \n",
" 115.434 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" InvoiceDate CustomerID Total\n",
"2065 2011-03-01 13313.0 0.000\n",
"2066 2011-03-01 18097.0 0.000\n",
"2067 2011-03-01 16656.0 589.248\n",
"2068 2011-03-01 16875.0 0.000\n",
"2069 2011-03-01 13094.0 115.434"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"labels.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that I have more samples than the feature tools tutorial because they accidently removed all transactions on the first of the month..."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" CustomerID | \n",
" Total | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 35339.000000 | \n",
" 35339.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 15276.044399 | \n",
" 318.284416 | \n",
"
\n",
" \n",
" std | \n",
" 1720.194897 | \n",
" 1698.736413 | \n",
"
\n",
" \n",
" min | \n",
" 12346.000000 | \n",
" -7074.589500 | \n",
"
\n",
" \n",
" 25% | \n",
" 13791.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" 15247.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 16757.000000 | \n",
" 260.460750 | \n",
"
\n",
" \n",
" max | \n",
" 18287.000000 | \n",
" 115906.725000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" CustomerID Total\n",
"count 35339.000000 35339.000000\n",
"mean 15276.044399 318.284416\n",
"std 1720.194897 1698.736413\n",
"min 12346.000000 -7074.589500\n",
"25% 13791.000000 0.000000\n",
"50% 15247.000000 0.000000\n",
"75% 16757.000000 260.460750\n",
"max 18287.000000 115906.725000"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"labels.describe()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" InvoiceDate | \n",
" CustomerID | \n",
" Total | \n",
"
\n",
" \n",
" \n",
" \n",
" 2731 | \n",
" 2011-03-01 | \n",
" 12347.0 | \n",
" 0.0000 | \n",
"
\n",
" \n",
" 4543 | \n",
" 2011-04-01 | \n",
" 12347.0 | \n",
" 1049.8125 | \n",
"
\n",
" \n",
" 6715 | \n",
" 2011-05-01 | \n",
" 12347.0 | \n",
" 0.0000 | \n",
"
\n",
" \n",
" 9204 | \n",
" 2011-06-01 | \n",
" 12347.0 | \n",
" 631.1580 | \n",
"
\n",
" \n",
" 11956 | \n",
" 2011-07-01 | \n",
" 12347.0 | \n",
" 0.0000 | \n",
"
\n",
" \n",
" 14926 | \n",
" 2011-08-01 | \n",
" 12347.0 | \n",
" 965.1015 | \n",
"
\n",
" \n",
" 18080 | \n",
" 2011-09-01 | \n",
" 12347.0 | \n",
" 0.0000 | \n",
"
\n",
" \n",
" 21547 | \n",
" 2011-10-01 | \n",
" 12347.0 | \n",
" 2135.6280 | \n",
"
\n",
" \n",
" 25385 | \n",
" 2011-11-01 | \n",
" 12347.0 | \n",
" 0.0000 | \n",
"
\n",
" \n",
" 29582 | \n",
" 2011-12-01 | \n",
" 12347.0 | \n",
" 370.9530 | \n",
"
\n",
" \n",
" 33826 | \n",
" 2012-01-01 | \n",
" 12347.0 | \n",
" 0.0000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" InvoiceDate CustomerID Total\n",
"2731 2011-03-01 12347.0 0.0000\n",
"4543 2011-04-01 12347.0 1049.8125\n",
"6715 2011-05-01 12347.0 0.0000\n",
"9204 2011-06-01 12347.0 631.1580\n",
"11956 2011-07-01 12347.0 0.0000\n",
"14926 2011-08-01 12347.0 965.1015\n",
"18080 2011-09-01 12347.0 0.0000\n",
"21547 2011-10-01 12347.0 2135.6280\n",
"25385 2011-11-01 12347.0 0.0000\n",
"29582 2011-12-01 12347.0 370.9530\n",
"33826 2012-01-01 12347.0 0.0000"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"labels.loc[labels['CustomerID'] == 12347]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"370.953"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[(df['CustomerID'] == 12347) & (df['InvoiceDate'] >= '2011-12-01')]['Total'].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Automation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This next series of commands will load data into feature tools and explain the dataset to feature tools."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"import featuretools as ft"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2018-09-22 09:06:26,324 featuretools.entityset - WARNING index purchases_index not found in dataframe, creating new integer column\n"
]
},
{
"data": {
"text/plain": [
"Entity: purchases\n",
" Variables:\n",
" purchases_index (dtype: index)\n",
" InvoiceNo (dtype: categorical)\n",
" StockCode (dtype: categorical)\n",
" Quantity (dtype: numeric)\n",
" InvoiceDate (dtype: datetime_time_index)\n",
" UnitPrice (dtype: numeric)\n",
" CustomerID (dtype: numeric)\n",
" Country (dtype: categorical)\n",
" Cancelled (dtype: boolean)\n",
" Total (dtype: numeric)\n",
" Description (dtype: text)\n",
" Shape:\n",
" (Rows: 375250, Columns: 11)"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"es = ft.EntitySet(id=\"Online Retail Logs\")\n",
"\n",
"# Add the entire data table as an entity\n",
"es.entity_from_dataframe(\"purchases\", # name of entity set\n",
" dataframe=df, # data\n",
" index=\"purchases_index\", # name of new index\n",
" time_index = 'InvoiceDate', # time associated with each row\n",
" variable_types = {'Description': ft.variable_types.Text}) # specifiy variable type\n",
"\n",
"es['purchases']"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" StockCode | \n",
" Description | \n",
" first_purchases_time | \n",
"
\n",
" \n",
" \n",
" \n",
" 20682 | \n",
" 20682 | \n",
" RED RETROSPOT CHILDRENS UMBRELLA | \n",
" 2011-01-04 10:00:00 | \n",
"
\n",
" \n",
" 20718 | \n",
" 20718 | \n",
" RED RETROSPOT SHOPPER BAG | \n",
" 2011-01-04 10:00:00 | \n",
"
\n",
" \n",
" 21498 | \n",
" 21498 | \n",
" RED RETROSPOT WRAP | \n",
" 2011-01-04 10:00:00 | \n",
"
\n",
" \n",
" 21499 | \n",
" 21499 | \n",
" BLUE POLKADOT WRAP | \n",
" 2011-01-04 10:00:00 | \n",
"
\n",
" \n",
" 22302 | \n",
" 22302 | \n",
" COFFEE MUG PEARS DESIGN | \n",
" 2011-01-04 10:00:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" StockCode Description first_purchases_time\n",
"20682 20682 RED RETROSPOT CHILDRENS UMBRELLA 2011-01-04 10:00:00\n",
"20718 20718 RED RETROSPOT SHOPPER BAG 2011-01-04 10:00:00\n",
"21498 21498 RED RETROSPOT WRAP 2011-01-04 10:00:00\n",
"21499 21499 BLUE POLKADOT WRAP 2011-01-04 10:00:00\n",
"22302 22302 COFFEE MUG PEARS DESIGN 2011-01-04 10:00:00"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"es.normalize_entity(new_entity_id=\"products\", # new entity for products. \n",
" base_entity_id=\"purchases\", # built from purchases entity\n",
" index=\"StockCode\", # Index with StockCode column from purchases entity. (unique key)\n",
" additional_variables=[\"Description\"]) # bring in this variable\n",
"\n",
"es['products'].df.head()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Entityset: Online Retail Logs\n",
" Entities:\n",
" purchases [Rows: 375250, Columns: 8]\n",
" products [Rows: 3612, Columns: 3]\n",
" customers [Rows: 4244, Columns: 2]\n",
" orders [Rows: 20482, Columns: 4]\n",
" Relationships:\n",
" purchases.StockCode -> products.StockCode\n",
" purchases.CustomerID -> customers.CustomerID\n",
" purchases.InvoiceNo -> orders.InvoiceNo"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"es.normalize_entity(new_entity_id=\"customers\", # customer entity\n",
" base_entity_id=\"purchases\", # from purchases\n",
" index=\"CustomerID\") # unique key is CustomerID\n",
"\n",
"es.normalize_entity(new_entity_id=\"orders\", # order entity\n",
" base_entity_id=\"purchases\", # from purchases\n",
" index=\"InvoiceNo\", # unique key is InvoiceNo\n",
" additional_variables=[\"Country\", 'Cancelled']) # Include these variables.\n",
"\n",
"es"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" CustomerID | \n",
" first_purchases_time | \n",
"
\n",
" \n",
" \n",
" \n",
" 13313.0 | \n",
" 13313.0 | \n",
" 2011-01-04 10:00:00 | \n",
"
\n",
" \n",
" 18097.0 | \n",
" 18097.0 | \n",
" 2011-01-04 10:22:00 | \n",
"
\n",
" \n",
" 16656.0 | \n",
" 16656.0 | \n",
" 2011-01-04 10:23:00 | \n",
"
\n",
" \n",
" 13094.0 | \n",
" 13094.0 | \n",
" 2011-01-04 10:37:00 | \n",
"
\n",
" \n",
" 16875.0 | \n",
" 16875.0 | \n",
" 2011-01-04 10:37:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" CustomerID first_purchases_time\n",
"13313.0 13313.0 2011-01-04 10:00:00\n",
"18097.0 18097.0 2011-01-04 10:22:00\n",
"16656.0 16656.0 2011-01-04 10:23:00\n",
"13094.0 13094.0 2011-01-04 10:37:00\n",
"16875.0 16875.0 2011-01-04 10:37:00"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"es['customers'].df.head()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" InvoiceNo | \n",
" Country | \n",
" Cancelled | \n",
" first_purchases_time | \n",
"
\n",
" \n",
" \n",
" \n",
" 539993 | \n",
" 539993 | \n",
" United Kingdom | \n",
" False | \n",
" 2011-01-04 10:00:00 | \n",
"
\n",
" \n",
" 540001 | \n",
" 540001 | \n",
" United Kingdom | \n",
" False | \n",
" 2011-01-04 10:22:00 | \n",
"
\n",
" \n",
" 540002 | \n",
" 540002 | \n",
" United Kingdom | \n",
" False | \n",
" 2011-01-04 10:23:00 | \n",
"
\n",
" \n",
" 540003 | \n",
" 540003 | \n",
" United Kingdom | \n",
" False | \n",
" 2011-01-04 10:37:00 | \n",
"
\n",
" \n",
" 540004 | \n",
" 540004 | \n",
" United Kingdom | \n",
" False | \n",
" 2011-01-04 10:37:00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" InvoiceNo Country Cancelled first_purchases_time\n",
"539993 539993 United Kingdom False 2011-01-04 10:00:00\n",
"540001 540001 United Kingdom False 2011-01-04 10:22:00\n",
"540002 540002 United Kingdom False 2011-01-04 10:23:00\n",
"540003 540003 United Kingdom False 2011-01-04 10:37:00\n",
"540004 540004 United Kingdom False 2011-01-04 10:37:00"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"es['orders'].df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deep Feature Synthesis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we use featuretools to create new features."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"if create_data:\n",
" feature_matrix,_ = ft.dfs(entityset=es, # entity\n",
" target_entity='customers', # what we're trying to predict\n",
" agg_primitives=['max', 'min', 'mode', 'mean', 'avg_time_between'], # requested aggregations\n",
" trans_primitives=['day', 'month', 'hour', 'weekend'], # requested transformations\n",
" cutoff_time=labels, # define time period of predictions\n",
" verbose=1, # how much stdout\n",
" cutoff_time_in_index=True, # specify that we've given cutoff times\n",
" chunk_size=50, # how much data to give each worker\n",
" n_jobs=-1, # how many threads to create\n",
" max_depth=1) # how many aggregations\n",
"\n",
" feature_matrix = feature_matrix.reset_index()\n",
" feature_matrix.to_csv('./data/processed/dfs_depth1.csv')\n",
" feature_matrix.head()\n",
"else:\n",
" feature_matrix = pd.read_csv('./data/processed/dfs_depth1.csv')\n",
" feature_matrix['time'] = pd.to_datetime(feature_matrix['time'])\n",
" feature_matrix = feature_matrix.drop('Unnamed: 0', axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['CustomerID', 'time', 'MAX(purchases.Quantity)',\n",
" 'MAX(purchases.UnitPrice)', 'MAX(purchases.Total)',\n",
" 'MIN(purchases.Quantity)', 'MIN(purchases.UnitPrice)',\n",
" 'MIN(purchases.Total)', 'MODE(purchases.InvoiceNo)',\n",
" 'MODE(purchases.StockCode)', 'MEAN(purchases.Quantity)',\n",
" 'MEAN(purchases.UnitPrice)', 'MEAN(purchases.Total)',\n",
" 'AVG_TIME_BETWEEN(purchases.InvoiceDate)', 'DAY(first_purchases_time)',\n",
" 'MONTH(first_purchases_time)', 'HOUR(first_purchases_time)',\n",
" 'WEEKEND(first_purchases_time)', 'Total'],\n",
" dtype='object')"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_matrix.columns"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"CustomerID 12346\n",
"time 2011-03-01 00:00:00\n",
"MAX(purchases.Quantity) 74215\n",
"Name: 0, dtype: object"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# look at a single output\n",
"feature_matrix.iloc[0, :3]"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"74215"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# demonstrate we understand the output\n",
"df[(df['CustomerID'] == 12346) & (df['InvoiceDate'] < '2011-03-01')]['Quantity'].max()"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(35339, 19)"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_matrix.shape"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"# create categorical response variable\n",
"feature_matrix['Total'] = feature_matrix['Total'].apply(lambda x: 1 if x > 500 else 0) "
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"def create_train_test(month, feature_matrix, drop_cols=['CustomerID', 'time', 'month', 'Total']):\n",
" \"\"\"Basic cleaning and return train/test data.\"\"\"\n",
" \n",
" # remove columns we know will not contribute\n",
" feature_matrix = feature_matrix.drop(columns= ['MODE(purchases.InvoiceNo)', 'MODE(purchases.StockCode)'])\n",
" # dummy code strings\n",
" feature_matrix = pd.get_dummies(feature_matrix)\n",
" # fill nans\n",
" feature_matrix = feature_matrix.fillna(0)\n",
" \n",
" # Labels\n",
" feature_matrix['month'] = feature_matrix['time'].dt.month\n",
" train_labels = feature_matrix.loc[feature_matrix['month'] < month, 'Total']\n",
" test_labels = feature_matrix.loc[feature_matrix['month'] >= month, 'Total']\n",
" y_train = np.array(train_labels).reshape((-1, ))\n",
" y_test = np.array(test_labels).reshape((-1, ))\n",
"\n",
" # Features\n",
" X_train = feature_matrix[feature_matrix['time'].dt.month < month].drop(columns=drop_cols)\n",
" X_test = feature_matrix[feature_matrix['time'].dt.month >= month].drop(columns=drop_cols)\n",
" \n",
" return (X_train, X_test, y_train, y_test)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = create_train_test(11, feature_matrix)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.183619599970258\n",
"0.16076294277929154\n"
]
}
],
"source": [
"print(np.mean(y_train))\n",
"print(np.mean(y_test))"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LogisticRegression(C=1.0, class_weight='balanced', dual=False,\n",
" fit_intercept=True, intercept_scaling=1, max_iter=100,\n",
" multi_class='ovr', n_jobs=1, penalty='l2', random_state=0,\n",
" solver='liblinear', tol=0.0001, verbose=0, warm_start=False)"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"model = LogisticRegression(random_state=0, class_weight='balanced')\n",
"model.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score\n",
"\n",
"def print_performance(X, y, model):\n",
" \"\"\"Print model performance metrics.\"\"\"\n",
" predictions = model.predict(X)\n",
" probs = model.predict_proba(X)[:, 1]\n",
"\n",
" # Calculate metrics\n",
" print('Precision: {}'.format(round(precision_score(y, predictions), 5)))\n",
" print('Recall: {}'.format(round(recall_score(y, predictions), 5)))\n",
" print('F1 Score: {}'.format(round(f1_score(y, predictions), 5)))\n",
" print('ROC AUC: {}'.format(round(roc_auc_score(y, probs), 5)))"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Precision: 0.44233\n",
"Recall: 0.43794\n",
"F1 Score: 0.44013\n",
"ROC AUC: 0.73158\n"
]
}
],
"source": [
"print_performance(X_train, y_train, model)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Precision: 0.40087\n",
"Recall: 0.33972\n",
"F1 Score: 0.36777\n",
"ROC AUC: 0.69795\n"
]
}
],
"source": [
"print_performance(X_test, y_test, model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deeper Features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Beware that this takes forever!!!"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"if create_data:\n",
" feature_matrix,_ = ft.dfs(entityset=es,\n",
" target_entity='customers', # what we're trying to predict\n",
" agg_primitives=['max', 'min', 'mode', 'mean', 'avg_time_between'], # requested aggs\n",
" trans_primitives=['day', 'month', 'hour', 'weekend'], # requested transformations\n",
" n_jobs=-1,\n",
" chunk_size=50,\n",
" max_depth=2, # how many aggregations to do\n",
" cutoff_time=labels,\n",
" cutoff_time_in_index=True,\n",
" verbose=1)\n",
" \n",
" feature_matrix = feature_matrix.reset_index()\n",
" feature_matrix.to_csv('./data/processed/dfs_depth2.csv')\n",
" feature_matrix.head()\n",
"else:\n",
" feature_matrix = pd.read_csv('./data/processed/dfs_depth2.csv')\n",
" feature_matrix['time'] = pd.to_datetime(feature_matrix['time']) \n",
" feature_matrix = feature_matrix.drop('Unnamed: 0', axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(35339, 23)"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_matrix.shape"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['CustomerID', 'time', 'MAX(purchases.Quantity)',\n",
" 'MAX(purchases.UnitPrice)', 'MAX(purchases.Total)',\n",
" 'MIN(purchases.Quantity)', 'MIN(purchases.UnitPrice)',\n",
" 'MIN(purchases.Total)', 'MODE(purchases.InvoiceNo)',\n",
" 'MODE(purchases.StockCode)', 'MEAN(purchases.Quantity)',\n",
" 'MEAN(purchases.UnitPrice)', 'MEAN(purchases.Total)',\n",
" 'AVG_TIME_BETWEEN(purchases.InvoiceDate)', 'DAY(first_purchases_time)',\n",
" 'MONTH(first_purchases_time)', 'HOUR(first_purchases_time)',\n",
" 'WEEKEND(first_purchases_time)', 'MODE(purchases.DAY(InvoiceDate))',\n",
" 'MODE(purchases.MONTH(InvoiceDate))',\n",
" 'MODE(purchases.HOUR(InvoiceDate))', 'MODE(purchases.orders.Country)',\n",
" 'Total'],\n",
" dtype='object')"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"feature_matrix.columns"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"CustomerID 13481\n",
"time 2011-06-01 00:00:00\n",
"MODE(purchases.MONTH(InvoiceDate)) 3\n",
"Name: 7000, dtype: object"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# look at a single output\n",
"feature_matrix.iloc[7000, [0, 1, 19]]"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"month\n",
"3 43\n",
"4 1\n",
"5 17\n",
"Name: Total, dtype: int64"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# demonstrate we understand the output\n",
"df['month'] = df['InvoiceDate'].dt.month\n",
"df[(df['CustomerID'] == 13481) & (df['InvoiceDate'] < '2011-06-01')].groupby('month')['Total'].count()"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"# create categorical response variable\n",
"feature_matrix['Total'] = feature_matrix['Total'].apply(lambda x: 1 if x > 500 else 0) "
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = create_train_test(11, feature_matrix)"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LogisticRegression(C=1.0, class_weight='balanced', dual=False,\n",
" fit_intercept=True, intercept_scaling=1, max_iter=100,\n",
" multi_class='ovr', n_jobs=1, penalty='l2', random_state=0,\n",
" solver='liblinear', tol=0.0001, verbose=0, warm_start=False)"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model = LogisticRegression(random_state=0, class_weight='balanced')\n",
"model.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Precision: 0.4444\n",
"Recall: 0.44503\n",
"F1 Score: 0.44471\n",
"ROC AUC: 0.72874\n"
]
}
],
"source": [
"print_performance(X_train, y_train, model)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Precision: 0.41033\n",
"Recall: 0.32203\n",
"F1 Score: 0.36086\n",
"ROC AUC: 0.68288\n"
]
}
],
"source": [
"print_performance(X_test, y_test, model)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:featureautomation]",
"language": "python",
"name": "conda-env-featureautomation-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}