{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Load data\n", "\n", "This dataset contains transactions of an online retail store in the UK. The files takes a bit to load. I downloaded this file as an xls from the [UCI machine learning dataset library](https://archive.ics.uci.edu/ml/datasets/online+retail#). I used excel to save it as a csv. \n", "\n", "I will largely follow ths process described in a [Feature Tools tutorial](https://github.com/Featuretools/Automated-Manual-Comparison/blob/master/Retail%20Spending/notebooks/Automated%20Retail%20Spending.ipynb). Note that the tutorial changed the column names from the original file (which I have not done). " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "%matplotlib inline\n", "\n", "create_data = False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Field info (from UCI machine learning dataset website):\n", "* InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.\n", "* StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.\n", "* Description: Product (item) name. Nominal.\n", "* Quantity: The quantities of each product (item) per transaction. Numeric.\n", "* InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.\n", "* UnitPrice: Unit price. Numeric, Product price per unit in sterling.\n", "* CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.\n", "* Country: Country name. Nominal, the name of the country where each customer resides. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('./data/raw/OnlineRetail.csv', parse_dates=[\"InvoiceDate\"])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Restrict data to 2011 - from tutorial. not sure why they did this.\n", "df = df[df['InvoiceDate'].dt.year == 2011]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',\n", " 'UnitPrice', 'CustomerID', 'Country'],\n", " dtype='object')" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "499428" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# label cancellation orders\n", "df['Cancelled'] = df['InvoiceNo'].str.startswith('C')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# drop the duplicates\n", "df = df.drop_duplicates()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# drop rows with null customer id\n", "df = df.dropna(axis=0)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Convert to dollars and create a field representing the total spent\n", "df['UnitPrice'] = df['UnitPrice'] * 1.65\n", "df['Total'] = df['UnitPrice'] * df['Quantity']" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QuantityUnitPriceCustomerIDTotal
count375250.000000375250.000000375250.000000375250.000000
mean12.2522965.76203115265.13616833.972397
std258.180877119.0541021710.927381733.803756
min-80995.0000000.00000012346.000000-277974.840000
25%2.0000002.06250013901.0000007.012500
50%5.0000003.21750015108.00000019.305000
75%12.0000006.18750016767.00000032.670000
max80995.00000064300.50000018287.000000277974.840000
\n", "
" ], "text/plain": [ " Quantity UnitPrice CustomerID Total\n", "count 375250.000000 375250.000000 375250.000000 375250.000000\n", "mean 12.252296 5.762031 15265.136168 33.972397\n", "std 258.180877 119.054102 1710.927381 733.803756\n", "min -80995.000000 0.000000 12346.000000 -277974.840000\n", "25% 2.000000 2.062500 13901.000000 7.012500\n", "50% 5.000000 3.217500 15108.000000 19.305000\n", "75% 12.000000 6.187500 16767.000000 32.670000\n", "max 80995.000000 64300.500000 18287.000000 277974.840000" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountryCancelledTotal
4248153999322386JUMBO BAG PINK POLKADOT102011-01-04 10:00:003.217513313.0United KingdomFalse32.175
4248253999321499BLUE POLKADOT WRAP252011-01-04 10:00:000.693013313.0United KingdomFalse17.325
4248353999321498RED RETROSPOT WRAP252011-01-04 10:00:000.693013313.0United KingdomFalse17.325
4248453999322379RECYCLING BAG RETROSPOT52011-01-04 10:00:003.465013313.0United KingdomFalse17.325
4248553999320718RED RETROSPOT SHOPPER BAG102011-01-04 10:00:002.062513313.0United KingdomFalse20.625
\n", "
" ], "text/plain": [ " InvoiceNo StockCode Description Quantity \\\n", "42481 539993 22386 JUMBO BAG PINK POLKADOT 10 \n", "42482 539993 21499 BLUE POLKADOT WRAP 25 \n", "42483 539993 21498 RED RETROSPOT WRAP 25 \n", "42484 539993 22379 RECYCLING BAG RETROSPOT 5 \n", "42485 539993 20718 RED RETROSPOT SHOPPER BAG 10 \n", "\n", " InvoiceDate UnitPrice CustomerID Country Cancelled \\\n", "42481 2011-01-04 10:00:00 3.2175 13313.0 United Kingdom False \n", "42482 2011-01-04 10:00:00 0.6930 13313.0 United Kingdom False \n", "42483 2011-01-04 10:00:00 0.6930 13313.0 United Kingdom False \n", "42484 2011-01-04 10:00:00 3.4650 13313.0 United Kingdom False \n", "42485 2011-01-04 10:00:00 2.0625 13313.0 United Kingdom False \n", "\n", " Total \n", "42481 32.175 \n", "42482 17.325 \n", "42483 17.325 \n", "42484 17.325 \n", "42485 20.625 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "InvoiceNo object\n", "StockCode object\n", "Description object\n", "Quantity int64\n", "InvoiceDate datetime64[ns]\n", "UnitPrice float64\n", "CustomerID float64\n", "Country object\n", "Cancelled bool\n", "Total float64\n", "dtype: object" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create response variable\n", "\n", "The following code is a little complex. I compute how much a customer spent each month (including months with no transactions)." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def monthly_spend(df):\n", " \"\"\"Identify how much each customer spent each month.\"\"\"\n", " df.index = df['InvoiceDate']\n", " monthly_sums = df.groupby('CustomerID').resample('MS')['Total'].sum()\n", " return monthly_sums.reset_index()\n", "\n", "\n", "def customer_by_month(monthly_data, df):\n", " \"\"\"Create an index for each customer for each month of the data set.\"\"\"\n", " labels = monthly_data.set_index(['InvoiceDate', 'CustomerID'])\n", " midx = pd.MultiIndex.from_product(\n", " [pd.date_range('2011-01-01', '2012-01-01', freq='MS'), df['CustomerID'].unique()],\n", " names=labels.index.names)\n", " return labels.reindex(midx, fill_value=0).reset_index()\n", "\n", "\n", "def monthly_min_date(monthly_data):\n", " \"\"\"Create a table which has all months since a customer's first transaction\"\"\"\n", " min_dates = (monthly_data\n", " .groupby('CustomerID')['InvoiceDate']\n", " .min()\n", " .apply(lambda x: pd.date_range(x, '2012-01-01', freq='MS').tolist())\n", " .reset_index()\n", " )\n", " return pd.DataFrame([(x, i) for x, y in zip(min_dates['CustomerID'], min_dates['InvoiceDate']) for i in y],\n", " columns=['CustomerID', 'InvoiceDate'])\n", "\n", "\n", "# join the table where each customer has a record each month with the table where each customer has a \n", "# record since account creation. This way customers do not have records before their first transaction.\n", "#labels = labels.merge(relevant_months, on=['CustomerID', 'InvoiceDate'], how='inner')\n", "monthly_df = monthly_spend(df)\n", "labels = customer_by_month(monthly_df, df).merge(monthly_min_date(monthly_df),\n", " on=['CustomerID', 'InvoiceDate'],\n", " how='inner')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# tutorial starts with march data.\n", "labels = labels[labels['InvoiceDate'] >= '2011-03-01']" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InvoiceDateCustomerIDTotal
20652011-03-0113313.00.000
20662011-03-0118097.00.000
20672011-03-0116656.0589.248
20682011-03-0116875.00.000
20692011-03-0113094.0115.434
\n", "
" ], "text/plain": [ " InvoiceDate CustomerID Total\n", "2065 2011-03-01 13313.0 0.000\n", "2066 2011-03-01 18097.0 0.000\n", "2067 2011-03-01 16656.0 589.248\n", "2068 2011-03-01 16875.0 0.000\n", "2069 2011-03-01 13094.0 115.434" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labels.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that I have more samples than the feature tools tutorial because they accidently removed all transactions on the first of the month..." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CustomerIDTotal
count35339.00000035339.000000
mean15276.044399318.284416
std1720.1948971698.736413
min12346.000000-7074.589500
25%13791.0000000.000000
50%15247.0000000.000000
75%16757.000000260.460750
max18287.000000115906.725000
\n", "
" ], "text/plain": [ " CustomerID Total\n", "count 35339.000000 35339.000000\n", "mean 15276.044399 318.284416\n", "std 1720.194897 1698.736413\n", "min 12346.000000 -7074.589500\n", "25% 13791.000000 0.000000\n", "50% 15247.000000 0.000000\n", "75% 16757.000000 260.460750\n", "max 18287.000000 115906.725000" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labels.describe()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InvoiceDateCustomerIDTotal
27312011-03-0112347.00.0000
45432011-04-0112347.01049.8125
67152011-05-0112347.00.0000
92042011-06-0112347.0631.1580
119562011-07-0112347.00.0000
149262011-08-0112347.0965.1015
180802011-09-0112347.00.0000
215472011-10-0112347.02135.6280
253852011-11-0112347.00.0000
295822011-12-0112347.0370.9530
338262012-01-0112347.00.0000
\n", "
" ], "text/plain": [ " InvoiceDate CustomerID Total\n", "2731 2011-03-01 12347.0 0.0000\n", "4543 2011-04-01 12347.0 1049.8125\n", "6715 2011-05-01 12347.0 0.0000\n", "9204 2011-06-01 12347.0 631.1580\n", "11956 2011-07-01 12347.0 0.0000\n", "14926 2011-08-01 12347.0 965.1015\n", "18080 2011-09-01 12347.0 0.0000\n", "21547 2011-10-01 12347.0 2135.6280\n", "25385 2011-11-01 12347.0 0.0000\n", "29582 2011-12-01 12347.0 370.9530\n", "33826 2012-01-01 12347.0 0.0000" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labels.loc[labels['CustomerID'] == 12347]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "370.953" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[(df['CustomerID'] == 12347) & (df['InvoiceDate'] >= '2011-12-01')]['Total'].sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Automation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This next series of commands will load data into feature tools and explain the dataset to feature tools." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "import featuretools as ft" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2018-09-22 09:06:26,324 featuretools.entityset - WARNING index purchases_index not found in dataframe, creating new integer column\n" ] }, { "data": { "text/plain": [ "Entity: purchases\n", " Variables:\n", " purchases_index (dtype: index)\n", " InvoiceNo (dtype: categorical)\n", " StockCode (dtype: categorical)\n", " Quantity (dtype: numeric)\n", " InvoiceDate (dtype: datetime_time_index)\n", " UnitPrice (dtype: numeric)\n", " CustomerID (dtype: numeric)\n", " Country (dtype: categorical)\n", " Cancelled (dtype: boolean)\n", " Total (dtype: numeric)\n", " Description (dtype: text)\n", " Shape:\n", " (Rows: 375250, Columns: 11)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es = ft.EntitySet(id=\"Online Retail Logs\")\n", "\n", "# Add the entire data table as an entity\n", "es.entity_from_dataframe(\"purchases\", # name of entity set\n", " dataframe=df, # data\n", " index=\"purchases_index\", # name of new index\n", " time_index = 'InvoiceDate', # time associated with each row\n", " variable_types = {'Description': ft.variable_types.Text}) # specifiy variable type\n", "\n", "es['purchases']" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StockCodeDescriptionfirst_purchases_time
2068220682RED RETROSPOT CHILDRENS UMBRELLA2011-01-04 10:00:00
2071820718RED RETROSPOT SHOPPER BAG2011-01-04 10:00:00
2149821498RED RETROSPOT WRAP2011-01-04 10:00:00
2149921499BLUE POLKADOT WRAP2011-01-04 10:00:00
2230222302COFFEE MUG PEARS DESIGN2011-01-04 10:00:00
\n", "
" ], "text/plain": [ " StockCode Description first_purchases_time\n", "20682 20682 RED RETROSPOT CHILDRENS UMBRELLA 2011-01-04 10:00:00\n", "20718 20718 RED RETROSPOT SHOPPER BAG 2011-01-04 10:00:00\n", "21498 21498 RED RETROSPOT WRAP 2011-01-04 10:00:00\n", "21499 21499 BLUE POLKADOT WRAP 2011-01-04 10:00:00\n", "22302 22302 COFFEE MUG PEARS DESIGN 2011-01-04 10:00:00" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.normalize_entity(new_entity_id=\"products\", # new entity for products. \n", " base_entity_id=\"purchases\", # built from purchases entity\n", " index=\"StockCode\", # Index with StockCode column from purchases entity. (unique key)\n", " additional_variables=[\"Description\"]) # bring in this variable\n", "\n", "es['products'].df.head()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Entityset: Online Retail Logs\n", " Entities:\n", " purchases [Rows: 375250, Columns: 8]\n", " products [Rows: 3612, Columns: 3]\n", " customers [Rows: 4244, Columns: 2]\n", " orders [Rows: 20482, Columns: 4]\n", " Relationships:\n", " purchases.StockCode -> products.StockCode\n", " purchases.CustomerID -> customers.CustomerID\n", " purchases.InvoiceNo -> orders.InvoiceNo" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es.normalize_entity(new_entity_id=\"customers\", # customer entity\n", " base_entity_id=\"purchases\", # from purchases\n", " index=\"CustomerID\") # unique key is CustomerID\n", "\n", "es.normalize_entity(new_entity_id=\"orders\", # order entity\n", " base_entity_id=\"purchases\", # from purchases\n", " index=\"InvoiceNo\", # unique key is InvoiceNo\n", " additional_variables=[\"Country\", 'Cancelled']) # Include these variables.\n", "\n", "es" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CustomerIDfirst_purchases_time
13313.013313.02011-01-04 10:00:00
18097.018097.02011-01-04 10:22:00
16656.016656.02011-01-04 10:23:00
13094.013094.02011-01-04 10:37:00
16875.016875.02011-01-04 10:37:00
\n", "
" ], "text/plain": [ " CustomerID first_purchases_time\n", "13313.0 13313.0 2011-01-04 10:00:00\n", "18097.0 18097.0 2011-01-04 10:22:00\n", "16656.0 16656.0 2011-01-04 10:23:00\n", "13094.0 13094.0 2011-01-04 10:37:00\n", "16875.0 16875.0 2011-01-04 10:37:00" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es['customers'].df.head()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InvoiceNoCountryCancelledfirst_purchases_time
539993539993United KingdomFalse2011-01-04 10:00:00
540001540001United KingdomFalse2011-01-04 10:22:00
540002540002United KingdomFalse2011-01-04 10:23:00
540003540003United KingdomFalse2011-01-04 10:37:00
540004540004United KingdomFalse2011-01-04 10:37:00
\n", "
" ], "text/plain": [ " InvoiceNo Country Cancelled first_purchases_time\n", "539993 539993 United Kingdom False 2011-01-04 10:00:00\n", "540001 540001 United Kingdom False 2011-01-04 10:22:00\n", "540002 540002 United Kingdom False 2011-01-04 10:23:00\n", "540003 540003 United Kingdom False 2011-01-04 10:37:00\n", "540004 540004 United Kingdom False 2011-01-04 10:37:00" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "es['orders'].df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deep Feature Synthesis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we use featuretools to create new features." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "if create_data:\n", " feature_matrix,_ = ft.dfs(entityset=es, # entity\n", " target_entity='customers', # what we're trying to predict\n", " agg_primitives=['max', 'min', 'mode', 'mean', 'avg_time_between'], # requested aggregations\n", " trans_primitives=['day', 'month', 'hour', 'weekend'], # requested transformations\n", " cutoff_time=labels, # define time period of predictions\n", " verbose=1, # how much stdout\n", " cutoff_time_in_index=True, # specify that we've given cutoff times\n", " chunk_size=50, # how much data to give each worker\n", " n_jobs=-1, # how many threads to create\n", " max_depth=1) # how many aggregations\n", "\n", " feature_matrix = feature_matrix.reset_index()\n", " feature_matrix.to_csv('./data/processed/dfs_depth1.csv')\n", " feature_matrix.head()\n", "else:\n", " feature_matrix = pd.read_csv('./data/processed/dfs_depth1.csv')\n", " feature_matrix['time'] = pd.to_datetime(feature_matrix['time'])\n", " feature_matrix = feature_matrix.drop('Unnamed: 0', axis=1)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['CustomerID', 'time', 'MAX(purchases.Quantity)',\n", " 'MAX(purchases.UnitPrice)', 'MAX(purchases.Total)',\n", " 'MIN(purchases.Quantity)', 'MIN(purchases.UnitPrice)',\n", " 'MIN(purchases.Total)', 'MODE(purchases.InvoiceNo)',\n", " 'MODE(purchases.StockCode)', 'MEAN(purchases.Quantity)',\n", " 'MEAN(purchases.UnitPrice)', 'MEAN(purchases.Total)',\n", " 'AVG_TIME_BETWEEN(purchases.InvoiceDate)', 'DAY(first_purchases_time)',\n", " 'MONTH(first_purchases_time)', 'HOUR(first_purchases_time)',\n", " 'WEEKEND(first_purchases_time)', 'Total'],\n", " dtype='object')" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_matrix.columns" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CustomerID 12346\n", "time 2011-03-01 00:00:00\n", "MAX(purchases.Quantity) 74215\n", "Name: 0, dtype: object" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# look at a single output\n", "feature_matrix.iloc[0, :3]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "74215" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# demonstrate we understand the output\n", "df[(df['CustomerID'] == 12346) & (df['InvoiceDate'] < '2011-03-01')]['Quantity'].max()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(35339, 19)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_matrix.shape" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "# create categorical response variable\n", "feature_matrix['Total'] = feature_matrix['Total'].apply(lambda x: 1 if x > 500 else 0) " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n", "def create_train_test(month, feature_matrix, drop_cols=['CustomerID', 'time', 'month', 'Total']):\n", " \"\"\"Basic cleaning and return train/test data.\"\"\"\n", " \n", " # remove columns we know will not contribute\n", " feature_matrix = feature_matrix.drop(columns= ['MODE(purchases.InvoiceNo)', 'MODE(purchases.StockCode)'])\n", " # dummy code strings\n", " feature_matrix = pd.get_dummies(feature_matrix)\n", " # fill nans\n", " feature_matrix = feature_matrix.fillna(0)\n", " \n", " # Labels\n", " feature_matrix['month'] = feature_matrix['time'].dt.month\n", " train_labels = feature_matrix.loc[feature_matrix['month'] < month, 'Total']\n", " test_labels = feature_matrix.loc[feature_matrix['month'] >= month, 'Total']\n", " y_train = np.array(train_labels).reshape((-1, ))\n", " y_test = np.array(test_labels).reshape((-1, ))\n", "\n", " # Features\n", " X_train = feature_matrix[feature_matrix['time'].dt.month < month].drop(columns=drop_cols)\n", " X_test = feature_matrix[feature_matrix['time'].dt.month >= month].drop(columns=drop_cols)\n", " \n", " return (X_train, X_test, y_train, y_test)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = create_train_test(11, feature_matrix)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.183619599970258\n", "0.16076294277929154\n" ] } ], "source": [ "print(np.mean(y_train))\n", "print(np.mean(y_test))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=1.0, class_weight='balanced', dual=False,\n", " fit_intercept=True, intercept_scaling=1, max_iter=100,\n", " multi_class='ovr', n_jobs=1, penalty='l2', random_state=0,\n", " solver='liblinear', tol=0.0001, verbose=0, warm_start=False)" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "model = LogisticRegression(random_state=0, class_weight='balanced')\n", "model.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score\n", "\n", "def print_performance(X, y, model):\n", " \"\"\"Print model performance metrics.\"\"\"\n", " predictions = model.predict(X)\n", " probs = model.predict_proba(X)[:, 1]\n", "\n", " # Calculate metrics\n", " print('Precision: {}'.format(round(precision_score(y, predictions), 5)))\n", " print('Recall: {}'.format(round(recall_score(y, predictions), 5)))\n", " print('F1 Score: {}'.format(round(f1_score(y, predictions), 5)))\n", " print('ROC AUC: {}'.format(round(roc_auc_score(y, probs), 5)))" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Precision: 0.44233\n", "Recall: 0.43794\n", "F1 Score: 0.44013\n", "ROC AUC: 0.73158\n" ] } ], "source": [ "print_performance(X_train, y_train, model)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Precision: 0.40087\n", "Recall: 0.33972\n", "F1 Score: 0.36777\n", "ROC AUC: 0.69795\n" ] } ], "source": [ "print_performance(X_test, y_test, model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deeper Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Beware that this takes forever!!!" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "if create_data:\n", " feature_matrix,_ = ft.dfs(entityset=es,\n", " target_entity='customers', # what we're trying to predict\n", " agg_primitives=['max', 'min', 'mode', 'mean', 'avg_time_between'], # requested aggs\n", " trans_primitives=['day', 'month', 'hour', 'weekend'], # requested transformations\n", " n_jobs=-1,\n", " chunk_size=50,\n", " max_depth=2, # how many aggregations to do\n", " cutoff_time=labels,\n", " cutoff_time_in_index=True,\n", " verbose=1)\n", " \n", " feature_matrix = feature_matrix.reset_index()\n", " feature_matrix.to_csv('./data/processed/dfs_depth2.csv')\n", " feature_matrix.head()\n", "else:\n", " feature_matrix = pd.read_csv('./data/processed/dfs_depth2.csv')\n", " feature_matrix['time'] = pd.to_datetime(feature_matrix['time']) \n", " feature_matrix = feature_matrix.drop('Unnamed: 0', axis=1)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(35339, 23)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_matrix.shape" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['CustomerID', 'time', 'MAX(purchases.Quantity)',\n", " 'MAX(purchases.UnitPrice)', 'MAX(purchases.Total)',\n", " 'MIN(purchases.Quantity)', 'MIN(purchases.UnitPrice)',\n", " 'MIN(purchases.Total)', 'MODE(purchases.InvoiceNo)',\n", " 'MODE(purchases.StockCode)', 'MEAN(purchases.Quantity)',\n", " 'MEAN(purchases.UnitPrice)', 'MEAN(purchases.Total)',\n", " 'AVG_TIME_BETWEEN(purchases.InvoiceDate)', 'DAY(first_purchases_time)',\n", " 'MONTH(first_purchases_time)', 'HOUR(first_purchases_time)',\n", " 'WEEKEND(first_purchases_time)', 'MODE(purchases.DAY(InvoiceDate))',\n", " 'MODE(purchases.MONTH(InvoiceDate))',\n", " 'MODE(purchases.HOUR(InvoiceDate))', 'MODE(purchases.orders.Country)',\n", " 'Total'],\n", " dtype='object')" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_matrix.columns" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "CustomerID 13481\n", "time 2011-06-01 00:00:00\n", "MODE(purchases.MONTH(InvoiceDate)) 3\n", "Name: 7000, dtype: object" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# look at a single output\n", "feature_matrix.iloc[7000, [0, 1, 19]]" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "month\n", "3 43\n", "4 1\n", "5 17\n", "Name: Total, dtype: int64" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# demonstrate we understand the output\n", "df['month'] = df['InvoiceDate'].dt.month\n", "df[(df['CustomerID'] == 13481) & (df['InvoiceDate'] < '2011-06-01')].groupby('month')['Total'].count()" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "# create categorical response variable\n", "feature_matrix['Total'] = feature_matrix['Total'].apply(lambda x: 1 if x > 500 else 0) " ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = create_train_test(11, feature_matrix)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=1.0, class_weight='balanced', dual=False,\n", " fit_intercept=True, intercept_scaling=1, max_iter=100,\n", " multi_class='ovr', n_jobs=1, penalty='l2', random_state=0,\n", " solver='liblinear', tol=0.0001, verbose=0, warm_start=False)" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = LogisticRegression(random_state=0, class_weight='balanced')\n", "model.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Precision: 0.4444\n", "Recall: 0.44503\n", "F1 Score: 0.44471\n", "ROC AUC: 0.72874\n" ] } ], "source": [ "print_performance(X_train, y_train, model)" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Precision: 0.41033\n", "Recall: 0.32203\n", "F1 Score: 0.36086\n", "ROC AUC: 0.68288\n" ] } ], "source": [ "print_performance(X_test, y_test, model)" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:featureautomation]", "language": "python", "name": "conda-env-featureautomation-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }