{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load data\n",
    "\n",
    "This dataset contains transactions of an online retail store in the UK. The files takes a bit to load. I downloaded this file as an xls from the [UCI machine learning dataset library](https://archive.ics.uci.edu/ml/datasets/online+retail#). I used excel to save it as a csv. \n",
    "\n",
    "I will largely follow ths process described in a [Feature Tools tutorial](https://github.com/Featuretools/Automated-Manual-Comparison/blob/master/Retail%20Spending/notebooks/Automated%20Retail%20Spending.ipynb). Note that the tutorial changed the column names from the original file (which I have not done). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "%matplotlib inline\n",
    "\n",
    "create_data = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Field info (from UCI machine learning dataset website):\n",
    "* InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.\n",
    "* StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.\n",
    "* Description: Product (item) name. Nominal.\n",
    "* Quantity: The quantities of each product (item) per transaction. Numeric.\n",
    "* InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.\n",
    "* UnitPrice: Unit price. Numeric, Product price per unit in sterling.\n",
    "* CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.\n",
    "* Country: Country name. Nominal, the name of the country where each customer resides. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv('./data/raw/OnlineRetail.csv', parse_dates=[\"InvoiceDate\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Restrict data to 2011 - from tutorial. not sure why they did this.\n",
    "df = df[df['InvoiceDate'].dt.year == 2011]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',\n",
       "       'UnitPrice', 'CustomerID', 'Country'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "499428"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# label cancellation orders\n",
    "df['Cancelled'] = df['InvoiceNo'].str.startswith('C')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# drop the duplicates\n",
    "df = df.drop_duplicates()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# drop rows with null customer id\n",
    "df = df.dropna(axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert to dollars and create a field representing the total spent\n",
    "df['UnitPrice'] = df['UnitPrice'] * 1.65\n",
    "df['Total'] = df['UnitPrice'] * df['Quantity']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Quantity</th>\n",
       "      <th>UnitPrice</th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>375250.000000</td>\n",
       "      <td>375250.000000</td>\n",
       "      <td>375250.000000</td>\n",
       "      <td>375250.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>12.252296</td>\n",
       "      <td>5.762031</td>\n",
       "      <td>15265.136168</td>\n",
       "      <td>33.972397</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>258.180877</td>\n",
       "      <td>119.054102</td>\n",
       "      <td>1710.927381</td>\n",
       "      <td>733.803756</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-80995.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>12346.000000</td>\n",
       "      <td>-277974.840000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>2.062500</td>\n",
       "      <td>13901.000000</td>\n",
       "      <td>7.012500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>5.000000</td>\n",
       "      <td>3.217500</td>\n",
       "      <td>15108.000000</td>\n",
       "      <td>19.305000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>12.000000</td>\n",
       "      <td>6.187500</td>\n",
       "      <td>16767.000000</td>\n",
       "      <td>32.670000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>80995.000000</td>\n",
       "      <td>64300.500000</td>\n",
       "      <td>18287.000000</td>\n",
       "      <td>277974.840000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            Quantity      UnitPrice     CustomerID          Total\n",
       "count  375250.000000  375250.000000  375250.000000  375250.000000\n",
       "mean       12.252296       5.762031   15265.136168      33.972397\n",
       "std       258.180877     119.054102    1710.927381     733.803756\n",
       "min    -80995.000000       0.000000   12346.000000 -277974.840000\n",
       "25%         2.000000       2.062500   13901.000000       7.012500\n",
       "50%         5.000000       3.217500   15108.000000      19.305000\n",
       "75%        12.000000       6.187500   16767.000000      32.670000\n",
       "max     80995.000000   64300.500000   18287.000000  277974.840000"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>InvoiceNo</th>\n",
       "      <th>StockCode</th>\n",
       "      <th>Description</th>\n",
       "      <th>Quantity</th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>UnitPrice</th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Country</th>\n",
       "      <th>Cancelled</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>42481</th>\n",
       "      <td>539993</td>\n",
       "      <td>22386</td>\n",
       "      <td>JUMBO BAG PINK POLKADOT</td>\n",
       "      <td>10</td>\n",
       "      <td>2011-01-04 10:00:00</td>\n",
       "      <td>3.2175</td>\n",
       "      <td>13313.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>False</td>\n",
       "      <td>32.175</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42482</th>\n",
       "      <td>539993</td>\n",
       "      <td>21499</td>\n",
       "      <td>BLUE POLKADOT WRAP</td>\n",
       "      <td>25</td>\n",
       "      <td>2011-01-04 10:00:00</td>\n",
       "      <td>0.6930</td>\n",
       "      <td>13313.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>False</td>\n",
       "      <td>17.325</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42483</th>\n",
       "      <td>539993</td>\n",
       "      <td>21498</td>\n",
       "      <td>RED RETROSPOT WRAP</td>\n",
       "      <td>25</td>\n",
       "      <td>2011-01-04 10:00:00</td>\n",
       "      <td>0.6930</td>\n",
       "      <td>13313.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>False</td>\n",
       "      <td>17.325</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42484</th>\n",
       "      <td>539993</td>\n",
       "      <td>22379</td>\n",
       "      <td>RECYCLING BAG RETROSPOT</td>\n",
       "      <td>5</td>\n",
       "      <td>2011-01-04 10:00:00</td>\n",
       "      <td>3.4650</td>\n",
       "      <td>13313.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>False</td>\n",
       "      <td>17.325</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42485</th>\n",
       "      <td>539993</td>\n",
       "      <td>20718</td>\n",
       "      <td>RED RETROSPOT SHOPPER BAG</td>\n",
       "      <td>10</td>\n",
       "      <td>2011-01-04 10:00:00</td>\n",
       "      <td>2.0625</td>\n",
       "      <td>13313.0</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>False</td>\n",
       "      <td>20.625</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      InvoiceNo StockCode                Description  Quantity  \\\n",
       "42481    539993     22386    JUMBO BAG PINK POLKADOT        10   \n",
       "42482    539993     21499         BLUE POLKADOT WRAP        25   \n",
       "42483    539993     21498        RED RETROSPOT WRAP         25   \n",
       "42484    539993     22379   RECYCLING BAG RETROSPOT          5   \n",
       "42485    539993     20718  RED RETROSPOT SHOPPER BAG        10   \n",
       "\n",
       "              InvoiceDate  UnitPrice  CustomerID         Country  Cancelled  \\\n",
       "42481 2011-01-04 10:00:00     3.2175     13313.0  United Kingdom      False   \n",
       "42482 2011-01-04 10:00:00     0.6930     13313.0  United Kingdom      False   \n",
       "42483 2011-01-04 10:00:00     0.6930     13313.0  United Kingdom      False   \n",
       "42484 2011-01-04 10:00:00     3.4650     13313.0  United Kingdom      False   \n",
       "42485 2011-01-04 10:00:00     2.0625     13313.0  United Kingdom      False   \n",
       "\n",
       "        Total  \n",
       "42481  32.175  \n",
       "42482  17.325  \n",
       "42483  17.325  \n",
       "42484  17.325  \n",
       "42485  20.625  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "InvoiceNo              object\n",
       "StockCode              object\n",
       "Description            object\n",
       "Quantity                int64\n",
       "InvoiceDate    datetime64[ns]\n",
       "UnitPrice             float64\n",
       "CustomerID            float64\n",
       "Country                object\n",
       "Cancelled                bool\n",
       "Total                 float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create response variable\n",
    "\n",
    "The following code is a little complex. I compute how much a customer spent each month (including months with no transactions)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def monthly_spend(df):\n",
    "    \"\"\"Identify how much each customer spent each month.\"\"\"\n",
    "    df.index = df['InvoiceDate']\n",
    "    monthly_sums = df.groupby('CustomerID').resample('MS')['Total'].sum()\n",
    "    return monthly_sums.reset_index()\n",
    "\n",
    "\n",
    "def customer_by_month(monthly_data, df):\n",
    "    \"\"\"Create an index for each customer for each month of the data set.\"\"\"\n",
    "    labels = monthly_data.set_index(['InvoiceDate', 'CustomerID'])\n",
    "    midx = pd.MultiIndex.from_product(\n",
    "        [pd.date_range('2011-01-01', '2012-01-01', freq='MS'), df['CustomerID'].unique()],\n",
    "        names=labels.index.names)\n",
    "    return labels.reindex(midx, fill_value=0).reset_index()\n",
    "\n",
    "\n",
    "def monthly_min_date(monthly_data):\n",
    "    \"\"\"Create a table which has all months since a customer's first transaction\"\"\"\n",
    "    min_dates = (monthly_data\n",
    "                 .groupby('CustomerID')['InvoiceDate']\n",
    "                 .min()\n",
    "                 .apply(lambda x: pd.date_range(x, '2012-01-01', freq='MS').tolist())\n",
    "                 .reset_index()\n",
    "                )\n",
    "    return pd.DataFrame([(x, i) for x, y in zip(min_dates['CustomerID'], min_dates['InvoiceDate']) for i in y],\n",
    "                        columns=['CustomerID', 'InvoiceDate'])\n",
    "\n",
    "\n",
    "# join the table where each customer has a record each month with the table where each customer has a \n",
    "# record since account creation. This way customers do not have records before their first transaction.\n",
    "#labels = labels.merge(relevant_months, on=['CustomerID', 'InvoiceDate'], how='inner')\n",
    "monthly_df = monthly_spend(df)\n",
    "labels = customer_by_month(monthly_df, df).merge(monthly_min_date(monthly_df),\n",
    "                                                   on=['CustomerID', 'InvoiceDate'],\n",
    "                                                   how='inner')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# tutorial starts with march data.\n",
    "labels = labels[labels['InvoiceDate'] >= '2011-03-01']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2065</th>\n",
       "      <td>2011-03-01</td>\n",
       "      <td>13313.0</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2066</th>\n",
       "      <td>2011-03-01</td>\n",
       "      <td>18097.0</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2067</th>\n",
       "      <td>2011-03-01</td>\n",
       "      <td>16656.0</td>\n",
       "      <td>589.248</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2068</th>\n",
       "      <td>2011-03-01</td>\n",
       "      <td>16875.0</td>\n",
       "      <td>0.000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2069</th>\n",
       "      <td>2011-03-01</td>\n",
       "      <td>13094.0</td>\n",
       "      <td>115.434</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     InvoiceDate  CustomerID    Total\n",
       "2065  2011-03-01     13313.0    0.000\n",
       "2066  2011-03-01     18097.0    0.000\n",
       "2067  2011-03-01     16656.0  589.248\n",
       "2068  2011-03-01     16875.0    0.000\n",
       "2069  2011-03-01     13094.0  115.434"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "labels.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that I have more samples than the feature tools tutorial because they accidently removed all transactions on the first of the month..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>35339.000000</td>\n",
       "      <td>35339.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>15276.044399</td>\n",
       "      <td>318.284416</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1720.194897</td>\n",
       "      <td>1698.736413</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>12346.000000</td>\n",
       "      <td>-7074.589500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>13791.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>15247.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>16757.000000</td>\n",
       "      <td>260.460750</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>18287.000000</td>\n",
       "      <td>115906.725000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         CustomerID          Total\n",
       "count  35339.000000   35339.000000\n",
       "mean   15276.044399     318.284416\n",
       "std     1720.194897    1698.736413\n",
       "min    12346.000000   -7074.589500\n",
       "25%    13791.000000       0.000000\n",
       "50%    15247.000000       0.000000\n",
       "75%    16757.000000     260.460750\n",
       "max    18287.000000  115906.725000"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "labels.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>InvoiceDate</th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>Total</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2731</th>\n",
       "      <td>2011-03-01</td>\n",
       "      <td>12347.0</td>\n",
       "      <td>0.0000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4543</th>\n",
       "      <td>2011-04-01</td>\n",
       "      <td>12347.0</td>\n",
       "      <td>1049.8125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6715</th>\n",
       "      <td>2011-05-01</td>\n",
       "      <td>12347.0</td>\n",
       "      <td>0.0000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9204</th>\n",
       "      <td>2011-06-01</td>\n",
       "      <td>12347.0</td>\n",
       "      <td>631.1580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11956</th>\n",
       "      <td>2011-07-01</td>\n",
       "      <td>12347.0</td>\n",
       "      <td>0.0000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14926</th>\n",
       "      <td>2011-08-01</td>\n",
       "      <td>12347.0</td>\n",
       "      <td>965.1015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18080</th>\n",
       "      <td>2011-09-01</td>\n",
       "      <td>12347.0</td>\n",
       "      <td>0.0000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21547</th>\n",
       "      <td>2011-10-01</td>\n",
       "      <td>12347.0</td>\n",
       "      <td>2135.6280</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25385</th>\n",
       "      <td>2011-11-01</td>\n",
       "      <td>12347.0</td>\n",
       "      <td>0.0000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29582</th>\n",
       "      <td>2011-12-01</td>\n",
       "      <td>12347.0</td>\n",
       "      <td>370.9530</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33826</th>\n",
       "      <td>2012-01-01</td>\n",
       "      <td>12347.0</td>\n",
       "      <td>0.0000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      InvoiceDate  CustomerID      Total\n",
       "2731   2011-03-01     12347.0     0.0000\n",
       "4543   2011-04-01     12347.0  1049.8125\n",
       "6715   2011-05-01     12347.0     0.0000\n",
       "9204   2011-06-01     12347.0   631.1580\n",
       "11956  2011-07-01     12347.0     0.0000\n",
       "14926  2011-08-01     12347.0   965.1015\n",
       "18080  2011-09-01     12347.0     0.0000\n",
       "21547  2011-10-01     12347.0  2135.6280\n",
       "25385  2011-11-01     12347.0     0.0000\n",
       "29582  2011-12-01     12347.0   370.9530\n",
       "33826  2012-01-01     12347.0     0.0000"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "labels.loc[labels['CustomerID'] == 12347]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "370.953"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[(df['CustomerID'] == 12347) & (df['InvoiceDate'] >= '2011-12-01')]['Total'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Automation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This next series of commands will load data into feature tools and explain the dataset to feature tools."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "import featuretools as ft"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2018-09-22 09:06:26,324 featuretools.entityset - WARNING    index purchases_index not found in dataframe, creating new integer column\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "Entity: purchases\n",
       "  Variables:\n",
       "    purchases_index (dtype: index)\n",
       "    InvoiceNo (dtype: categorical)\n",
       "    StockCode (dtype: categorical)\n",
       "    Quantity (dtype: numeric)\n",
       "    InvoiceDate (dtype: datetime_time_index)\n",
       "    UnitPrice (dtype: numeric)\n",
       "    CustomerID (dtype: numeric)\n",
       "    Country (dtype: categorical)\n",
       "    Cancelled (dtype: boolean)\n",
       "    Total (dtype: numeric)\n",
       "    Description (dtype: text)\n",
       "  Shape:\n",
       "    (Rows: 375250, Columns: 11)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "es = ft.EntitySet(id=\"Online Retail Logs\")\n",
    "\n",
    "# Add the entire data table as an entity\n",
    "es.entity_from_dataframe(\"purchases\", # name of entity set\n",
    "                         dataframe=df, # data\n",
    "                         index=\"purchases_index\", # name of new index\n",
    "                         time_index = 'InvoiceDate', # time associated with each row\n",
    "                         variable_types = {'Description': ft.variable_types.Text})  # specifiy variable type\n",
    "\n",
    "es['purchases']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>StockCode</th>\n",
       "      <th>Description</th>\n",
       "      <th>first_purchases_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>20682</th>\n",
       "      <td>20682</td>\n",
       "      <td>RED RETROSPOT CHILDRENS UMBRELLA</td>\n",
       "      <td>2011-01-04 10:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20718</th>\n",
       "      <td>20718</td>\n",
       "      <td>RED RETROSPOT SHOPPER BAG</td>\n",
       "      <td>2011-01-04 10:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21498</th>\n",
       "      <td>21498</td>\n",
       "      <td>RED RETROSPOT WRAP</td>\n",
       "      <td>2011-01-04 10:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21499</th>\n",
       "      <td>21499</td>\n",
       "      <td>BLUE POLKADOT WRAP</td>\n",
       "      <td>2011-01-04 10:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22302</th>\n",
       "      <td>22302</td>\n",
       "      <td>COFFEE MUG PEARS  DESIGN</td>\n",
       "      <td>2011-01-04 10:00:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      StockCode                       Description first_purchases_time\n",
       "20682     20682  RED RETROSPOT CHILDRENS UMBRELLA  2011-01-04 10:00:00\n",
       "20718     20718         RED RETROSPOT SHOPPER BAG  2011-01-04 10:00:00\n",
       "21498     21498               RED RETROSPOT WRAP   2011-01-04 10:00:00\n",
       "21499     21499                BLUE POLKADOT WRAP  2011-01-04 10:00:00\n",
       "22302     22302          COFFEE MUG PEARS  DESIGN  2011-01-04 10:00:00"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "es.normalize_entity(new_entity_id=\"products\", # new entity for products. \n",
    "                    base_entity_id=\"purchases\", # built from purchases entity\n",
    "                    index=\"StockCode\",  # Index with StockCode column from purchases entity. (unique key)\n",
    "                    additional_variables=[\"Description\"]) # bring in this variable\n",
    "\n",
    "es['products'].df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Entityset: Online Retail Logs\n",
       "  Entities:\n",
       "    purchases [Rows: 375250, Columns: 8]\n",
       "    products [Rows: 3612, Columns: 3]\n",
       "    customers [Rows: 4244, Columns: 2]\n",
       "    orders [Rows: 20482, Columns: 4]\n",
       "  Relationships:\n",
       "    purchases.StockCode -> products.StockCode\n",
       "    purchases.CustomerID -> customers.CustomerID\n",
       "    purchases.InvoiceNo -> orders.InvoiceNo"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "es.normalize_entity(new_entity_id=\"customers\", # customer entity\n",
    "                    base_entity_id=\"purchases\", # from purchases\n",
    "                    index=\"CustomerID\")  # unique key is CustomerID\n",
    "\n",
    "es.normalize_entity(new_entity_id=\"orders\", # order entity\n",
    "                    base_entity_id=\"purchases\", # from purchases\n",
    "                    index=\"InvoiceNo\", # unique key is InvoiceNo\n",
    "                    additional_variables=[\"Country\", 'Cancelled']) # Include these variables.\n",
    "\n",
    "es"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>CustomerID</th>\n",
       "      <th>first_purchases_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>13313.0</th>\n",
       "      <td>13313.0</td>\n",
       "      <td>2011-01-04 10:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18097.0</th>\n",
       "      <td>18097.0</td>\n",
       "      <td>2011-01-04 10:22:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16656.0</th>\n",
       "      <td>16656.0</td>\n",
       "      <td>2011-01-04 10:23:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13094.0</th>\n",
       "      <td>13094.0</td>\n",
       "      <td>2011-01-04 10:37:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16875.0</th>\n",
       "      <td>16875.0</td>\n",
       "      <td>2011-01-04 10:37:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         CustomerID first_purchases_time\n",
       "13313.0     13313.0  2011-01-04 10:00:00\n",
       "18097.0     18097.0  2011-01-04 10:22:00\n",
       "16656.0     16656.0  2011-01-04 10:23:00\n",
       "13094.0     13094.0  2011-01-04 10:37:00\n",
       "16875.0     16875.0  2011-01-04 10:37:00"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "es['customers'].df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>InvoiceNo</th>\n",
       "      <th>Country</th>\n",
       "      <th>Cancelled</th>\n",
       "      <th>first_purchases_time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>539993</th>\n",
       "      <td>539993</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>False</td>\n",
       "      <td>2011-01-04 10:00:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>540001</th>\n",
       "      <td>540001</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>False</td>\n",
       "      <td>2011-01-04 10:22:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>540002</th>\n",
       "      <td>540002</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>False</td>\n",
       "      <td>2011-01-04 10:23:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>540003</th>\n",
       "      <td>540003</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>False</td>\n",
       "      <td>2011-01-04 10:37:00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>540004</th>\n",
       "      <td>540004</td>\n",
       "      <td>United Kingdom</td>\n",
       "      <td>False</td>\n",
       "      <td>2011-01-04 10:37:00</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       InvoiceNo         Country  Cancelled first_purchases_time\n",
       "539993    539993  United Kingdom      False  2011-01-04 10:00:00\n",
       "540001    540001  United Kingdom      False  2011-01-04 10:22:00\n",
       "540002    540002  United Kingdom      False  2011-01-04 10:23:00\n",
       "540003    540003  United Kingdom      False  2011-01-04 10:37:00\n",
       "540004    540004  United Kingdom      False  2011-01-04 10:37:00"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "es['orders'].df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deep Feature Synthesis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we use featuretools to create new features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "if create_data:\n",
    "    feature_matrix,_ = ft.dfs(entityset=es, # entity\n",
    "                              target_entity='customers', # what we're trying to predict\n",
    "                              agg_primitives=['max', 'min', 'mode', 'mean', 'avg_time_between'], # requested aggregations\n",
    "                              trans_primitives=['day', 'month', 'hour', 'weekend'], # requested transformations\n",
    "                              cutoff_time=labels, # define time period of predictions\n",
    "                              verbose=1, # how much stdout\n",
    "                              cutoff_time_in_index=True, # specify that we've given cutoff times\n",
    "                              chunk_size=50, # how much data to give each worker\n",
    "                              n_jobs=-1, # how many threads to create\n",
    "                              max_depth=1) # how many aggregations\n",
    "\n",
    "    feature_matrix = feature_matrix.reset_index()\n",
    "    feature_matrix.to_csv('./data/processed/dfs_depth1.csv')\n",
    "    feature_matrix.head()\n",
    "else:\n",
    "    feature_matrix = pd.read_csv('./data/processed/dfs_depth1.csv')\n",
    "    feature_matrix['time'] = pd.to_datetime(feature_matrix['time'])\n",
    "    feature_matrix = feature_matrix.drop('Unnamed: 0', axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['CustomerID', 'time', 'MAX(purchases.Quantity)',\n",
       "       'MAX(purchases.UnitPrice)', 'MAX(purchases.Total)',\n",
       "       'MIN(purchases.Quantity)', 'MIN(purchases.UnitPrice)',\n",
       "       'MIN(purchases.Total)', 'MODE(purchases.InvoiceNo)',\n",
       "       'MODE(purchases.StockCode)', 'MEAN(purchases.Quantity)',\n",
       "       'MEAN(purchases.UnitPrice)', 'MEAN(purchases.Total)',\n",
       "       'AVG_TIME_BETWEEN(purchases.InvoiceDate)', 'DAY(first_purchases_time)',\n",
       "       'MONTH(first_purchases_time)', 'HOUR(first_purchases_time)',\n",
       "       'WEEKEND(first_purchases_time)', 'Total'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_matrix.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CustomerID                               12346\n",
       "time                       2011-03-01 00:00:00\n",
       "MAX(purchases.Quantity)                  74215\n",
       "Name: 0, dtype: object"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# look at a single output\n",
    "feature_matrix.iloc[0, :3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "74215"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# demonstrate we understand the output\n",
    "df[(df['CustomerID'] == 12346) & (df['InvoiceDate'] < '2011-03-01')]['Quantity'].max()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(35339, 19)"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_matrix.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create categorical response variable\n",
    "feature_matrix['Total'] = feature_matrix['Total'].apply(lambda x: 1 if x > 500 else 0) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def create_train_test(month, feature_matrix, drop_cols=['CustomerID', 'time', 'month', 'Total']):\n",
    "    \"\"\"Basic cleaning and return train/test data.\"\"\"\n",
    "    \n",
    "    # remove columns we know will not contribute\n",
    "    feature_matrix = feature_matrix.drop(columns= ['MODE(purchases.InvoiceNo)', 'MODE(purchases.StockCode)'])\n",
    "    # dummy code strings\n",
    "    feature_matrix = pd.get_dummies(feature_matrix)\n",
    "    # fill nans\n",
    "    feature_matrix = feature_matrix.fillna(0)\n",
    "    \n",
    "    # Labels\n",
    "    feature_matrix['month'] = feature_matrix['time'].dt.month\n",
    "    train_labels = feature_matrix.loc[feature_matrix['month'] < month, 'Total']\n",
    "    test_labels = feature_matrix.loc[feature_matrix['month'] >= month, 'Total']\n",
    "    y_train = np.array(train_labels).reshape((-1, ))\n",
    "    y_test = np.array(test_labels).reshape((-1, ))\n",
    "\n",
    "    # Features\n",
    "    X_train = feature_matrix[feature_matrix['time'].dt.month < month].drop(columns=drop_cols)\n",
    "    X_test = feature_matrix[feature_matrix['time'].dt.month >= month].drop(columns=drop_cols)\n",
    "    \n",
    "    return (X_train, X_test, y_train, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = create_train_test(11, feature_matrix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.183619599970258\n",
      "0.16076294277929154\n"
     ]
    }
   ],
   "source": [
    "print(np.mean(y_train))\n",
    "print(np.mean(y_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LogisticRegression(C=1.0, class_weight='balanced', dual=False,\n",
       "          fit_intercept=True, intercept_scaling=1, max_iter=100,\n",
       "          multi_class='ovr', n_jobs=1, penalty='l2', random_state=0,\n",
       "          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "model = LogisticRegression(random_state=0, class_weight='balanced')\n",
    "model.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score\n",
    "\n",
    "def print_performance(X, y, model):\n",
    "    \"\"\"Print model performance metrics.\"\"\"\n",
    "    predictions = model.predict(X)\n",
    "    probs = model.predict_proba(X)[:, 1]\n",
    "\n",
    "    # Calculate metrics\n",
    "    print('Precision: {}'.format(round(precision_score(y, predictions), 5)))\n",
    "    print('Recall: {}'.format(round(recall_score(y, predictions), 5)))\n",
    "    print('F1 Score: {}'.format(round(f1_score(y, predictions), 5)))\n",
    "    print('ROC AUC: {}'.format(round(roc_auc_score(y, probs), 5)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Precision: 0.44233\n",
      "Recall: 0.43794\n",
      "F1 Score: 0.44013\n",
      "ROC AUC: 0.73158\n"
     ]
    }
   ],
   "source": [
    "print_performance(X_train, y_train, model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Precision: 0.40087\n",
      "Recall: 0.33972\n",
      "F1 Score: 0.36777\n",
      "ROC AUC: 0.69795\n"
     ]
    }
   ],
   "source": [
    "print_performance(X_test, y_test, model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deeper Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Beware that this takes forever!!!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "if create_data:\n",
    "    feature_matrix,_ = ft.dfs(entityset=es,\n",
    "                              target_entity='customers', # what we're trying to predict\n",
    "                              agg_primitives=['max', 'min', 'mode', 'mean', 'avg_time_between'], # requested aggs\n",
    "                              trans_primitives=['day', 'month', 'hour', 'weekend'], # requested transformations\n",
    "                              n_jobs=-1,\n",
    "                              chunk_size=50,\n",
    "                              max_depth=2, # how many aggregations to do\n",
    "                              cutoff_time=labels,\n",
    "                              cutoff_time_in_index=True,\n",
    "                              verbose=1)\n",
    "    \n",
    "    feature_matrix = feature_matrix.reset_index()\n",
    "    feature_matrix.to_csv('./data/processed/dfs_depth2.csv')\n",
    "    feature_matrix.head()\n",
    "else:\n",
    "    feature_matrix = pd.read_csv('./data/processed/dfs_depth2.csv')\n",
    "    feature_matrix['time'] = pd.to_datetime(feature_matrix['time']) \n",
    "    feature_matrix = feature_matrix.drop('Unnamed: 0', axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(35339, 23)"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_matrix.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['CustomerID', 'time', 'MAX(purchases.Quantity)',\n",
       "       'MAX(purchases.UnitPrice)', 'MAX(purchases.Total)',\n",
       "       'MIN(purchases.Quantity)', 'MIN(purchases.UnitPrice)',\n",
       "       'MIN(purchases.Total)', 'MODE(purchases.InvoiceNo)',\n",
       "       'MODE(purchases.StockCode)', 'MEAN(purchases.Quantity)',\n",
       "       'MEAN(purchases.UnitPrice)', 'MEAN(purchases.Total)',\n",
       "       'AVG_TIME_BETWEEN(purchases.InvoiceDate)', 'DAY(first_purchases_time)',\n",
       "       'MONTH(first_purchases_time)', 'HOUR(first_purchases_time)',\n",
       "       'WEEKEND(first_purchases_time)', 'MODE(purchases.DAY(InvoiceDate))',\n",
       "       'MODE(purchases.MONTH(InvoiceDate))',\n",
       "       'MODE(purchases.HOUR(InvoiceDate))', 'MODE(purchases.orders.Country)',\n",
       "       'Total'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "feature_matrix.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "CustomerID                                          13481\n",
       "time                                  2011-06-01 00:00:00\n",
       "MODE(purchases.MONTH(InvoiceDate))                      3\n",
       "Name: 7000, dtype: object"
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# look at a single output\n",
    "feature_matrix.iloc[7000, [0, 1, 19]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "month\n",
       "3    43\n",
       "4     1\n",
       "5    17\n",
       "Name: Total, dtype: int64"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# demonstrate we understand the output\n",
    "df['month'] = df['InvoiceDate'].dt.month\n",
    "df[(df['CustomerID'] == 13481) & (df['InvoiceDate'] < '2011-06-01')].groupby('month')['Total'].count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create categorical response variable\n",
    "feature_matrix['Total'] = feature_matrix['Total'].apply(lambda x: 1 if x > 500 else 0) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = create_train_test(11, feature_matrix)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "LogisticRegression(C=1.0, class_weight='balanced', dual=False,\n",
       "          fit_intercept=True, intercept_scaling=1, max_iter=100,\n",
       "          multi_class='ovr', n_jobs=1, penalty='l2', random_state=0,\n",
       "          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = LogisticRegression(random_state=0, class_weight='balanced')\n",
    "model.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Precision: 0.4444\n",
      "Recall: 0.44503\n",
      "F1 Score: 0.44471\n",
      "ROC AUC: 0.72874\n"
     ]
    }
   ],
   "source": [
    "print_performance(X_train, y_train, model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Precision: 0.41033\n",
      "Recall: 0.32203\n",
      "F1 Score: 0.36086\n",
      "ROC AUC: 0.68288\n"
     ]
    }
   ],
   "source": [
    "print_performance(X_test, y_test, model)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:featureautomation]",
   "language": "python",
   "name": "conda-env-featureautomation-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}