{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Online Retail - Predicting order cancellations\n",
                "\n",
                "In this tutorial, we demonstrate how getML can be applied in an e-commerce context. Using a dataset of about 400,000 orders, our goal is to predict whether an order will be cancelled.\n",
                "\n",
                "We also show that we can significantly improve our results by using getML's built-in hyperparameter tuning routines.\n",
                "\n",
                "Summary:\n",
                "\n",
                "- Prediction type: __Classification model__\n",
                "- Domain: __E-commerce__\n",
                "- Prediction target: __Whether an order will be cancelled__ \n",
                "- Population size: __397925__"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Background\n",
                "\n",
                "The data set contains about 400,000 orders from a British online retailer. Each order consists of a product that has been ordered and a corresponding quantity. Several orders can be summarized onto a single invoice. The goal is to predict whether an order will be cancelled.\n",
                "\n",
                "Because the company mainly sells to other businesses, the cancellation rate is relatively low, namely 1.83%.\n",
                "\n",
                "The data set has been originally collected for this study:\n",
                "\n",
                "> Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).\n",
                "\n",
                "It has been downloaded from the UCI Machine Learning Repository:\n",
                "\n",
                "> Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/dataset/352/online+retail). Irvine, CA: University of California, School of Information and Computer Science."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "## Analysis"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Let's get started with the analysis and set up your session:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 1,
            "metadata": {},
            "outputs": [],
            "source": [
                "%pip install -q \"getml==1.5.0\" \"pyspark==3.5.2\" \"ipywidgets==8.1.5\""
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 2,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "getML API version: 1.5.0\n",
                        "\n"
                    ]
                }
            ],
            "source": [
                "import os\n",
                "from urllib import request\n",
                "\n",
                "import numpy as np\n",
                "import pandas as pd\n",
                "\n",
                "from pyspark.sql import SparkSession\n",
                "import getml\n",
                "\n",
                "os.environ[\"PYARROW_IGNORE_TIMEZONE\"] = \"1\"\n",
                "\n",
                "print(f\"getML API version: {getml.__version__}\\n\")"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 3,
            "metadata": {},
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux...\n",
                        "Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912145332.log.\n",
                        "\u001b[2K  Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
                        "\u001b[?25h"
                    ]
                },
                {
                    "data": {
                        "text/html": [
                            "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Connected to project <span style=\"color: #008000; text-decoration-color: #008000\">'online_retail'</span>.\n",
                            "</pre>\n"
                        ],
                        "text/plain": [
                            "Connected to project \u001b[32m'online_retail'\u001b[0m.\n"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "getml.engine.launch(allow_remote_ips=True, token='token')\n",
                "getml.engine.set_project('online_retail')"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 4,
            "metadata": {},
            "outputs": [],
            "source": [
                "RUN_SPARK = False"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 1. Loading data"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### 1.1 Download from source\n",
                "\n",
                "We begin by downloading the data from the source file:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 5,
            "metadata": {},
            "outputs": [],
            "source": [
                "fname = \"online_retail.csv\"\n",
                "\n",
                "if not os.path.exists(fname):\n",
                "    fname, res = request.urlretrieve(\n",
                "        \"https://static.getml.com/datasets/online_retail/\" + fname, \n",
                "        fname\n",
                "    )\n",
                "    \n",
                "full_data_pandas = pd.read_csv(fname, sep=\"|\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### 1.2 Data preparation\n",
                "\n",
                "The invoice dates are in a somewhat unusual format, fo we need to rectify that."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 6,
            "metadata": {},
            "outputs": [],
            "source": [
                "def add_zero(string):\n",
                "    if len(string) == 1:\n",
                "        return \"0\" + string\n",
                "    return string"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 7,
            "metadata": {},
            "outputs": [],
            "source": [
                "def format_date(string):\n",
                "    datetime = string.split(\" \")\n",
                "    assert len(datetime) == 2, \"Expected date and time\"\n",
                "    \n",
                "    date_components = datetime[0].split(\"/\")\n",
                "    assert len(date_components) == 3, \"Expected three date components\"\n",
                "    \n",
                "    date_components = [add_zero(x) for x in date_components]\n",
                "    \n",
                "    return \"-\".join(date_components) + \" \" + datetime[1] "
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 8,
            "metadata": {},
            "outputs": [],
            "source": [
                "full_data_pandas[\"InvoiceDate\"] = [\n",
                "    format_date(string) for string in np.asarray(full_data_pandas[\"InvoiceDate\"])\n",
                "]"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 9,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<div>\n",
                            "<style scoped>\n",
                            "    .dataframe tbody tr th:only-of-type {\n",
                            "        vertical-align: middle;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe tbody tr th {\n",
                            "        vertical-align: top;\n",
                            "    }\n",
                            "\n",
                            "    .dataframe thead th {\n",
                            "        text-align: right;\n",
                            "    }\n",
                            "</style>\n",
                            "<table border=\"1\" class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr style=\"text-align: right;\">\n",
                            "      <th></th>\n",
                            "      <th>Invoice</th>\n",
                            "      <th>StockCode</th>\n",
                            "      <th>Description</th>\n",
                            "      <th>Quantity</th>\n",
                            "      <th>InvoiceDate</th>\n",
                            "      <th>Price</th>\n",
                            "      <th>Customer ID</th>\n",
                            "      <th>Country</th>\n",
                            "    </tr>\n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    <tr>\n",
                            "      <th>0</th>\n",
                            "      <td>536365</td>\n",
                            "      <td>85123A</td>\n",
                            "      <td>WHITE HANGING HEART T-LIGHT HOLDER</td>\n",
                            "      <td>6</td>\n",
                            "      <td>2010-12-01 08:26</td>\n",
                            "      <td>2.55</td>\n",
                            "      <td>17850.0</td>\n",
                            "      <td>United Kingdom</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>1</th>\n",
                            "      <td>536365</td>\n",
                            "      <td>71053</td>\n",
                            "      <td>WHITE METAL LANTERN</td>\n",
                            "      <td>6</td>\n",
                            "      <td>2010-12-01 08:26</td>\n",
                            "      <td>3.39</td>\n",
                            "      <td>17850.0</td>\n",
                            "      <td>United Kingdom</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>2</th>\n",
                            "      <td>536365</td>\n",
                            "      <td>84406B</td>\n",
                            "      <td>CREAM CUPID HEARTS COAT HANGER</td>\n",
                            "      <td>8</td>\n",
                            "      <td>2010-12-01 08:26</td>\n",
                            "      <td>2.75</td>\n",
                            "      <td>17850.0</td>\n",
                            "      <td>United Kingdom</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>3</th>\n",
                            "      <td>536365</td>\n",
                            "      <td>84029G</td>\n",
                            "      <td>KNITTED UNION FLAG HOT WATER BOTTLE</td>\n",
                            "      <td>6</td>\n",
                            "      <td>2010-12-01 08:26</td>\n",
                            "      <td>3.39</td>\n",
                            "      <td>17850.0</td>\n",
                            "      <td>United Kingdom</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>4</th>\n",
                            "      <td>536365</td>\n",
                            "      <td>84029E</td>\n",
                            "      <td>RED WOOLLY HOTTIE WHITE HEART.</td>\n",
                            "      <td>6</td>\n",
                            "      <td>2010-12-01 08:26</td>\n",
                            "      <td>3.39</td>\n",
                            "      <td>17850.0</td>\n",
                            "      <td>United Kingdom</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>...</th>\n",
                            "      <td>...</td>\n",
                            "      <td>...</td>\n",
                            "      <td>...</td>\n",
                            "      <td>...</td>\n",
                            "      <td>...</td>\n",
                            "      <td>...</td>\n",
                            "      <td>...</td>\n",
                            "      <td>...</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>541905</th>\n",
                            "      <td>581587</td>\n",
                            "      <td>22899</td>\n",
                            "      <td>CHILDREN'S APRON DOLLY GIRL</td>\n",
                            "      <td>6</td>\n",
                            "      <td>2011-12-09 12:50</td>\n",
                            "      <td>2.10</td>\n",
                            "      <td>12680.0</td>\n",
                            "      <td>France</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>541906</th>\n",
                            "      <td>581587</td>\n",
                            "      <td>23254</td>\n",
                            "      <td>CHILDRENS CUTLERY DOLLY GIRL</td>\n",
                            "      <td>4</td>\n",
                            "      <td>2011-12-09 12:50</td>\n",
                            "      <td>4.15</td>\n",
                            "      <td>12680.0</td>\n",
                            "      <td>France</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>541907</th>\n",
                            "      <td>581587</td>\n",
                            "      <td>23255</td>\n",
                            "      <td>CHILDRENS CUTLERY CIRCUS PARADE</td>\n",
                            "      <td>4</td>\n",
                            "      <td>2011-12-09 12:50</td>\n",
                            "      <td>4.15</td>\n",
                            "      <td>12680.0</td>\n",
                            "      <td>France</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>541908</th>\n",
                            "      <td>581587</td>\n",
                            "      <td>22138</td>\n",
                            "      <td>BAKING SET 9 PIECE RETROSPOT</td>\n",
                            "      <td>3</td>\n",
                            "      <td>2011-12-09 12:50</td>\n",
                            "      <td>4.95</td>\n",
                            "      <td>12680.0</td>\n",
                            "      <td>France</td>\n",
                            "    </tr>\n",
                            "    <tr>\n",
                            "      <th>541909</th>\n",
                            "      <td>581587</td>\n",
                            "      <td>POST</td>\n",
                            "      <td>POSTAGE</td>\n",
                            "      <td>1</td>\n",
                            "      <td>2011-12-09 12:50</td>\n",
                            "      <td>18.00</td>\n",
                            "      <td>12680.0</td>\n",
                            "      <td>France</td>\n",
                            "    </tr>\n",
                            "  </tbody>\n",
                            "</table>\n",
                            "<p>541910 rows × 8 columns</p>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "       Invoice StockCode                          Description  Quantity  \\\n",
                            "0       536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   \n",
                            "1       536365     71053                  WHITE METAL LANTERN         6   \n",
                            "2       536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   \n",
                            "3       536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   \n",
                            "4       536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   \n",
                            "...        ...       ...                                  ...       ...   \n",
                            "541905  581587     22899         CHILDREN'S APRON DOLLY GIRL          6   \n",
                            "541906  581587     23254        CHILDRENS CUTLERY DOLLY GIRL          4   \n",
                            "541907  581587     23255      CHILDRENS CUTLERY CIRCUS PARADE         4   \n",
                            "541908  581587     22138        BAKING SET 9 PIECE RETROSPOT          3   \n",
                            "541909  581587      POST                              POSTAGE         1   \n",
                            "\n",
                            "             InvoiceDate  Price  Customer ID         Country  \n",
                            "0       2010-12-01 08:26   2.55      17850.0  United Kingdom  \n",
                            "1       2010-12-01 08:26   3.39      17850.0  United Kingdom  \n",
                            "2       2010-12-01 08:26   2.75      17850.0  United Kingdom  \n",
                            "3       2010-12-01 08:26   3.39      17850.0  United Kingdom  \n",
                            "4       2010-12-01 08:26   3.39      17850.0  United Kingdom  \n",
                            "...                  ...    ...          ...             ...  \n",
                            "541905  2011-12-09 12:50   2.10      12680.0          France  \n",
                            "541906  2011-12-09 12:50   4.15      12680.0          France  \n",
                            "541907  2011-12-09 12:50   4.15      12680.0          France  \n",
                            "541908  2011-12-09 12:50   4.95      12680.0          France  \n",
                            "541909  2011-12-09 12:50  18.00      12680.0          France  \n",
                            "\n",
                            "[541910 rows x 8 columns]"
                        ]
                    },
                    "execution_count": 9,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "full_data_pandas"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "In this data set, the targets aren't as clearly defined as we would like to, so we have do define them ourselves."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 10,
            "metadata": {},
            "outputs": [],
            "source": [
                "def add_target(df):\n",
                "    df = df.sort_values(by=[\"Customer ID\", \"InvoiceDate\"])\n",
                "    \n",
                "    cancelled = np.zeros(df.shape[0])\n",
                "\n",
                "    invoice = np.asarray(df[\"Invoice\"])\n",
                "    stock_code = np.asarray(df[\"StockCode\"])\n",
                "    customer_id = np.asarray(df[\"Customer ID\"])\n",
                "\n",
                "    for i in range(len(invoice)):\n",
                "        if (invoice[i][0] == 'C') or (i == len(invoice) - 1):\n",
                "            continue\n",
                "\n",
                "        j = i + 1\n",
                "\n",
                "        while customer_id[j] == customer_id[i]:\n",
                "            if (invoice[j][0] == 'C') and (stock_code[i] == stock_code[j]):\n",
                "                cancelled[i] = 1.0\n",
                "                break\n",
                "\n",
                "            if stock_code[i] == stock_code[j]:\n",
                "                break\n",
                "\n",
                "            j += 1\n",
                "    \n",
                "    df[\"cancelled\"] = cancelled\n",
                "    \n",
                "    return df"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Also, we want to remove any orders in the data set that are actually cancellations."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 11,
            "metadata": {},
            "outputs": [],
            "source": [
                "def remove_cancellations(df):\n",
                "    invoice = np.asarray(df[\"Invoice\"])\n",
                "\n",
                "    is_order = [inv[0] != 'C' for inv in invoice]\n",
                "    \n",
                "    df = df[is_order]\n",
                "    \n",
                "    return df"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 12,
            "metadata": {},
            "outputs": [],
            "source": [
                "full_data_pandas = add_target(full_data_pandas)\n",
                "full_data_pandas = remove_cancellations(full_data_pandas)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Finally, there are some order for which we do not have a customer ID. We want to remove those."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 13,
            "metadata": {},
            "outputs": [],
            "source": [
                "full_data_pandas = full_data_pandas[~np.isnan(full_data_pandas[\"Customer ID\"])]"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Now we can upload the data to getML."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 14,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<style>\n",
                            "  table {\n",
                            "    font-family: Helvetica, sans-serif;\n",
                            "  }\n",
                            "\n",
                            "  th {\n",
                            "    text-align: left !important;\n",
                            "  }\n",
                            "  td {\n",
                            "    text-align: left !important;\n",
                            "  }\n",
                            "  th:nth-child(1) {\n",
                            "    text-align: right !important;\n",
                            "    border-right: 1px solid LightGray;\n",
                            "  }\n",
                            "  th.sub-header {\n",
                            "    font-weight: normal;\n",
                            "    font-style: italic;\n",
                            "  }\n",
                            "  .join_key,\n",
                            "  .numerical,\n",
                            "  .target,\n",
                            "  .unused_float {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "\n",
                            "  .char-align {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  span.left {\n",
                            "    text-align: right;\n",
                            "    width: 3em;\n",
                            "  }\n",
                            "  span.right {\n",
                            "    float: right;\n",
                            "    text-align: left;\n",
                            "  }\n",
                            "</style>\n",
                            "\n",
                            "<table class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr>\n",
                            "      \n",
                            "        \n",
                            "          <th>  name</th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"unused_float\">    Quantity</th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"unused_float\">       Price</th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"unused_float\"> Customer ID</th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"unused_float\">   cancelled</th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"unused_string\">Invoice      </th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"unused_string\">StockCode    </th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"unused_string\">Description                     </th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"unused_string\">InvoiceDate     </th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"unused_string\">Country       </th>\n",
                            "        \n",
                            "      \n",
                            "    </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        \n",
                            "          \n",
                            "            <th class=\"sub-header\">  role</th>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <th class=\"sub-header unused_float\">unused_float</th>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <th class=\"sub-header unused_float\">unused_float</th>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <th class=\"sub-header unused_float\">unused_float</th>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <th class=\"sub-header unused_float\">unused_float</th>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <th class=\"sub-header unused_string\">unused_string</th>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <th class=\"sub-header unused_string\">unused_string</th>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <th class=\"sub-header unused_string\">unused_string                   </th>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <th class=\"sub-header unused_string\">unused_string   </th>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <th class=\"sub-header unused_string\">unused_string </th>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>0</th>\n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">74215</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "            \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">1</span\n",
                            "              ><span class=\"right\" style=\"width: 2ch\"\n",
                            "                >.04</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">12346</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">1</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">541431</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">23166</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">MEDIUM CERAMIC TOP STORAGE JAR</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">2011-01-18 10:01</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">United Kingdom</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>1</th>\n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">12</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "            \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">2</span\n",
                            "              ><span class=\"right\" style=\"width: 2ch\"\n",
                            "                >.1</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">12347</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">0</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">537626</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">85116</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">BLACK CANDELABRA T-LIGHT HOLDER</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">2010-12-07 14:57</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">Iceland</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>2</th>\n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">4</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "            \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">4</span\n",
                            "              ><span class=\"right\" style=\"width: 2ch\"\n",
                            "                >.25</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">12347</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">0</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">537626</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">22375</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">AIRLINE BAG VINTAGE JET SET BROW...</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">2010-12-07 14:57</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">Iceland</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>3</th>\n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">12</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "            \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">3</span\n",
                            "              ><span class=\"right\" style=\"width: 2ch\"\n",
                            "                >.25</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">12347</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">0</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">537626</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">71477</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">COLOUR GLASS. STAR T-LIGHT HOLDE...</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">2010-12-07 14:57</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">Iceland</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>4</th>\n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">36</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "            \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">0</span\n",
                            "              ><span class=\"right\" style=\"width: 2ch\"\n",
                            "                >.65</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">12347</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">0</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">537626</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">22492</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">MINI PAINT SET VINTAGE</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">2010-12-07 14:57</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">Iceland</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th></th>\n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">...</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">...</span\n",
                            "              ><span class=\"right\" style=\"width: 2ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">...</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">...</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">...</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">...</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">...</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">...</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">...</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>397920</th>\n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">12</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "            \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">0</span\n",
                            "              ><span class=\"right\" style=\"width: 2ch\"\n",
                            "                >.42</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">18287</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">0</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">570715</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">22419</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">LIPSTICK PEN RED</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">2011-10-12 10:23</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">United Kingdom</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>397921</th>\n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">12</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "            \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">2</span\n",
                            "              ><span class=\"right\" style=\"width: 2ch\"\n",
                            "                >.1</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">18287</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">0</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">570715</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">22866</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">HAND WARMER SCOTTY DOG DESIGN</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">2011-10-12 10:23</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">United Kingdom</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>397922</th>\n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">36</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "            \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">1</span\n",
                            "              ><span class=\"right\" style=\"width: 2ch\"\n",
                            "                >.25</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">18287</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">0</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">573167</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">23264</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">SET OF 3 WOODEN SLEIGH DECORATIO...</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">2011-10-28 09:29</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">United Kingdom</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>397923</th>\n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">48</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "            \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">0</span\n",
                            "              ><span class=\"right\" style=\"width: 2ch\"\n",
                            "                >.39</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">18287</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">0</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">573167</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">21824</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">PAINTED METAL STAR WITH HOLLY BE...</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">2011-10-28 09:29</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">United Kingdom</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>397924</th>\n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">24</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "            \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">0</span\n",
                            "              ><span class=\"right\" style=\"width: 2ch\"\n",
                            "                >.29</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">18287</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            \n",
                            "              \n",
                            "              \n",
                            "          \n",
                            "            <td class=\"char-align unused_float\">\n",
                            "              <span class=\"left\">0</span\n",
                            "              ><span class=\"right\" style=\"width: 0ch\"\n",
                            "                >&nbsp;</span\n",
                            "              >\n",
                            "            </td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">573167</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">21014</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">SWISS CHALET TREE DECORATION</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">2011-10-28 09:29</td>\n",
                            "          \n",
                            "        \n",
                            "          \n",
                            "            <td class=\"unused_string\">United Kingdom</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "  </tbody>\n",
                            "</table>\n",
                            "\n",
                            "  <p>\n",
                            "    397925 rows x 9 columns<br />\n",
                            "    memory usage: 57.28 MB<br />\n",
                            "    name: full_data<br />\n",
                            "    type: getml.DataFrame<br />\n",
                            "    \n",
                            "  </p>\n"
                        ],
                        "text/plain": [
                            "  name       Quantity          Price    Customer ID      cancelled   Invoice         StockCode    \n",
                            "  role   unused_float   unused_float   unused_float   unused_float   unused_string   unused_string\n",
                            "     0          74215           1.04          12346              1   541431          23166        \n",
                            "     1             12           2.1           12347              0   537626          85116        \n",
                            "     2              4           4.25          12347              0   537626          22375        \n",
                            "     3             12           3.25          12347              0   537626          71477        \n",
                            "     4             36           0.65          12347              0   537626          22492        \n",
                            "                  ...          ...              ...            ...   ...             ...          \n",
                            "397920             12           0.42          18287              0   570715          22419        \n",
                            "397921             12           2.1           18287              0   570715          22866        \n",
                            "397922             36           1.25          18287              0   573167          23264        \n",
                            "397923             48           0.39          18287              0   573167          21824        \n",
                            "397924             24           0.29          18287              0   573167          21014        \n",
                            "\n",
                            "  name   Description                        InvoiceDate        Country       \n",
                            "  role   unused_string                      unused_string      unused_string \n",
                            "     0   MEDIUM CERAMIC TOP STORAGE JAR     2011-01-18 10:01   United Kingdom\n",
                            "     1   BLACK CANDELABRA T-LIGHT HOLDER    2010-12-07 14:57   Iceland       \n",
                            "     2   AIRLINE BAG VINTAGE JET SET BROW...   2010-12-07 14:57   Iceland       \n",
                            "     3   COLOUR GLASS. STAR T-LIGHT HOLDE...   2010-12-07 14:57   Iceland       \n",
                            "     4   MINI PAINT SET VINTAGE             2010-12-07 14:57   Iceland       \n",
                            "         ...                                ...                ...           \n",
                            "397920   LIPSTICK PEN RED                   2011-10-12 10:23   United Kingdom\n",
                            "397921   HAND WARMER SCOTTY DOG DESIGN      2011-10-12 10:23   United Kingdom\n",
                            "397922   SET OF 3 WOODEN SLEIGH DECORATIO...   2011-10-28 09:29   United Kingdom\n",
                            "397923   PAINTED METAL STAR WITH HOLLY BE...   2011-10-28 09:29   United Kingdom\n",
                            "397924   SWISS CHALET TREE DECORATION       2011-10-28 09:29   United Kingdom\n",
                            "\n",
                            "\n",
                            "397925 rows x 9 columns\n",
                            "memory usage: 57.28 MB\n",
                            "type: getml.DataFrame"
                        ]
                    },
                    "execution_count": 14,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "full_data = getml.data.DataFrame.from_pandas(full_data_pandas, \"full_data\")\n",
                "\n",
                "full_data"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### 1.3 Prepare data for getML"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "getML requires that we define *roles* for each of the columns."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 15,
            "metadata": {},
            "outputs": [],
            "source": [
                "full_data.set_role(\"InvoiceDate\", getml.data.roles.time_stamp, time_formats=['%Y-%m-%d %H:%M'])\n",
                "full_data.set_role([\"Customer ID\", \"Invoice\"], getml.data.roles.join_key)\n",
                "full_data.set_role([\"cancelled\"], getml.data.roles.target)\n",
                "full_data.set_role([\"Quantity\", \"Price\"], getml.data.roles.numerical)\n",
                "full_data.set_role(\"Country\", getml.data.roles.categorical)\n",
                "full_data.set_role(\"Description\", getml.data.roles.text)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "The *StockCode* is a 5-digit code that uniquely defines a product. It is hierarchical, meaning that every digit has a meaning. We want to make use of that, so we assign a unit to the stock code, which we can reference in our preprocessors."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 16,
            "metadata": {},
            "outputs": [],
            "source": [
                "full_data.set_unit(\"StockCode\", \"code\")"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 17,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<style>\n",
                            "  th {\n",
                            "    text-align: left !important;\n",
                            "  }\n",
                            "  th.sub-header {\n",
                            "    font-weight: normal;\n",
                            "    font-style: italic;\n",
                            "    text-align: left !important;\n",
                            "  }\n",
                            "  td {\n",
                            "    text-align: left !important;\n",
                            "  }\n",
                            "  th:nth-child(1) {\n",
                            "    text-align: right !important;\n",
                            "    border-right: 1px solid LightGray;\n",
                            "  }\n",
                            "  th.numerical {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  td.numerical {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "\n",
                            "  td.char-align {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  span.left {\n",
                            "    text-align: right;\n",
                            "    width: 3em;\n",
                            "  }\n",
                            "  span.right {\n",
                            "    float: right;\n",
                            "    text-align: left;\n",
                            "  }\n",
                            "</style>\n",
                            "\n",
                            "<table class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr>\n",
                            "      \n",
                            "        \n",
                            "          <th>  </th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th>          </th>\n",
                            "        \n",
                            "      \n",
                            "    </tr>\n",
                            "    \n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>0</th>\n",
                            "        \n",
                            "          \n",
                            "            <td>train</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>1</th>\n",
                            "        \n",
                            "          \n",
                            "            <td>validation</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>2</th>\n",
                            "        \n",
                            "          \n",
                            "            <td>train</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>3</th>\n",
                            "        \n",
                            "          \n",
                            "            <td>validation</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>4</th>\n",
                            "        \n",
                            "          \n",
                            "            <td>validation</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th></th>\n",
                            "        \n",
                            "          \n",
                            "            <td>...</td>\n",
                            "          \n",
                            "        \n",
                            "      </tr>\n",
                            "    \n",
                            "  </tbody>\n",
                            "</table>\n",
                            "\n",
                            "  <p>\n",
                            "    infinite number of  rows<br />\n",
                            "    \n",
                            "    type: StringColumnView<br />\n",
                            "    \n",
                            "  </p>\n"
                        ],
                        "text/plain": [
                            "               \n",
                            " 0   train     \n",
                            " 1   validation\n",
                            " 2   train     \n",
                            " 3   validation\n",
                            " 4   validation\n",
                            "     ...       \n",
                            "\n",
                            "\n",
                            "infinite number of  rows\n",
                            "type: StringColumnView"
                        ]
                    },
                    "execution_count": 17,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "split = getml.data.split.random(train=0.7, validation=0.15, test=0.15)\n",
                "split"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 2. Predictive modeling\n",
                "\n",
                "We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning."
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### 2.1 Define relational model\n",
                "\n",
                "To get started with relational learning, we need to specify the data model.\n",
                "\n",
                "In our case, there are two joins we are interested in: \n",
                "\n",
                "1) We want to take a look at all of the other orders on the same invoice.\n",
                "\n",
                "2) We want to check out how often a certain customer has cancelled orders in the past. Here, we limit ourselves to the last 90 days. To avoid data leaks, we set a horizon of one day."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 18,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<span style='font-size: 1.2rem; font-weight: 500;'>data model</span>\n",
                            "            <div style='margin-top: 15px; margin-bottom: 5px;'>\n",
                            "            <div style='margin-bottom: 10px; font-size: 1rem;'>diagram</div>\n",
                            "            <div style=\"height:210px;width:660px;position:relative;\"><svg height=\"200\" width=\"650\"><rect y=\"0\" x=\"0\" rx=\"10\" ry=\"10\" width=\"150\" height=\"90\" style=\"fill:#6829c2;stroke-width:0;\" /><text y=\"73.8\" x=\"75.0\" dominant-baseline=\"middle\" text-anchor=\"middle\" fill=\"white\">full_data</text><rect x=\"51\" y=\"10\" rx=\"4\" ry=\"4\" width=\"48\" height=\"48\" style=\" fill:#6829c2;stroke:#ffffff;stroke-width:3;\" /><line x1=\"67.0\" y1=\"10\" x2=\"67.0\" y2=\"58\" style=\"stroke:white;stroke-width:3\" /><line x1=\"83.0\" y1=\"10\" x2=\"83.0\" y2=\"58\" style=\"stroke:white;stroke-width:3\" /><line x1=\"51\" y1=\"26.0\" x2=\"99\" y2=\"26.0\" style=\"stroke:white;stroke-width:3\" /><line x1=\"51\" y1=\"42.0\" x2=\"99\" y2=\"42.0\" style=\"stroke:white;stroke-width:3\" /><rect y=\"110\" x=\"0\" rx=\"10\" ry=\"10\" width=\"150\" height=\"90\" style=\"fill:#6829c2;stroke-width:0;\" /><text y=\"183.8\" x=\"75.0\" dominant-baseline=\"middle\" text-anchor=\"middle\" fill=\"white\">full_data</text><rect x=\"51\" y=\"120\" rx=\"4\" ry=\"4\" width=\"48\" height=\"48\" style=\" fill:#6829c2;stroke:#ffffff;stroke-width:3;\" /><line x1=\"67.0\" y1=\"120\" x2=\"67.0\" y2=\"168\" style=\"stroke:white;stroke-width:3\" /><line x1=\"83.0\" y1=\"120\" x2=\"83.0\" y2=\"168\" style=\"stroke:white;stroke-width:3\" /><line x1=\"51\" y1=\"136.0\" x2=\"99\" y2=\"136.0\" style=\"stroke:white;stroke-width:3\" /><line x1=\"51\" y1=\"152.0\" x2=\"99\" y2=\"152.0\" style=\"stroke:white;stroke-width:3\" /><rect y=\"110\" x=\"500\" rx=\"10\" ry=\"10\" width=\"150\" height=\"90\" style=\"fill:#6829c2;stroke-width:0;\" /><text y=\"183.8\" x=\"575.0\" dominant-baseline=\"middle\" text-anchor=\"middle\" fill=\"white\">population</text><rect x=\"551\" y=\"120\" rx=\"4\" ry=\"4\" width=\"48\" height=\"48\" style=\" fill:#6829c2;stroke:#ffffff;stroke-width:3;\" /><line x1=\"567.0\" y1=\"120\" x2=\"567.0\" y2=\"168\" style=\"stroke:white;stroke-width:3\" /><line x1=\"583.0\" y1=\"120\" x2=\"583.0\" y2=\"168\" style=\"stroke:white;stroke-width:3\" /><line x1=\"551\" y1=\"136.0\" x2=\"599\" y2=\"136.0\" style=\"stroke:white;stroke-width:3\" /><line x1=\"551\" y1=\"152.0\" x2=\"599\" y2=\"152.0\" style=\"stroke:white;stroke-width:3\" /><line x1=\"150\" y1=\"43.0\" x2=\"573.0\" y2=\"43.0\" style=\"stroke:#808080;;stroke-width:4\" /><line x1=\"573.0\" y1=\"41.0\" x2=\"573.0\" y2=\"100\" style=\"stroke:#808080;;stroke-width:4\" /><polygon points=\"573.0, 110 567.0, 100 579.0, 100 \" style=\"fill:#808080;;stroke-width:0;\" /><rect y=\"10.0\" x=\"249.0\" rx=\"10\" ry=\"10\" width=\"150\" height=\"70\" style=\"fill:#6829c2;stroke-width:0;\" /><text dominant-baseline=\"middle\" text-anchor=\"middle\" fill=\"white\"><tspan y=\"45.0\" x=\"324.0\" font-size=\"7pt\" >Invoice = Invoice</tspan></text><line x1=\"150\" y1=\"153.0\" x2=\"490\" y2=\"153.0\" style=\"stroke:#808080;;stroke-width:4\" /><polygon points=\"500, 153.0 490, 147.0 490, 159.0 \" style=\"fill:#808080;;stroke-width:0;\" /><rect y=\"120.0\" x=\"249.0\" rx=\"10\" ry=\"10\" width=\"150\" height=\"70\" style=\"fill:#6829c2;stroke-width:0;\" /><text dominant-baseline=\"middle\" text-anchor=\"middle\" fill=\"white\"><tspan y=\"135.0\" x=\"324.0\" font-size=\"7pt\" >Customer ID = Customer ID</tspan><tspan y=\"145.0\" x=\"324.0\" font-size=\"7pt\" >InvoiceDate &lt;= InvoiceDate</tspan><tspan y=\"155.0\" x=\"324.0\" font-size=\"7pt\" >Memory: 90.0 days</tspan><tspan y=\"165.0\" x=\"324.0\" font-size=\"7pt\" >Horizon: 1.0 days</tspan><tspan y=\"175.0\" x=\"324.0\" font-size=\"7pt\" >Lagged targets allowed</tspan></text></svg></div>\n",
                            "            </div>\n",
                            "\n",
                            "            <div style='margin-top: 15px;'>\n",
                            "            <div style='margin-bottom: 10px; font-size: 1rem;'>staging</div>\n",
                            "            <style>\n",
                            "  th {\n",
                            "    text-align: left !important;\n",
                            "  }\n",
                            "  td {\n",
                            "    text-align: left !important;\n",
                            "  }\n",
                            "  th:nth-child(1) {\n",
                            "    text-align: right;\n",
                            "    border-right: 1px solid LightGray;\n",
                            "  }\n",
                            "  th.float {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  td.float {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  th.int {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  td.int {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "</style>\n",
                            "\n",
                            "<table class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr>\n",
                            "      \n",
                            "        \n",
                            "          <th class=\"int\"> </th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"str\">data frames</th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"str\">staging table              </th>\n",
                            "        \n",
                            "      \n",
                            "    </tr>\n",
                            "    \n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>0</th>\n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">population</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">POPULATION__STAGING_TABLE_1</td>\n",
                            "            \n",
                            "          \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>1</th>\n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">full_data</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">FULL_DATA__STAGING_TABLE_2</td>\n",
                            "            \n",
                            "          \n",
                            "      </tr>\n",
                            "    \n",
                            "  </tbody>\n",
                            "</table>\n",
                            "            </div>\n",
                            "            \n",
                            "<span style='font-size: 1.2rem; font-weight: 500;'>container</span>\n",
                            "<div style='margin-top: 15px;'>\n",
                            "<div style='float: left; margin-right: 50px;'>\n",
                            "<div style='margin-bottom: 10px; font-size: 1rem;'>population</div>\n",
                            "    <style>\n",
                            "  th {\n",
                            "    text-align: left !important;\n",
                            "  }\n",
                            "  td {\n",
                            "    text-align: left !important;\n",
                            "  }\n",
                            "  th:nth-child(1) {\n",
                            "    text-align: right;\n",
                            "    border-right: 1px solid LightGray;\n",
                            "  }\n",
                            "  th.float {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  td.float {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  th.int {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  td.int {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "</style>\n",
                            "\n",
                            "<table class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr>\n",
                            "      \n",
                            "        \n",
                            "          <th class=\"int\"> </th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"str\">subset    </th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"str\">name     </th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"int\">  rows</th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"str\">type</th>\n",
                            "        \n",
                            "      \n",
                            "    </tr>\n",
                            "    \n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>0</th>\n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">test</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">full_data</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"int\">60013</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">View</td>\n",
                            "            \n",
                            "          \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>1</th>\n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">train</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">full_data</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"int\">278171</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">View</td>\n",
                            "            \n",
                            "          \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>2</th>\n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">validation</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">full_data</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"int\">59741</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">View</td>\n",
                            "            \n",
                            "          \n",
                            "      </tr>\n",
                            "    \n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>\n",
                            "<div style='float: left;'>\n",
                            "<div style='margin-bottom: 10px; font-size: 1rem;'>peripheral</div>\n",
                            "    <style>\n",
                            "  th {\n",
                            "    text-align: left !important;\n",
                            "  }\n",
                            "  td {\n",
                            "    text-align: left !important;\n",
                            "  }\n",
                            "  th:nth-child(1) {\n",
                            "    text-align: right;\n",
                            "    border-right: 1px solid LightGray;\n",
                            "  }\n",
                            "  th.float {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  td.float {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  th.int {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  td.int {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "</style>\n",
                            "\n",
                            "<table class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr>\n",
                            "      \n",
                            "        \n",
                            "          <th class=\"int\"> </th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"str\">name     </th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"int\">  rows</th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"str\">type</th>\n",
                            "        \n",
                            "      \n",
                            "    </tr>\n",
                            "    \n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>0</th>\n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">full_data</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"int\">397925</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">View</td>\n",
                            "            \n",
                            "          \n",
                            "      </tr>\n",
                            "    \n",
                            "  </tbody>\n",
                            "</table>\n",
                            "</div>\n",
                            "</div>"
                        ],
                        "text/plain": [
                            "data model\n",
                            "\n",
                            "  population:\n",
                            "    columns:\n",
                            "    - Country: categorical\n",
                            "    - Customer ID: join_key\n",
                            "    - Invoice: join_key\n",
                            "    - Quantity: numerical\n",
                            "    - Price: numerical\n",
                            "    - ...\n",
                            "\n",
                            "    joins:\n",
                            "    - right: 'full_data'\n",
                            "      on: (population.Invoice, full_data.Invoice)\n",
                            "      relationship: 'many-to-many'\n",
                            "      lagged_targets: False\n",
                            "    - right: 'full_data'\n",
                            "      on: (population.Customer ID, full_data.Customer ID)\n",
                            "      time_stamps: (population.InvoiceDate, full_data.InvoiceDate)\n",
                            "      relationship: 'many-to-many'\n",
                            "      memory: 7776000.0\n",
                            "      horizon: 86400.0\n",
                            "      lagged_targets: True\n",
                            "\n",
                            "  full_data:\n",
                            "    columns:\n",
                            "    - Country: categorical\n",
                            "    - Customer ID: join_key\n",
                            "    - Invoice: join_key\n",
                            "    - Quantity: numerical\n",
                            "    - Price: numerical\n",
                            "    - ...\n",
                            "\n",
                            "  full_data:\n",
                            "    columns:\n",
                            "    - Country: categorical\n",
                            "    - Customer ID: join_key\n",
                            "    - Invoice: join_key\n",
                            "    - Quantity: numerical\n",
                            "    - Price: numerical\n",
                            "    - ...\n",
                            "\n",
                            "\n",
                            "container\n",
                            "\n",
                            "  population\n",
                            "      subset       name          rows   type\n",
                            "  0   test         full_data    60013   View\n",
                            "  1   train        full_data   278171   View\n",
                            "  2   validation   full_data    59741   View\n",
                            "\n",
                            "  peripheral\n",
                            "      name          rows   type\n",
                            "  0   full_data   397925   View"
                        ]
                    },
                    "execution_count": 18,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "star_schema = getml.data.StarSchema(\n",
                "    population=full_data, \n",
                "    alias=\"population\",\n",
                "    split=split,\n",
                ")\n",
                "\n",
                "star_schema.join(\n",
                "    full_data.drop(\"Description\"),\n",
                "    alias=\"full_data\",\n",
                "    on='Invoice',\n",
                ")\n",
                "\n",
                "star_schema.join(\n",
                "    full_data.drop(\"Description\"),\n",
                "    alias=\"full_data\",\n",
                "    on='Customer ID',\n",
                "    time_stamps='InvoiceDate',\n",
                "    horizon=getml.data.time.days(1),\n",
                "    memory=getml.data.time.days(90),\n",
                "    lagged_targets=True,\n",
                ")\n",
                "\n",
                "star_schema"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### 2.2 getML pipeline"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "<!-- #### 2.1.1  -->\n",
                "__Set-up the feature learner & predictor__"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "We have mentioned that the *StockCode* is a hierarchical code. To make use of that fact, we use getML's substring preprocessor, extracting the first digit, the first two digits etc. Since we have assigned the unit *code* to the *StockCode*, the preprocessors know which column they should be applied to."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 19,
            "metadata": {},
            "outputs": [],
            "source": [
                "substr1 = getml.preprocessors.Substring(0, 1, \"code\")\n",
                "substr2 = getml.preprocessors.Substring(0, 2, \"code\")\n",
                "substr3 = getml.preprocessors.Substring(0, 3, \"code\")\n",
                "\n",
                "mapping = getml.preprocessors.Mapping()\n",
                "\n",
                "text_field_splitter = getml.preprocessors.TextFieldSplitter()\n",
                "\n",
                "fast_prop = getml.feature_learning.FastProp(\n",
                "    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,\n",
                "    num_threads=1,\n",
                "    sampling_factor=0.1,\n",
                ")\n",
                "\n",
                "feature_selector = getml.predictors.XGBoostClassifier()\n",
                "\n",
                "predictor = getml.predictors.XGBoostClassifier()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "__Build the pipeline__"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 20,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<pre>Pipeline(data_model='population',\n",
                            "         feature_learners=['FastProp'],\n",
                            "         feature_selectors=['XGBoostClassifier'],\n",
                            "         include_categorical=False,\n",
                            "         loss_function='CrossEntropyLoss',\n",
                            "         peripheral=['full_data'],\n",
                            "         predictors=['XGBoostClassifier'],\n",
                            "         preprocessors=['Substring', 'Substring', 'Substring', 'Mapping', 'TextFieldSplitter'],\n",
                            "         share_selected_features=0.2,\n",
                            "         tags=['fast_prop'])</pre>"
                        ],
                        "text/plain": [
                            "Pipeline(data_model='population',\n",
                            "         feature_learners=['FastProp'],\n",
                            "         feature_selectors=['XGBoostClassifier'],\n",
                            "         include_categorical=False,\n",
                            "         loss_function='CrossEntropyLoss',\n",
                            "         peripheral=['full_data'],\n",
                            "         predictors=['XGBoostClassifier'],\n",
                            "         preprocessors=['Substring', 'Substring', 'Substring', 'Mapping', 'TextFieldSplitter'],\n",
                            "         share_selected_features=0.2,\n",
                            "         tags=['fast_prop'])"
                        ]
                    },
                    "execution_count": 20,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "pipe = getml.pipeline.Pipeline(\n",
                "    tags=['fast_prop'],\n",
                "    data_model=star_schema.data_model,\n",
                "    preprocessors=[substr1, substr2, substr3, mapping, text_field_splitter],\n",
                "    feature_learners=[fast_prop],\n",
                "    feature_selectors=[feature_selector],\n",
                "    predictors=[predictor],\n",
                "    share_selected_features=0.2,\n",
                ")\n",
                "\n",
                "pipe"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### 2.3 Model training"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 21,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Checking data model<span style=\"color: #808000; text-decoration-color: #808000\">...</span>\n",
                            "</pre>\n"
                        ],
                        "text/plain": [
                            "Checking data model\u001b[33m...\u001b[0m\n"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "\u001b[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
                        "\u001b[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:13\n",
                        "\u001b[2K  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
                        "\u001b[?25h"
                    ]
                },
                {
                    "data": {
                        "text/html": [
                            "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">OK.\n",
                            "</pre>\n"
                        ],
                        "text/plain": [
                            "OK.\n"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                }
            ],
            "source": [
                "pipe.check(star_schema.train)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 22,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/html": [
                            "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Checking data model<span style=\"color: #808000; text-decoration-color: #808000\">...</span>\n",
                            "</pre>\n"
                        ],
                        "text/plain": [
                            "Checking data model\u001b[33m...\u001b[0m\n"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "\u001b[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
                        "\u001b[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
                        "\u001b[?25h"
                    ]
                },
                {
                    "data": {
                        "text/html": [
                            "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">OK.\n",
                            "</pre>\n"
                        ],
                        "text/plain": [
                            "OK.\n"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "\u001b[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
                        "\u001b[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n",
                        "\u001b[2K  Indexing text fields... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
                        "\u001b[2K  FastProp: Trying 206 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:06\n",
                        "\u001b[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:54\n",
                        "\u001b[2K  XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 01:40\n",
                        "\u001b[2K  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:21\n",
                        "\u001b[?25h"
                    ]
                },
                {
                    "data": {
                        "text/html": [
                            "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Trained pipeline.\n",
                            "</pre>\n"
                        ],
                        "text/plain": [
                            "Trained pipeline.\n"
                        ]
                    },
                    "metadata": {},
                    "output_type": "display_data"
                },
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "Time taken: 0:03:05.120829.\n",
                        "\n"
                    ]
                },
                {
                    "data": {
                        "text/html": [
                            "<pre>Pipeline(data_model='population',\n",
                            "         feature_learners=['FastProp'],\n",
                            "         feature_selectors=['XGBoostClassifier'],\n",
                            "         include_categorical=False,\n",
                            "         loss_function='CrossEntropyLoss',\n",
                            "         peripheral=['full_data'],\n",
                            "         predictors=['XGBoostClassifier'],\n",
                            "         preprocessors=['Substring', 'Substring', 'Substring', 'Mapping', 'TextFieldSplitter'],\n",
                            "         share_selected_features=0.2,\n",
                            "         tags=['fast_prop', 'container-TWm7IL'])</pre>"
                        ],
                        "text/plain": [
                            "Pipeline(data_model='population',\n",
                            "         feature_learners=['FastProp'],\n",
                            "         feature_selectors=['XGBoostClassifier'],\n",
                            "         include_categorical=False,\n",
                            "         loss_function='CrossEntropyLoss',\n",
                            "         peripheral=['full_data'],\n",
                            "         predictors=['XGBoostClassifier'],\n",
                            "         preprocessors=['Substring', 'Substring', 'Substring', 'Mapping', 'TextFieldSplitter'],\n",
                            "         share_selected_features=0.2,\n",
                            "         tags=['fast_prop', 'container-TWm7IL'])"
                        ]
                    },
                    "execution_count": 22,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "pipe.fit(star_schema.train)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### 2.4 Model evaluation"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 23,
            "metadata": {
                "lines_to_next_cell": 0
            },
            "outputs": [
                {
                    "name": "stdout",
                    "output_type": "stream",
                    "text": [
                        "\u001b[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
                        "\u001b[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
                        "\u001b[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n",
                        "\u001b[?25h"
                    ]
                },
                {
                    "data": {
                        "text/html": [
                            "<style>\n",
                            "  th {\n",
                            "    text-align: left !important;\n",
                            "  }\n",
                            "  td {\n",
                            "    text-align: left !important;\n",
                            "  }\n",
                            "  th:nth-child(1) {\n",
                            "    text-align: right;\n",
                            "    border-right: 1px solid LightGray;\n",
                            "  }\n",
                            "  th.float {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  td.float {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  th.int {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "  td.int {\n",
                            "    text-align: right !important;\n",
                            "  }\n",
                            "</style>\n",
                            "\n",
                            "<table class=\"dataframe\">\n",
                            "  <thead>\n",
                            "    <tr>\n",
                            "      \n",
                            "        \n",
                            "          <th class=\"int\"> </th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"datetime\">date time          </th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"str\">set used</th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"str\">target   </th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"float\">accuracy</th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"float\">    auc</th>\n",
                            "        \n",
                            "      \n",
                            "        \n",
                            "          <th class=\"float\">cross entropy</th>\n",
                            "        \n",
                            "      \n",
                            "    </tr>\n",
                            "    \n",
                            "  </thead>\n",
                            "  <tbody>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>0</th>\n",
                            "          \n",
                            "            \n",
                            "              <td class=\"datetime\">2024-09-12 11:51:44</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">train</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">cancelled</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"float\">0.9825</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"float\">0.8446</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"float\">0.0736</td>\n",
                            "            \n",
                            "          \n",
                            "      </tr>\n",
                            "    \n",
                            "      <tr>\n",
                            "        <th>1</th>\n",
                            "          \n",
                            "            \n",
                            "              <td class=\"datetime\">2024-09-12 11:51:47</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">test</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"str\">cancelled</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"float\">0.9825</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"float\">0.8119</td>\n",
                            "            \n",
                            "          \n",
                            "            \n",
                            "              <td class=\"float\">0.07529</td>\n",
                            "            \n",
                            "          \n",
                            "      </tr>\n",
                            "    \n",
                            "  </tbody>\n",
                            "</table>"
                        ],
                        "text/plain": [
                            "    date time             set used   target      accuracy       auc   cross entropy\n",
                            "0   2024-09-12 11:51:44   train      cancelled     0.9825    0.8446         0.0736 \n",
                            "1   2024-09-12 11:51:47   test       cancelled     0.9825    0.8119         0.07529"
                        ]
                    },
                    "execution_count": 23,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "pipe.score(star_schema.test)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### 2.5 Features\n",
                "\n",
                "The most important feature looks as follows:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 24,
            "metadata": {},
            "outputs": [
                {
                    "data": {
                        "text/markdown": [
                            "```sql\n",
                            "DROP TABLE IF EXISTS \"FEATURE_1_194\";\n",
                            "\n",
                            "CREATE TABLE \"FEATURE_1_194\" AS\n",
                            "SELECT AVG( t2.\"description__mapping_3_target_1_avg\" ) AS \"feature_1_194\",\n",
                            "       t1.rowid AS rownum\n",
                            "FROM \"POPULATION__STAGING_TABLE_1\" t1\n",
                            "INNER JOIN \"POPULATION__STAGING_TABLE_1__DESCRIPTION\" t2\n",
                            "ON t1.\"rowid\" = t2.\"rownum\"\n",
                            "GROUP BY t1.rowid;\n",
                            "```"
                        ],
                        "text/plain": [
                            "'DROP TABLE IF EXISTS \"FEATURE_1_194\";\\n\\nCREATE TABLE \"FEATURE_1_194\" AS\\nSELECT AVG( t2.\"description__mapping_3_target_1_avg\" ) AS \"feature_1_194\",\\n       t1.rowid AS rownum\\nFROM \"POPULATION__STAGING_TABLE_1\" t1\\nINNER JOIN \"POPULATION__STAGING_TABLE_1__DESCRIPTION\" t2\\nON t1.\"rowid\" = t2.\"rownum\"\\nGROUP BY t1.rowid;'"
                        ]
                    },
                    "execution_count": 24,
                    "metadata": {},
                    "output_type": "execute_result"
                }
            ],
            "source": [
                "pipe.features.to_sql()[pipe.features.sort(by=\"importances\")[0].name]"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "#### 2.6 Productionization\n",
                "\n",
                "It is possible to productionize the pipeline by transpiling the features into production-ready SQL code."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save(\"online_retail_spark\")"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 26,
            "metadata": {},
            "outputs": [],
            "source": [
                "if RUN_SPARK:\n",
                "    spark = SparkSession.builder.appName(\n",
                "        \"online_retail\"\n",
                "    ).config(\n",
                "        \"spark.driver.maxResultSize\",\"5g\"\n",
                "    ).config(\n",
                "        \"spark.driver.memory\", \"5g\"\n",
                "    ).config(\n",
                "        \"spark.executor.memory\", \"5g\"\n",
                "    ).config(\n",
                "        \"spark.sql.execution.arrow.pyspark.enabled\", \"true\"\n",
                "    ).config(\n",
                "        \"spark.sql.session.timeZone\", \"UTC\"\n",
                "    ).enableHiveSupport().getOrCreate()\n",
                "\n",
                "    spark.sparkContext.setLogLevel(\"ERROR\")"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 27,
            "metadata": {},
            "outputs": [],
            "source": [
                "if RUN_SPARK:\n",
                "    population_spark = star_schema.train.population.to_pyspark(spark, name=\"population\")\n",
                "    peripheral_spark = star_schema.full_data.to_pyspark(spark, name=\"full_data\")"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 28,
            "metadata": {},
            "outputs": [],
            "source": [
                "if RUN_SPARK:\n",
                "    getml.spark.execute(spark, \"online_retail_spark\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "The resulting features are in a table called features. Here is how you can retrieve them:"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": 29,
            "metadata": {},
            "outputs": [],
            "source": [
                "if RUN_SPARK:\n",
                "    spark.sql(\"SELECT * FROM `FEATURES` LIMIT 20\").toPandas()"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "getml.engine.shutdown()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "### 3. Conclusion\n",
                "\n",
                "In this notebook we have demonstrated how getML can be applied to an e-commerce setting. In particular, we have seen how results can be improved using the built-in hyperparamater tuning routines."
            ]
        }
    ],
    "metadata": {
        "jupytext": {
            "cell_metadata_filter": "-all",
            "encoding": "# -*- coding: utf-8 -*-",
            "notebook_metadata_filter": "-all"
        },
        "kernelspec": {
            "display_name": "Python 3 (ipykernel)",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.11.4"
        },
        "toc": {
            "base_numbering": 1,
            "nav_menu": {},
            "number_sections": false,
            "sideBar": true,
            "skip_h1_title": false,
            "title_cell": "Table of Contents",
            "title_sidebar": "Contents",
            "toc_cell": false,
            "toc_position": {},
            "toc_section_display": true,
            "toc_window_display": true
        }
    },
    "nbformat": 4,
    "nbformat_minor": 4
}