{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Online Retail - Predicting order cancellations\n", "\n", "In this tutorial, we demonstrate how getML can be applied in an e-commerce context. Using a dataset of about 400,000 orders, our goal is to predict whether an order will be cancelled.\n", "\n", "We also show that we can significantly improve our results by using getML's built-in hyperparameter tuning routines.\n", "\n", "Summary:\n", "\n", "- Prediction type: __Classification model__\n", "- Domain: __E-commerce__\n", "- Prediction target: __Whether an order will be cancelled__ \n", "- Population size: __397925__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Background\n", "\n", "The data set contains about 400,000 orders from a British online retailer. Each order consists of a product that has been ordered and a corresponding quantity. Several orders can be summarized onto a single invoice. The goal is to predict whether an order will be cancelled.\n", "\n", "Because the company mainly sells to other businesses, the cancellation rate is relatively low, namely 1.83%.\n", "\n", "The data set has been originally collected for this study:\n", "\n", "> Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).\n", "\n", "It has been downloaded from the UCI Machine Learning Repository:\n", "\n", "> Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/dataset/352/online+retail). Irvine, CA: University of California, School of Information and Computer Science." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get started with the analysis and set up your session:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%pip install -q \"getml==1.5.0\" \"pyspark==3.5.2\" \"ipywidgets==8.1.5\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "getML API version: 1.5.0\n", "\n" ] } ], "source": [ "import os\n", "from urllib import request\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from pyspark.sql import SparkSession\n", "import getml\n", "\n", "os.environ[\"PYARROW_IGNORE_TIMEZONE\"] = \"1\"\n", "\n", "print(f\"getML API version: {getml.__version__}\\n\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux...\n", "Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912145332.log.\n", "\u001b[2K Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[?25h" ] }, { "data": { "text/html": [ "
Connected to project 'online_retail'.\n",
                            "
\n" ], "text/plain": [ "Connected to project \u001b[32m'online_retail'\u001b[0m.\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "getml.engine.launch(allow_remote_ips=True, token='token')\n", "getml.engine.set_project('online_retail')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "RUN_SPARK = False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Loading data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.1 Download from source\n", "\n", "We begin by downloading the data from the source file:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "fname = \"online_retail.csv\"\n", "\n", "if not os.path.exists(fname):\n", " fname, res = request.urlretrieve(\n", " \"https://static.getml.com/datasets/online_retail/\" + fname, \n", " fname\n", " )\n", " \n", "full_data_pandas = pd.read_csv(fname, sep=\"|\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.2 Data preparation\n", "\n", "The invoice dates are in a somewhat unusual format, fo we need to rectify that." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def add_zero(string):\n", " if len(string) == 1:\n", " return \"0\" + string\n", " return string" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def format_date(string):\n", " datetime = string.split(\" \")\n", " assert len(datetime) == 2, \"Expected date and time\"\n", " \n", " date_components = datetime[0].split(\"/\")\n", " assert len(date_components) == 3, \"Expected three date components\"\n", " \n", " date_components = [add_zero(x) for x in date_components]\n", " \n", " return \"-\".join(date_components) + \" \" + datetime[1] " ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "full_data_pandas[\"InvoiceDate\"] = [\n", " format_date(string) for string in np.asarray(full_data_pandas[\"InvoiceDate\"])\n", "]" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InvoiceStockCodeDescriptionQuantityInvoiceDatePriceCustomer IDCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:262.5517850.0United Kingdom
153636571053WHITE METAL LANTERN62010-12-01 08:263.3917850.0United Kingdom
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:262.7517850.0United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:263.3917850.0United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:263.3917850.0United Kingdom
...........................
54190558158722899CHILDREN'S APRON DOLLY GIRL62011-12-09 12:502.1012680.0France
54190658158723254CHILDRENS CUTLERY DOLLY GIRL42011-12-09 12:504.1512680.0France
54190758158723255CHILDRENS CUTLERY CIRCUS PARADE42011-12-09 12:504.1512680.0France
54190858158722138BAKING SET 9 PIECE RETROSPOT32011-12-09 12:504.9512680.0France
541909581587POSTPOSTAGE12011-12-09 12:5018.0012680.0France
\n", "

541910 rows × 8 columns

\n", "
" ], "text/plain": [ " Invoice StockCode Description Quantity \\\n", "0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 \n", "1 536365 71053 WHITE METAL LANTERN 6 \n", "2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 \n", "3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 \n", "4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 \n", "... ... ... ... ... \n", "541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 \n", "541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 \n", "541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 \n", "541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 \n", "541909 581587 POST POSTAGE 1 \n", "\n", " InvoiceDate Price Customer ID Country \n", "0 2010-12-01 08:26 2.55 17850.0 United Kingdom \n", "1 2010-12-01 08:26 3.39 17850.0 United Kingdom \n", "2 2010-12-01 08:26 2.75 17850.0 United Kingdom \n", "3 2010-12-01 08:26 3.39 17850.0 United Kingdom \n", "4 2010-12-01 08:26 3.39 17850.0 United Kingdom \n", "... ... ... ... ... \n", "541905 2011-12-09 12:50 2.10 12680.0 France \n", "541906 2011-12-09 12:50 4.15 12680.0 France \n", "541907 2011-12-09 12:50 4.15 12680.0 France \n", "541908 2011-12-09 12:50 4.95 12680.0 France \n", "541909 2011-12-09 12:50 18.00 12680.0 France \n", "\n", "[541910 rows x 8 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "full_data_pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this data set, the targets aren't as clearly defined as we would like to, so we have do define them ourselves." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "def add_target(df):\n", " df = df.sort_values(by=[\"Customer ID\", \"InvoiceDate\"])\n", " \n", " cancelled = np.zeros(df.shape[0])\n", "\n", " invoice = np.asarray(df[\"Invoice\"])\n", " stock_code = np.asarray(df[\"StockCode\"])\n", " customer_id = np.asarray(df[\"Customer ID\"])\n", "\n", " for i in range(len(invoice)):\n", " if (invoice[i][0] == 'C') or (i == len(invoice) - 1):\n", " continue\n", "\n", " j = i + 1\n", "\n", " while customer_id[j] == customer_id[i]:\n", " if (invoice[j][0] == 'C') and (stock_code[i] == stock_code[j]):\n", " cancelled[i] = 1.0\n", " break\n", "\n", " if stock_code[i] == stock_code[j]:\n", " break\n", "\n", " j += 1\n", " \n", " df[\"cancelled\"] = cancelled\n", " \n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, we want to remove any orders in the data set that are actually cancellations." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def remove_cancellations(df):\n", " invoice = np.asarray(df[\"Invoice\"])\n", "\n", " is_order = [inv[0] != 'C' for inv in invoice]\n", " \n", " df = df[is_order]\n", " \n", " return df" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "full_data_pandas = add_target(full_data_pandas)\n", "full_data_pandas = remove_cancellations(full_data_pandas)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, there are some order for which we do not have a customer ID. We want to remove those." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "full_data_pandas = full_data_pandas[~np.isnan(full_data_pandas[\"Customer ID\"])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can upload the data to getML." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name Quantity Price Customer ID cancelledInvoice StockCode Description InvoiceDate Country
roleunused_floatunused_floatunused_floatunused_floatunused_stringunused_stringunused_string unused_string unused_string
0\n", " 74215 \n", " \n", " 1.04\n", " \n", " 12346 \n", " \n", " 1 \n", " 54143123166MEDIUM CERAMIC TOP STORAGE JAR2011-01-18 10:01United Kingdom
1\n", " 12 \n", " \n", " 2.1\n", " \n", " 12347 \n", " \n", " 0 \n", " 53762685116BLACK CANDELABRA T-LIGHT HOLDER2010-12-07 14:57Iceland
2\n", " 4 \n", " \n", " 4.25\n", " \n", " 12347 \n", " \n", " 0 \n", " 53762622375AIRLINE BAG VINTAGE JET SET BROW...2010-12-07 14:57Iceland
3\n", " 12 \n", " \n", " 3.25\n", " \n", " 12347 \n", " \n", " 0 \n", " 53762671477COLOUR GLASS. STAR T-LIGHT HOLDE...2010-12-07 14:57Iceland
4\n", " 36 \n", " \n", " 0.65\n", " \n", " 12347 \n", " \n", " 0 \n", " 53762622492MINI PAINT SET VINTAGE2010-12-07 14:57Iceland
\n", " ... \n", " \n", " ... \n", " \n", " ... \n", " \n", " ... \n", " ...............
397920\n", " 12 \n", " \n", " 0.42\n", " \n", " 18287 \n", " \n", " 0 \n", " 57071522419LIPSTICK PEN RED2011-10-12 10:23United Kingdom
397921\n", " 12 \n", " \n", " 2.1\n", " \n", " 18287 \n", " \n", " 0 \n", " 57071522866HAND WARMER SCOTTY DOG DESIGN2011-10-12 10:23United Kingdom
397922\n", " 36 \n", " \n", " 1.25\n", " \n", " 18287 \n", " \n", " 0 \n", " 57316723264SET OF 3 WOODEN SLEIGH DECORATIO...2011-10-28 09:29United Kingdom
397923\n", " 48 \n", " \n", " 0.39\n", " \n", " 18287 \n", " \n", " 0 \n", " 57316721824PAINTED METAL STAR WITH HOLLY BE...2011-10-28 09:29United Kingdom
397924\n", " 24 \n", " \n", " 0.29\n", " \n", " 18287 \n", " \n", " 0 \n", " 57316721014SWISS CHALET TREE DECORATION2011-10-28 09:29United Kingdom
\n", "\n", "

\n", " 397925 rows x 9 columns
\n", " memory usage: 57.28 MB
\n", " name: full_data
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ " name Quantity Price Customer ID cancelled Invoice StockCode \n", " role unused_float unused_float unused_float unused_float unused_string unused_string\n", " 0 74215 1.04 12346 1 541431 23166 \n", " 1 12 2.1 12347 0 537626 85116 \n", " 2 4 4.25 12347 0 537626 22375 \n", " 3 12 3.25 12347 0 537626 71477 \n", " 4 36 0.65 12347 0 537626 22492 \n", " ... ... ... ... ... ... \n", "397920 12 0.42 18287 0 570715 22419 \n", "397921 12 2.1 18287 0 570715 22866 \n", "397922 36 1.25 18287 0 573167 23264 \n", "397923 48 0.39 18287 0 573167 21824 \n", "397924 24 0.29 18287 0 573167 21014 \n", "\n", " name Description InvoiceDate Country \n", " role unused_string unused_string unused_string \n", " 0 MEDIUM CERAMIC TOP STORAGE JAR 2011-01-18 10:01 United Kingdom\n", " 1 BLACK CANDELABRA T-LIGHT HOLDER 2010-12-07 14:57 Iceland \n", " 2 AIRLINE BAG VINTAGE JET SET BROW... 2010-12-07 14:57 Iceland \n", " 3 COLOUR GLASS. STAR T-LIGHT HOLDE... 2010-12-07 14:57 Iceland \n", " 4 MINI PAINT SET VINTAGE 2010-12-07 14:57 Iceland \n", " ... ... ... \n", "397920 LIPSTICK PEN RED 2011-10-12 10:23 United Kingdom\n", "397921 HAND WARMER SCOTTY DOG DESIGN 2011-10-12 10:23 United Kingdom\n", "397922 SET OF 3 WOODEN SLEIGH DECORATIO... 2011-10-28 09:29 United Kingdom\n", "397923 PAINTED METAL STAR WITH HOLLY BE... 2011-10-28 09:29 United Kingdom\n", "397924 SWISS CHALET TREE DECORATION 2011-10-28 09:29 United Kingdom\n", "\n", "\n", "397925 rows x 9 columns\n", "memory usage: 57.28 MB\n", "type: getml.DataFrame" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "full_data = getml.data.DataFrame.from_pandas(full_data_pandas, \"full_data\")\n", "\n", "full_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.3 Prepare data for getML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "getML requires that we define *roles* for each of the columns." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "full_data.set_role(\"InvoiceDate\", getml.data.roles.time_stamp, time_formats=['%Y-%m-%d %H:%M'])\n", "full_data.set_role([\"Customer ID\", \"Invoice\"], getml.data.roles.join_key)\n", "full_data.set_role([\"cancelled\"], getml.data.roles.target)\n", "full_data.set_role([\"Quantity\", \"Price\"], getml.data.roles.numerical)\n", "full_data.set_role(\"Country\", getml.data.roles.categorical)\n", "full_data.set_role(\"Description\", getml.data.roles.text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *StockCode* is a 5-digit code that uniquely defines a product. It is hierarchical, meaning that every digit has a meaning. We want to make use of that, so we assign a unit to the stock code, which we can reference in our preprocessors." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "full_data.set_unit(\"StockCode\", \"code\")" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0train
1validation
2train
3validation
4validation
...
\n", "\n", "

\n", " infinite number of rows
\n", " \n", " type: StringColumnView
\n", " \n", "

\n" ], "text/plain": [ " \n", " 0 train \n", " 1 validation\n", " 2 train \n", " 3 validation\n", " 4 validation\n", " ... \n", "\n", "\n", "infinite number of rows\n", "type: StringColumnView" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "split = getml.data.split.random(train=0.7, validation=0.15, test=0.15)\n", "split" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Predictive modeling\n", "\n", "We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1 Define relational model\n", "\n", "To get started with relational learning, we need to specify the data model.\n", "\n", "In our case, there are two joins we are interested in: \n", "\n", "1) We want to take a look at all of the other orders on the same invoice.\n", "\n", "2) We want to check out how often a certain customer has cancelled orders in the past. Here, we limit ourselves to the last 90 days. To avoid data leaks, we set a horizon of one day." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "data model\n", "
\n", "
diagram
\n", "
full_datafull_datapopulationInvoice = InvoiceCustomer ID = Customer IDInvoiceDate <= InvoiceDateMemory: 90.0 daysHorizon: 1.0 daysLagged targets allowed
\n", "
\n", "\n", "
\n", "
staging
\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data framesstaging table
0populationPOPULATION__STAGING_TABLE_1
1full_dataFULL_DATA__STAGING_TABLE_2
\n", "
\n", " \n", "container\n", "
\n", "
\n", "
population
\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subset name rowstype
0testfull_data60013View
1trainfull_data278171View
2validationfull_data59741View
\n", "
\n", "
\n", "
peripheral
\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name rowstype
0full_data397925View
\n", "
\n", "
" ], "text/plain": [ "data model\n", "\n", " population:\n", " columns:\n", " - Country: categorical\n", " - Customer ID: join_key\n", " - Invoice: join_key\n", " - Quantity: numerical\n", " - Price: numerical\n", " - ...\n", "\n", " joins:\n", " - right: 'full_data'\n", " on: (population.Invoice, full_data.Invoice)\n", " relationship: 'many-to-many'\n", " lagged_targets: False\n", " - right: 'full_data'\n", " on: (population.Customer ID, full_data.Customer ID)\n", " time_stamps: (population.InvoiceDate, full_data.InvoiceDate)\n", " relationship: 'many-to-many'\n", " memory: 7776000.0\n", " horizon: 86400.0\n", " lagged_targets: True\n", "\n", " full_data:\n", " columns:\n", " - Country: categorical\n", " - Customer ID: join_key\n", " - Invoice: join_key\n", " - Quantity: numerical\n", " - Price: numerical\n", " - ...\n", "\n", " full_data:\n", " columns:\n", " - Country: categorical\n", " - Customer ID: join_key\n", " - Invoice: join_key\n", " - Quantity: numerical\n", " - Price: numerical\n", " - ...\n", "\n", "\n", "container\n", "\n", " population\n", " subset name rows type\n", " 0 test full_data 60013 View\n", " 1 train full_data 278171 View\n", " 2 validation full_data 59741 View\n", "\n", " peripheral\n", " name rows type\n", " 0 full_data 397925 View" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "star_schema = getml.data.StarSchema(\n", " population=full_data, \n", " alias=\"population\",\n", " split=split,\n", ")\n", "\n", "star_schema.join(\n", " full_data.drop(\"Description\"),\n", " alias=\"full_data\",\n", " on='Invoice',\n", ")\n", "\n", "star_schema.join(\n", " full_data.drop(\"Description\"),\n", " alias=\"full_data\",\n", " on='Customer ID',\n", " time_stamps='InvoiceDate',\n", " horizon=getml.data.time.days(1),\n", " memory=getml.data.time.days(90),\n", " lagged_targets=True,\n", ")\n", "\n", "star_schema" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2 getML pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "__Set-up the feature learner & predictor__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have mentioned that the *StockCode* is a hierarchical code. To make use of that fact, we use getML's substring preprocessor, extracting the first digit, the first two digits etc. Since we have assigned the unit *code* to the *StockCode*, the preprocessors know which column they should be applied to." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "substr1 = getml.preprocessors.Substring(0, 1, \"code\")\n", "substr2 = getml.preprocessors.Substring(0, 2, \"code\")\n", "substr3 = getml.preprocessors.Substring(0, 3, \"code\")\n", "\n", "mapping = getml.preprocessors.Mapping()\n", "\n", "text_field_splitter = getml.preprocessors.TextFieldSplitter()\n", "\n", "fast_prop = getml.feature_learning.FastProp(\n", " loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,\n", " num_threads=1,\n", " sampling_factor=0.1,\n", ")\n", "\n", "feature_selector = getml.predictors.XGBoostClassifier()\n", "\n", "predictor = getml.predictors.XGBoostClassifier()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Build the pipeline__" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(data_model='population',\n",
                            "         feature_learners=['FastProp'],\n",
                            "         feature_selectors=['XGBoostClassifier'],\n",
                            "         include_categorical=False,\n",
                            "         loss_function='CrossEntropyLoss',\n",
                            "         peripheral=['full_data'],\n",
                            "         predictors=['XGBoostClassifier'],\n",
                            "         preprocessors=['Substring', 'Substring', 'Substring', 'Mapping', 'TextFieldSplitter'],\n",
                            "         share_selected_features=0.2,\n",
                            "         tags=['fast_prop'])
" ], "text/plain": [ "Pipeline(data_model='population',\n", " feature_learners=['FastProp'],\n", " feature_selectors=['XGBoostClassifier'],\n", " include_categorical=False,\n", " loss_function='CrossEntropyLoss',\n", " peripheral=['full_data'],\n", " predictors=['XGBoostClassifier'],\n", " preprocessors=['Substring', 'Substring', 'Substring', 'Mapping', 'TextFieldSplitter'],\n", " share_selected_features=0.2,\n", " tags=['fast_prop'])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe = getml.pipeline.Pipeline(\n", " tags=['fast_prop'],\n", " data_model=star_schema.data_model,\n", " preprocessors=[substr1, substr2, substr3, mapping, text_field_splitter],\n", " feature_learners=[fast_prop],\n", " feature_selectors=[feature_selector],\n", " predictors=[predictor],\n", " share_selected_features=0.2,\n", ")\n", "\n", "pipe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.3 Model training" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Checking data model...\n",
                            "
\n" ], "text/plain": [ "Checking data model\u001b[33m...\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:13\n", "\u001b[2K Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[?25h" ] }, { "data": { "text/html": [ "
OK.\n",
                            "
\n" ], "text/plain": [ "OK.\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "pipe.check(star_schema.train)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Checking data model...\n",
                            "
\n" ], "text/plain": [ "Checking data model\u001b[33m...\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[?25h" ] }, { "data": { "text/html": [ "
OK.\n",
                            "
\n" ], "text/plain": [ "OK.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n", "\u001b[2K Indexing text fields... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K FastProp: Trying 206 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:06\n", "\u001b[2K FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:54\n", "\u001b[2K XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 01:40\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:21\n", "\u001b[?25h" ] }, { "data": { "text/html": [ "
Trained pipeline.\n",
                            "
\n" ], "text/plain": [ "Trained pipeline.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Time taken: 0:03:05.120829.\n", "\n" ] }, { "data": { "text/html": [ "
Pipeline(data_model='population',\n",
                            "         feature_learners=['FastProp'],\n",
                            "         feature_selectors=['XGBoostClassifier'],\n",
                            "         include_categorical=False,\n",
                            "         loss_function='CrossEntropyLoss',\n",
                            "         peripheral=['full_data'],\n",
                            "         predictors=['XGBoostClassifier'],\n",
                            "         preprocessors=['Substring', 'Substring', 'Substring', 'Mapping', 'TextFieldSplitter'],\n",
                            "         share_selected_features=0.2,\n",
                            "         tags=['fast_prop', 'container-TWm7IL'])
" ], "text/plain": [ "Pipeline(data_model='population',\n", " feature_learners=['FastProp'],\n", " feature_selectors=['XGBoostClassifier'],\n", " include_categorical=False,\n", " loss_function='CrossEntropyLoss',\n", " peripheral=['full_data'],\n", " predictors=['XGBoostClassifier'],\n", " preprocessors=['Substring', 'Substring', 'Substring', 'Mapping', 'TextFieldSplitter'],\n", " share_selected_features=0.2,\n", " tags=['fast_prop', 'container-TWm7IL'])" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.fit(star_schema.train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.4 Model evaluation" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "lines_to_next_cell": 0 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n", "\u001b[?25h" ] }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date time set usedtarget accuracy auccross entropy
02024-09-12 11:51:44traincancelled0.98250.84460.0736
12024-09-12 11:51:47testcancelled0.98250.81190.07529
" ], "text/plain": [ " date time set used target accuracy auc cross entropy\n", "0 2024-09-12 11:51:44 train cancelled 0.9825 0.8446 0.0736 \n", "1 2024-09-12 11:51:47 test cancelled 0.9825 0.8119 0.07529" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.score(star_schema.test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.5 Features\n", "\n", "The most important feature looks as follows:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "```sql\n", "DROP TABLE IF EXISTS \"FEATURE_1_194\";\n", "\n", "CREATE TABLE \"FEATURE_1_194\" AS\n", "SELECT AVG( t2.\"description__mapping_3_target_1_avg\" ) AS \"feature_1_194\",\n", " t1.rowid AS rownum\n", "FROM \"POPULATION__STAGING_TABLE_1\" t1\n", "INNER JOIN \"POPULATION__STAGING_TABLE_1__DESCRIPTION\" t2\n", "ON t1.\"rowid\" = t2.\"rownum\"\n", "GROUP BY t1.rowid;\n", "```" ], "text/plain": [ "'DROP TABLE IF EXISTS \"FEATURE_1_194\";\\n\\nCREATE TABLE \"FEATURE_1_194\" AS\\nSELECT AVG( t2.\"description__mapping_3_target_1_avg\" ) AS \"feature_1_194\",\\n t1.rowid AS rownum\\nFROM \"POPULATION__STAGING_TABLE_1\" t1\\nINNER JOIN \"POPULATION__STAGING_TABLE_1__DESCRIPTION\" t2\\nON t1.\"rowid\" = t2.\"rownum\"\\nGROUP BY t1.rowid;'" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.features.to_sql()[pipe.features.sort(by=\"importances\")[0].name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.6 Productionization\n", "\n", "It is possible to productionize the pipeline by transpiling the features into production-ready SQL code." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save(\"online_retail_spark\")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "if RUN_SPARK:\n", " spark = SparkSession.builder.appName(\n", " \"online_retail\"\n", " ).config(\n", " \"spark.driver.maxResultSize\",\"5g\"\n", " ).config(\n", " \"spark.driver.memory\", \"5g\"\n", " ).config(\n", " \"spark.executor.memory\", \"5g\"\n", " ).config(\n", " \"spark.sql.execution.arrow.pyspark.enabled\", \"true\"\n", " ).config(\n", " \"spark.sql.session.timeZone\", \"UTC\"\n", " ).enableHiveSupport().getOrCreate()\n", "\n", " spark.sparkContext.setLogLevel(\"ERROR\")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "if RUN_SPARK:\n", " population_spark = star_schema.train.population.to_pyspark(spark, name=\"population\")\n", " peripheral_spark = star_schema.full_data.to_pyspark(spark, name=\"full_data\")" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "if RUN_SPARK:\n", " getml.spark.execute(spark, \"online_retail_spark\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting features are in a table called features. Here is how you can retrieve them:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "if RUN_SPARK:\n", " spark.sql(\"SELECT * FROM `FEATURES` LIMIT 20\").toPandas()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "getml.engine.shutdown()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Conclusion\n", "\n", "In this notebook we have demonstrated how getML can be applied to an e-commerce setting. In particular, we have seen how results can be improved using the built-in hyperparamater tuning routines." ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "-all", "encoding": "# -*- coding: utf-8 -*-", "notebook_metadata_filter": "-all" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }