{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Model-Training-with-Aerospike-Feature-Store\" data-toc-modified-id=\"Model-Training-with-Aerospike-Feature-Store-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Model Training with Aerospike Feature Store</a></span><ul class=\"toc-item\"><li><span><a href=\"#Introduction\" data-toc-modified-id=\"Introduction-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href=\"#Prerequisites\" data-toc-modified-id=\"Prerequisites-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Prerequisites</a></span></li><li><span><a href=\"#Setup\" data-toc-modified-id=\"Setup-1.3\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>Setup</a></span><ul class=\"toc-item\"><li><span><a href=\"#Ensure-Database-Is-Running\" data-toc-modified-id=\"Ensure-Database-Is-Running-1.3.1\"><span class=\"toc-item-num\">1.3.1&nbsp;&nbsp;</span>Ensure Database Is Running</a></span></li><li><span><a href=\"#Initialize-Spark\" data-toc-modified-id=\"Initialize-Spark-1.3.2\"><span class=\"toc-item-num\">1.3.2&nbsp;&nbsp;</span>Initialize Spark</a></span><ul class=\"toc-item\"><li><span><a href=\"#Initialize-Paths-and-Env-Variables\" data-toc-modified-id=\"Initialize-Paths-and-Env-Variables-1.3.2.1\"><span class=\"toc-item-num\">1.3.2.1&nbsp;&nbsp;</span>Initialize Paths and Env Variables</a></span></li><li><span><a href=\"#Configure-Spark-Session\" data-toc-modified-id=\"Configure-Spark-Session-1.3.2.2\"><span class=\"toc-item-num\">1.3.2.2&nbsp;&nbsp;</span>Configure Spark Session</a></span></li></ul></li><li><span><a href=\"#Access-Shell-Commands\" data-toc-modified-id=\"Access-Shell-Commands-1.3.3\"><span class=\"toc-item-num\">1.3.3&nbsp;&nbsp;</span>Access Shell Commands</a></span></li></ul></li></ul></li><li><span><a href=\"#Context-from-Part-1-(Feature-Engineering-Notebook)\" data-toc-modified-id=\"Context-from-Part-1-(Feature-Engineering-Notebook)-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Context from Part 1 (Feature Engineering Notebook)</a></span><ul class=\"toc-item\"><li><span><a href=\"#Code:-Feature-Group,-Feature,-and-Entity\" data-toc-modified-id=\"Code:-Feature-Group,-Feature,-and-Entity-2.1\"><span class=\"toc-item-num\">2.1&nbsp;&nbsp;</span>Code: Feature Group, Feature, and Entity</a></span></li><li><span><a href=\"#Feature-Data:--Credit-Card-Transactions\" data-toc-modified-id=\"Feature-Data:--Credit-Card-Transactions-2.2\"><span class=\"toc-item-num\">2.2&nbsp;&nbsp;</span>Feature Data:  Credit Card Transactions</a></span><ul class=\"toc-item\"><li><span><a href=\"#Read-and-Transform-Data\" data-toc-modified-id=\"Read-and-Transform-Data-2.2.1\"><span class=\"toc-item-num\">2.2.1&nbsp;&nbsp;</span>Read and Transform Data</a></span></li><li><span><a href=\"#Save-Features\" data-toc-modified-id=\"Save-Features-2.2.2\"><span class=\"toc-item-num\">2.2.2&nbsp;&nbsp;</span>Save Features</a></span></li></ul></li></ul></li><li><span><a href=\"#Implementing-Dataset\" data-toc-modified-id=\"Implementing-Dataset-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Implementing Dataset</a></span><ul class=\"toc-item\"><li><span><a href=\"#Object-Model\" data-toc-modified-id=\"Object-Model-3.1\"><span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>Object Model</a></span><ul class=\"toc-item\"><li><span><a href=\"#Attributes\" data-toc-modified-id=\"Attributes-3.1.1\"><span class=\"toc-item-num\">3.1.1&nbsp;&nbsp;</span>Attributes</a></span></li><li><span><a href=\"#Operations\" data-toc-modified-id=\"Operations-3.1.2\"><span class=\"toc-item-num\">3.1.2&nbsp;&nbsp;</span>Operations</a></span></li></ul></li><li><span><a href=\"#Dataset-Implementation\" data-toc-modified-id=\"Dataset-Implementation-3.2\"><span class=\"toc-item-num\">3.2&nbsp;&nbsp;</span>Dataset Implementation</a></span></li></ul></li><li><span><a href=\"#Using-Pushdown-Expressions\" data-toc-modified-id=\"Using-Pushdown-Expressions-4\"><span class=\"toc-item-num\">4&nbsp;&nbsp;</span>Using Pushdown Expressions</a></span><ul class=\"toc-item\"><li><span><a href=\"#Creating-Pushdown-Expressions\" data-toc-modified-id=\"Creating-Pushdown-Expressions-4.1\"><span class=\"toc-item-num\">4.1&nbsp;&nbsp;</span>Creating Pushdown Expressions</a></span></li><li><span><a href=\"#Examples\" data-toc-modified-id=\"Examples-4.2\"><span class=\"toc-item-num\">4.2&nbsp;&nbsp;</span>Examples</a></span><ul class=\"toc-item\"><li><span><a href=\"#Records-with-Specific-Tags\" data-toc-modified-id=\"Records-with-Specific-Tags-4.2.1\"><span class=\"toc-item-num\">4.2.1&nbsp;&nbsp;</span>Records with Specific Tags</a></span></li><li><span><a href=\"#Records-with-Specific-Attribute-Value\" data-toc-modified-id=\"Records-with-Specific-Attribute-Value-4.2.2\"><span class=\"toc-item-num\">4.2.2&nbsp;&nbsp;</span>Records with Specific Attribute Value</a></span></li><li><span><a href=\"#Records-with-String-Matching-Pattern\" data-toc-modified-id=\"Records-with-String-Matching-Pattern-4.2.3\"><span class=\"toc-item-num\">4.2.3&nbsp;&nbsp;</span>Records with String Matching Pattern</a></span></li></ul></li></ul></li><li><span><a href=\"#Exploring-Features-in-Feature-Store\" data-toc-modified-id=\"Exploring-Features-in-Feature-Store-5\"><span class=\"toc-item-num\">5&nbsp;&nbsp;</span>Exploring Features in Feature Store</a></span><ul class=\"toc-item\"><li><span><a href=\"#Exploring-Feature-Groups\" data-toc-modified-id=\"Exploring-Feature-Groups-5.1\"><span class=\"toc-item-num\">5.1&nbsp;&nbsp;</span>Exploring Feature Groups</a></span></li><li><span><a href=\"#Exploring-Feature-Data\" data-toc-modified-id=\"Exploring-Feature-Data-5.2\"><span class=\"toc-item-num\">5.2&nbsp;&nbsp;</span>Exploring Feature Data</a></span><ul class=\"toc-item\"><li><span><a href=\"#Defining-Schema\" data-toc-modified-id=\"Defining-Schema-5.2.1\"><span class=\"toc-item-num\">5.2.1&nbsp;&nbsp;</span>Defining Schema</a></span></li><li><span><a href=\"#Retrieving-Data\" data-toc-modified-id=\"Retrieving-Data-5.2.2\"><span class=\"toc-item-num\">5.2.2&nbsp;&nbsp;</span>Retrieving Data</a></span></li><li><span><a href=\"#Examining-Data\" data-toc-modified-id=\"Examining-Data-5.2.3\"><span class=\"toc-item-num\">5.2.3&nbsp;&nbsp;</span>Examining Data</a></span></li></ul></li></ul></li><li><span><a href=\"#Defining-Dataset\" data-toc-modified-id=\"Defining-Dataset-6\"><span class=\"toc-item-num\">6&nbsp;&nbsp;</span>Defining Dataset</a></span><ul class=\"toc-item\"><li><span><a href=\"#Save-Dataset\" data-toc-modified-id=\"Save-Dataset-6.1\"><span class=\"toc-item-num\">6.1&nbsp;&nbsp;</span>Save Dataset</a></span></li><li><span><a href=\"#Query-and-Verify-Dataset\" data-toc-modified-id=\"Query-and-Verify-Dataset-6.2\"><span class=\"toc-item-num\">6.2&nbsp;&nbsp;</span>Query and Verify Dataset</a></span></li></ul></li><li><span><a href=\"#Create-AI/ML-Model\" data-toc-modified-id=\"Create-AI/ML-Model-7\"><span class=\"toc-item-num\">7&nbsp;&nbsp;</span>Create AI/ML Model</a></span><ul class=\"toc-item\"><li><span><a href=\"#Create-Training-and-Test-Sets\" data-toc-modified-id=\"Create-Training-and-Test-Sets-7.1\"><span class=\"toc-item-num\">7.1&nbsp;&nbsp;</span>Create Training and Test Sets</a></span></li><li><span><a href=\"#Train-Model\" data-toc-modified-id=\"Train-Model-7.2\"><span class=\"toc-item-num\">7.2&nbsp;&nbsp;</span>Train Model</a></span></li><li><span><a href=\"#Evaluate-Model\" data-toc-modified-id=\"Evaluate-Model-7.3\"><span class=\"toc-item-num\">7.3&nbsp;&nbsp;</span>Evaluate Model</a></span></li><li><span><a href=\"#Save-Model\" data-toc-modified-id=\"Save-Model-7.4\"><span class=\"toc-item-num\">7.4&nbsp;&nbsp;</span>Save Model</a></span><ul class=\"toc-item\"><li><span><a href=\"#Load-and-Test-Model\" data-toc-modified-id=\"Load-and-Test-Model-7.4.1\"><span class=\"toc-item-num\">7.4.1&nbsp;&nbsp;</span>Load and Test Model</a></span></li></ul></li></ul></li><li><span><a href=\"#Takeaways-and-Conclusion\" data-toc-modified-id=\"Takeaways-and-Conclusion-8\"><span class=\"toc-item-num\">8&nbsp;&nbsp;</span>Takeaways and Conclusion</a></span></li><li><span><a href=\"#Cleaning-Up\" data-toc-modified-id=\"Cleaning-Up-9\"><span class=\"toc-item-num\">9&nbsp;&nbsp;</span>Cleaning Up</a></span></li><li><span><a href=\"#Further-Exploration-and-Resources\" data-toc-modified-id=\"Further-Exploration-and-Resources-10\"><span class=\"toc-item-num\">10&nbsp;&nbsp;</span>Further Exploration and Resources</a></span><ul class=\"toc-item\"><li><span><a href=\"#Resources\" data-toc-modified-id=\"Resources-10.1\"><span class=\"toc-item-num\">10.1&nbsp;&nbsp;</span>Resources</a></span></li><li><span><a href=\"#Exploring-Other-Notebooks\" data-toc-modified-id=\"Exploring-Other-Notebooks-10.2\"><span class=\"toc-item-num\">10.2&nbsp;&nbsp;</span>Exploring Other Notebooks</a></span></li></ul></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model Training with Aerospike Feature Store\n",
    "This notebook is the second in the series of notebooks that show how Aerospike can be used as a feature store.\n",
    "\n",
    "This notebook requires the Aerospike Database and Spark running locally with Aerospike Spark Connector. To create a Docker container that satisfies the requirements and holds a copy of Aerospike notebooks, visit the [Aerospike Notebooks Repo](https://github.com/aerospike-examples/interactive-notebooks)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "This notebook shows how Aerospike can be used as a Feature Store for Machine Learning applications on Spark using Aerospike Spark Connector. It is Part 2 of the Feature Store series of notebooks, and focuses on Model Training aspects concerning a Feature Store. The [first notebook](feature-store-feature-eng.ipynb) in the series discusses Feature Engineering, and [the next one](feature-store-model-serving.ipynb) describes Model Serving."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Reference Architecture](resources/fs-arch.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is organized as follows:\n",
    "- Summary of the prior (Data Engineering) notebook\n",
    "- Exploring features and datasets\n",
    "- Defining and saving a dataset\n",
    "- Training and saving an AI/ML model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "This tutorial assumes familiarity with the following topics:\n",
    "\n",
    "- [Aerospike Notebooks - Readme and Tips](../readme_tips.ipynb)\n",
    "- [Hello World](../python/hello_world.ipynb)\n",
    "- [Aerospike Connect for Spark Tutorial for Python](AerospikeSparkPython.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup\n",
    "Set up Aerospike Server. Spark Server, and Spark Connector."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ensure Database Is Running\n",
    "This notebook requires that Aerospike datbase is running."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Aerospike database is running!\r\n"
     ]
    }
   ],
   "source": [
    "!asd >& /dev/null\n",
    "!pgrep -x asd >/dev/null && echo \"Aerospike database is running!\" || echo \"**Aerospike database is not running!**\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initialize Spark\n",
    "We will be using Spark functionality in this notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Initialize Paths and Env Variables"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# directory where spark notebook requisites are installed\n",
    "SPARK_NB_DIR = '/opt/spark-nb'\n",
    "SPARK_DIR = 'spark-dir-link'\n",
    "SPARK_HOME = SPARK_NB_DIR + '/' + SPARK_DIR\n",
    "AEROSPIKE_JAR = 'aerospike-jar-link'\n",
    "AEROSPIKE_JAR_PATH = SPARK_NB_DIR + '/' + AEROSPIKE_JAR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# IP Address or DNS name for one host in your Aerospike cluster\n",
    "AS_HOST =\"localhost\"\n",
    "# Name of one of your namespaces. Type 'show namespaces' at the aql prompt if you are not sure\n",
    "AS_NAMESPACE = \"test\" \n",
    "AS_PORT = 3000 # Usually 3000, but change here if not\n",
    "AS_CONNECTION_STRING = AS_HOST + \":\"+ str(AS_PORT)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Locate the Spark installation using the SPARK_HOME parameter.\n",
    "import findspark\n",
    "findspark.init(SPARK_HOME)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Specify the Aerospike Spark Connector jar in the command used to interact with Aerospike.\n",
    "import os \n",
    "os.environ[\"PYSPARK_SUBMIT_ARGS\"] = '--jars ' + AEROSPIKE_JAR_PATH + ' pyspark-shell'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Configure Spark Session\n",
    "Please visit [Configuring Aerospike Connect for Spark](https://docs.aerospike.com/docs/connect/processing/spark/configuration.html) for more information about the properties used on this page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# imports\n",
    "import pyspark\n",
    "from pyspark.context import SparkContext\n",
    "from pyspark.sql.context import SQLContext\n",
    "from pyspark.sql.session import SparkSession\n",
    "from pyspark.sql.types import StringType, StructField, StructType, ArrayType, IntegerType, MapType, LongType, DoubleType"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc = SparkContext.getOrCreate()\n",
    "conf=sc._conf.setAll([(\"aerospike.namespace\",AS_NAMESPACE),(\"aerospike.seedhost\",AS_CONNECTION_STRING)])\n",
    "sc.stop()\n",
    "sc = pyspark.SparkContext(conf=conf)\n",
    "spark = SparkSession(sc)\n",
    "sqlContext = SQLContext(sc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Access Shell Commands\n",
    "You may execute shell commands including Aerospike tools like [aql](https://docs.aerospike.com/docs/tools/aql/index.html) and [asadm](https://docs.aerospike.com/docs/tools/asadm/index.html) in the terminal tab throughout this tutorial. Open a terminal tab by selecting File->Open from the notebook menu, and then New->Terminal."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Context from Part 1 (Feature Engineering Notebook)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the [previous notebook](feature-store-feature-eng.ipynb) in the Feature Store series, we showed how features engineered using the Spark platform can be efficiently stored in Aerospike feature store. We implemented a simple example feature store interface that leverages the Aerospike Spark connector capabilities for this purpose. We implementd a simple object model to save and query features, and illustrated its use with two examples.\n",
    "\n",
    "You are encouraged to review the Feature Engineering notebook as we will use the same object model, implementation (with some extensions), and data in this notebook. \n",
    "\n",
    "The code from Part 1 is replicated below as we will be using it later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Code: Feature Group, Feature, and Entity\n",
    "Below, we have copied over the code for Feature Group, Feature, and Entity classes for use in the following sections. Please review the object model described in the Feature Engineering notebook. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import copy\n",
    "\n",
    "# Feature Group\n",
    "class FeatureGroup:\n",
    "    schema = StructType([StructField(\"name\", StringType(), False),\n",
    "                         StructField(\"description\", StringType(), True),\n",
    "                         StructField(\"source\", StringType(), True),\n",
    "                         StructField(\"attrs\", MapType(StringType(), StringType()), True),\n",
    "                         StructField(\"tags\", ArrayType(StringType()), True)])\n",
    "    \n",
    "    def __init__(self, name, description, source, attrs, tags):\n",
    "        self.name = name\n",
    "        self.description = description\n",
    "        self.source = source\n",
    "        self.attrs = attrs            \n",
    "        self.tags = tags            \n",
    "        return\n",
    "\n",
    "    def __str__(self):\n",
    "        return str(self.__class__) + \": \" + str(self.__dict__)\n",
    "\n",
    "    def save(self):\n",
    "        inputBuf = [(self.name, self.description, self.source, self.attrs, self.tags)]\n",
    "        inputRDD = spark.sparkContext.parallelize(inputBuf)            \n",
    "        inputDF = spark.createDataFrame(inputRDD, FeatureGroup.schema)\n",
    "        #Write the data frame to Aerospike, the name field is used as the primary key\n",
    "        inputDF.write \\\n",
    "            .mode('overwrite') \\\n",
    "            .format(\"aerospike\")  \\\n",
    "            .option(\"aerospike.writeset\", \"fg-metadata\")\\\n",
    "            .option(\"aerospike.updateByKey\", \"name\") \\\n",
    "            .save()\n",
    "        return \n",
    "\n",
    "    def load(name):\n",
    "        fg = None\n",
    "        schema = copy.deepcopy(FeatureGroup.schema)\n",
    "        schema.add(\"__key\", StringType(), False)\n",
    "        fgdf = spark.read \\\n",
    "            .format(\"aerospike\") \\\n",
    "            .option(\"aerospike.set\", \"fg-metadata\") \\\n",
    "            .schema(schema) \\\n",
    "            .load().where(\"__key = \\\"\" + name + \"\\\"\")  \n",
    "        if fgdf.count() > 0:\n",
    "            fgtuple = fgdf.collect()[0]\n",
    "            fg = FeatureGroup(*fgtuple[:-1])       \n",
    "        return fg\n",
    "    \n",
    "    def query(predicate): #returns a dataframe\n",
    "        fg_df = spark.read \\\n",
    "            .format(\"aerospike\") \\\n",
    "            .schema(FeatureGroup.schema) \\\n",
    "            .option(\"aerospike.set\", \"fg-metadata\") \\\n",
    "            .load().where(predicate)\n",
    "        return fg_df\n",
    "    \n",
    "# Feature\n",
    "class Feature:\n",
    "    schema = StructType([StructField(\"fid\", StringType(), False),\n",
    "                           StructField(\"fgname\", StringType(), False),\n",
    "                           StructField(\"name\", StringType(), False),\n",
    "                           StructField(\"type\", StringType(), False),\n",
    "                           StructField(\"description\", StringType(), True),\n",
    "                           StructField(\"attrs\", MapType(StringType(), StringType()), True),\n",
    "                           StructField(\"tags\", ArrayType(StringType()), True)])\n",
    "\n",
    "    def __init__(self, fgname, name, ftype, description, attrs, tags):\n",
    "        self.fid = fgname + '_' + name\n",
    "        self.fgname = fgname\n",
    "        self.name = name\n",
    "        self.ftype = ftype\n",
    "        self.description = description\n",
    "        self.attrs = attrs            \n",
    "        self.tags = tags            \n",
    "        return\n",
    "    \n",
    "    def __str__(self):\n",
    "        return str(self.__class__) + \": \" + str(self.__dict__)\n",
    "\n",
    "    def save(self):\n",
    "        inputBuf = [(self.fid, self.fgname, self.name, self.ftype, self.description, self.attrs, self.tags)]\n",
    "        inputRDD = spark.sparkContext.parallelize(inputBuf)            \n",
    "        inputDF = spark.createDataFrame(inputRDD, Feature.schema)\n",
    "        # Write the data frame to Aerospike, the fid field is used as the primary key\n",
    "        inputDF.write \\\n",
    "            .mode('overwrite') \\\n",
    "            .format(\"aerospike\")  \\\n",
    "            .option(\"aerospike.writeset\", \"feature-metadata\")\\\n",
    "            .option(\"aerospike.updateByKey\", \"fid\") \\\n",
    "            .save()\n",
    "        return \n",
    "\n",
    "    def load(fgname, name):\n",
    "        f = None\n",
    "        schema = copy.deepcopy(Feature.schema)\n",
    "        schema.add(\"__key\", StringType(), False)\n",
    "        f_df = spark.read \\\n",
    "            .format(\"aerospike\") \\\n",
    "            .schema(schema) \\\n",
    "            .option(\"aerospike.set\", \"feature-metadata\") \\\n",
    "            .load().where(\"__key = \\\"\" + fgname+'_'+name + \"\\\"\")  \n",
    "        if f_df.count() > 0:\n",
    "            f_tuple = f_df.collect()[0]\n",
    "            f = Feature(*f_tuple[1:-1])       \n",
    "        return f\n",
    "                             \n",
    "    def query(predicate, pushdown_expr=None): #returns a dataframe\n",
    "        f_df = spark.read \\\n",
    "            .format(\"aerospike\") \\\n",
    "            .schema(Feature.schema) \\\n",
    "            .option(\"aerospike.set\", \"feature-metadata\") \n",
    "        # see the section on pushdown expressions\n",
    "        if pushdown_expr:\n",
    "            f_df = f_df.option(\"aerospike.pushdown.expressions\", pushdown_expr) \\\n",
    "                .load()\n",
    "        else:\n",
    "            f_df = f_df.load().where(predicate)\n",
    "        return f_df\n",
    "    \n",
    "# Entity\n",
    "class Entity:\n",
    "    \n",
    "    def __init__(self, etype, record, id_col):\n",
    "        # record is an array of triples (name, type, value)\n",
    "        self.etype = etype\n",
    "        self.record = record\n",
    "        self.id_col = id_col\n",
    "        return\n",
    "    \n",
    "    def __str__(self):\n",
    "        return str(self.__class__) + \": \" + str(self.__dict__)\n",
    "    \n",
    "    def get_schema(record): \n",
    "        schema = StructType()\n",
    "        for f in record:\n",
    "            schema.add(f[0], f[1], True)\n",
    "        return schema\n",
    "        \n",
    "    def get_id_type(schema, id_col): \n",
    "        return schema[id_col].dataType.typeName()\n",
    "\n",
    "    def save(self, schema):\n",
    "        fvalues = [f[2] for f in self.record]\n",
    "        inputBuf = [tuple(fvalues)]\n",
    "        inputRDD = spark.sparkContext.parallelize(inputBuf)            \n",
    "        inputDF = spark.createDataFrame(inputRDD, schema)\n",
    "        #Write the data frame to Aerospike, the id_col field is used as the primary key\n",
    "        inputDF.write \\\n",
    "            .mode('overwrite') \\\n",
    "            .format(\"aerospike\")  \\\n",
    "            .option(\"aerospike.writeset\", self.etype+'-features')\\\n",
    "            .option(\"aerospike.updateByKey\", self.id_col) \\\n",
    "            .save()\n",
    "        return \n",
    "\n",
    "    def load(etype, eid, schema, id_col):\n",
    "        ent = None\n",
    "        schema = copy.deepcopy(schema)\n",
    "        schema.add(\"__key\", StringType(), False)\n",
    "        ent_df = spark.read \\\n",
    "            .format(\"aerospike\") \\\n",
    "            .schema(schema) \\\n",
    "            .option(\"aerospike.set\", etype+'-features') \\\n",
    "            .load().where(\"__key = \\\"\" + eid + \"\\\"\")  \n",
    "        if ent_df.count() > 0:\n",
    "            ent_tuple = ent_df.collect()[0]\n",
    "            record = [(schema[i].name, schema[i].dataType.typeName(), fv) for i, fv in enumerate(ent_tuple[:-1])]\n",
    "            ent = Entity(etype, record, id_col)       \n",
    "        return ent\n",
    "    \n",
    "    def saveDF(df, etype, id_col): # save a dataframe\n",
    "    # df: dataframe consisting of entiry records\n",
    "    # etype: entity type (such as user or sensor)\n",
    "    # id_col: column name that holds the primary key\n",
    "        #Write the data frame to Aerospike, the column in id_col is used as the key bin\n",
    "        df.write \\\n",
    "            .mode('overwrite') \\\n",
    "            .format(\"aerospike\")  \\\n",
    "            .option(\"aerospike.writeset\", etype+'-features')\\\n",
    "            .option(\"aerospike.updateByKey\", id_col) \\\n",
    "            .save()\n",
    "        return \n",
    "        \n",
    "    \n",
    "    def query(etype, predicate, schema, id_col): #returns a dataframe\n",
    "        ent_df = spark.read \\\n",
    "            .format(\"aerospike\") \\\n",
    "            .schema(schema) \\\n",
    "            .option(\"aerospike.set\", etype+'-features') \\\n",
    "            .load().where(predicate)\n",
    "        return ent_df\n",
    "        \n",
    "    def get_feature_vector(etype, eid, feature_list): # elements in feature_list are in \"fgname_name\" form\n",
    "        # deferred to Model Serving tutorial \n",
    "        pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "truncate test\r\n",
      "OK\r\n",
      "\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "# clear the database by truncating the namespace test\n",
    "!aql -c \"truncate test\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create set indexes on all sets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ok\n",
      "ok\n",
      "ok\n"
     ]
    }
   ],
   "source": [
    "!asinfo -v \"set-config:context=namespace;id=test;set=fg-metadata;enable-index=true\"\n",
    "!asinfo -v \"set-config:context=namespace;id=test;set=feature-metadata;enable-index=true\"\n",
    "!asinfo -v \"set-config:context=namespace;id=test;set=dataset-metadata;enable-index=true\"\n",
    "#!asinfo -v \"set-config:context=namespace;id=test;set=cctxn-features;enable-index=true\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Feature group with name fg_name1:\n",
      "<class '__main__.FeatureGroup'>: {'name': 'fg_name1', 'description': 'fg_desc1', 'source': 'fg_source1', 'attrs': {'etype': 'etype1', 'key': 'feature1'}, 'tags': ['tag1', 'tag2']} \n",
      "\n",
      "Feature groups with a description containing 'desc':\n",
      "+--------+-----------+----------+--------------------+------------+\n",
      "|    name|description|    source|               attrs|        tags|\n",
      "+--------+-----------+----------+--------------------+------------+\n",
      "|fg_name2|   fg_desc2|fg_source2|{etype -> etype1,...|[tag1, tag3]|\n",
      "|fg_name3|   fg_desc3|fg_source3|{etype -> etype2,...|[tag4, tag5]|\n",
      "|fg_name1|   fg_desc1|fg_source1|{etype -> etype1,...|[tag1, tag2]|\n",
      "+--------+-----------+----------+--------------------+------------+\n",
      "\n",
      "Feature groups with the source 'fg_source2':\n",
      "+--------+-----------+----------+--------------------+------------+\n",
      "|    name|description|    source|               attrs|        tags|\n",
      "+--------+-----------+----------+--------------------+------------+\n",
      "|fg_name2|   fg_desc2|fg_source2|{etype -> etype1,...|[tag1, tag3]|\n",
      "+--------+-----------+----------+--------------------+------------+\n",
      "\n",
      "Feature groups with the attribute 'etype'='etype2':\n",
      "+--------+-----------+----------+--------------------+------------+\n",
      "|    name|description|    source|               attrs|        tags|\n",
      "+--------+-----------+----------+--------------------+------------+\n",
      "|fg_name3|   fg_desc3|fg_source3|{etype -> etype2,...|[tag4, tag5]|\n",
      "+--------+-----------+----------+--------------------+------------+\n",
      "\n",
      "Feature groups with a tag 'tag1':\n",
      "+--------+-----------+----------+--------------------+------------+\n",
      "|    name|description|    source|               attrs|        tags|\n",
      "+--------+-----------+----------+--------------------+------------+\n",
      "|fg_name2|   fg_desc2|fg_source2|{etype -> etype1,...|[tag1, tag3]|\n",
      "|fg_name1|   fg_desc1|fg_source1|{etype -> etype1,...|[tag1, tag2]|\n",
      "+--------+-----------+----------+--------------------+------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# test feature group \n",
    "# test save and load\n",
    "# save\n",
    "fg1 = FeatureGroup(\"fg_name1\", \"fg_desc1\", \"fg_source1\", {\"etype\":\"etype1\", \"key\":\"feature1\"}, [\"tag1\", \"tag2\"])\n",
    "fg1.save()\n",
    "# load\n",
    "fg2 = FeatureGroup.load(\"fg_name1\")\n",
    "print(\"Feature group with name fg_name1:\")\n",
    "print(fg2, '\\n')\n",
    "# test query\n",
    "fg2 = FeatureGroup(\"fg_name2\", \"fg_desc2\", \"fg_source2\", {\"etype\":\"etype1\", \"key\":\"fname1\"}, [\"tag1\", \"tag3\"])\n",
    "fg2.save()\n",
    "fg3 = FeatureGroup(\"fg_name3\", \"fg_desc3\", \"fg_source3\", {\"etype\":\"etype2\", \"key\":\"fname3\"}, [\"tag4\", \"tag5\"])\n",
    "fg3.save()\n",
    "# query 1\n",
    "print(\"Feature groups with a description containing 'desc':\")\n",
    "fg_df = FeatureGroup.query(\"description like '%desc%'\")\n",
    "fg_df.show()\n",
    "# query 2\n",
    "print(\"Feature groups with the source 'fg_source2':\")\n",
    "fg_df = FeatureGroup.query(\"source = 'fg_source2'\")\n",
    "fg_df.show()\n",
    "# query 3\n",
    "print(\"Feature groups with the attribute 'etype'='etype2':\")\n",
    "fg_df = FeatureGroup.query(\"attrs.etype = 'etype2'\")\n",
    "fg_df.show()\n",
    "# query 4\n",
    "print(\"Feature groups with a tag 'tag1':\")\n",
    "fg_df = FeatureGroup.query(\"array_contains(tags, 'tag1')\")\n",
    "fg_df.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Feature with group 'fgname1' and name 'f_name1:\n",
      "<class '__main__.Feature'>: {'fid': 'fgname1_f_name1', 'fgname': 'fgname1', 'name': 'f_name1', 'ftype': 'integer', 'description': 'f_desc1', 'attrs': {'etype': 'etype1', 'f_attr1': 'v1'}, 'tags': ['f_tag1', 'f_tag2']} \n",
      "\n",
      "Features in feature group 'fg_name1':\n",
      "+---------------+-------+-------+-------+-----------+--------------------+----------------+\n",
      "|            fid| fgname|   name|   type|description|               attrs|            tags|\n",
      "+---------------+-------+-------+-------+-----------+--------------------+----------------+\n",
      "|fgname1_f_name1|fgname1|f_name1|integer|    f_desc1|{etype -> etype1,...|[f_tag1, f_tag2]|\n",
      "|fgname1_f_name2|fgname1|f_name2| double|    f_desc2|{etype -> etype1,...|[f_tag1, f_tag3]|\n",
      "+---------------+-------+-------+-------+-----------+--------------------+----------------+\n",
      "\n",
      "Features of type 'integer':\n",
      "+---------------+-------+-------+-------+-----------+--------------------+----------------+\n",
      "|            fid| fgname|   name|   type|description|               attrs|            tags|\n",
      "+---------------+-------+-------+-------+-----------+--------------------+----------------+\n",
      "|fgname1_f_name1|fgname1|f_name1|integer|    f_desc1|{etype -> etype1,...|[f_tag1, f_tag2]|\n",
      "+---------------+-------+-------+-------+-----------+--------------------+----------------+\n",
      "\n",
      "Features with the attribute 'etype'='etype1':\n",
      "+---------------+-------+-------+-------+-----------+--------------------+----------------+\n",
      "|            fid| fgname|   name|   type|description|               attrs|            tags|\n",
      "+---------------+-------+-------+-------+-----------+--------------------+----------------+\n",
      "|fgname1_f_name1|fgname1|f_name1|integer|    f_desc1|{etype -> etype1,...|[f_tag1, f_tag2]|\n",
      "|fgname1_f_name2|fgname1|f_name2| double|    f_desc2|{etype -> etype1,...|[f_tag1, f_tag3]|\n",
      "+---------------+-------+-------+-------+-----------+--------------------+----------------+\n",
      "\n",
      "Features with the tag 'f_tag2':\n",
      "+---------------+-------+-------+-------+-----------+--------------------+----------------+\n",
      "|            fid| fgname|   name|   type|description|               attrs|            tags|\n",
      "+---------------+-------+-------+-------+-----------+--------------------+----------------+\n",
      "|fgname1_f_name1|fgname1|f_name1|integer|    f_desc1|{etype -> etype1,...|[f_tag1, f_tag2]|\n",
      "|fgname2_f_name3|fgname2|f_name3| double|    f_desc3|{etype -> etype2,...|[f_tag2, f_tag4]|\n",
      "+---------------+-------+-------+-------+-----------+--------------------+----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# test feature \n",
    "# test save and load\n",
    "# save\n",
    "feature1 = Feature(\"fgname1\", \"f_name1\", \"integer\", \"f_desc1\", {\"etype\":\"etype1\", \"f_attr1\":\"v1\"}, \n",
    "                   [\"f_tag1\", \"f_tag2\"])\n",
    "feature1.save()\n",
    "# load\n",
    "f1 = Feature.load(\"fgname1\", \"f_name1\")\n",
    "print(\"Feature with group 'fgname1' and name 'f_name1:\")\n",
    "print(f1, '\\n')\n",
    "# test query\n",
    "feature2 = Feature(\"fgname1\", \"f_name2\", \"double\", \"f_desc2\", {\"etype\":\"etype1\", \"f_attr1\":\"v2\"}, \n",
    "                   [\"f_tag1\", \"f_tag3\"])\n",
    "feature2.save()\n",
    "feature3 = Feature(\"fgname2\", \"f_name3\", \"double\", \"f_desc3\", {\"etype\":\"etype2\", \"f_attr2\":\"v3\"}, \n",
    "                   [\"f_tag2\", \"f_tag4\"])\n",
    "feature3.save()\n",
    "# query 1\n",
    "print(\"Features in feature group 'fg_name1':\")\n",
    "f_df = Feature.query(\"fgname = 'fgname1'\")\n",
    "f_df.show()\n",
    "# query 2\n",
    "print(\"Features of type 'integer':\")\n",
    "f_df = Feature.query(\"type = 'integer'\")\n",
    "f_df.show()\n",
    "# query 3\n",
    "print(\"Features with the attribute 'etype'='etype1':\")\n",
    "f_df = Feature.query(\"attrs.etype = 'etype1'\")\n",
    "f_df.show()\n",
    "# query 3\n",
    "print(\"Features with the tag 'f_tag2':\")\n",
    "f_df = Feature.query(\"array_contains(tags, 'f_tag2')\")\n",
    "f_df.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Entity of type 'entity_type1' and id 'eid1':\n",
      "<class '__main__.Entity'>: {'etype': 'entity_type1', 'record': [('eid', 'string', 'eid1'), ('fg1_f_name1', 'integer', 1), ('fg1_f_name2', 'double', 2.0), ('fg1_f_name3', 'string', 'three')], 'id_col': 'eid'} \n",
      "\n",
      "Instances of entity type entity_type1 with id ending in 1:\n",
      "+----+-----------+-----------+-----------+\n",
      "| eid|fg1_f_name1|fg1_f_name2|fg1_f_name3|\n",
      "+----+-----------+-----------+-----------+\n",
      "|eid1|          1|        2.0|      three|\n",
      "+----+-----------+-----------+-----------+\n",
      "\n",
      "Instances of entity type entity_type2 meeting the specified condition:\n",
      "+----+-----------+-----------+-----------+\n",
      "| eid|fg1_f_name1|fg1_f_name2|fg1_f_name3|\n",
      "+----+-----------+-----------+-----------+\n",
      "|eid2|         10|       20.0|     thirty|\n",
      "+----+-----------+-----------+-----------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# test Entity\n",
    "# test save and load\n",
    "# save\n",
    "features1 = [('fg1_f_name1', IntegerType(), 1), ('fg1_f_name2', DoubleType(), 2.0), ('fg1_f_name3', StringType(), 'three')]\n",
    "record1 = [('eid', StringType(), 'eid1')] + features1\n",
    "ent1 = Entity('entity_type1', record1, 'eid')\n",
    "schema = Entity.get_schema(record1)\n",
    "ent1.save(schema);\n",
    "# load\n",
    "e1 = Entity.load('entity_type1', 'eid1', schema, 'eid')\n",
    "print(\"Entity of type 'entity_type1' and id 'eid1':\")\n",
    "print(e1, '\\n')\n",
    "# test query\n",
    "features2 = [('fg1_f_name1', IntegerType(), 10), ('fg1_f_name2', DoubleType(), 20.0), ('fg1_f_name3', StringType(), 'thirty')]\n",
    "record2 = [('eid', StringType(), 'eid2')] + features2\n",
    "ent2 = Entity('entity_type2', record2, 'eid')\n",
    "ent2.save(schema);\n",
    "# query 1\n",
    "print(\"Instances of entity type entity_type1 with id ending in 1:\")\n",
    "instances = Entity.query('entity_type1', 'eid like \"%1\"', schema, 'eid')\n",
    "instances.show()\n",
    "# query 2\n",
    "print(\"Instances of entity type entity_type2 meeting the specified condition:\")\n",
    "instances = Entity.query('entity_type2', 'eid in (\"eid2\")', schema, 'eid')\n",
    "instances.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Data:  Credit Card Transactions\n",
    "The following cell populates the data from Part 1 in the database for use below.\n",
    "### Read and Transform Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TxnId</th>\n",
       "      <th>CC1_Class</th>\n",
       "      <th>CC1_Amount</th>\n",
       "      <th>CC1_V1</th>\n",
       "      <th>CC1_V2</th>\n",
       "      <th>CC1_V3</th>\n",
       "      <th>CC1_V4</th>\n",
       "      <th>CC1_V5</th>\n",
       "      <th>CC1_V6</th>\n",
       "      <th>CC1_V7</th>\n",
       "      <th>...</th>\n",
       "      <th>CC1_V19</th>\n",
       "      <th>CC1_V20</th>\n",
       "      <th>CC1_V21</th>\n",
       "      <th>CC1_V22</th>\n",
       "      <th>CC1_V23</th>\n",
       "      <th>CC1_V24</th>\n",
       "      <th>CC1_V25</th>\n",
       "      <th>CC1_V26</th>\n",
       "      <th>CC1_V27</th>\n",
       "      <th>CC1_V28</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>149.62</td>\n",
       "      <td>-1.359807</td>\n",
       "      <td>-0.072781</td>\n",
       "      <td>2.536347</td>\n",
       "      <td>1.378155</td>\n",
       "      <td>-0.338321</td>\n",
       "      <td>0.462388</td>\n",
       "      <td>0.239599</td>\n",
       "      <td>...</td>\n",
       "      <td>0.403993</td>\n",
       "      <td>0.251412</td>\n",
       "      <td>-0.018307</td>\n",
       "      <td>0.277838</td>\n",
       "      <td>-0.110474</td>\n",
       "      <td>0.066928</td>\n",
       "      <td>0.128539</td>\n",
       "      <td>-0.189115</td>\n",
       "      <td>0.133558</td>\n",
       "      <td>-0.021053</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>2.69</td>\n",
       "      <td>1.191857</td>\n",
       "      <td>0.266151</td>\n",
       "      <td>0.166480</td>\n",
       "      <td>0.448154</td>\n",
       "      <td>0.060018</td>\n",
       "      <td>-0.082361</td>\n",
       "      <td>-0.078803</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.145783</td>\n",
       "      <td>-0.069083</td>\n",
       "      <td>-0.225775</td>\n",
       "      <td>-0.638672</td>\n",
       "      <td>0.101288</td>\n",
       "      <td>-0.339846</td>\n",
       "      <td>0.167170</td>\n",
       "      <td>0.125895</td>\n",
       "      <td>-0.008983</td>\n",
       "      <td>0.014724</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>378.66</td>\n",
       "      <td>-1.358354</td>\n",
       "      <td>-1.340163</td>\n",
       "      <td>1.773209</td>\n",
       "      <td>0.379780</td>\n",
       "      <td>-0.503198</td>\n",
       "      <td>1.800499</td>\n",
       "      <td>0.791461</td>\n",
       "      <td>...</td>\n",
       "      <td>-2.261857</td>\n",
       "      <td>0.524980</td>\n",
       "      <td>0.247998</td>\n",
       "      <td>0.771679</td>\n",
       "      <td>0.909412</td>\n",
       "      <td>-0.689281</td>\n",
       "      <td>-0.327642</td>\n",
       "      <td>-0.139097</td>\n",
       "      <td>-0.055353</td>\n",
       "      <td>-0.059752</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>123.50</td>\n",
       "      <td>-0.966272</td>\n",
       "      <td>-0.185226</td>\n",
       "      <td>1.792993</td>\n",
       "      <td>-0.863291</td>\n",
       "      <td>-0.010309</td>\n",
       "      <td>1.247203</td>\n",
       "      <td>0.237609</td>\n",
       "      <td>...</td>\n",
       "      <td>-1.232622</td>\n",
       "      <td>-0.208038</td>\n",
       "      <td>-0.108300</td>\n",
       "      <td>0.005274</td>\n",
       "      <td>-0.190321</td>\n",
       "      <td>-1.175575</td>\n",
       "      <td>0.647376</td>\n",
       "      <td>-0.221929</td>\n",
       "      <td>0.062723</td>\n",
       "      <td>0.061458</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>69.99</td>\n",
       "      <td>-1.158233</td>\n",
       "      <td>0.877737</td>\n",
       "      <td>1.548718</td>\n",
       "      <td>0.403034</td>\n",
       "      <td>-0.407193</td>\n",
       "      <td>0.095921</td>\n",
       "      <td>0.592941</td>\n",
       "      <td>...</td>\n",
       "      <td>0.803487</td>\n",
       "      <td>0.408542</td>\n",
       "      <td>-0.009431</td>\n",
       "      <td>0.798278</td>\n",
       "      <td>-0.137458</td>\n",
       "      <td>0.141267</td>\n",
       "      <td>-0.206010</td>\n",
       "      <td>0.502292</td>\n",
       "      <td>0.219422</td>\n",
       "      <td>0.215153</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 31 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "  TxnId  CC1_Class  CC1_Amount    CC1_V1    CC1_V2    CC1_V3    CC1_V4  \\\n",
       "0     1          0      149.62 -1.359807 -0.072781  2.536347  1.378155   \n",
       "1     2          0        2.69  1.191857  0.266151  0.166480  0.448154   \n",
       "2     3          0      378.66 -1.358354 -1.340163  1.773209  0.379780   \n",
       "3     4          0      123.50 -0.966272 -0.185226  1.792993 -0.863291   \n",
       "4     5          0       69.99 -1.158233  0.877737  1.548718  0.403034   \n",
       "\n",
       "     CC1_V5    CC1_V6    CC1_V7  ...   CC1_V19   CC1_V20   CC1_V21   CC1_V22  \\\n",
       "0 -0.338321  0.462388  0.239599  ...  0.403993  0.251412 -0.018307  0.277838   \n",
       "1  0.060018 -0.082361 -0.078803  ... -0.145783 -0.069083 -0.225775 -0.638672   \n",
       "2 -0.503198  1.800499  0.791461  ... -2.261857  0.524980  0.247998  0.771679   \n",
       "3 -0.010309  1.247203  0.237609  ... -1.232622 -0.208038 -0.108300  0.005274   \n",
       "4 -0.407193  0.095921  0.592941  ...  0.803487  0.408542 -0.009431  0.798278   \n",
       "\n",
       "    CC1_V23   CC1_V24   CC1_V25   CC1_V26   CC1_V27   CC1_V28  \n",
       "0 -0.110474  0.066928  0.128539 -0.189115  0.133558 -0.021053  \n",
       "1  0.101288 -0.339846  0.167170  0.125895 -0.008983  0.014724  \n",
       "2  0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752  \n",
       "3 -0.190321 -1.175575  0.647376 -0.221929  0.062723  0.061458  \n",
       "4 -0.137458  0.141267 -0.206010  0.502292  0.219422  0.215153  \n",
       "\n",
       "[5 rows x 31 columns]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# read and transform the sample credit card transactions data from a csv file\n",
    "from pyspark.sql.functions import expr\n",
    "df = spark.read.options(header=\"True\", inferSchema=\"True\") \\\n",
    "    .csv(\"resources/creditcard_small.csv\") \\\n",
    "    . orderBy(['_c0'], ascending=[True])\n",
    "new_col_names = ['CC1_' + (c if c != '_c0' else 'OldIdx') for c in df.columns]\n",
    "df = df.toDF(*new_col_names) \\\n",
    "        .withColumn('TxnId', expr('CC1_OldIdx+1').cast(StringType())) \\\n",
    "        .select(['TxnId','CC1_Class','CC1_Amount']+['CC1_V'+str(i) for i in range(1,29)])\n",
    "df.toPandas().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save Features\n",
    "Insert the credit card transaction features in the feature store."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Features stored to Feature Store.\n"
     ]
    }
   ],
   "source": [
    "# 1. Create a feature group.\n",
    "FG_NAME = 'CC1'\n",
    "FG_DESCRIPTION = 'Credit card transaction data'\n",
    "FG_SOURCE = 'European cardholder dataset from Kaggle'\n",
    "fg = FeatureGroup(FG_NAME, FG_DESCRIPTION, FG_SOURCE,\n",
    "                 attrs={'entity':'cctxn', 'class':'fraud'}, tags=['kaggle', 'demo'])\n",
    "fg.save()\n",
    "\n",
    "# 2. Create feature metadata\n",
    "FEATURE_AMOUNT = 'Amount'\n",
    "f = Feature(FG_NAME, FEATURE_AMOUNT, 'double', \"Transaction amount\", \n",
    "           attrs={'entity':'cctxn'}, tags=['usd'])\n",
    "f.save()\n",
    "FEATURE_CLASS = 'Class'\n",
    "f = Feature(FG_NAME, FEATURE_CLASS, 'integer', \"Label indicating fraud or not\", \n",
    "           attrs={'entity':'cctxn'}, tags=['label'])\n",
    "f.save()\n",
    "FEATURE_PCA_XFORM = \"V\"\n",
    "for i in range(1,29):\n",
    "    f = Feature(FG_NAME, FEATURE_PCA_XFORM+str(i), 'double', \"Transformed version of PCA\", \n",
    "               attrs={'entity':'cctxn'}, tags=['pca'])\n",
    "    f.save()\n",
    "\n",
    "# 3. Save feature values in entity records\n",
    "ENTITY_TYPE = 'cctxn'\n",
    "ID_COLUMN = 'TxnId'\n",
    "Entity.saveDF(df, ENTITY_TYPE, ID_COLUMN)\n",
    "print('Features stored to Feature Store.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Implementing Dataset\n",
    "We created example implementations of Feature Group, Feature, and Entity objects as above. Let us now create a similar implementation of Dataset. \n",
    "\n",
    "## Object Model\n",
    "A dataset is a subset of features and entities selected for an ML model. A Dataset object holds the selected features and entity instances. The actual (materialized) copy of entity records is stored outside the feature store (for instance, in a file system). \n",
    "\n",
    "### Attributes\n",
    "A dataset record has the following attributes.\n",
    "\n",
    "- name: name of the data set, serves as the primary key for the record\n",
    "- description: human readable description\n",
    "- features: a list of the dataset features\n",
    "- predicate: query predicate to enumerate the entity instances in the dataset\n",
    "- location: external location where the dataset is stored\n",
    "- attrs: other metadata\n",
    "- tags: associated tags\n",
    "\n",
    "Datasets are stored in the set \"dataset-metadata\".\n",
    "\n",
    "### Operations\n",
    "Dataset is used during Model Training. The following operations are needed.\n",
    "\n",
    "- create\n",
    "- load (get)\n",
    "- query (returns dataset metadata records)\n",
    "- materialize (returns entity records as defined by a dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dataset Implementation\n",
    "Below is an example implementation of Dataset as described above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dataset\n",
    "class Dataset:\n",
    "    schema = StructType([StructField(\"name\", StringType(), False),\n",
    "                         StructField(\"description\", StringType(), True),\n",
    "                         StructField(\"entity\", StringType(), False), \n",
    "                         StructField(\"id_col\", StringType(), False), \n",
    "                         StructField(\"id_type\", StringType(), False),                          \n",
    "                         StructField(\"features\", ArrayType(StringType()), True),\n",
    "                         StructField(\"query\", StringType(), True),                         \n",
    "                         StructField(\"location\", StringType(), True),                         \n",
    "                         StructField(\"attrs\", MapType(StringType(), StringType()), True),\n",
    "                         StructField(\"tags\", ArrayType(StringType()), True)])\n",
    "    \n",
    "    def __init__(self, name, description, entity, id_col, id_type,\n",
    "                 features, query, location, attrs, tags):\n",
    "        self.name = name\n",
    "        self.description = description\n",
    "        self.entity = entity\n",
    "        self.id_col = id_col\n",
    "        self.id_type = id_type\n",
    "        self.features = features\n",
    "        self.query = query\n",
    "        self.location = location\n",
    "        self.attrs = attrs            \n",
    "        self.tags = tags            \n",
    "        return\n",
    "\n",
    "    def __str__(self):\n",
    "        return str(self.__class__) + \": \" + str(self.__dict__)\n",
    "\n",
    "    def save(self):\n",
    "        inputBuf = [(self.name, self.description, self.entity, self.id_col, self.id_type,\n",
    "                     self.features, self.query, self.location, self.attrs, self.tags)]\n",
    "        inputRDD = spark.sparkContext.parallelize(inputBuf)            \n",
    "        inputDF = spark.createDataFrame(inputRDD, Dataset.schema)\n",
    "        #Write the data frame to Aerospike, the name field is used as the primary key\n",
    "        inputDF.write \\\n",
    "            .mode('overwrite') \\\n",
    "            .format(\"aerospike\")  \\\n",
    "            .option(\"aerospike.writeset\", \"dataset-metadata\")\\\n",
    "            .option(\"aerospike.updateByKey\", \"name\") \\\n",
    "            .save()\n",
    "        return \n",
    "\n",
    "    def load(name):\n",
    "        dataset = None\n",
    "        ds_df = spark.read \\\n",
    "            .format(\"aerospike\") \\\n",
    "            .option(\"aerospike.set\", \"dataset-metadata\") \\\n",
    "            .schema(Dataset.schema) \\\n",
    "            .option(\"aerospike.updateByKey\", \"name\") \\\n",
    "            .load().where(\"name = \\\"\" + name + \"\\\"\")  \n",
    "        if ds_df.count() > 0:\n",
    "            dstuple = ds_df.collect()[0]\n",
    "            dataset = Dataset(*dstuple)       \n",
    "        return dataset\n",
    "    \n",
    "    def query(predicate): #returns a dataframe\n",
    "        ds_df = spark.read \\\n",
    "            .format(\"aerospike\") \\\n",
    "            .schema(Dataset.schema) \\\n",
    "            .option(\"aerospike.set\", \"dataset-metadata\") \\\n",
    "            .load().where(predicate)\n",
    "        return ds_df\n",
    "    \n",
    "    def features_to_schema(entity, id_col, id_type, features):\n",
    "        def convert_field_type(ftype):\n",
    "            return DoubleType() if ftype == 'double' \\\n",
    "                                else (IntegerType() if ftype in ['integer','long'] \\\n",
    "                                      else StringType())                                 \n",
    "        schema = StructType()\n",
    "        schema.add(id_col, convert_field_type(id_type), False)\n",
    "        for fid in features:\n",
    "            sep = fid.find('_')\n",
    "            f = Feature.load(fid[:sep] if sep != -1 else \"\", fid[sep+1:]) \n",
    "            if f:\n",
    "                schema.add(f.fid, convert_field_type(f.ftype), True)\n",
    "        return schema\n",
    "    \n",
    "    def materialize_to_df(self):\n",
    "        df = Entity.query(self.entity, self.query, \n",
    "                            Dataset.features_to_schema(self.entity, self.id_col, self.id_type, \n",
    "                                                       self.features), self.id_col) \n",
    "        return df\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset named 'ds_test1':\n",
      "<class '__main__.Dataset'>: {'name': 'ds_test1', 'description': 'Test dataset', 'entity': 'cctxn', 'id_col': 'TxnId', 'id_type': 'string', 'features': ['CC1_Amount', 'CC1_Class', 'CC1_V1'], 'query': 'CC1_Amount > 1500', 'location': '', 'attrs': {'risk': 'high'}, 'tags': ['test', 'dataset']} \n",
      "\n",
      "Datasets with attribute 'risk'='high' and tag 'test':\n",
      "+--------+------------+------+------+-------+--------------------+-----------------+--------+--------------+---------------+\n",
      "|    name| description|entity|id_col|id_type|            features|            query|location|         attrs|           tags|\n",
      "+--------+------------+------+------+-------+--------------------+-----------------+--------+--------------+---------------+\n",
      "|ds_test1|Test dataset| cctxn| TxnId| string|[CC1_Amount, CC1_...|CC1_Amount > 1500|        |{risk -> high}|[test, dataset]|\n",
      "+--------+------------+------+------+-------+--------------------+-----------------+--------+--------------+---------------+\n",
      "\n",
      "Materialize dataset ds_test1 as defined above:\n",
      "Records in the dataset:  4\n",
      "+------+----------+---------+-----------------+\n",
      "| TxnId|CC1_Amount|CC1_Class|           CC1_V1|\n",
      "+------+----------+---------+-----------------+\n",
      "|  6972|   1809.68|        1|-3.49910753739178|\n",
      "|   165|   3828.04|        0|-6.09324780457494|\n",
      "|249168|   1504.93|        1|-1.60021129907252|\n",
      "|176050|   2125.87|        1|-2.00345953080582|\n",
      "+------+----------+---------+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# test Dataset\n",
    "# test save and load\n",
    "# save\n",
    "features = [\"CC1_Amount\", \"CC1_Class\", \"CC1_V1\"]\n",
    "ds = Dataset(\"ds_test1\", \"Test dataset\", \"cctxn\", \"TxnId\", \"string\",\n",
    "                 features, \"CC1_Amount > 1500\", \"\", {\"risk\":\"high\"}, [\"test\", \"dataset\"])\n",
    "ds.save()\n",
    "# load\n",
    "ds = Dataset.load(\"ds_test1\")\n",
    "print(\"Dataset named 'ds_test1':\")\n",
    "print(ds, '\\n')\n",
    "# test query\n",
    "print(\"Datasets with attribute 'risk'='high' and tag 'test':\")\n",
    "dsq_df = Dataset.query(\"attrs.risk == 'high' and array_contains(tags, 'test')\")\n",
    "dsq_df.show()\n",
    "# test materialize_to_df\n",
    "print(\"Materialize dataset ds_test1 as defined above:\")\n",
    "ds_df = ds.materialize_to_df()\n",
    "print(\"Records in the dataset: \", ds_df.count())\n",
    "ds_df.show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using Pushdown Expressions\n",
    "In order to get best performance from the Aerospike feature store, one important optimization is to \"push down\" processing to the database and minimize the amount of data retrieved to Spark. This is especially important for querying from large amounts of underlying data, such as when creating a dataset. This is achieved by \"pushing down\" filters or processing filters in the database. \n",
    "\n",
    "Currently the Spark Connector allows two mutually exclusive ways of specifying filters in a dataframe load: \n",
    "1. The `where` clause \n",
    "2. The `pushdown expressions` option\n",
    "\n",
    "Only one may be specified because the underlying Aerospike database mechanisms used to process them are different and exclusive. The latter takes prcedence if both are specified. \n",
    "\n",
    "The `where` clause filter may be pushed down in part or fully depending on the parts in the filter (that is, if the database supports them and the Spark Connector takes advantage of it). The `pushdown expression` filter however is fully processed in the database, which ensures best performance.\n",
    "\n",
    "Aerospike expressions provide some filtering capabilities that are either not available on Spark (such as record metadata based filtering). Also, expression based filtering will be processed more efficiently in the database. On the other hand, the `where` clause also has many capabilities that are not available in Aerospike expressions. So it may be necessary to use both, in which case it is best to use pushdown expressions to retrieve a dataframe, and then process it using the Spark dataframe capabilities."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating Pushdown Expressions\n",
    "The Spark Connector currently requires the base64 encoding of the expression. Exporting the base64 encoded expression currently requires the Java client, which can be run in a parallel notebook) and entails the following steps:\n",
    "1. Write the expression in Java.\n",
    "2. Test the expression with the desired data.\n",
    "3. Obtain the base64 encoding.\n",
    "4. Use the base64 representation in this notebook as shown below.\n",
    "\n",
    "You can run the adjunct notebook [Pushdown Expressions for Spark Connector](resources/pushdown-expressions.ipynb) to follow the above recipe and obtain the base64 representation of an expression for use in the following examples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Examples\n",
    "We illustrate pushdown expressions with `Feature` class queries, but the `query` method implementation can be adopted in other objects. \n",
    "\n",
    "The examples below illustrate the capabilities and process of working with pushdown expressions. More details on expressions are explained in [Understanding Expressions in Aerospike](../java/expressions.ipynb) notebook.\n",
    "\n",
    "### Records with Specific Tags\n",
    "Examine the expression in Java:\n",
    "```\n",
    "    Exp.gt(\n",
    "        ListExp.getByValueList(ListReturnType.COUNT, \n",
    "           Exp.val(new ArrayList<String>(Arrays.asList(\"label\",\"f_tag1\"))), \n",
    "           Exp.listBin(\"tags\")),\n",
    "        Exp.val(0))\n",
    "```\n",
    "The outer expression compares for the value returned from the first argument to be greater than 0. The first argument is the count of matching tags from the specified tags in the list bin `tags`.\n",
    "\n",
    "Obtain the base64 representation from [Understanding Expressions in Aerospike](../java/expressions.ipynb) notebook. It is \"kwOVfwIAkxcFkn6SpgNsYWJlbKcDZl90YWcxk1EEpHRhZ3MA\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fid</th>\n",
       "      <th>fgname</th>\n",
       "      <th>name</th>\n",
       "      <th>type</th>\n",
       "      <th>description</th>\n",
       "      <th>attrs</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>fgname1_f_name1</td>\n",
       "      <td>fgname1</td>\n",
       "      <td>f_name1</td>\n",
       "      <td>integer</td>\n",
       "      <td>f_desc1</td>\n",
       "      <td>{'etype': 'etype1', 'f_attr1': 'v1'}</td>\n",
       "      <td>[f_tag1, f_tag2]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CC1_Class</td>\n",
       "      <td>CC1</td>\n",
       "      <td>Class</td>\n",
       "      <td>integer</td>\n",
       "      <td>Label indicating fraud or not</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[label]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>fgname1_f_name2</td>\n",
       "      <td>fgname1</td>\n",
       "      <td>f_name2</td>\n",
       "      <td>double</td>\n",
       "      <td>f_desc2</td>\n",
       "      <td>{'etype': 'etype1', 'f_attr1': 'v2'}</td>\n",
       "      <td>[f_tag1, f_tag3]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               fid   fgname     name     type                    description  \\\n",
       "0  fgname1_f_name1  fgname1  f_name1  integer                        f_desc1   \n",
       "1        CC1_Class      CC1    Class  integer  Label indicating fraud or not   \n",
       "2  fgname1_f_name2  fgname1  f_name2   double                        f_desc2   \n",
       "\n",
       "                                  attrs              tags  \n",
       "0  {'etype': 'etype1', 'f_attr1': 'v1'}  [f_tag1, f_tag2]  \n",
       "1                   {'entity': 'cctxn'}           [label]  \n",
       "2  {'etype': 'etype1', 'f_attr1': 'v2'}  [f_tag1, f_tag3]  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "base64_expr = \"kwOVfwIAkxcFkn6SpgNsYWJlbKcDZl90YWcxk1EEpHRhZ3MA\"\n",
    "f_df = Feature.query(None, pushdown_expr=base64_expr)\n",
    "f_df.toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Records with Specific Attribute Value\n",
    "Examine the expression in Java:\n",
    "```\n",
    "MapExp.getByKey(MapReturnType.VALUE, \n",
    "                          Exp.Type.STRING, Exp.val(\"f_attr1\"), Exp.mapBin(\"attrs\")), \n",
    "              Exp.val(\"v1\"))\n",
    "```\n",
    "It would filter records having a key \"f_attr1\" with value \"v1\" from the map bin `attrs`. \n",
    "\n",
    "Obtain the base64 representation from [Understanding Expressions in Aerospike](../java/expressions.ipynb) notebook. It is \"kwGVfwMAk2EHqANmX2F0dHIxk1EFpWF0dHJzowN2MQ==\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fid</th>\n",
       "      <th>fgname</th>\n",
       "      <th>name</th>\n",
       "      <th>type</th>\n",
       "      <th>description</th>\n",
       "      <th>attrs</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>fgname1_f_name1</td>\n",
       "      <td>fgname1</td>\n",
       "      <td>f_name1</td>\n",
       "      <td>integer</td>\n",
       "      <td>f_desc1</td>\n",
       "      <td>{'etype': 'etype1', 'f_attr1': 'v1'}</td>\n",
       "      <td>[f_tag1, f_tag2]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               fid   fgname     name     type description  \\\n",
       "0  fgname1_f_name1  fgname1  f_name1  integer     f_desc1   \n",
       "\n",
       "                                  attrs              tags  \n",
       "0  {'etype': 'etype1', 'f_attr1': 'v1'}  [f_tag1, f_tag2]  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "base64_expr = \"kwGVfwMAk2EHqANmX2F0dHIxk1EFpWF0dHJzowN2MQ==\"\n",
    "f_df = Feature.query(None, pushdown_expr=base64_expr)\n",
    "f_df.toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Records with String Matching Pattern \n",
    "Examine the expression in Java:\n",
    "```\n",
    "Exp.regexCompare(\"^c.*2$\", RegexFlag.ICASE, Exp.stringBin(\"fid\"))\n",
    "```\n",
    "It would filter records with fid starting with \"c\" and ending in \"2\" (case insensitive).\n",
    "\n",
    "Obtain the base64 representation from [Understanding Expressions in Aerospike](../java/expressions.ipynb) notebook. It is \"lAcCpl5DLioyJJNRA6NmaWQ=\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fid</th>\n",
       "      <th>fgname</th>\n",
       "      <th>name</th>\n",
       "      <th>type</th>\n",
       "      <th>description</th>\n",
       "      <th>attrs</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CC1_V2</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V2</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CC1_V12</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V12</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CC1_V22</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V22</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       fid fgname name    type                 description  \\\n",
       "0   CC1_V2    CC1   V2  double  Transformed version of PCA   \n",
       "1  CC1_V12    CC1  V12  double  Transformed version of PCA   \n",
       "2  CC1_V22    CC1  V22  double  Transformed version of PCA   \n",
       "\n",
       "                 attrs   tags  \n",
       "0  {'entity': 'cctxn'}  [pca]  \n",
       "1  {'entity': 'cctxn'}  [pca]  \n",
       "2  {'entity': 'cctxn'}  [pca]  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "base64_expr = \"lAcCpl5DLioyJJNRA6NmaWQ=\"\n",
    "f_df = Feature.query(None, pushdown_expr=base64_expr)\n",
    "f_df.toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploring Features in Feature Store\n",
    "Now let's explore the features available in the Feature Store prior to using them to train a model. We will illustrate this with the querying functions on the metadata objects we have implemented above, as well as Spark functions."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "## Exploring Datasets\n",
    "As we are interested in building a fraud detection model, let's see if there are any existing datasets that have \"fraud' in their description. At present there should be no datasets in the database until we create and save one in later sections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+----+-----------+------+------+-------+--------+-----+--------+-----+----+\n",
      "|name|description|entity|id_col|id_type|features|query|location|attrs|tags|\n",
      "+----+-----------+------+------+-------+--------+-----+--------+-----+----+\n",
      "+----+-----------+------+------+-------+--------+-----+--------+-----+----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "ds_df = Dataset.query(\"description like '%fraud%'\")\n",
    "ds_df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring Feature Groups\n",
    "Let's identify feature groups for the entity type \"cctxn\" (credit card transactions) that have an attribute \"class\"=\"fraud\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <td>CC1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>description</th>\n",
       "      <td>Credit card transaction data</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>source</th>\n",
       "      <td>European cardholder dataset from Kaggle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>attrs</th>\n",
       "      <td>{'class': 'fraud', 'entity': 'cctxn'}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tags</th>\n",
       "      <td>[kaggle, demo]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                   0\n",
       "name                                             CC1\n",
       "description             Credit card transaction data\n",
       "source       European cardholder dataset from Kaggle\n",
       "attrs          {'class': 'fraud', 'entity': 'cctxn'}\n",
       "tags                                  [kaggle, demo]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fg_df = FeatureGroup.query(\"attrs.entity == 'cctxn' and attrs.class == 'fraud'\")\n",
    "fg_df.toPandas().transpose().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>fid</th>\n",
       "      <th>fgname</th>\n",
       "      <th>name</th>\n",
       "      <th>type</th>\n",
       "      <th>description</th>\n",
       "      <th>attrs</th>\n",
       "      <th>tags</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CC1_V23</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V23</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CC1_V10</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V10</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CC1_Class</td>\n",
       "      <td>CC1</td>\n",
       "      <td>Class</td>\n",
       "      <td>integer</td>\n",
       "      <td>Label indicating fraud or not</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[label]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CC1_V20</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V20</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>CC1_V16</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V16</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>CC1_V1</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V1</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>CC1_V6</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V6</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>CC1_V25</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V25</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>CC1_V9</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V9</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>CC1_V2</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V2</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>CC1_V3</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V3</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>CC1_V12</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V12</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>CC1_V21</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V21</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>CC1_V27</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V27</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>CC1_Amount</td>\n",
       "      <td>CC1</td>\n",
       "      <td>Amount</td>\n",
       "      <td>double</td>\n",
       "      <td>Transaction amount</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[usd]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>CC1_V24</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V24</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>CC1_V7</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V7</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>CC1_V28</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V28</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>CC1_V4</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V4</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>CC1_V13</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V13</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>CC1_V17</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V17</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>CC1_V18</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V18</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>CC1_V26</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V26</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>CC1_V19</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V19</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>CC1_V14</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V14</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>CC1_V11</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V11</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>CC1_V8</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V8</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>CC1_V5</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V5</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>CC1_V22</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V22</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>CC1_V15</td>\n",
       "      <td>CC1</td>\n",
       "      <td>V15</td>\n",
       "      <td>double</td>\n",
       "      <td>Transformed version of PCA</td>\n",
       "      <td>{'entity': 'cctxn'}</td>\n",
       "      <td>[pca]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           fid fgname    name     type                    description  \\\n",
       "0      CC1_V23    CC1     V23   double     Transformed version of PCA   \n",
       "1      CC1_V10    CC1     V10   double     Transformed version of PCA   \n",
       "2    CC1_Class    CC1   Class  integer  Label indicating fraud or not   \n",
       "3      CC1_V20    CC1     V20   double     Transformed version of PCA   \n",
       "4      CC1_V16    CC1     V16   double     Transformed version of PCA   \n",
       "5       CC1_V1    CC1      V1   double     Transformed version of PCA   \n",
       "6       CC1_V6    CC1      V6   double     Transformed version of PCA   \n",
       "7      CC1_V25    CC1     V25   double     Transformed version of PCA   \n",
       "8       CC1_V9    CC1      V9   double     Transformed version of PCA   \n",
       "9       CC1_V2    CC1      V2   double     Transformed version of PCA   \n",
       "10      CC1_V3    CC1      V3   double     Transformed version of PCA   \n",
       "11     CC1_V12    CC1     V12   double     Transformed version of PCA   \n",
       "12     CC1_V21    CC1     V21   double     Transformed version of PCA   \n",
       "13     CC1_V27    CC1     V27   double     Transformed version of PCA   \n",
       "14  CC1_Amount    CC1  Amount   double             Transaction amount   \n",
       "15     CC1_V24    CC1     V24   double     Transformed version of PCA   \n",
       "16      CC1_V7    CC1      V7   double     Transformed version of PCA   \n",
       "17     CC1_V28    CC1     V28   double     Transformed version of PCA   \n",
       "18      CC1_V4    CC1      V4   double     Transformed version of PCA   \n",
       "19     CC1_V13    CC1     V13   double     Transformed version of PCA   \n",
       "20     CC1_V17    CC1     V17   double     Transformed version of PCA   \n",
       "21     CC1_V18    CC1     V18   double     Transformed version of PCA   \n",
       "22     CC1_V26    CC1     V26   double     Transformed version of PCA   \n",
       "23     CC1_V19    CC1     V19   double     Transformed version of PCA   \n",
       "24     CC1_V14    CC1     V14   double     Transformed version of PCA   \n",
       "25     CC1_V11    CC1     V11   double     Transformed version of PCA   \n",
       "26      CC1_V8    CC1      V8   double     Transformed version of PCA   \n",
       "27      CC1_V5    CC1      V5   double     Transformed version of PCA   \n",
       "28     CC1_V22    CC1     V22   double     Transformed version of PCA   \n",
       "29     CC1_V15    CC1     V15   double     Transformed version of PCA   \n",
       "\n",
       "                  attrs     tags  \n",
       "0   {'entity': 'cctxn'}    [pca]  \n",
       "1   {'entity': 'cctxn'}    [pca]  \n",
       "2   {'entity': 'cctxn'}  [label]  \n",
       "3   {'entity': 'cctxn'}    [pca]  \n",
       "4   {'entity': 'cctxn'}    [pca]  \n",
       "5   {'entity': 'cctxn'}    [pca]  \n",
       "6   {'entity': 'cctxn'}    [pca]  \n",
       "7   {'entity': 'cctxn'}    [pca]  \n",
       "8   {'entity': 'cctxn'}    [pca]  \n",
       "9   {'entity': 'cctxn'}    [pca]  \n",
       "10  {'entity': 'cctxn'}    [pca]  \n",
       "11  {'entity': 'cctxn'}    [pca]  \n",
       "12  {'entity': 'cctxn'}    [pca]  \n",
       "13  {'entity': 'cctxn'}    [pca]  \n",
       "14  {'entity': 'cctxn'}    [usd]  \n",
       "15  {'entity': 'cctxn'}    [pca]  \n",
       "16  {'entity': 'cctxn'}    [pca]  \n",
       "17  {'entity': 'cctxn'}    [pca]  \n",
       "18  {'entity': 'cctxn'}    [pca]  \n",
       "19  {'entity': 'cctxn'}    [pca]  \n",
       "20  {'entity': 'cctxn'}    [pca]  \n",
       "21  {'entity': 'cctxn'}    [pca]  \n",
       "22  {'entity': 'cctxn'}    [pca]  \n",
       "23  {'entity': 'cctxn'}    [pca]  \n",
       "24  {'entity': 'cctxn'}    [pca]  \n",
       "25  {'entity': 'cctxn'}    [pca]  \n",
       "26  {'entity': 'cctxn'}    [pca]  \n",
       "27  {'entity': 'cctxn'}    [pca]  \n",
       "28  {'entity': 'cctxn'}    [pca]  \n",
       "29  {'entity': 'cctxn'}    [pca]  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# View all available features in this feature group\n",
    "f_df = Feature.query(\"fgname == 'CC1'\")\n",
    "f_df.toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The features look promising for a fraud prediction model. Let's look at the actual feature data and its characteristics by querying the entity records."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring Feature Data\n",
    "We can further explore the feature data to determine what features should be part of the dataset. The feature data resides in Entity records and we can use the above info to form the schema and retrieve the records."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Defining Schema\n",
    "In order to query using the Aerospike Spark Conntector, we must define the schema \n",
    "for the record."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define the schema for the record.\n",
    "FG_NAME = 'CC1'\n",
    "ENTITY_TYPE = 'cctxn'\n",
    "ID_COLUMN = 'TxnId'\n",
    "FEATURE_AMOUNT = 'Amount'\n",
    "FEATURE_CLASS = 'Class'\n",
    "FEATURE_PCA_XFORM = \"V\"\n",
    "\n",
    "schema = StructType([StructField(ID_COLUMN, StringType(), False),\n",
    "                    StructField(FG_NAME+'_'+FEATURE_CLASS, IntegerType(), False),\n",
    "                    StructField(FG_NAME+'_'+FEATURE_AMOUNT, DoubleType(), False)])\n",
    "for i in range(1,29):\n",
    "    schema.add(FG_NAME+'_'+FEATURE_PCA_XFORM+str(i), DoubleType(), True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Retrieving Data\n",
    "Here we get all records from the sample data in the database. A small subset of the data would suffice in practice."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Records retrieved:  984\n",
      "root\n",
      " |-- TxnId: string (nullable = false)\n",
      " |-- CC1_Class: integer (nullable = false)\n",
      " |-- CC1_Amount: double (nullable = false)\n",
      " |-- CC1_V1: double (nullable = true)\n",
      " |-- CC1_V2: double (nullable = true)\n",
      " |-- CC1_V3: double (nullable = true)\n",
      " |-- CC1_V4: double (nullable = true)\n",
      " |-- CC1_V5: double (nullable = true)\n",
      " |-- CC1_V6: double (nullable = true)\n",
      " |-- CC1_V7: double (nullable = true)\n",
      " |-- CC1_V8: double (nullable = true)\n",
      " |-- CC1_V9: double (nullable = true)\n",
      " |-- CC1_V10: double (nullable = true)\n",
      " |-- CC1_V11: double (nullable = true)\n",
      " |-- CC1_V12: double (nullable = true)\n",
      " |-- CC1_V13: double (nullable = true)\n",
      " |-- CC1_V14: double (nullable = true)\n",
      " |-- CC1_V15: double (nullable = true)\n",
      " |-- CC1_V16: double (nullable = true)\n",
      " |-- CC1_V17: double (nullable = true)\n",
      " |-- CC1_V18: double (nullable = true)\n",
      " |-- CC1_V19: double (nullable = true)\n",
      " |-- CC1_V20: double (nullable = true)\n",
      " |-- CC1_V21: double (nullable = true)\n",
      " |-- CC1_V22: double (nullable = true)\n",
      " |-- CC1_V23: double (nullable = true)\n",
      " |-- CC1_V24: double (nullable = true)\n",
      " |-- CC1_V25: double (nullable = true)\n",
      " |-- CC1_V26: double (nullable = true)\n",
      " |-- CC1_V27: double (nullable = true)\n",
      " |-- CC1_V28: double (nullable = true)\n",
      "\n"
     ]
    }
   ],
   "source": [
    "## let's get the entity records to assess the data\n",
    "txn_df = Entity.query(ENTITY_TYPE, \"TxnId like '%'\", schema, \"TxnId\")\n",
    "print(\"Records retrieved: \", txn_df.count())\n",
    "txn_df.printSchema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Examining Data\n",
    "We will examine the statistical properties as well as null values of the feature columns. Note, the column CC1_Class is the label (fraud or not). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>summary</th>\n",
       "      <td>count</td>\n",
       "      <td>mean</td>\n",
       "      <td>stddev</td>\n",
       "      <td>min</td>\n",
       "      <td>max</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>TxnId</th>\n",
       "      <td>984</td>\n",
       "      <td>59771.279471544716</td>\n",
       "      <td>83735.17714512878</td>\n",
       "      <td>1</td>\n",
       "      <td>99507</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_Class</th>\n",
       "      <td>984</td>\n",
       "      <td>0.5</td>\n",
       "      <td>0.5002542588519272</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_Amount</th>\n",
       "      <td>984</td>\n",
       "      <td>96.22459349593494</td>\n",
       "      <td>240.14239707065826</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3828.04</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V1</th>\n",
       "      <td>984</td>\n",
       "      <td>-2.4674030372100715</td>\n",
       "      <td>5.40712231422648</td>\n",
       "      <td>-30.552380043581</td>\n",
       "      <td>2.13238602134104</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V2</th>\n",
       "      <td>984</td>\n",
       "      <td>1.9053035968231344</td>\n",
       "      <td>3.5961094277406076</td>\n",
       "      <td>-12.1142127363483</td>\n",
       "      <td>22.0577289904909</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V3</th>\n",
       "      <td>984</td>\n",
       "      <td>-3.0838842028294335</td>\n",
       "      <td>6.435904925385388</td>\n",
       "      <td>-31.1036848245812</td>\n",
       "      <td>3.77285685226266</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V4</th>\n",
       "      <td>984</td>\n",
       "      <td>2.456780057740528</td>\n",
       "      <td>3.0427216170397466</td>\n",
       "      <td>-4.51582435488105</td>\n",
       "      <td>12.1146718424589</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V5</th>\n",
       "      <td>984</td>\n",
       "      <td>-1.5617259373325372</td>\n",
       "      <td>4.202691637741722</td>\n",
       "      <td>-22.105531524316</td>\n",
       "      <td>11.0950886001596</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V6</th>\n",
       "      <td>984</td>\n",
       "      <td>-0.572583991041022</td>\n",
       "      <td>1.8036571668000605</td>\n",
       "      <td>-6.40626663445964</td>\n",
       "      <td>6.47411462748849</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V7</th>\n",
       "      <td>984</td>\n",
       "      <td>-2.73090333834317</td>\n",
       "      <td>5.863241960076915</td>\n",
       "      <td>-43.5572415712451</td>\n",
       "      <td>5.80253735302589</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V8</th>\n",
       "      <td>984</td>\n",
       "      <td>0.26108185138806433</td>\n",
       "      <td>4.850081053008372</td>\n",
       "      <td>-41.0442609210741</td>\n",
       "      <td>20.0072083651213</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V9</th>\n",
       "      <td>984</td>\n",
       "      <td>-1.301144796452937</td>\n",
       "      <td>2.2667801026716186</td>\n",
       "      <td>-13.4340663182301</td>\n",
       "      <td>5.43663339611854</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V10</th>\n",
       "      <td>984</td>\n",
       "      <td>-2.805194376398951</td>\n",
       "      <td>4.549492504413138</td>\n",
       "      <td>-24.5882624372475</td>\n",
       "      <td>8.73745780611353</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V11</th>\n",
       "      <td>984</td>\n",
       "      <td>1.9525351017305452</td>\n",
       "      <td>2.7369799649027207</td>\n",
       "      <td>-2.33201137167952</td>\n",
       "      <td>12.0189131816199</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V12</th>\n",
       "      <td>984</td>\n",
       "      <td>-2.995316874600595</td>\n",
       "      <td>4.657383279424635</td>\n",
       "      <td>-18.6837146333443</td>\n",
       "      <td>2.15205511590243</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V13</th>\n",
       "      <td>984</td>\n",
       "      <td>-0.0902914283635715</td>\n",
       "      <td>1.0102129366924129</td>\n",
       "      <td>-3.12779501198771</td>\n",
       "      <td>2.81543981456255</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V14</th>\n",
       "      <td>984</td>\n",
       "      <td>-3.597226605511213</td>\n",
       "      <td>4.5682405087763325</td>\n",
       "      <td>-19.2143254902614</td>\n",
       "      <td>3.44242199594215</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V15</th>\n",
       "      <td>984</td>\n",
       "      <td>0.06275139057382162</td>\n",
       "      <td>1.0021871899317296</td>\n",
       "      <td>-4.49894467676621</td>\n",
       "      <td>2.47135790380837</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V16</th>\n",
       "      <td>984</td>\n",
       "      <td>-2.1571248198091597</td>\n",
       "      <td>3.42439305003353</td>\n",
       "      <td>-14.1298545174931</td>\n",
       "      <td>3.13965565883069</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V17</th>\n",
       "      <td>984</td>\n",
       "      <td>-3.36609535335953</td>\n",
       "      <td>5.953540928078054</td>\n",
       "      <td>-25.1627993693248</td>\n",
       "      <td>6.73938438478335</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V18</th>\n",
       "      <td>984</td>\n",
       "      <td>-1.2187062731658431</td>\n",
       "      <td>2.3587681071910915</td>\n",
       "      <td>-9.49874592104677</td>\n",
       "      <td>3.79031621184375</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V19</th>\n",
       "      <td>984</td>\n",
       "      <td>0.3359445791509033</td>\n",
       "      <td>1.2843379816775733</td>\n",
       "      <td>-3.68190355226504</td>\n",
       "      <td>5.2283417900513</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V20</th>\n",
       "      <td>984</td>\n",
       "      <td>0.21117939872897198</td>\n",
       "      <td>1.0613528102262861</td>\n",
       "      <td>-4.12818582871798</td>\n",
       "      <td>11.0590042933942</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V21</th>\n",
       "      <td>984</td>\n",
       "      <td>0.3548982757919287</td>\n",
       "      <td>2.7872670478499595</td>\n",
       "      <td>-22.7976039055519</td>\n",
       "      <td>27.2028391573154</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V22</th>\n",
       "      <td>984</td>\n",
       "      <td>-0.04448149211405776</td>\n",
       "      <td>1.1450798238059015</td>\n",
       "      <td>-8.88701714094871</td>\n",
       "      <td>8.36198519168435</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V23</th>\n",
       "      <td>984</td>\n",
       "      <td>-0.036528942589509734</td>\n",
       "      <td>1.148960101817997</td>\n",
       "      <td>-19.2543276173719</td>\n",
       "      <td>5.46622995370963</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V24</th>\n",
       "      <td>984</td>\n",
       "      <td>-0.047380430113435276</td>\n",
       "      <td>0.5866834793500019</td>\n",
       "      <td>-2.02802422921896</td>\n",
       "      <td>1.21527882183022</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V25</th>\n",
       "      <td>984</td>\n",
       "      <td>0.08757054553217883</td>\n",
       "      <td>0.6404192414977024</td>\n",
       "      <td>-4.78160552206407</td>\n",
       "      <td>2.20820917836653</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V26</th>\n",
       "      <td>984</td>\n",
       "      <td>0.026120460105754934</td>\n",
       "      <td>0.4682991121957343</td>\n",
       "      <td>-1.24392415371264</td>\n",
       "      <td>3.06557569653728</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V27</th>\n",
       "      <td>984</td>\n",
       "      <td>0.09618165650018666</td>\n",
       "      <td>1.0037324673667467</td>\n",
       "      <td>-7.26348214633855</td>\n",
       "      <td>3.05235768679424</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CC1_V28</th>\n",
       "      <td>984</td>\n",
       "      <td>0.027865303758426337</td>\n",
       "      <td>0.4429545316584082</td>\n",
       "      <td>-2.73388711897575</td>\n",
       "      <td>1.77936385243205</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                0                      1                   2  \\\n",
       "summary     count                   mean              stddev   \n",
       "TxnId         984     59771.279471544716   83735.17714512878   \n",
       "CC1_Class     984                    0.5  0.5002542588519272   \n",
       "CC1_Amount    984      96.22459349593494  240.14239707065826   \n",
       "CC1_V1        984    -2.4674030372100715    5.40712231422648   \n",
       "CC1_V2        984     1.9053035968231344  3.5961094277406076   \n",
       "CC1_V3        984    -3.0838842028294335   6.435904925385388   \n",
       "CC1_V4        984      2.456780057740528  3.0427216170397466   \n",
       "CC1_V5        984    -1.5617259373325372   4.202691637741722   \n",
       "CC1_V6        984     -0.572583991041022  1.8036571668000605   \n",
       "CC1_V7        984      -2.73090333834317   5.863241960076915   \n",
       "CC1_V8        984    0.26108185138806433   4.850081053008372   \n",
       "CC1_V9        984     -1.301144796452937  2.2667801026716186   \n",
       "CC1_V10       984     -2.805194376398951   4.549492504413138   \n",
       "CC1_V11       984     1.9525351017305452  2.7369799649027207   \n",
       "CC1_V12       984     -2.995316874600595   4.657383279424635   \n",
       "CC1_V13       984    -0.0902914283635715  1.0102129366924129   \n",
       "CC1_V14       984     -3.597226605511213  4.5682405087763325   \n",
       "CC1_V15       984    0.06275139057382162  1.0021871899317296   \n",
       "CC1_V16       984    -2.1571248198091597    3.42439305003353   \n",
       "CC1_V17       984      -3.36609535335953   5.953540928078054   \n",
       "CC1_V18       984    -1.2187062731658431  2.3587681071910915   \n",
       "CC1_V19       984     0.3359445791509033  1.2843379816775733   \n",
       "CC1_V20       984    0.21117939872897198  1.0613528102262861   \n",
       "CC1_V21       984     0.3548982757919287  2.7872670478499595   \n",
       "CC1_V22       984   -0.04448149211405776  1.1450798238059015   \n",
       "CC1_V23       984  -0.036528942589509734   1.148960101817997   \n",
       "CC1_V24       984  -0.047380430113435276  0.5866834793500019   \n",
       "CC1_V25       984    0.08757054553217883  0.6404192414977024   \n",
       "CC1_V26       984   0.026120460105754934  0.4682991121957343   \n",
       "CC1_V27       984    0.09618165650018666  1.0037324673667467   \n",
       "CC1_V28       984   0.027865303758426337  0.4429545316584082   \n",
       "\n",
       "                            3                 4  \n",
       "summary                   min               max  \n",
       "TxnId                       1             99507  \n",
       "CC1_Class                   0                 1  \n",
       "CC1_Amount                0.0           3828.04  \n",
       "CC1_V1       -30.552380043581  2.13238602134104  \n",
       "CC1_V2      -12.1142127363483  22.0577289904909  \n",
       "CC1_V3      -31.1036848245812  3.77285685226266  \n",
       "CC1_V4      -4.51582435488105  12.1146718424589  \n",
       "CC1_V5       -22.105531524316  11.0950886001596  \n",
       "CC1_V6      -6.40626663445964  6.47411462748849  \n",
       "CC1_V7      -43.5572415712451  5.80253735302589  \n",
       "CC1_V8      -41.0442609210741  20.0072083651213  \n",
       "CC1_V9      -13.4340663182301  5.43663339611854  \n",
       "CC1_V10     -24.5882624372475  8.73745780611353  \n",
       "CC1_V11     -2.33201137167952  12.0189131816199  \n",
       "CC1_V12     -18.6837146333443  2.15205511590243  \n",
       "CC1_V13     -3.12779501198771  2.81543981456255  \n",
       "CC1_V14     -19.2143254902614  3.44242199594215  \n",
       "CC1_V15     -4.49894467676621  2.47135790380837  \n",
       "CC1_V16     -14.1298545174931  3.13965565883069  \n",
       "CC1_V17     -25.1627993693248  6.73938438478335  \n",
       "CC1_V18     -9.49874592104677  3.79031621184375  \n",
       "CC1_V19     -3.68190355226504   5.2283417900513  \n",
       "CC1_V20     -4.12818582871798  11.0590042933942  \n",
       "CC1_V21     -22.7976039055519  27.2028391573154  \n",
       "CC1_V22     -8.88701714094871  8.36198519168435  \n",
       "CC1_V23     -19.2543276173719  5.46622995370963  \n",
       "CC1_V24     -2.02802422921896  1.21527882183022  \n",
       "CC1_V25     -4.78160552206407  2.20820917836653  \n",
       "CC1_V26     -1.24392415371264  3.06557569653728  \n",
       "CC1_V27     -7.26348214633855  3.05235768679424  \n",
       "CC1_V28     -2.73388711897575  1.77936385243205  "
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# examine the statistical properties\n",
    "txn_df.describe().toPandas().transpose()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>TxnId</th>\n",
       "      <th>CC1_Class</th>\n",
       "      <th>CC1_Amount</th>\n",
       "      <th>CC1_V1</th>\n",
       "      <th>CC1_V2</th>\n",
       "      <th>CC1_V3</th>\n",
       "      <th>CC1_V4</th>\n",
       "      <th>CC1_V5</th>\n",
       "      <th>CC1_V6</th>\n",
       "      <th>CC1_V7</th>\n",
       "      <th>...</th>\n",
       "      <th>CC1_V19</th>\n",
       "      <th>CC1_V20</th>\n",
       "      <th>CC1_V21</th>\n",
       "      <th>CC1_V22</th>\n",
       "      <th>CC1_V23</th>\n",
       "      <th>CC1_V24</th>\n",
       "      <th>CC1_V25</th>\n",
       "      <th>CC1_V26</th>\n",
       "      <th>CC1_V27</th>\n",
       "      <th>CC1_V28</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1 rows × 31 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   TxnId  CC1_Class  CC1_Amount  CC1_V1  CC1_V2  CC1_V3  CC1_V4  CC1_V5  \\\n",
       "0      0          0           0       0       0       0       0       0   \n",
       "\n",
       "   CC1_V6  CC1_V7  ...  CC1_V19  CC1_V20  CC1_V21  CC1_V22  CC1_V23  CC1_V24  \\\n",
       "0       0       0  ...        0        0        0        0        0        0   \n",
       "\n",
       "   CC1_V25  CC1_V26  CC1_V27  CC1_V28  \n",
       "0        0        0        0        0  \n",
       "\n",
       "[1 rows x 31 columns]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check for null values\n",
    "from pyspark.sql.functions import count, when, isnan\n",
    "txn_df.select([count(when(isnan(c), c)).alias(c) for c in txn_df.columns]).toPandas().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Defining Dataset\n",
    "Based on the above exploration, we will choose features V1-V28 for our training dataset, which we will define below.\n",
    "\n",
    "In addition to the features, we also need to choose the data records for the dataset. We only have a small data from the original dataset, and therefore we will use all the available records by setting the dataset query predicate to \"true\". \n",
    "\n",
    "It is possible to create a random dataset of random records by performing an \"aerolookup\" of randomly selected key values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Records in the dataset:  984\n"
     ]
    }
   ],
   "source": [
    "# Create a dataset with the V1-V28 features. \n",
    "CC_FRAUD_DATASET = \"CC_FRAUD_DETECTION\"\n",
    "features = [\"CC1_V\"+str(i) for i in range(1,29)]\n",
    "features_and_label = [\"CC1_Class\"] + features\n",
    "ds = Dataset(CC_FRAUD_DATASET, \"Training dataset for fraud detection model\", \"cctxn\", \"TxnId\", \"string\",\n",
    "                 features_and_label, \"true\", \"\", {\"class\":\"fraud\"}, [\"test\", \"2017\"])\n",
    "ds_df = ds.materialize_to_df()\n",
    "print(\"Records in the dataset: \", ds_df.count())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save Dataset\n",
    "Save the dataset in Feature Store for future use.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save the materialized dataset externally in a file\n",
    "DATASET_PATH = 'resources/fs_part2_dataset_cctxn.csv'\n",
    "ds_df.write.csv(path=DATASET_PATH, header=\"true\", mode=\"overwrite\", sep=\"\\t\")\n",
    "    \n",
    "# save the dataset metadata in the feature store\n",
    "ds.location = DATASET_PATH\n",
    "ds.save()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Query and Verify Dataset\n",
    "Verify the saved dataset is in the feature store for future exploration and use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>name</th>\n",
       "      <td>CC_FRAUD_DETECTION</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>description</th>\n",
       "      <td>Training dataset for fraud detection model</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>entity</th>\n",
       "      <td>cctxn</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id_col</th>\n",
       "      <td>TxnId</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>id_type</th>\n",
       "      <td>string</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>features</th>\n",
       "      <td>[CC1_Class, CC1_V1, CC1_V2, CC1_V3, CC1_V4, CC...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>query</th>\n",
       "      <td>true</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>location</th>\n",
       "      <td>resources/fs_part2_dataset_cctxn.csv</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>attrs</th>\n",
       "      <td>{'class': 'fraud'}</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>tags</th>\n",
       "      <td>[test, 2017]</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                             0\n",
       "name                                        CC_FRAUD_DETECTION\n",
       "description         Training dataset for fraud detection model\n",
       "entity                                                   cctxn\n",
       "id_col                                                   TxnId\n",
       "id_type                                                 string\n",
       "features     [CC1_Class, CC1_V1, CC1_V2, CC1_V3, CC1_V4, CC...\n",
       "query                                                     true\n",
       "location                  resources/fs_part2_dataset_cctxn.csv\n",
       "attrs                                       {'class': 'fraud'}\n",
       "tags                                              [test, 2017]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dsq_df = Dataset.query(\"description like '%fraud%'\")\n",
    "dsq_df.toPandas().transpose()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Verify the database through an AQL query on the set \"dataset-metadata\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "select * from test.dataset-metadata\r\n",
      "+--------------------------------------+----------------------------------------------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+----------+----------------------------------------+----------------------+---------------------+-----------------------------+\r\n",
      "| attrs                                | description                                  | entity  | features                                                                                                                                                                                                                                                       | id_col  | id_type  | location                               | name                 | query               | tags                        |\r\n",
      "+--------------------------------------+----------------------------------------------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+----------+----------------------------------------+----------------------+---------------------+-----------------------------+\r\n",
      "| KEY_ORDERED_MAP('{\"class\":\"fraud\"}') | \"Training dataset for fraud detection model\" | \"cctxn\" | LIST('[\"CC1_Class\", \"CC1_V1\", \"CC1_V2\", \"CC1_V3\", \"CC1_V4\", \"CC1_V5\", \"CC1_V6\", \"CC1_V7\", \"CC1_V8\", \"CC1_V9\", \"CC1_V10\", \"CC1_V11\", \"CC1_V12\", \"CC1_V13\", \"CC1_V14\", \"CC1_V15\", \"CC1_V16\", \"CC1_V17\", \"CC1_V18\", \"CC1_V19\", \"CC1_V20\", \"CC1_V21\", \"CC1_V22\", \" | \"TxnId\" | \"string\" | \"resources/fs_part2_dataset_cctxn.csv\" | \"CC_FRAUD_DETECTION\" | \"true\"              | LIST('[\"test\", \"2017\"]')    |\r\n",
      "| KEY_ORDERED_MAP('{\"risk\":\"high\"}')   | \"Test dataset\"                               | \"cctxn\" | LIST('[\"CC1_Amount\", \"CC1_Class\", \"CC1_V1\"]')                                                                                                                                                                                                                  | \"TxnId\" | \"string\" | \"\"                                     | \"ds_test1\"           | \"CC1_Amount > 1500\" | LIST('[\"test\", \"dataset\"]') |\r\n",
      "+--------------------------------------+----------------------------------------------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+----------+----------------------------------------+----------------------+---------------------+-----------------------------+\r\n",
      "2 rows in set (0.014 secs)\r\n",
      "\r\n",
      "OK\r\n",
      "\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!aql -c \"select * from test.dataset-metadata\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create AI/ML Model\n",
    "Below we will choose two algorithms to predict fraud in a credit card transcation: LogisticRegression and RandomForestClassifier."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create Training and Test Sets\n",
    "We first split the dataset into training and test sets to train and evaluate a model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training dataset records: 791\n",
      "Test dataset records: 193\n"
     ]
    }
   ],
   "source": [
    "from pyspark.ml.feature import VectorAssembler\n",
    "\n",
    "# create a feature vector from features\n",
    "assembler = VectorAssembler(inputCols=features, outputCol=\"fvector\")\n",
    "ds_df2 = assembler.transform(ds_df)\n",
    "\n",
    "# split the dataset into randomly selected training and test sets\n",
    "train, test = ds_df2.randomSplit([0.8,0.2], seed=2021)\n",
    "print('Training dataset records:', train.count())\n",
    "print('Test dataset records:', test.count())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+---------+-----+\n",
      "|CC1_Class|count|\n",
      "+---------+-----+\n",
      "|        1|  380|\n",
      "|        0|  411|\n",
      "+---------+-----+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# examine the fraud cases in the training set\n",
    "train.groupby('CC1_Class').count().show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train Model\n",
    "We choose two models to train: LogisticRegression and RandomForestClassifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.classification import LogisticRegression, RandomForestClassifier\n",
    "lr_algo = LogisticRegression(featuresCol='fvector', labelCol='CC1_Class', maxIter=5)\n",
    "lr_model = lr_algo.fit(train)\n",
    "\n",
    "rf_algo = RandomForestClassifier(featuresCol='fvector', labelCol='CC1_Class')\n",
    "rf_model = rf_algo.fit(train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate Model\n",
    "Run the trained models on the test set and evaluate their performacne metrics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Logistic Regression: Accuracy = 0.9853395061728388\n",
      "Logistic Regression: Area under ROC = 0.9298321136461472\n",
      "Logistic Regression: Area under PR = 0.8910277315666429\n"
     ]
    }
   ],
   "source": [
    "from pyspark.mllib.evaluation import BinaryClassificationMetrics\n",
    "from pyspark.ml.evaluation import BinaryClassificationEvaluator\n",
    "\n",
    "# rename label column\n",
    "test = test.withColumnRenamed('CC1_Class', 'label')\n",
    "\n",
    "# use the logistic regression model to predict test cases \n",
    "lr_predictions = lr_model.transform(test)\n",
    "\n",
    "# instantiate evaluator\n",
    "evaluator = BinaryClassificationEvaluator()\n",
    "\n",
    "# Logistic Regression performance metrics\n",
    "print(\"Logistic Regression: Accuracy = {}\".format(evaluator.evaluate(lr_predictions)))\n",
    "\n",
    "lr_labels_and_predictions = test.rdd.map(lambda x: float(x.label)).zip(lr_predictions.rdd.map(lambda x: x.prediction))\n",
    "lr_metrics = BinaryClassificationMetrics(lr_labels_and_predictions)\n",
    "print(\"Logistic Regression: Area under ROC = %s\" % lr_metrics.areaUnderROC)\n",
    "print(\"Logistic Regression: Area under PR = %s\" % lr_metrics.areaUnderPR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Random Forest Classifier: Accuracy = 0.9876543209876544\n",
      "Random Forest Classifier: Area under ROC = 0.935483870967742\n",
      "Random Forest Classifier: Area under PR = 0.8928571428571429\n"
     ]
    }
   ],
   "source": [
    "# use the random forest model to predict test cases \n",
    "rf_predictions = rf_model.transform(test)\n",
    "\n",
    "# RandonForestClassifer performance metrics\n",
    "print(\"Random Forest Classifier: Accuracy = {}\".format(evaluator.evaluate(rf_predictions)))\n",
    "\n",
    "rf_labels_and_predictions = test.rdd.map(lambda x: float(x.label)).zip(rf_predictions.rdd.map(lambda x: x.prediction))\n",
    "rf_metrics = BinaryClassificationMetrics(rf_labels_and_predictions)\n",
    "print(\"Random Forest Classifier: Area under ROC = %s\" % rf_metrics.areaUnderROC)\n",
    "print(\"Random Forest Classifier: Area under PR = %s\" % rf_metrics.areaUnderPR)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Save Model\n",
    "Save the model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save each model\n",
    "lr_model.write().overwrite().save(\"resources/fs_model_lr\")\n",
    "rf_model.write().overwrite().save(\"resources/fs_model_rf\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load and Test Model\n",
    "Load the saved model and test it by predicting a test instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Logistic Regression model save/load test:\n",
      "+-----+----------+\n",
      "|label|prediction|\n",
      "+-----+----------+\n",
      "|    1|       1.0|\n",
      "|    0|       0.0|\n",
      "|    0|       0.0|\n",
      "|    0|       0.0|\n",
      "|    0|       0.0|\n",
      "+-----+----------+\n",
      "\n",
      "Random Forest model save/load test:\n",
      "+-----+----------+\n",
      "|label|prediction|\n",
      "+-----+----------+\n",
      "|    1|       1.0|\n",
      "|    0|       0.0|\n",
      "|    0|       0.0|\n",
      "|    0|       0.0|\n",
      "|    0|       0.0|\n",
      "+-----+----------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from pyspark.ml.classification import LogisticRegressionModel, RandomForestClassificationModel\n",
    "\n",
    "lr_model2 = LogisticRegressionModel.load(\"resources/fs_model_lr\")\n",
    "print(\"Logistic Regression model save/load test:\")\n",
    "lr_predictions2 = lr_model2.transform(test.limit(5))\n",
    "lr_predictions2['label', 'prediction'].show()\n",
    "\n",
    "print(\"Random Forest model save/load test:\")\n",
    "rf_model2 = RandomForestClassificationModel.read().load(\"resources/fs_model_rf\")\n",
    "rf_predictions2 = rf_model2.transform(test.limit(5))\n",
    "rf_predictions2['label', 'prediction'].show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Takeaways and Conclusion\n",
    "In this notebook, we explored how Aerospike can be used as a Feature Store for ML applications. Specifically, we showed how features and datasets stored in the Aerospike can be explored and reused for model training. We implemented a simple example feature store interface that leverages the Aerospike Spark Connector capabilities for this purpose. We used the APIs to create, save, and query features and datasets for model training.\n",
    "\n",
    "This is the second notebook in the series of notebooks on how Aerospike can be used as a feature store. The [first notebook](feature-store-feature-eng.ipynb) discusses Feature Engineering aspects, whereas the [third notebook](feature-store-model-serving.ipynb) explores the use of Aerospike Feature Store for Model Serving."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cleaning Up\n",
    "Close the spark session, and remove the tutorial data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    spark.stop()\n",
    "except: \n",
    "    ; # ignore\n",
    "# To remove all data in the namespace test, uncomment the following line and run: \n",
    "#!aql -c \"truncate test\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Further Exploration and Resources\n",
    "Here are some links for further exploration.\n",
    "\n",
    "## Resources\n",
    "- Related notebooks\n",
    "    - [Feature Store with Aerospike (Part 1)](feature-store-feature-eng.ipynb) \n",
    "    - [Model Serving with Aerospike Feature Store (Part 3)](feature-store-model-serving.ipynb)\n",
    "    - [Aerospike Connect for Spark Tutorial for Python](AerospikeSparkPython.ipynb)\n",
    "    - [Pushdown Expressions for Spark Connector](resources/pushdown-expressions.ipynb)\n",
    "- Related blog posts\n",
    "    - [Let AI/ML workloads take off with Aerospike and Spark 3.0](https://medium.com/aerospike-developer-blog/let-ai-ml-workloads-take-off-with-aerospike-and-spark-3-0-82de2d834b99)\n",
    "    - [Using Aerospike Connect For Spark](https://medium.com/aerospike-developer-blog/aerospike-is-a-highly-scalable-key-value-database-offering-best-in-class-performance-5922450aaa78)\n",
    "- Aerospike Developer Hub\n",
    "    - [Developer Hub](https://developer.aerospike.com/)\n",
    "- Github repos\n",
    "    - [Spark Aerospike Example](https://github.com/aerospike-examples/spark-aerospike-example)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring Other Notebooks\n",
    "\n",
    "Visit [Aerospike notebooks repo](https://github.com/aerospike-examples/interactive-notebooks) to run additional Aerospike notebooks. To run a different notebook, download the notebook from the repo to your local machine, and then click on File->Open in the notebook menu, and select Upload."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": false,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {},
   "toc_section_display": false,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}