{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": true
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Implementing-SQL-Operations:-Aggregates-(Part-1)\" data-toc-modified-id=\"Implementing-SQL-Operations:-Aggregates-(Part-1)-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Implementing SQL Operations: Aggregates (Part 1)</a></span><ul class=\"toc-item\"><li><span><a href=\"#Introduction\" data-toc-modified-id=\"Introduction-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href=\"#Working-with-UDF-Module\" data-toc-modified-id=\"Working-with-UDF-Module-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>Working with UDF Module</a></span></li><li><span><a href=\"#Prerequisites\" data-toc-modified-id=\"Prerequisites-1.3\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>Prerequisites</a></span></li><li><span><a href=\"#Initialization\" data-toc-modified-id=\"Initialization-1.4\"><span class=\"toc-item-num\">1.4&nbsp;&nbsp;</span>Initialization</a></span><ul class=\"toc-item\"><li><span><a href=\"#Ensure-database-is-running\" data-toc-modified-id=\"Ensure-database-is-running-1.4.1\"><span class=\"toc-item-num\">1.4.1&nbsp;&nbsp;</span>Ensure database is running</a></span></li><li><span><a href=\"#Download-and-install-additional-components.\" data-toc-modified-id=\"Download-and-install-additional-components.-1.4.2\"><span class=\"toc-item-num\">1.4.2&nbsp;&nbsp;</span>Download and install additional components.</a></span></li></ul></li><li><span><a href=\"#Connect-to-database-and-populate-test-data\" data-toc-modified-id=\"Connect-to-database-and-populate-test-data-1.5\"><span class=\"toc-item-num\">1.5&nbsp;&nbsp;</span>Connect to database and populate test data</a></span></li><li><span><a href=\"#Create-a-secondary-index\" data-toc-modified-id=\"Create-a-secondary-index-1.6\"><span class=\"toc-item-num\">1.6&nbsp;&nbsp;</span>Create a secondary index</a></span></li></ul></li><li><span><a href=\"#Execution-Model-for-Processing-Aggregates\" data-toc-modified-id=\"Execution-Model-for-Processing-Aggregates-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Execution Model for Processing Aggregates</a></span></li><li><span><a href=\"#Simple-Aggregate-State-with-Map-Reduce\" data-toc-modified-id=\"Simple-Aggregate-State-with-Map-Reduce-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>Simple Aggregate State with Map-Reduce</a></span><ul class=\"toc-item\"><li><span><a href=\"#Create-User-Defined-Function-(UDF)\" data-toc-modified-id=\"Create-User-Defined-Function-(UDF)-3.1\"><span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>Create User Defined Function (UDF)</a></span></li><li><span><a href=\"#Register-UDF\" data-toc-modified-id=\"Register-UDF-3.2\"><span class=\"toc-item-num\">3.2&nbsp;&nbsp;</span>Register UDF</a></span></li><li><span><a href=\"#Execute-UDF\" data-toc-modified-id=\"Execute-UDF-3.3\"><span class=\"toc-item-num\">3.3&nbsp;&nbsp;</span>Execute UDF</a></span></li><li><span><a href=\"#Implementing-the-WHERE-Clause\" data-toc-modified-id=\"Implementing-the-WHERE-Clause-3.4\"><span class=\"toc-item-num\">3.4&nbsp;&nbsp;</span>Implementing the WHERE Clause</a></span><ul class=\"toc-item\"><li><span><a href=\"#Using-Filter-Operator\" data-toc-modified-id=\"Using-Filter-Operator-3.4.1\"><span class=\"toc-item-num\">3.4.1&nbsp;&nbsp;</span>Using Filter Operator</a></span></li><li><span><a href=\"#Execute-SUM_RANGE\" data-toc-modified-id=\"Execute-SUM_RANGE-3.4.2\"><span class=\"toc-item-num\">3.4.2&nbsp;&nbsp;</span>Execute SUM_RANGE</a></span></li><li><span><a href=\"#Do-Not-Use-Expression-Filters\" data-toc-modified-id=\"Do-Not-Use-Expression-Filters-3.4.3\"><span class=\"toc-item-num\">3.4.3&nbsp;&nbsp;</span>Do Not Use Expression Filters</a></span></li></ul></li><li><span><a href=\"#More-Simple-Aggregates:-MIN-and-MAX\" data-toc-modified-id=\"More-Simple-Aggregates:-MIN-and-MAX-3.5\"><span class=\"toc-item-num\">3.5&nbsp;&nbsp;</span>More Simple Aggregates: MIN and MAX</a></span></li></ul></li><li><span><a href=\"#Using-Aggregate-Operator\" data-toc-modified-id=\"Using-Aggregate-Operator-4\"><span class=\"toc-item-num\">4&nbsp;&nbsp;</span>Using Aggregate Operator</a></span></li><li><span><a href=\"#Takeaways-and-Conclusion\" data-toc-modified-id=\"Takeaways-and-Conclusion-5\"><span class=\"toc-item-num\">5&nbsp;&nbsp;</span>Takeaways and Conclusion</a></span></li><li><span><a href=\"#Clean-up\" data-toc-modified-id=\"Clean-up-6\"><span class=\"toc-item-num\">6&nbsp;&nbsp;</span>Clean up</a></span></li><li><span><a href=\"#Further-Exploration-and-Resources\" data-toc-modified-id=\"Further-Exploration-and-Resources-7\"><span class=\"toc-item-num\">7&nbsp;&nbsp;</span>Further Exploration and Resources</a></span><ul class=\"toc-item\"><li><span><a href=\"#Next-steps\" data-toc-modified-id=\"Next-steps-7.1\"><span class=\"toc-item-num\">7.1&nbsp;&nbsp;</span>Next steps</a></span></li></ul></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Implementing SQL Operations: Aggregates (Part 1)\n",
    "This tutorial is Part 1 of how to implement SQL aggregate queries in Aerospike.\n",
    "\n",
    "This notebook requires the Aerospike Database running locally with Java kernel and Aerospike Java Client. To create a Docker container that satisfies the requirements and holds a copy of Aerospike notebooks, visit the [Aerospike Notebooks Repo](https://github.com/aerospike-examples/interactive-notebooks)."   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "In this notebook, we will see how specific aggregate statements in SQL can be implemented in Aerospike. \n",
    "\n",
    "SQL is a widely known data access language. The examples in this notebook provide patterns for implementing specific SQL aggregate queries. You should be able to understand them and find them useful even without deep familiarity with SQL.\n",
    "\n",
    "This notebook is the second in the SQL Operations series that consists of the following notebooks:\n",
    "- Implementing SQL Operations: SELECT\n",
    "- Implementing SQL Operations: Aggregate functions - Part 1 (this notebook) and 2\n",
    "- Implementing SQL Operations: UPDATE, CREATE, and DELETE\n",
    "\n",
    "The specific topics and aggregate functions we discuss include:\n",
    "- Execution model for processing aggregates\n",
    "    - Pipeline of stream operators\n",
    "    - Types of stream operators\n",
    "    - Two phases of execution \n",
    "- Simple aggregation state\n",
    "    - MIN\n",
    "    - MAX\n",
    "    - SUM\n",
    "    - COUNT\n",
    "- Complex aggregation state: Aggregate Operator\n",
    "    - Multiple aggregate computations\n",
    "\n",
    "In Part 2, we describe GROUP BY processing and some additional aggregates.\n",
    "\n",
    "The purpose of this notebook is to illustrate Aerospike implementation for specific SQL operations. Check out [Aerospike Presto Connector](https://www.aerospike.com/docs/connect/access/presto/index.html) for ad-hoc SQL access to Aerospike data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working with UDF Module\n",
    "All UDF functions for this notebook are placed in \"aggregate_fns.lua\" file under the \"udf\" subdirectory. If the subdirectory or file is not there, you may download the file from [here](https://github.com/aerospike-examples/interactive-notebooks/tree/main/notebooks/udf/aggregate_fns.lua) and place it there using the notebook's File->Open followed by Upload/New menus.\n",
    "\n",
    "You are encouraged to experiment with the Lua code in the module. Be sure to save your changes and then run the convenience function \"registerUDF()\" in a code cell for the changes to take effect."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "This tutorial assumes familiarity with the following topics:\n",
    "- [Hello World](hello_world.ipynb)\n",
    "- [Implementing SQL Operations: SELECT](sql_select.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ensure database is running\n",
    "This notebook requires that Aerospike datbase is running. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-12-29T20:48:49.065421Z",
     "start_time": "2020-12-29T20:48:49.060897Z"
    }
   },
   "outputs": [],
   "source": [
    "import io.github.spencerpark.ijava.IJava;\n",
    "import io.github.spencerpark.jupyter.kernel.magic.common.Shell;\n",
    "IJava.getKernelInstance().getMagics().registerMagics(Shell.class);\n",
    "%sh asd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download and install additional components.\n",
    "Install the Java client."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-12-29T20:48:50.084636Z",
     "start_time": "2020-12-29T20:48:50.080629Z"
    }
   },
   "outputs": [],
   "source": [
    "%%loadFromPOM\n",
    "<dependencies>\n",
    "  <dependency>\n",
    "    <groupId>com.aerospike</groupId>\n",
    "    <artifactId>aerospike-client</artifactId>\n",
    "    <version>5.0.0</version>\n",
    "  </dependency>\n",
    "</dependencies>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Connect to database and populate test data\n",
    "The test data has 1000 records with user-key \"id-1\" through \"id-1000\", two integer bins (fields) \"bin1\" (1-1000) and \"bin2\" (1001-2000), and one string bin \"bin3\" (random 5 values \"A\" through \"E\"), in the namespace \"test\" and set \"sql-aggregate\". "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-12-29T20:48:50.771243Z",
     "start_time": "2020-12-29T20:48:50.767819Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Initialized the client and connected to the cluster.\n",
      "Test data popuated"
     ]
    }
   ],
   "source": [
    "import com.aerospike.client.AerospikeClient;\n",
    "import com.aerospike.client.Bin;\n",
    "import com.aerospike.client.Key;\n",
    "import com.aerospike.client.policy.WritePolicy;\n",
    "import java.util.Random; \n",
    "\n",
    "String[] groups = {\"A\", \"B\", \"C\", \"D\", \"E\"}; \n",
    "Random rand = new Random(1); \n",
    "\n",
    "AerospikeClient client = new AerospikeClient(\"localhost\", 3000);\n",
    "System.out.println(\"Initialized the client and connected to the cluster.\");\n",
    "\n",
    "String Namespace = \"test\";\n",
    "String Set = \"sql-aggregate\";\n",
    "\n",
    "WritePolicy wpolicy = new WritePolicy();\n",
    "wpolicy.sendKey = true;\n",
    "for (int i = 1; i <= 1000; i++) {\n",
    "    Key key = new Key(Namespace, Set, \"id-\"+i);\n",
    "    Bin bin1 = new Bin(new String(\"bin1\"), i);\n",
    "    Bin bin2 = new Bin(new String(\"bin2\"), 1000+i);\n",
    "    Bin bin3 = new Bin(new String(\"bin3\"), groups[rand.nextInt(groups.length)]); \n",
    "    client.put(wpolicy, key, bin1, bin2, bin3);\n",
    "}\n",
    "\n",
    "System.out.format(\"Test data popuated\");;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a secondary index\n",
    "To use the query API with index based filter, a secondary index must exist on the filter bin. Here we create a numeric index on \"bin1\" in \"sql-aggregate\" set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Created number index test_sql_aggregate_bin1_number_idx on ns=test set=sql-aggregate bin=bin1."
     ]
    }
   ],
   "source": [
    "import com.aerospike.client.policy.Policy;\n",
    "import com.aerospike.client.query.IndexType;\n",
    "import com.aerospike.client.task.IndexTask;\n",
    "import com.aerospike.client.AerospikeException;\n",
    "import com.aerospike.client.ResultCode;\n",
    "\n",
    "String IndexName = \"test_sql_aggregate_bin1_number_idx\";\n",
    "\n",
    "Policy policy = new Policy();\n",
    "policy.socketTimeout = 0; // Do not timeout on index create.\n",
    "\n",
    "try {\n",
    "    IndexTask task = client.createIndex(policy, Namespace, Set, IndexName, \n",
    "                                        \"bin1\", IndexType.NUMERIC);\n",
    "    task.waitTillComplete();\n",
    "}\n",
    "catch (AerospikeException ae) {\n",
    "    if (ae.getResultCode() != ResultCode.INDEX_ALREADY_EXISTS) {\n",
    "        throw ae;\n",
    "    }\n",
    "}\n",
    "\n",
    "System.out.format(\"Created number index %s on ns=%s set=%s bin=%s.\", \n",
    "                                    IndexName, Namespace, Set, \"bin1\");;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Execution Model for Processing Aggregates\n",
    " \n",
    "Processing aggregates in Aerospike involves processing a stream of records through a pipeline of operators on server as well as client.\n",
    "\n",
    "Four types of operators are supported: Filter, Map, Aggregate, and Reduce. The operators work with one of the following data types as input and output: Record, Integer, String, Map (the data type, not to be confused with the Map operator), and List. Only the initial filter(s) and first non-filter operator in the pipeline can consume Record type.\n",
    "\n",
    "- Filter: Object -> Boolean; filters input objects, input and output objects are of the same type.\n",
    "- Map: Object -> Object; any transformation is possible.\n",
    "- Aggregate: (Current State, Object) -> New State; maintains the global \"aggregate\" state of the stream. While any type can be used, a (Aerospike) Map type is often used.\n",
    "- Reduce: (Object, Object) -> Object; reduces two objects to a single object of the same type.\n",
    "\n",
    "The operators may appear any number of times and in any order in the pipeline.\n",
    "\n",
    "The operator pipeline is typically processed in two phases: first phase on server nodes and the second phase on client.\n",
    "- Phase 1: Server nodes execute all operators up to and including the first reduce operation in the pipeline.\n",
    "- Phase 2: The client processes results from multiple nodes through the remaining pipeline operators starting with and including the first reduce operation in the pipeline. \n",
    "\n",
    "Thus, the first reduce operation if specified in the pipeline is executed on all server nodes as well as on client. If there is no reduce operator in the pipeline, the application will receive the combined results returned from server nodes.\n",
    "\n",
    "Post aggregation processing involves operators after the first reduce in the pipeline, usually for sorting, filtering, and final transformation, and takes place on the client side. \n",
    "\n",
    "Aggregation processing in Aerospike is defined using User Defined Functions (UDFs). UDFs are written in Lua with arbitrary logic and are executed on both server and client as explained above. Since aggregates by definition involve multiple records, only stream UDFs are discussed below (versus record UDFs whose scope of execution is a single record). \n",
    "\n",
    "A stream UDF specifies the pipeline of operators for processing aggregates. Different aggregates differ in their UDF functions, whereas the Aerospike APIs are the same to specify the aggregate processing. \n",
    "\n",
    "The UDFs and logic are described in appropriate sections for each aggregate function below. For additional context and details, please refer to the [documentation](https://www.aerospike.com/docs/udf/developing_lua_modules.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Simple Aggregate State with Map-Reduce\n",
    "`SELECT aggregate(col) FROM namespace.set`\n",
    "\n",
    "The UDF function that specifies the processing pipeline is different, and is specified and executed with the following API operations for synchronous execution.\n",
    "\n",
    " `void Statement::setAggregateFunction(String udfModule, String udfFunction, ... Value udfArgs))`\n",
    " \n",
    " `ResultSet rs = Client::queryAggregate(QueryPolicy policy, Statement stmt);`\n",
    " \n",
    "Simple aggregate computations are possible with a simple pipeline of map and reduce. Such simple computations can save the aggregation state in a single numeric or string value during stream processing. Examples include single aggregate computations for:\n",
    "- MIN\n",
    "- MAX\n",
    "- SUM\n",
    "- COUNT\n",
    "\n",
    "Let us separately implement COUNT and SUM aggregate functions on a single bin. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create User Defined Function (UDF)\n",
    "Note, all UDF functions for this notebook are assumed to be in \"aggregate_fns.lua\" file under \"udf\" directory. Please refer to \"Working with UDF Module\" section above.\n",
    "\n",
    "As explained above, the logic for aggregation resides in a stream UDF.\n",
    "\n",
    "COUNT\n",
    "\n",
    "Examine the following Lua code that implements COUNT. The pipeline consists of map and reduce operators.\n",
    "- the map function \"rec_to_count_closure\" is a closure for \"rec_to_count\" which takes a record and returns 1 or 0 that signifies record (if bin is unspecified) or bin (if it is specified) count. In this and subsequent examples, closures are used to access the aggregate parameters.\n",
    "- the reduce function \"add_values\" adds the two input values and returns their sum.\n",
    "\n",
    "<pre>\n",
    "-- count and sum reducer\n",
    "local function add_values(val1, val2)\n",
    "    return (val1 or 0) + (val2 or 0)\n",
    "end\n",
    "\n",
    "-- count mapper\n",
    "-- note closures are used to access aggregate parameters such as bin\n",
    "local function rec_to_count_closure(bin)\n",
    "    local function rec_to_count(rec) \n",
    "    -- if bin is specified: if bin exists in record return 1 else 0; if no bin is specified, return 1\n",
    "        return (not bin and 1) or ((rec[bin] and 1) or 0)\n",
    "    end\n",
    "    return rec_to_count\n",
    "end\n",
    "\n",
    "-- count\n",
    "function count(stream)\n",
    "    return stream : map(rec_to_count_closure()) : reduce(add_values)\n",
    "end\n",
    "\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "SUM\n",
    "\n",
    "Examine the following Lua code that implements SUM. The pipeline consists of map and reduce operators.\n",
    "- the map function \"rec_to_bin_value_closure\" is a closure for \"rec_to_bin_value\" which takes a record and returns the bin value. In this and subsequent examples, closures are used to access the aggregate parameters such as the bin in this case.\n",
    "- the reduce function \"add_values\" adds the two input values and returns their sum.\n",
    "\n",
    "\n",
    "<pre>\n",
    "- mapper for various single bin aggregates\n",
    "local function rec_to_bin_value_closure(bin)\n",
    "    local function rec_to_bin_value(rec)\n",
    "    -- if a numeric bin exists in record return its value; otherwise return nil\n",
    "        local val = rec[bin]\n",
    "        if (type(val) ~= \"number\") then val = nil end\n",
    "        return val\n",
    "    end\n",
    "    return rec_to_bin_value \n",
    "end\n",
    "\n",
    "-- sum\n",
    "function sum(stream, bin)\n",
    "    return stream : map(rec_to_bin_value_closure(bin)) : reduce(add_values)\n",
    "end\n",
    "\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Register UDF\n",
    "Register the UDF with the server by executing the following code cell. \n",
    "\n",
    "The registerUDF() function below can be run conveniently when the UDF is modified (you are encouraged to experiment with the UDF code). The function invalidates the cache, removes the currently registered module, and registers the latest version."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Registered the UDF module aggregate_fns.lua."
     ]
    }
   ],
   "source": [
    "import com.aerospike.client.policy.Policy;\n",
    "import com.aerospike.client.task.RegisterTask;\n",
    "import com.aerospike.client.Language;\n",
    "import com.aerospike.client.lua.LuaConfig;\n",
    "import com.aerospike.client.lua.LuaCache;\n",
    "\n",
    "LuaConfig.SourceDirectory = \"../udf\";\n",
    "String UDFFile = \"aggregate_fns.lua\";\n",
    "String UDFModule = \"aggregate_fns\";\n",
    "\n",
    "void registerUDF() {\n",
    "    // clear the lua cache\n",
    "    LuaCache.clearPackages();\n",
    "    Policy policy = new Policy();\n",
    "    // remove the current module, if any\n",
    "    client.removeUdf(null, UDFFile);\n",
    "    RegisterTask task = client.register(policy, LuaConfig.SourceDirectory+\"/\"+UDFFile, \n",
    "                                        UDFFile, Language.LUA);\n",
    "    task.waitTillComplete();\n",
    "    System.out.format(\"Registered the UDF module %s.\", UDFFile);;\n",
    "}\n",
    "\n",
    "registerUDF();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Execute UDF\n",
    "`SELECT COUNT(bin2) FROM test.sql-aggregate`\n",
    "\n",
    "`SELECT SUM(bin2) FROM test.sql-aggregate`\n",
    "\n",
    "Here we will execute the \"count\" and \"sum\" functions on \"bin2\" in all (1000) records in the set. The expected sum for bin2 values (1001 + 1002 + ... + 2000) is 1500500."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Executed COUNT.\n",
      "Returned object: 1000\n",
      "Executed SUM.\n",
      "Returned object: 1500500\n"
     ]
    }
   ],
   "source": [
    "import com.aerospike.client.query.Statement;\n",
    "import com.aerospike.client.Value;\n",
    "import com.aerospike.client.query.RecordSet;\n",
    "import com.aerospike.client.query.ResultSet;\n",
    "\n",
    "// COUNT\n",
    "Statement stmt = new Statement();\n",
    "stmt.setNamespace(Namespace); \n",
    "stmt.setSetName(Set); \n",
    "stmt.setAggregateFunction(UDFModule, \"count\", Value.get(\"bin2\"));\n",
    "ResultSet rs = client.queryAggregate(null, stmt);\n",
    "System.out.println(\"Executed COUNT.\");\n",
    "while (rs.next()) {\n",
    "    Object obj = rs.getObject();\n",
    "    System.out.format(\"Returned object: %s\\n\", obj.toString());\n",
    "}\n",
    "rs.close();\n",
    "\n",
    "// SUM\n",
    "Statement stmt = new Statement();\n",
    "stmt.setNamespace(Namespace); \n",
    "stmt.setSetName(Set); \n",
    "stmt.setAggregateFunction(UDFModule, \"sum\", Value.get(\"bin2\"));\n",
    "ResultSet rs = client.queryAggregate(null, stmt);\n",
    "System.out.println(\"Executed SUM.\");\n",
    "while (rs.next()) {\n",
    "    Object obj = rs.getObject();\n",
    "    System.out.format(\"Returned object: %s\\n\", obj.toString());\n",
    "}\n",
    "rs.close();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementing the WHERE Clause\n",
    "`SELECT agg(col) FROM namespace.set WHERE condition`\n",
    "\n",
    "The WHERE clause must be implemented using either query's index predicate or UDF's stream filter. Let's implement this specific query:\n",
    "\n",
    "`SELECT SUM(bin2) FROM test.sql-aggregate WHERE bin1 >= 3 AND bin1 <= 7`\n",
    "\n",
    "Let's first use query filter and then UDF stream filter to illustrate. In both cases, the filter is 2<=bin1<=7. The expected sum (1002 + 1003 + .. + 1007) is 6027."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Executed SUM using the query filter.\n",
      "Returned object: 6027"
     ]
    }
   ],
   "source": [
    "import com.aerospike.client.query.Filter;\n",
    "import com.aerospike.client.policy.QueryPolicy;\n",
    "import com.aerospike.client.exp.Exp;\n",
    "\n",
    "Statement stmt = new Statement();\n",
    "stmt.setNamespace(Namespace); \n",
    "stmt.setSetName(Set); \n",
    "// range filter using the secondary index on bin1\n",
    "stmt.setFilter(Filter.range(\"bin1\", 2, 7));\n",
    "stmt.setAggregateFunction(UDFModule, \"sum\", Value.get(\"bin2\"));\n",
    "ResultSet rs = client.queryAggregate(null, stmt);\n",
    "System.out.println(\"Executed SUM using the query filter.\");\n",
    "while (rs.next()) {\n",
    "    Object obj = rs.getObject();\n",
    "    System.out.format(\"Returned object: %s\", obj.toString());\n",
    "}\n",
    "rs.close();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using Filter Operator\n",
    "Now let's implement the range filter function in UDF.\n",
    "\n",
    "SUM_RANGE\n",
    "\n",
    "Examine the following Lua code that implements the SUM with a range filter. It takes sum_bin, range_bin, and range limits range_low and range_high. The pipeline consists of filter followed by map and reduce operators.\n",
    "- the filter function \"range_filter\" returns true if the bin value is within the range \\[range_low, range_high\\], false otherwise.\n",
    "- the map function \"rec_to_bin_value\" takes a record and returns the numeric \"bin\" value. If \"bin\" doesn't exist or is non-numeric, returns 0.\n",
    "- the reduce function \"add_values\" adds the two input values and returns their sum.\n",
    "\n",
    "<pre>\n",
    "-- range filter\n",
    "local function range_filter_closure(range_bin, range_low, range_high)\n",
    "    local function range_filter(rec)\n",
    "        -- if bin value is in [low,high] return true, false otherwise\n",
    "        local val = rec[range_bin]\n",
    "        if (not val or type(val) ~= \"number\") then val = nil end\n",
    "        return (val and (val >= range_low and val <= range_high)) or false\n",
    "    end\n",
    "    return ranger_filter\n",
    "end\n",
    "    \n",
    "-- sum of range: sum(sum_bin) where range_bin in [range_low, range_high]\n",
    "function sum_range(stream, sum_bin, range_bin, range_low, range_high)\n",
    "    return stream : filter(range_filter_closure(range_bin, range_low, range_high)) \n",
    "                    : map(rec_to_bin_value_closure(sum_bin)) : reduce(add_values)\n",
    "end\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Execute SUM_RANGE\n",
    "With the same range (2 <= bin1 <= 7), we expect the same results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Executed SUM-RANGE using the filter operator.\n",
      "Returned object: 6027"
     ]
    }
   ],
   "source": [
    "Statement stmt = new Statement();\n",
    "stmt.setNamespace(Namespace); \n",
    "stmt.setSetName(Set); \n",
    "stmt.setAggregateFunction(UDFModule, \"sum_range\", \n",
    "                            Value.get(\"bin2\"), Value.get(\"bin1\"), Value.get(2), Value.get(7));\n",
    "ResultSet rs = client.queryAggregate(null, stmt);\n",
    "System.out.println(\"Executed SUM-RANGE using the filter operator.\");\n",
    "while (rs.next()) {\n",
    "    Object obj = rs.getObject();\n",
    "    System.out.format(\"Returned object: %s\", obj.toString());\n",
    "}\n",
    "rs.close();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Do Not Use Expression Filters\n",
    "Note, you cannot use expression filters with queryAggregate as they are ignored. Below, all records in the set are aggregated in sum even when the expression filter 2 <= bin1 <= 7 that is specified in the policy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Executed SUM using expression filter 2 <= bin1 <=7\n",
      "Returned object: 1500500"
     ]
    }
   ],
   "source": [
    "import com.aerospike.client.policy.QueryPolicy;\n",
    "import com.aerospike.client.exp.Exp;\n",
    "\n",
    "Statement stmt = new Statement();\n",
    "stmt.setNamespace(Namespace); \n",
    "stmt.setSetName(Set); \n",
    "QueryPolicy policy = new QueryPolicy(client.queryPolicyDefault);\n",
    "policy.filterExp = Exp.build(\n",
    "    Exp.and(\n",
    "        Exp.ge(Exp.intBin(\"bin1\"), Exp.val(2)),\n",
    "        Exp.le(Exp.intBin(\"bin1\"), Exp.val(7))));        \n",
    "stmt.setAggregateFunction(UDFModule, \"sum\", Value.get(\"bin2\"));\n",
    "ResultSet rs = client.queryAggregate(null, stmt);\n",
    "System.out.println(\"Executed SUM using expression filter 2 <= bin1 <=7\");\n",
    "while (rs.next()) {\n",
    "    Object obj = rs.getObject();\n",
    "    System.out.format(\"Returned object: %s\", obj.toString());\n",
    "}\n",
    "rs.close();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More Simple Aggregates: MIN and MAX\n",
    "`SELECT MIN(bin2) FROM test.sql-aggregate`\n",
    "\n",
    "`SELECT MAX(bin2) FROM test.sql-aggregate`\n",
    "\n",
    "Examine the following Lua code that implements the aggregate functions MIN and MAX. \n",
    "\n",
    "MIN\n",
    "\n",
    "The pipeline consists of a simple map and reduce.\n",
    "- the map function \"rec_to_bin_value\" is as described in earlier examples. \n",
    "- the reduce function returns the minimum of the input values and handles nil values appropriately.\n",
    "\n",
    "MAX is very similar to MIN above.\n",
    "\n",
    "<pre>\n",
    "-- min reducer\n",
    "local function get_min(val1, val2)\n",
    "    local min = nil\n",
    "    if val1 then\n",
    "        if val2 then\n",
    "            if val1 < val2 then min = val1 else min = val2 end\n",
    "        else min = val1 \n",
    "        end\n",
    "    else \n",
    "        if val2 then min = val2 end\n",
    "    end\n",
    "    return min\n",
    "end\n",
    "\n",
    "-- min\n",
    "function min(stream, bin)\n",
    "    return stream : map(rec_to_bin_value_closure(bin)) : reduce(get_min)\n",
    "end\n",
    " \n",
    "-- max reducer\n",
    "local function get_max(val1, val2)\n",
    "    local max = nil\n",
    "    if val1 then\n",
    "        if val2 then\n",
    "            if val1 > val2 then max = val1 else max = val2 end\n",
    "        else max = val1 \n",
    "        end\n",
    "    else \n",
    "        if val2 then max = val2 end\n",
    "    end\n",
    "    return max\n",
    "end\n",
    "\n",
    "-- max\n",
    "function max(stream, bin)\n",
    "    return stream : map(rec_to_bin_value_closure(bin)) : reduce(get_max)\n",
    "end\n",
    "   \n",
    "</pre>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Executed MIN.\n",
      "Returned object: 1001\n",
      "Executed MAX.\n",
      "Returned object: 2000"
     ]
    }
   ],
   "source": [
    "// MIN\n",
    "Statement stmt = new Statement();\n",
    "stmt.setNamespace(Namespace); \n",
    "stmt.setSetName(Set); \n",
    "stmt.setAggregateFunction(UDFModule, \"min\", Value.get(\"bin2\"));\n",
    "ResultSet rs = client.queryAggregate(null, stmt);\n",
    "System.out.println(\"Executed MIN.\");\n",
    "while (rs.next()) {\n",
    "    Object obj = rs.getObject();\n",
    "    System.out.format(\"Returned object: %s\\n\", obj.toString());\n",
    "}\n",
    "rs.close();\n",
    "\n",
    "// MAX\n",
    "Statement stmt = new Statement();\n",
    "stmt.setNamespace(Namespace); \n",
    "stmt.setSetName(Set); \n",
    "stmt.setAggregateFunction(UDFModule, \"max\", Value.get(\"bin2\"));\n",
    "ResultSet rs = client.queryAggregate(null, stmt);\n",
    "System.out.println(\"Executed MAX.\");\n",
    "while (rs.next()) {\n",
    "    Object obj = rs.getObject();\n",
    "    System.out.format(\"Returned object: %s\", obj.toString());\n",
    "}\n",
    "rs.close();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Using Aggregate Operator\n",
    "`SELECT agg1(bin1), agg2(bin2), ... FROM namespace.set WHERE condition`\n",
    "\n",
    "The aggregate operator is used when you need to track a more complex state during stream processing. For example, to compute multiple aggregates in one query or to compute aggregates that need other aggregates for evaluation such as AVERAGE (SUM/COUNT) and RANGE (MAX-MIN).\n",
    "\n",
    "We will illustrate the aggregate operator for AVERAGE and RANGE computations of bin1 and bin2 respectively. The aggregate function will compute SUM, COUNT, MIN, and MAX of appropriate bins needed for AVERAGE and RANGE computations at the end.\n",
    "\n",
    "`SELECT AVERAGE(bin1), RANGE(bin2), ... FROM test.sql-aggregate`\n",
    "\n",
    "We will implement a new UDF \"average_range\" for this.\n",
    "\n",
    "Note that the reducer function entails merging two partial stream aggregates into one by adding their \"sum\" and \"count\" values (\"map merge\"). The final phase of reduce happens on the client to arrive at the final Sum and Count. The final map operator is a client-only operation that takes the aggregate (map) as input and outputs the average and range values. \n",
    "\n",
    "AVERAGE_RANGE\n",
    "\n",
    "It takes the bins whose AVERAGE and RANGE are needed. The pipeline consists of map, aggregate, reduce, and map operators.\n",
    "- the map function \"rec_to_bins\" returns numeric values of bin_avg and bin_range. \n",
    "- the aggregate function \"aggregate_stats\" takes the current aggregate state and two bin values and returns the new aggregate state. \n",
    "- the reduce function \"merge_stats\" merges two aggregate state maps by adding corresponding (same key) elements and returns a merged map.\n",
    "- the last map operator \"compute_final_stats\" takes the final value of SUM, COUNT, MIN, and MAX stats and outputs two values: AVERAGE (SUM/COUNT) and RANGE (MAX-MIN).\n",
    "\n",
    "<pre>\n",
    "-- map function to compute average and range\n",
    "local function compute_final_stats(stats)\n",
    "    local ret = map();\n",
    "    ret['AVERAGE'] = stats[\"sum\"] / stats[\"count\"]\n",
    "    ret['RANGE'] = stats[\"max\"] - stats[\"min\"]\n",
    "    return ret\n",
    "end\n",
    "\n",
    "-- merge partial stream maps into one\n",
    "local function merge_stats(a, b)\n",
    "    local ret = map()\n",
    "    ret[\"sum\"] = add_values((a[\"sum\"], b[\"sum\"])\n",
    "    ret[\"count\"] = add_values(a[\"count\"], b[\"count\"])\n",
    "    ret[\"min\"] = get_min(a[\"min\"], b[\"min\"])\n",
    "    ret[\"max\"] = get_max(a[\"max\"], b[\"max\"])\n",
    "    return ret\n",
    "end\n",
    "\n",
    "-- aggregate operator to compute stream state for average_range\n",
    "local function aggregate_stats(agg, val)\n",
    "    agg[\"count\"] = (agg[\"count\"] or 0) + ((val[\"bin_avg\"] and 1) or 0)\n",
    "    agg[\"sum\"] = (agg[\"sum\"] or 0) + (val[\"bin_avg\"] or 0)\n",
    "    agg[\"min\"] = get_min(agg[\"min\"], val[\"bin_range\"])\n",
    "    agg[\"max\"] = get_max(agg[\"max\"], val[\"bin_range\"])\n",
    "    return agg\n",
    "end\n",
    "\n",
    "-- average_range\n",
    "function average_range(stream, bin_avg, bin_range)\n",
    "    local function rec_to_bins(rec)\n",
    "        -- extract the values of the two bins in ret \n",
    "        local ret = map()\n",
    "        ret[\"bin_avg\"] = rec[bin_avg]\n",
    "        ret[\"bin_range\"] = rec[bin_range]\n",
    "        return ret\n",
    "    end\n",
    "    return stream : map(rec_to_bins) : aggregate(map(), aggregate_stats) : reduce(merge_stats) : map(compute_final_stats)\n",
    "end\n",
    "\n",
    "</pre>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Executed AVERAGE+RANGE.\n",
      "Returned object: {AVERAGE=500.5, RANGE=999}"
     ]
    }
   ],
   "source": [
    "Statement stmt = new Statement();\n",
    "stmt.setNamespace(Namespace); \n",
    "stmt.setSetName(Set); \n",
    "stmt.setAggregateFunction(UDFModule, \"average_range\", Value.get(\"bin1\"), Value.get(\"bin2\"));\n",
    "ResultSet rs = client.queryAggregate(null, stmt);\n",
    "System.out.println(\"Executed AVERAGE+RANGE.\");\n",
    "while (rs.next()) {\n",
    "    Object obj = rs.getObject();\n",
    "    System.out.format(\"Returned object: %s\", obj.toString());\n",
    "}\n",
    "rs.close();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Takeaways and Conclusion\n",
    "Many developers that are familiar with SQL would like to see how SQL operations translate to Aerospike. We looked at how to implement various aggregate statements. This should be generally useful irrespective of the reader's SQL knowledge. While the examples here use synchronous execution, many operations can also be performed asynchronously. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Clean up\n",
    "Remove tutorial data and close connection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-12-29T20:49:19.972650Z",
     "start_time": "2020-12-29T20:49:19.967344Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Removed tutorial data and closed server connection.\n"
     ]
    }
   ],
   "source": [
    "client.dropIndex(null, Namespace, Set, IndexName);\n",
    "client.truncate(null, Namespace, null, null);\n",
    "client.close();\n",
    "System.out.println(\"Removed tutorial data and closed server connection.\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Further Exploration and Resources\n",
    "Here are some links for further exploration\n",
    "\n",
    "Resources\n",
    "- Related notebooks\n",
    "    - [Queries](https://github.com/aerospike-examples/interactive-notebooks/blob/main/notebooks/python/query.ipynb)\n",
    "    - Other notebooks in the SQL series on 1) [SELECT](sql_select.ipynb), 2) [Aggregates (Part 2)](sql.aggregates_2.ipynb), and 2) UPDATE, CREATE, and DELETE.\n",
    "- [Aerospike Presto Connector](https://www.aerospike.com/docs/connect/access/presto/index.html)\n",
    "- Blog post\n",
    "    - [Introducing Aerospike JDBC Connector](https://medium.com/aerospike-developer-blog/introducing-aerospike-jdbc-driver-fe46d9fc3b4d)\n",
    "- Aerospike Developer Hub\n",
    "    - [Java Developers Resources](https://developer.aerospike.com/java-developers)\n",
    "- Github repos\n",
    "    - [Java code examples](https://github.com/aerospike/aerospike-client-java/tree/master/examples/src/com/aerospike/examples)\n",
    "- Documentation\n",
    "    - [Developing Stream UDFs](https://www.aerospike.com/docs/udf/developing_stream_udfs.html)\n",
    "    - [Java Client](https://www.aerospike.com/docs/client/java/index.html)\n",
    "    - [Java API Reference](https://www.aerospike.com/apidocs/java/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Next steps\n",
    "\n",
    "Visit [Aerospike notebooks repo](https://github.com/aerospike-examples/interactive-notebooks) to run additional Aerospike notebooks. To run a different notebook, download the notebook from the repo to your local machine, and then click on File->Open, and select Upload."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Java",
   "language": "java",
   "name": "java"
  },
  "language_info": {
   "codemirror_mode": "java",
   "file_extension": ".jshell",
   "mimetype": "text/x-java-source",
   "name": "Java",
   "pygments_lexer": "java",
   "version": "11.0.8+10-LTS"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}