{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DataFrame\n",
    "\n",
    "Many Smile algorithms take simple `double[]` as input. But we also use the encapsulation class `DataFrame`. As shown in [Data](data.ipynb) notebook, the output of most Smile data parsers is a `DataFrame` object. DataFrames are immutable and contain a fixed number of named columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import $ivy.`com.github.haifengl::smile-scala:4.0.0`\n",
    "import $ivy.`org.slf4j:slf4j-simple:2.0.16`  \n",
    "\n",
    "import scala.language.existentials\n",
    "import scala.language.postfixOps\n",
    "import org.apache.commons.csv.CSVFormat\n",
    "import java.nio.file.{Files, Paths}\n",
    "import smile._\n",
    "import smile.data._\n",
    "\n",
    "def display(df: DataFrame, limit: Int = 20, truncate: Boolean = true) = {\n",
    "  import xml.Utility.escape\n",
    "  val header = df.names\n",
    "  val rows = df.toStrings(limit, truncate)\n",
    "  kernel.publish.html(\n",
    "    s\"\"\"\n",
    "      <table>\n",
    "        <tr>${header.map(h => s\"<th>${escape(h)}</th>\").mkString}</tr>\n",
    "        ${rows.map { row =>\n",
    "          s\"<tr>${row.map{c => s\"<td>${escape(c)}</td>\" }.mkString}</tr>\"\n",
    "        }.mkString}\n",
    "      </table>\n",
    "    \"\"\"\n",
    "  )\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this session, we will explore the functionality of `DataFrame` with the `iris` data. The `iris` data is from early statistical work of R.A. Fisher, who used three species of Iris flowers to develop linear discriminant analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val iris = read.arff(\"../data/weka/iris.arff\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's check out the statistic summary of numeric columns in the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris.summary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get a row with the array syntax."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When selecting a row, it returns a `Tuple`, which is an immutable finite ordered list (sequence) of elements. Moreover, we can slice a `DataFrame` into a new one. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris.slice(10, 20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can refer a column by its name and it returns a vector. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris(\"sepallength\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarly, we can select a few columns to create a new data frame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris.select(\"sepallength\", \"sepalwidth\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Advanced operations such as `exists`, `forall`, `find`, `filter` are also supported. The predicate of these functions expect a `Tuple`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris.exists(_.getDouble(0) > 4.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we test if there is any sample with `sepallength > 4.5`. Since `sepallength` is the first column, we use `getDouble(0)` to retrive the value in the predicate labmda. Note that `Tuple` allows generic access by `get()` method, which will incur boxing overhead for primitives. Therefore, `Tuple` also provides the native primitive access method `getXXX()`, where `XXX` is the type.\n",
    "\n",
    "It is invalid to use the native primitive interface to retrieve a value\n",
    "that is null, instead a user must check `isNullAt` before attempting\n",
    "to retrieve a value that might be null."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris.forall(_.getDouble(0) < 10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In contrast to `exists`, the function `forall` returns `true` only if all rows pass the test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris.find(_(\"class\") == 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `find` method returns the first row passes the test if it exists. Otherwise, it returns `Optional.empty`. Note that `_(\"class\")` in the example returns an object of Integer because the nominal data are stored as integers (byte, short, or int, depending on the levels of measurements). To the string representation of `class`, one can use `getString()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris.find(_.getString(\"class\").equals(\"Iris-versicolor\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's combine what we just learn into an example of `filter`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris.filter { row => row.getDouble(1) > 3 && row(\"class\") != 0 }"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For data wrangling, the most important functions of `DataFrame` are `map` and `groupBy`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris.map { row =>\n",
    "  val x = new Array[Double](6)\n",
    "  for (i <- 0 until 4) x(i) = row.getDouble(i)\n",
    "  x(4) = x(0) * x(1)\n",
    "  x(5) = x(2) * x(3)\n",
    "  x\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris.groupBy(row => row.getString(\"class\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Besides numeric and nominal values, many other data types are also supported in `DataFrame`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val strings = read.arff(\"../data/weka/string.arff\")\n",
    "strings.filter(_.getString(0).startsWith(\"AS\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "val dates = read.arff(\"../data/weka/date.arff\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Scala (2.13)",
   "language": "scala",
   "name": "scala213"
  },
  "language_info": {
   "codemirror_mode": "text/x-scala",
   "file_extension": ".sc",
   "mimetype": "text/x-scala",
   "name": "scala",
   "nbconvert_exporter": "script",
   "version": "2.13.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}