{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# DataFrame\n", "\n", "Many Smile algorithms take simple `double[]` as input. But we also use the encapsulation class `DataFrame`. As shown in [Data](data.ipynb) notebook, the output of most Smile data parsers is a `DataFrame` object. DataFrames are immutable and contain a fixed number of named columns." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import $ivy.`com.github.haifengl::smile-scala:2.6.0`\n", "import $ivy.`org.slf4j:slf4j-simple:1.7.30` \n", "\n", "import scala.language.postfixOps\n", "import org.apache.commons.csv.CSVFormat\n", "import java.nio.file.{Files, Paths}\n", "import smile._\n", "import smile.data._\n", "\n", "def display(df: DataFrame, limit: Int = 20, truncate: Boolean = true) = {\n", " import xml.Utility.escape\n", " val header = df.names\n", " val rows = df.toStrings(limit, truncate)\n", " kernel.publish.html(\n", " s\"\"\"\n", " \n", " ${header.map(h => s\"\").mkString}\n", " ${rows.map { row =>\n", " s\"${row.map{c => s\"\" }.mkString}\"\n", " }.mkString}\n", "
${escape(h)}
${escape(c)}
\n", " \"\"\"\n", " )\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this session, we will explore the functionality of `DataFrame` with the `iris` data. The `iris` data is from early statistical work of R.A. Fisher, who used three species of Iris flowers to develop linear discriminant analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val iris = read.arff(\"../data/weka/iris.arff\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's check out the statistic summary of numeric columns in the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris.summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get a row with the array syntax." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When selecting a row, it returns a `Tuple`, which is an immutable finite ordered list (sequence) of elements. Moreover, we can slice a `DataFrame` into a new one. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris.slice(10, 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can refer a column by its name and it returns a vector. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris(\"sepallength\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly, we can select a few columns to create a new data frame." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris.select(\"sepallength\", \"sepalwidth\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Advanced operations such as `exists`, `forall`, `find`, `filter` are also supported. The predicate of these functions expect a `Tuple`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris.exists(_.getDouble(0) > 4.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we test if there is any sample with `sepallength > 4.5`. Since `sepallength` is the first column, we use `getDouble(0)` to retrive the value in the predicate labmda. Note that `Tuple` allows generic access by `get()` method, which will incur boxing overhead for primitives. Therefore, `Tuple` also provides the native primitive access method `getXXX()`, where `XXX` is the type.\n", "\n", "It is invalid to use the native primitive interface to retrieve a value\n", "that is null, instead a user must check `isNullAt` before attempting\n", "to retrieve a value that might be null." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris.forall(_.getDouble(0) < 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In contrast to `exists`, the function `forall` returns `true` only if all rows pass the test." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris.find(_(\"class\") == 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `find` method returns the first row passes the test if it exists. Otherwise, it returns `Optional.empty`. Note that `_(\"class\")` in the example returns an object of Integer because the nominal data are stored as integers (byte, short, or int, depending on the levels of measurements). To the string representation of `class`, one can use `getString()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris.find(_.getString(\"class\").equals(\"Iris-versicolor\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's combine what we just learn into an example of `filter`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris.filter { row => row.getDouble(1) > 3 && row(\"class\") != 0 }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For data wrangling, the most important functions of `DataFrame` are `map` and `groupBy`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris.map { row =>\n", " val x = new Array[Double](6)\n", " for (i <- 0 until 4) x(i) = row.getDouble(i)\n", " x(4) = x(0) * x(1)\n", " x(5) = x(2) * x(3)\n", " x\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris.groupBy(row => row.getString(\"class\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Besides numeric and nominal values, many other data types are also supported in `DataFrame`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val strings = read.arff(\"../data/weka/string.arff\")\n", "strings.filter(_.getString(0).startsWith(\"AS\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "val dates = read.arff(\"../data/weka/date.arff\")" ] } ], "metadata": { "kernelspec": { "display_name": "Scala (2.13)", "language": "scala", "name": "scala213" }, "language_info": { "codemirror_mode": "text/x-scala", "file_extension": ".scala", "mimetype": "text/x-scala", "name": "scala", "nbconvert_exporter": "script", "version": "2.13.1" } }, "nbformat": 4, "nbformat_minor": 4 }