# DataFrame

Many Smile algorithms take simple `double[]` as input. But we also use the encapsulation class `DataFrame`. As shown in [Data](data.ipynb) notebook, the output of most Smile data parsers is a `DataFrame` object. DataFrames are immutable and contain a fixed number of named columns.

In [None]:
import $ivy.`com.github.haifengl::smile-scala:2.6.0`
import $ivy.`org.slf4j:slf4j-simple:1.7.30`  

import scala.language.postfixOps
import org.apache.commons.csv.CSVFormat
import java.nio.file.{Files, Paths}
import smile._
import smile.data._

def display(df: DataFrame, limit: Int = 20, truncate: Boolean = true) = {
  import xml.Utility.escape
  val header = df.names
  val rows = df.toStrings(limit, truncate)
  kernel.publish.html(
    s"""
      <table>
        <tr>${header.map(h => s"<th>${escape(h)}</th>").mkString}</tr>
        ${rows.map { row =>
          s"<tr>${row.map{c => s"<td>${escape(c)}</td>" }.mkString}</tr>"
        }.mkString}
      </table>
    """
  )
}

In this session, we will explore the functionality of `DataFrame` with the `iris` data. The `iris` data is from early statistical work of R.A. Fisher, who used three species of Iris flowers to develop linear discriminant analysis.

In [None]:
val iris = read.arff("../data/weka/iris.arff")

First, let's check out the statistic summary of numeric columns in the data.

In [None]:
iris.summary

We can get a row with the array syntax.

In [None]:
iris(0)

When selecting a row, it returns a `Tuple`, which is an immutable finite ordered list (sequence) of elements. Moreover, we can slice a `DataFrame` into a new one. 

In [None]:
iris.slice(10, 20)

We can refer a column by its name and it returns a vector. 

In [None]:
iris("sepallength")

Similarly, we can select a few columns to create a new data frame.

In [None]:
iris.select("sepallength", "sepalwidth")

Advanced operations such as `exists`, `forall`, `find`, `filter` are also supported. The predicate of these functions expect a `Tuple`.

In [None]:
iris.exists(_.getDouble(0) > 4.5)

In this example, we test if there is any sample with `sepallength > 4.5`. Since `sepallength` is the first column, we use `getDouble(0)` to retrive the value in the predicate labmda. Note that `Tuple` allows generic access by `get()` method, which will incur boxing overhead for primitives. Therefore, `Tuple` also provides the native primitive access method `getXXX()`, where `XXX` is the type.

It is invalid to use the native primitive interface to retrieve a value
that is null, instead a user must check `isNullAt` before attempting
to retrieve a value that might be null.

In [None]:
iris.forall(_.getDouble(0) < 10)

In contrast to `exists`, the function `forall` returns `true` only if all rows pass the test.

In [None]:
iris.find(_("class") == 1)

The `find` method returns the first row passes the test if it exists. Otherwise, it returns `Optional.empty`. Note that `_("class")` in the example returns an object of Integer because the nominal data are stored as integers (byte, short, or int, depending on the levels of measurements). To the string representation of `class`, one can use `getString()` method.

In [None]:
iris.find(_.getString("class").equals("Iris-versicolor"))

Let's combine what we just learn into an example of `filter`.

In [None]:
iris.filter { row => row.getDouble(1) > 3 && row("class") != 0 }

For data wrangling, the most important functions of `DataFrame` are `map` and `groupBy`.

In [None]:
iris.map { row =>
  val x = new Array[Double](6)
  for (i <- 0 until 4) x(i) = row.getDouble(i)
  x(4) = x(0) * x(1)
  x(5) = x(2) * x(3)
  x
}

In [None]:
iris.groupBy(row => row.getString("class"))

Besides numeric and nominal values, many other data types are also supported in `DataFrame`.

In [None]:
val strings = read.arff("../data/weka/string.arff")
strings.filter(_.getString(0).startsWith("AS"))

In [None]:
val dates = read.arff("../data/weka/date.arff")