# SMILE — DataFrame User Guide & Tutorial This document covers `smile.data.DataFrame`, `smile.data.Tuple`, and the supporting packages `smile.data.type`, `smile.data.measure`, and `smile.data.vector`. --- ## Table of Contents 1. [Architecture overview](#1-architecture-overview) 2. [Data types — `smile.data.type`](#2-data-types--smileddatatype) - [Primitive and nullable types](#21-primitive-and-nullable-types) - [Temporal types](#22-temporal-types) - [Other types](#23-other-types) - [StructField](#24-structfield) - [StructType](#25-structtype) 3. [Levels of measurement — `smile.data.measure`](#3-levels-of-measurement--smiledatameasure) - [NominalScale](#31-nominalscale) - [OrdinalScale](#32-ordinalscale) - [IntervalScale](#33-intervalscale) - [RatioScale](#34-ratioscale) 4. [Column vectors — `smile.data.vector`](#4-column-vectors--smiledatavector) - [Primitive vectors](#41-primitive-vectors) - [Nullable primitive vectors](#42-nullable-primitive-vectors) - [Object vectors](#43-object-vectors) - [Common ValueVector operations](#44-common-valuevector-operations) 5. [Tuple — a single row](#5-tuple--a-single-row) 6. [DataFrame](#6-dataframe) - [Creating a DataFrame](#61-creating-a-dataframe) - [Inspecting a DataFrame](#62-inspecting-a-dataframe) - [Accessing cells, rows, and columns](#63-accessing-cells-rows-and-columns) - [Row indexing with RowIndex](#64-row-indexing-with-rowindex) - [Selecting and dropping columns](#65-selecting-and-dropping-columns) - [Adding and replacing columns](#66-adding-and-replacing-columns) - [Renaming columns](#67-renaming-columns) - [Filtering rows](#68-filtering-rows) - [Sorting](#69-sorting) - [Slicing and sampling](#610-slicing-and-sampling) - [Combining DataFrames](#611-combining-dataframes) - [Missing values](#612-missing-values) - [Categorical encoding with factorize](#613-categorical-encoding-with-factorize) - [Exporting to numeric arrays and matrices](#614-exporting-to-numeric-arrays-and-matrices) - [Statistics with describe](#615-statistics-with-describe) - [Printing and display](#616-printing-and-display) - [Loading from JDBC](#617-loading-from-jdbc) 7. [CategoricalEncoder](#7-categoricalencoder) 8. [End-to-end tutorial](#8-end-to-end-tutorial) 9. [API quick reference](#9-api-quick-reference) --- ## 1. Architecture overview ``` smile.data.type │ ├── DataType (interface) – describes storage format │ ├── Primitive types – BooleanType, ByteType, ShortType, │ │ IntType, LongType, FloatType, DoubleType, │ │ CharType (each has nullable=false/true) │ ├── TemporalType – DateType, TimeType, DateTimeType │ ├── DecimalType – BigDecimal │ ├── StringType – String │ ├── ObjectType – arbitrary Object │ ├── ArrayType – primitive array columns │ └── StructType – schema (ordered list of StructFields) │ ├── StructField(name, dtype, measure?) – one column descriptor └── DataTypes – constants and factory methods smile.data.measure │ ├── Measure (interface) – adds semantic meaning to a numeric column │ ├── CategoricalMeasure – discrete labeled integer codes │ │ ├── NominalScale – unordered categories (gender, country…) │ │ └── OrdinalScale – ordered categories (rating, grade…) │ └── NumericalMeasure – continuous numeric annotations │ ├── IntervalScale – no true zero (temperature °C, year) │ └── RatioScale – true zero (price, weight, count) └── Measure.Currency / Percent – built-in ratio-scale singletons smile.data.vector │ ├── ValueVector (interface) – one typed column of a DataFrame │ ├── PrimitiveVector – non-nullable: IntVector, DoubleVector… │ ├── NullablePrimitiveVector – nullable: NullableIntVector… │ ├── StringVector – String column (always nullable-compatible) │ ├── NumberVector – boxed Number column (BigDecimal…) │ └── ObjectVector – arbitrary Object column smile.data │ ├── Tuple (interface) – one row; immutable ordered field list │ └── Row (record) – Tuple backed by a DataFrame + row index │ ├── RowIndex – optional label → ordinal index for rows │ └── DataFrame (record) – 2-D heterogeneous tabular data schema : StructType columns : List index : RowIndex? (optional row labels) ``` --- ## 2. Data types — `smile.data.type` Every column in a `DataFrame` has a `DataType` that describes how values are stored. Access the singletons via `DataTypes.*`: ```java import smile.data.type.DataTypes; import smile.data.type.DataType; ``` ### 2.1 Primitive and nullable types | Singleton | Java type | Nullable variant | |-----------|-----------|-----------------| | `DataTypes.BooleanType` | `boolean` | `DataTypes.NullableBooleanType` | | `DataTypes.ByteType` | `byte` | `DataTypes.NullableByteType` | | `DataTypes.ShortType` | `short` | `DataTypes.NullableShortType` | | `DataTypes.IntType` | `int` | `DataTypes.NullableIntType` | | `DataTypes.LongType` | `long` | `DataTypes.NullableLongType` | | `DataTypes.FloatType` | `float` | `DataTypes.NullableFloatType` | | `DataTypes.DoubleType` | `double` | `DataTypes.NullableDoubleType` | | `DataTypes.CharType` | `char` | `DataTypes.NullableCharType` | **Non-nullable** primitive types store values directly in a primitive array — no boxing, no null overhead. **Nullable** variants add a `BitSet` null-mask alongside the primitive array; reading a null cell via the primitive accessor (`getDouble`, `getInt`, …) returns `Double.NaN` / `Integer.MIN_VALUE` as the sentinel; always use `isNullAt(i)` first when the column is nullable. ```java boolean isNull = df.isNullAt(row, col); if (!isNull) { double v = df.getDouble(row, col); } ``` ### 2.2 Temporal types | Singleton | Java type | |-----------|-----------| | `DataTypes.DateType` | `java.time.LocalDate` | | `DataTypes.TimeType` | `java.time.LocalTime` | | `DataTypes.DateTimeType` | `java.time.LocalDateTime` | Temporal columns are always represented as object arrays (they are not primitives), but SMILE provides ISO-8601 parsing and formatting out of the box via `dtype.valueOf(String)` and `dtype.toString(Object)`. ### 2.3 Other types | Singleton / factory | Java type | Notes | |--------------------|-----------|-------| | `DataTypes.StringType` | `String` | Always nullable | | `DataTypes.DecimalType` | `BigDecimal` | Always nullable | | `DataTypes.ObjectType` | `Object` | Catch-all | | `DataTypes.object(Class)` | specific class | Resolves to known type if possible | | `DataTypes.IntArrayType` etc. | `int[]` etc. | Primitive array columns | ### 2.4 StructField A `StructField` is an immutable triple `(name, dtype, measure?)` that describes a single column. ```java import smile.data.type.StructField; // Plain double column StructField age = new StructField("age", DataTypes.IntType); // Nullable salary StructField salary = new StructField("salary", DataTypes.NullableDoubleType); // Categorical column with a NominalScale StructField gender = new StructField("gender", DataTypes.ByteType, new NominalScale("Male", "Female")); // Useful predicates boolean numeric = age.isNumeric(); // true for non-nominal numeric boolean nullable = salary.dtype().isNullable(); ``` `StructField` is a Java `record`; two fields are equal when name, dtype, and measure all match. ### 2.5 StructType `StructType` is the schema of a `DataFrame` or `Tuple`. It is an ordered list of `StructField`s with a fast name-to-index lookup map. ```java import smile.data.type.StructType; StructType schema = new StructType( new StructField("name", DataTypes.StringType), new StructField("age", DataTypes.IntType), new StructField("salary", DataTypes.NullableDoubleType) ); // Field access StructField f = schema.field("age"); // by name StructField f = schema.field(1); // by ordinal int j = schema.indexOf("age"); // ordinal of "age" String[] ns = schema.names(); DataType[] dt = schema.dtypes(); int len = schema.length(); ``` `StructType` is **mutable** for internal use by `DataFrame` (columns can be added/renamed in-place). Do not cache a `StructType` reference and assume it is immutable. --- ## 3. Levels of measurement — `smile.data.measure` A `Measure` is an optional annotation on a `StructField` that adds semantic meaning to a numeric column. It controls how values are rendered and how SMILE's algorithms treat the column (e.g., dummy encoding for categorical variables). ```java import smile.data.measure.*; ``` ### 3.1 NominalScale Unordered categories. Each integer code maps to a string label. ```java // From string labels (codes are 0, 1, 2, …) NominalScale gender = new NominalScale("Male", "Female"); // From an enum NominalScale color = new NominalScale(Color.class); // enum Color { Red, Green, Blue } // From explicit code→label pairs NominalScale custom = new NominalScale( new int[] {1, 3, 7}, new String[] {"Low", "Mid", "High"} ); int code = gender.valueOf("Female").intValue(); // 1 String label = gender.level(0); // "Male" int size = gender.size(); // 2 ``` When a column has a `NominalScale`, `getString(i)` returns the label, not the raw integer, and `df.factorize()` will produce columns backed by a `NominalScale`. ### 3.2 OrdinalScale Ordered categories. Levels carry an implied rank; values are kept sorted. ```java OrdinalScale rating = new OrdinalScale("Poor", "Fair", "Good", "Excellent"); // ordinals: 0=Poor, 1=Fair, 2=Good, 3=Excellent ``` The ordinal position is meaningful for comparison and sorting but arithmetic (mean, variance) is not valid on pure ordinal data. ### 3.3 IntervalScale Numeric, no true zero. Arithmetic differences are meaningful, ratios are not. Primarily used as an annotation for documentation/display. ```java IntervalScale celsius = new IntervalScale(NumberFormat.getInstance()); ``` ### 3.4 RatioScale Numeric with a true zero. All arithmetic is valid. ```java // Built-in singletons Measure price = Measure.Currency; // formats as currency Measure pct = Measure.Percent; // formats as percentage // Custom RatioScale weight = new RatioScale(NumberFormat.getInstance()); ``` **Compatibility rules** (enforced by `StructField`'s compact constructor): - `NumericalMeasure` is invalid for `Boolean`, `Char`, `String` columns. - `CategoricalMeasure` is only valid for integral (`int`, `long`, `byte`, `short`) columns. --- ## 4. Column vectors — `smile.data.vector` A `ValueVector` is a typed, indexed, one-dimensional array that forms a single column of a `DataFrame`. ```java import smile.data.vector.*; ``` ### 4.1 Primitive vectors Each primitive type has a corresponding non-nullable vector class: | Class | Backing store | Example | |-------|---------------|---------| | `IntVector` | `int[]` | `new IntVector("age", new int[]{25, 30, 35})` | | `LongVector` | `long[]` | `new LongVector("ts", new long[]{...})` | | `FloatVector` | `float[]` | `new FloatVector("score", new float[]{...})` | | `DoubleVector` | `double[]` | `new DoubleVector("salary", new double[]{...})` | | `BooleanVector` | `boolean[]` | `new BooleanVector("flag", new boolean[]{...})` | | `ByteVector` | `byte[]` | `new ByteVector("cat", new byte[]{...})` | | `ShortVector` | `short[]` | `new ShortVector("rank", new short[]{...})` | | `CharVector` | `char[]` | `new CharVector("grade", new char[]{...})` | All take an optional `StructField` as the first argument when a measure is needed: ```java StructField field = new StructField("gender", DataTypes.ByteType, new NominalScale("Male", "Female")); ByteVector gender = new ByteVector(field, new byte[]{0, 1, 0, 1}); ``` ### 4.2 Nullable primitive vectors Nullable variants store an additional `BitSet` null-mask: ```java import java.util.BitSet; double[] values = {80000.0, Double.NaN, 90000.0}; BitSet nullMask = new BitSet(3); nullMask.set(1); // index 1 is null NullableDoubleVector salary = new NullableDoubleVector( new StructField("salary", DataTypes.NullableDoubleType), values, nullMask ); salary.isNullable(); // true salary.isNullAt(1); // true salary.getDouble(0); // 80000.0 salary.getNullCount(); // 1 ``` Corresponding nullable classes: `NullableIntVector`, `NullableLongVector`, `NullableFloatVector`, `NullableDoubleVector`, `NullableBooleanVector`, `NullableByteVector`, `NullableShortVector`, `NullableCharVector`. ### 4.3 Object vectors | Class | Content | |-------|---------| | `StringVector` | `String[]` — always nullable | | `NumberVector` | `Number[]` — `BigDecimal`, boxed primitives | | `ObjectVector` | `Object[]` — `LocalDate`, `LocalDateTime`, any type | ```java StringVector names = new StringVector("name", new String[]{"Alice","Bob"}); ObjectVector dates = new ObjectVector<>( new StructField("birthday", DataTypes.DateType), new LocalDate[]{ LocalDate.of(1990,1,1), LocalDate.of(1985,6,15) } ); ``` ### 4.4 Common ValueVector operations ```java ValueVector v = df.column("age"); // Size and nullability int n = v.size(); boolean nullable = v.isNullable(); boolean hasNull = v.anyNull(); int nullCount = v.getNullCount(); boolean rowNull = v.isNullAt(2); // Typed reads int i = v.getInt(0); double d = v.getDouble(0); String s = v.getString(0); // uses measure for categorical columns Object o = v.get(0); // boxed / object value, may be null // Bulk export int[] ia = v.toIntArray(); double[] da = v.toDoubleArray(); String[] sa = v.toStringArray(); // Streaming v.intStream().sum(); v.doubleStream().average(); v.stream().filter(Objects::nonNull).count(); // Boolean filter masks boolean[] mask = v.eq(30); // element-wise == boolean[] mask = v.gt(25); // element-wise > boolean[] mask = v.isin(25, 30); // element-wise membership // Sub-selection ValueVector sub = v.get(Index.of(new int[]{0, 2, 3})); // Rename (returns new vector) ValueVector renamed = v.withName("years"); // In-place mutation v.set(0, 99); ``` --- ## 5. Tuple — a single row `Tuple` is an interface representing one row of a `DataFrame`. It is **immutable** from the user's perspective — mutating a `Row` (the concrete implementation) directly modifies the backing `DataFrame`. ### Creating a Tuple ```java import smile.data.Tuple; import smile.data.type.StructType; // Construct standalone from schema + values StructType schema = new StructType( new StructField("x", DataTypes.IntType), new StructField("y", DataTypes.DoubleType) ); Tuple t = Tuple.of(schema, new Object[]{42, 3.14}); Tuple t = Tuple.of(schema, new int[]{42}, new double[]{3.14}); ``` ### Reading fields All accessors have both ordinal (`int i`) and name (`String field`) forms: ```java Tuple row = df.get(0); // By ordinal int age = row.getInt(0); double salary = row.getDouble(2); String name = row.getString(1); // By name int age = row.getInt("age"); double salary = row.getDouble("salary"); // Generic (boxed, may return null) Object val = row.get(0); Object val = row.get("name"); // Null check — always check before primitive accessors on nullable columns boolean isNull = row.isNullAt(2); boolean isNull = row.isNullAt("salary"); boolean anyNull = row.anyNull(); ``` Available typed accessors: `getBoolean`, `getChar`, `getByte`, `getShort`, `getInt`, `getLong`, `getFloat`, `getDouble`, `getString`. ### Exporting to a double array ```java // All fields as doubles (NaN for nulls, level encoding for categoricals) double[] arr = row.toArray(); // Selective fields double[] arr = row.toArray("age", "salary"); // With intercept (bias=1) and dummy encoding double[] arr = row.toArray(true, CategoricalEncoder.DUMMY, "age", "gender", "salary"); ``` ### Schema access ```java StructType schema = row.schema(); int length = row.length(); int j = row.indexOf("salary"); ``` --- ## 6. DataFrame `DataFrame` is a Java `record` — immutable value by reference — backed by: - `StructType schema` — column descriptors - `List columns` — the actual data, one vector per column - `RowIndex index` — optional row labels (may be `null`) The `List` and `StructType` are **mutable** for the in-place operations `add()`, `set()`, `rename()`, and `fillna()`. All other operations that structurally reshape the data (e.g., `select`, `drop`, `sort`, `concat`) return a **new** `DataFrame`. ```java import smile.data.DataFrame; ``` ### 6.1 Creating a DataFrame #### From column vectors (most direct) ```java DataFrame df = new DataFrame( new StringVector ("name", new String[] {"Alice", "Bob", "Charlie"}), new IntVector ("age", new int[] {25, 30, 35}), new DoubleVector ("salary", new double[] {60000., 80000., 90000.}) ); ``` #### From a 2-D double / float / int array ```java double[][] data = {{1.0, 2.0}, {3.0, 4.0}, {5.0, 6.0}}; // Auto-named columns: V1, V2 DataFrame df = DataFrame.of(data); // Custom names DataFrame df = DataFrame.of(data, "x", "y"); DataFrame df = DataFrame.of(new int[][]{{1,2},{3,4}}, "a", "b"); DataFrame df = DataFrame.of(new float[][]{{1f},{2f}}, "v"); ``` #### From a Java Bean / POJO via reflection SMILE uses Java Beans introspection (`getXxx()` methods) to discover columns automatically. Field order in the schema follows the alphabetical order of getter names. ```java public class Person { public String getName() { return name; } public int getAge() { return age; } public Double getSalary() { return salary; } // nullable wrapper → NullableDoubleType public Gender getGender() { return gender; } // enum → ByteVector with NominalScale // … } List persons = List.of( new Person("Alice", 25, 60000., Gender.Female), new Person("Bob", 30, null, Gender.Male) ); DataFrame df = DataFrame.of(Person.class, persons); ``` **Rules for automatic type inference:** - `int` / `long` / `float` / `double` / `boolean` / `char` / `byte` / `short` → non-nullable primitive vector - Boxed `Integer`, `Double`, etc. and `Number` subclasses → nullable vector or `NumberVector` - `String` → `StringVector` - Enum → `ByteVector` (≤127 levels), `ShortVector` (≤32767), or `IntVector`; always annotated with a `NominalScale` - `LocalDate`, `LocalDateTime`, `LocalTime` → `ObjectVector` with temporal type - Anything else → `ObjectVector` #### From a schema + list/stream of Tuples ```java StructType schema = new StructType( new StructField("x", DataTypes.IntType), new StructField("y", DataTypes.DoubleType) ); // Non-empty list List rows = List.of(Tuple.of(schema, 1, 2.0), Tuple.of(schema, 3, 4.0)); DataFrame df = DataFrame.of(schema, rows); // Stream variant DataFrame df = DataFrame.of(schema, rows.stream()); // Empty list — returns a zero-row DataFrame with the correct schema DataFrame empty = DataFrame.of(schema, List.of()); ``` #### From a JDBC ResultSet ```java try (Connection conn = DriverManager.getConnection(url); Statement stmt = conn.createStatement(); ResultSet rs = stmt.executeQuery("SELECT * FROM employees")) { DataFrame df = DataFrame.of(rs); } ``` JDBC type mapping is handled automatically via `StructType.of(ResultSet)`. ### 6.2 Inspecting a DataFrame ```java // Dimensions int nrow = df.nrow(); // number of rows int ncol = df.ncol(); // number of columns int size = df.size(); // alias for nrow() int[] shape = {df.shape(0), df.shape(1)}; // {nrow, ncol} boolean empty = df.isEmpty(); // Schema StructType schema = df.schema(); String[] names = df.names(); DataType[] dtypes = df.dtypes(); Measure[] measures = df.measures(); // Print System.out.println(df); // head(10) System.out.println(df.head(5)); // first 5 rows System.out.println(df.tail(5)); // last 5 rows System.out.println(df.toString(2, 7, true)); // rows [2,7) // Summary statistics System.out.println(df.describe()); ``` `describe()` returns a new `DataFrame` with one row per column and columns: `column`, `type`, `measure`, `count` (non-null), `mode`, `mean`, `std`, `min`, `25%`, `50%`, `75%`, `max`. ### 6.3 Accessing cells, rows, and columns #### Cells ```java Object val = df.get(row, col); // boxed value, may be null int i = df.getInt(row, col); long l = df.getLong(row, col); float f = df.getFloat(row, col); double d = df.getDouble(row, col); String s = df.getString(row, col); // uses measure for categoricals String sc = df.getScale(row, col); // level name for NominalScale / OrdinalScale boolean nil = df.isNullAt(row, col); ``` #### Rows (Tuples) ```java Tuple row = df.get(0); // first row Tuple row = df.apply(0); // Scala alias // Iterate all rows for (Row row : df) { /* … */ } df.stream().forEach(row -> /* … */); List list = df.toList(); ``` #### Columns ```java ValueVector col = df.column(0); // by ordinal ValueVector col = df.column("age"); // by name ValueVector col = df.apply("age"); // Scala alias ``` #### Mutating a cell ```java df.set(row, col, value); df.update(row, col, value); // Scala alias ``` ### 6.4 Row indexing with RowIndex A `RowIndex` maps arbitrary label objects to row ordinals, allowing label-based row selection similar to Pandas `.loc`. ```java // Attach an index from an existing column (column is removed from data) DataFrame indexed = df.setIndex("name"); // Attach an index from an external array DataFrame indexed = df.setIndex(new Object[]{"r0","r1","r2","r3"}); // Look up by label Tuple row = indexed.loc("Alice"); // single row DataFrame sub = indexed.loc("Alice","Bob"); // multiple rows ``` `RowIndex` is also used internally by `join()` to perform an inner join on shared keys. ```java // RowIndex directly RowIndex index = new RowIndex(new Object[]{"a","b","c"}); int i = index.apply("b"); // 1 int i = index.getOrDefault("x"); // -1 (not found) boolean has = index.containsKey("a"); // true int n = index.size(); // 3 ``` **Constraints:** no null values, no duplicate values — both throw `IllegalArgumentException` at construction time. ### 6.5 Selecting and dropping columns All selection/drop operations return a **new** `DataFrame`. ```java // Select by ordinal indices DataFrame sub = df.select(0, 2); // Select by name DataFrame sub = df.select("name", "salary"); DataFrame sub = df.apply("name", "salary"); // Scala alias // Drop by ordinal index DataFrame sub = df.drop(1); DataFrame sub = df.drop(0, 2); // Drop by name DataFrame sub = df.drop("age"); DataFrame sub = df.drop("age", "birthday"); ``` ### 6.6 Adding and replacing columns `add()` and `set()` **mutate `this` in-place** (they modify the internal `List` and `StructType`). ```java // Add new columns — all must have the same size as the DataFrame // and names must not clash with existing columns or each other IntVector bonus = new IntVector("bonus", new int[]{5000, 8000, 12000}); df.add(bonus); // Two columns at once — both names must be distinct from each other // and from existing columns df.add(c1, c2); // Replace an existing column (or add if not present) df.set("salary", updatedSalaryVector); df.update("salary", updatedSalaryVector); // Scala alias ``` ### 6.7 Renaming columns `rename()` mutates both the `StructType` and the backing `ValueVector` in-place: ```java df.rename("age", "years"); // df.names() is now ["name", "years", "salary"] ``` ### 6.8 Filtering rows All row-selection operations return a new `DataFrame`. #### Boolean mask ```java // Build a boolean mask manually or from a column comparison boolean[] mask = df.column("age").gt(28); DataFrame sub = df.get(mask); ``` #### Index object ```java import smile.util.Index; // From explicit row indices DataFrame sub = df.get(Index.of(new int[]{0, 2, 3})); // From a boolean mask DataFrame sub = df.get(Index.of(new boolean[]{true, false, true, true})); ``` #### dropna ```java // Remove any row that has at least one null/NaN value DataFrame clean = df.dropna(); ``` ### 6.9 Sorting `sort()` returns a new `DataFrame` with all rows reordered. Null values always sort to the end regardless of direction. ```java // Ascending (default) DataFrame sorted = df.sort("age"); // Descending DataFrame sorted = df.sort("salary", false); ``` The sort is stable and works on any column type: integral, floating-point, `String`, and any `Comparable`. ### 6.10 Slicing and sampling ```java // Contiguous row range [from, to) DataFrame slice = df.slice(1, 4); // rows 1, 2, 3 DataFrame first = df.slice(0, 1); // first row only DataFrame empty = df.slice(2, 2); // zero rows (valid) // Random sample without replacement DataFrame sample = df.sample(50); // up to 50 rows (capped at nrow()) ``` `slice()` validates that `0 ≤ from ≤ to ≤ nrow()`. ### 6.11 Combining DataFrames #### merge — horizontal (column union) Combines two or more DataFrames side-by-side. All must have the same row count. Clashing column names get a `_2`, `_3`, … suffix. ```java DataFrame wide = left.merge(right); DataFrame wide = a.merge(b, c, d); ``` #### concat — vertical (row union) Stacks DataFrames on top of each other. All must have the **exact same schema**. ```java DataFrame tall = train.concat(test); DataFrame tall = a.concat(b, c); ``` If all frames have a `RowIndex`, the indices are concatenated too. #### join — inner join on RowIndex Performs an inner join using matching row-label keys. If either frame has no `RowIndex`, falls back to `merge()`. ```java DataFrame merged = left.join(right); // Rows present in both left.index and right.index are kept; // unmatched rows are dropped. ``` ### 6.12 Missing values ```java // Drop rows with any null/NaN DataFrame clean = df.dropna(); // Fill NaN/Inf in numeric columns in-place df.fillna(0.0); // replace with zero df.fillna(-1.0); // replace with sentinel ``` `fillna` operates on `DoubleVector`, `FloatVector`, `NullablePrimitiveVector`, and `NumberVector` columns; non-numeric columns are unaffected. ### 6.13 Categorical encoding with factorize `factorize()` converts `String` columns into `IntVector` columns annotated with a `NominalScale`. The integer codes are assigned in alphabetical order of the distinct string values. ```java // Convert all String columns DataFrame f = df.factorize(); // Convert specific columns DataFrame f = df.factorize("color", "country"); // Inspect the resulting scale NominalScale scale = (NominalScale) f.schema().field("color").measure(); String label = scale.level(0); // first level alphabetically int code = scale.valueOf("Red").intValue(); ``` This is the standard step to prepare string data for machine-learning algorithms that require integer inputs. ### 6.14 Exporting to numeric arrays and matrices Both `toArray()` and `toMatrix()` convert the DataFrame to a dense numeric representation with optional bias (intercept) column and categorical encoding. ```java // Default: no bias, level encoding, all columns double[][] X = df.toArray(); // Selective columns double[][] X = df.toArray("age", "salary", "gender"); // With bias + dummy encoding for categoricals double[][] X = df.toArray(true, CategoricalEncoder.DUMMY, "age", "gender", "salary"); // DenseMatrix form (suitable for linear algebra) DenseMatrix M = df.toMatrix(); DenseMatrix M = df.toMatrix(true, CategoricalEncoder.DUMMY, "rowNameColumn"); ``` `NaN` is used for null/missing values in the output array. See [§7](#7-categoricalencoder) for the `CategoricalEncoder` options. ### 6.15 Statistics with describe ```java DataFrame stats = df.describe(); System.out.println(stats); ``` Output columns: `column`, `type`, `measure`, `count`, `mode`, `mean`, `std`, `min`, `25%`, `50%`, `75%`, `max`. - Categorical columns report `mode`, `min`, `median`, `max` over the integer codes. - Floating-point columns report `mean`, `std`, `min`, quartiles, `max`. - Integral columns report all statistics. - String / object columns report only `count` (non-null) and `mode`. ### 6.16 Printing and display ```java System.out.println(df); // head(10) System.out.println(df.head(5)); System.out.println(df.tail(5)); System.out.println(df.toString(from, to, truncate)); ``` `toString(from, to, truncate)`: - `from` must be in `[0, nrow]`; `from > nrow` throws. - `to ≤ from` (or after clamping to `nrow`) returns `"Empty DataFrame\n"`. - Columns wider than `maxColWidth` are truncated with `"..."` when `truncate=true`. ### 6.17 Loading from JDBC ```java import java.sql.*; try (Connection conn = DriverManager.getConnection(jdbcUrl, user, pass); Statement stmt = conn.createStatement(); ResultSet rs = stmt.executeQuery("SELECT * FROM sales")) { DataFrame df = DataFrame.of(rs); System.out.println(df.describe()); } ``` JDBC types are mapped to SMILE types via `StructType.of(ResultSetMetaData)`. --- ## 7. CategoricalEncoder `CategoricalEncoder` controls how categorical (`NominalScale` / `OrdinalScale`) columns are converted when calling `toArray()`, `toMatrix()`, or `Tuple.toArray()`. | Enum value | Meaning | Output columns per category | |-----------|---------|----------------------------| | `LEVEL` | Integer level code (default) | 1 — raw code value | | `DUMMY` | Dummy / treatment encoding | k−1 binary columns (reference = first level) | | `ONE_HOT` | Full one-hot encoding | k binary columns | ```java import smile.data.CategoricalEncoder; // Level encoding (default) — "gender" becomes a single int column double[][] X = df.toArray("age", "gender"); // Dummy encoding — k levels → k-1 binary columns // e.g. gender {Male=0, Female=1} → one binary column "gender_Female" double[][] X = df.toArray(false, CategoricalEncoder.DUMMY, "age", "gender"); // One-hot encoding — k levels → k binary columns double[][] X = df.toArray(false, CategoricalEncoder.ONE_HOT, "age", "gender"); ``` --- ## 8. End-to-end tutorial This tutorial processes an employee dataset from a raw POJO list through to a numeric design matrix ready for a machine-learning algorithm. ### Step 1 — Define the domain object and load data ```java import java.time.LocalDate; import smile.data.DataFrame; import smile.data.measure.*; import smile.data.type.*; import smile.data.vector.*; public enum Department { Engineering, Marketing, HR } public class Employee { public String getName() { return name; } public int getAge() { return age; } public Department getDepartment() { return dept; } public LocalDate getHireDate() { return hireDate; } public Double getSalary() { return salary; } // nullable // …constructor, fields… } List employees = loadEmployees(); // from DB / file / … DataFrame df = DataFrame.of(Employee.class, employees); System.out.println(df.schema()); // age: int // department: byte nominal[Engineering, HR, Marketing] // hireDate: Date // name: String // salary: double? ``` ### Step 2 — Inspect and describe ```java System.out.println(df); System.out.println(df.describe()); // How many rows have a null salary? long nullSalaries = df.column("salary").getNullCount(); System.out.println("Missing salaries: " + nullSalaries); ``` ### Step 3 — Add a derived column ```java import smile.data.vector.IntVector; // Tenure in years = current year - hire year int[] tenure = new int[df.nrow()]; for (int i = 0; i < df.nrow(); i++) { LocalDate d = (LocalDate) df.column("hireDate").get(i); tenure[i] = LocalDate.now().getYear() - d.getYear(); } df.add(new IntVector("tenure", tenure)); ``` ### Step 4 — Handle missing values ```java // Option A: drop rows with any null DataFrame clean = df.dropna(); // Option B: fill salary nulls with median double medianSalary = df.column("salary").doubleStream() .filter(Double::isFinite).sorted() .skip(df.nrow() / 2).findFirst().orElse(0.0); df.fillna(medianSalary); ``` ### Step 5 — Sort and slice ```java // Sort by salary descending DataFrame sorted = df.sort("salary", false); // Top 10 earners DataFrame top10 = sorted.slice(0, Math.min(10, sorted.nrow())); System.out.println(top10.head(10)); ``` ### Step 6 — Select features and encode categoricals ```java // Select the columns we want for the model DataFrame features = df.select("age", "tenure", "salary", "department"); // For algorithms that need integer encoding: // "department" already has NominalScale (auto-detected from enum) // Export design matrix with dummy encoding double[][] X = features.drop("salary") .toArray(false, CategoricalEncoder.DUMMY, "age", "tenure", "department"); // Response vector double[] y = features.column("salary").toDoubleArray(); ``` ### Step 7 — Set a row index for traceability ```java DataFrame indexed = df.setIndex("name"); // Later: look up a specific employee by name Tuple alice = indexed.loc("Alice"); System.out.println("Alice's salary: " + alice.getDouble("salary")); // Join two DataFrames on employee name DataFrame reviews = loadReviews().setIndex("employee"); DataFrame combined = indexed.join(reviews); ``` ### Step 8 — Describe the final feature set ```java DataFrame finalFeatures = df.select("age", "tenure", "department", "salary"); System.out.println(finalFeatures.describe()); // Verify no nulls remain boolean anyNull = finalFeatures.stream().anyMatch(Tuple::anyNull); System.out.println("Any nulls: " + anyNull); ``` --- ## 9. API quick reference ### DataFrame static factories | Method | Description | |--------|-------------| | `new DataFrame(ValueVector...)` | Construct from column vectors | | `new DataFrame(RowIndex, ValueVector...)` | With row index | | `DataFrame.of(double[][], String...)` | From 2-D double array | | `DataFrame.of(float[][], String...)` | From 2-D float array | | `DataFrame.of(int[][], String...)` | From 2-D int array | | `DataFrame.of(Class, List)` | From POJOs via reflection | | `DataFrame.of(StructType, List)` | From tuple list (empty → zero-row frame) | | `DataFrame.of(StructType, Stream)` | From tuple stream | | `DataFrame.of(ResultSet)` | From JDBC ResultSet | ### DataFrame instance methods | Method | Returns | Mutates `this`? | Description | |--------|---------|-----------------|-------------| | `nrow()` / `size()` | `int` | no | Number of rows | | `ncol()` | `int` | no | Number of columns | | `shape(dim)` | `int` | no | Size of dimension 0 (rows) or 1 (cols) | | `isEmpty()` | `boolean` | no | True if zero rows | | `schema()` | `StructType` | no | Column schema | | `names()` | `String[]` | no | Column names | | `dtypes()` | `DataType[]` | no | Column types | | `measures()` | `Measure[]` | no | Column measures | | `column(int)` / `column(String)` | `ValueVector` | no | Column vector | | `get(int, int)` | `Object` | no | Cell (boxed) | | `getInt/Double/…(int,int)` | primitive | no | Cell (typed) | | `getString(int,int)` | `String` | no | Cell as string (uses measure) | | `isNullAt(int,int)` | `boolean` | no | Null check | | `set(int,int,Object)` | `void` | **yes** | Set cell value | | `get(int)` | `Tuple` | no | Row as Tuple | | `get(Index)` | `DataFrame` | no | Rows by Index | | `get(boolean[])` | `DataFrame` | no | Rows by boolean mask | | `slice(int,int)` | `DataFrame` | no | Rows `[from, to)` | | `sample(int)` | `DataFrame` | no | Random sample without replacement | | `sort(String)` | `DataFrame` | no | Ascending sort | | `sort(String,boolean)` | `DataFrame` | no | Sort with direction | | `select(int...)` | `DataFrame` | no | Columns by index | | `select(String...)` | `DataFrame` | no | Columns by name | | `drop(int...)` | `DataFrame` | no | Remove columns by index | | `drop(String...)` | `DataFrame` | no | Remove columns by name | | `add(ValueVector...)` | `DataFrame` | **yes** | Add new columns | | `set(String,ValueVector)` | `DataFrame` | **yes** | Replace or add column | | `rename(String,String)` | `DataFrame` | **yes** | Rename column in-place | | `merge(DataFrame...)` | `DataFrame` | no | Horizontal column union | | `concat(DataFrame...)` | `DataFrame` | no | Vertical row union | | `join(DataFrame)` | `DataFrame` | no | Inner join on RowIndex | | `setIndex(String)` | `DataFrame` | no | Column → RowIndex (removes column) | | `setIndex(Object[])` | `DataFrame` | no | Attach RowIndex array | | `loc(Object)` | `Tuple` | no | Row by label | | `loc(Object...)` | `DataFrame` | no | Rows by labels | | `dropna()` | `DataFrame` | no | Remove rows with any null | | `fillna(double)` | `DataFrame` | **yes** | Fill NaN/null in numeric columns | | `factorize(String...)` | `DataFrame` | no | Encode string columns as NominalScale | | `toArray(String...)` | `double[][]` | no | Numeric array (LEVEL encoding) | | `toArray(boolean,CategoricalEncoder,String...)` | `double[][]` | no | Numeric array with options | | `toMatrix()` | `DenseMatrix` | no | Matrix (LEVEL, no bias) | | `toMatrix(boolean,CategoricalEncoder,String)` | `DenseMatrix` | no | Matrix with options | | `describe()` | `DataFrame` | no | Summary statistics | | `head(int)` | `String` | no | Top-N rows formatted | | `tail(int)` | `String` | no | Bottom-N rows formatted | | `toString(int,int,boolean)` | `String` | no | Row range formatted | | `stream()` | `Stream` | no | Row stream | | `iterator()` | `Iterator` | no | Row iterator | | `toList()` | `List` | no | All rows as list | ### StructType | Method | Description | |--------|-------------| | `new StructType(StructField...)` | Construct from fields | | `field(int)` / `field(String)` | Get field by ordinal or name | | `indexOf(String)` | Ordinal of named field | | `length()` | Number of fields | | `names()` / `dtypes()` / `measures()` | Field property arrays | | `add(StructField)` | Append a field (mutable) | | `rename(String, String)` | Rename a field (mutable) | ### StructField | Constructor / method | Description | |---------------------|-------------| | `new StructField(name, dtype)` | Without measure | | `new StructField(name, dtype, measure)` | With measure | | `withName(String)` | Return renamed copy | | `isNumeric()` | True for non-nominal numeric fields | | `toString(Object)` | Format a value using measure or dtype | ### ValueVector (selected) | Method | Description | |--------|-------------| | `size()` | Element count | | `isNullable()` | True if vector can contain nulls | | `isNullAt(int)` | Null check at position | | `getNullCount()` | Count of null positions | | `anyNull()` | True if any null exists | | `get(int)` | Boxed value (may be null) | | `getInt/Double/…(int)` | Typed value | | `getString(int)` | String form (uses measure) | | `set(int, Object)` | Mutation | | `get(Index)` | Sub-selection | | `withName(String)` | Return renamed copy | | `toIntArray()` / `toDoubleArray()` / `toStringArray()` | Bulk export | | `intStream()` / `longStream()` / `doubleStream()` / `stream()` | Streaming | | `eq(Object)` / `ne` / `lt` / `le` / `gt` / `ge` | Element-wise comparison masks | | `isin(String...)` / `isin(int...)` | Membership mask | | `isNull()` | Per-element null mask | --- *SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.*