# SMILE — Data I/O User Guide & Tutorial This document covers the `smile.io` package — every class and interface used to read data into and write data out of SMILE's in-memory representations (`DataFrame`, `SparseDataset`, and serializable objects). --- ## Table of Contents 1. [Architecture overview](#1-architecture-overview) 2. [Input — resolving file paths and URIs](#2-input--resolving-file-paths-and-uris) 3. [Read — the one-stop reading interface](#3-read--the-one-stop-reading-interface) - [Auto-dispatch by extension](#31-auto-dispatch-by-extension) - [CSV](#32-csv) - [JSON](#33-json) - [ARFF](#34-arff) - [Apache Arrow / Feather](#35-apache-arrow--feather) - [Apache Avro](#36-apache-avro) - [Apache Parquet](#37-apache-parquet) - [SAS7BDAT](#38-sas7bdat) - [libsvm sparse format](#39-libsvm-sparse-format) - [Java object serialization](#310-java-object-serialization) 4. [Write — the one-stop writing interface](#4-write--the-one-stop-writing-interface) - [CSV](#41-csv) - [Apache Arrow](#42-apache-arrow) - [ARFF](#43-arff) - [Java object serialization](#44-java-object-serialization) 5. [CSV in depth](#5-csv-in-depth) - [Schema inference](#51-schema-inference) - [Explicit schema](#52-explicit-schema) - [Format string reference](#53-format-string-reference) - [CSVFormat object API](#54-csvformat-object-api) - [Charset](#55-charset) - [Reading a limited number of rows](#56-reading-a-limited-number-of-rows) - [Writing](#57-writing) 6. [JSON in depth](#6-json-in-depth) - [Single-line mode](#61-single-line-mode) - [Multi-line mode](#62-multi-line-mode) - [Schema override](#63-schema-override) 7. [ARFF in depth](#7-arff-in-depth) - [ARFF format primer](#71-arff-format-primer) - [Reading](#72-reading) - [Writing](#73-writing) 8. [Apache Arrow in depth](#8-apache-arrow-in-depth) 9. [Apache Avro in depth](#9-apache-avro-in-depth) 10. [Apache Parquet in depth](#10-apache-parquet-in-depth) 11. [SAS7BDAT in depth](#11-sas7bdat-in-depth) 12. [libsvm sparse format in depth](#12-libsvm-sparse-format-in-depth) 13. [CacheFiles — downloading remote datasets](#13-cachefiles--downloading-remote-datasets) 14. [Paths — test data helper](#14-paths--test-data-helper) 15. [End-to-end tutorials](#15-end-to-end-tutorials) - [Load, clean, and save a CSV pipeline](#151-load-clean-and-save-a-csv-pipeline) - [Cross-format conversion](#152-cross-format-conversion) - [Training a model from libsvm data](#153-training-a-model-from-libsvm-data) - [Downloading and caching a remote dataset](#154-downloading-and-caching-a-remote-dataset) 16. [API quick reference](#16-api-quick-reference) --- ## 1. Architecture overview ``` smile.io │ ├── Read (interface) Static factory methods for all read operations ├── Write (interface) Static factory methods for all write operations │ ├── CSV (class) Comma-/delimiter-separated values reader & writer ├── JSON (class) JSON reader (single-line and multi-line) ├── Arff (class) Weka ARFF reader & writer (AutoCloseable) ├── Arrow (class) Apache Arrow IPC stream reader & writer ├── Avro (class) Apache Avro reader ├── Parquet (class) Apache Parquet reader (via Arrow Dataset API) ├── SAS (interface) SAS7BDAT reader (via Parso) │ ├── Input (interface) Resolve a String path/URI to InputStream/Reader ├── CacheFiles (interface) Download remote files to a local cache directory └── Paths (interface) Locate test-data resources on the classpath ``` `Read` and `Write` are the recommended entry points for most use cases. The concrete classes (`CSV`, `JSON`, `Arff`, …) are used directly only when you need fine-grained control — custom charset, explicit schema, or row limit. --- ## 2. Input — resolving file paths and URIs `Input` is a low-level helper used internally by every reader. You can also use it directly to get a `BufferedReader` or `InputStream` for any location: ```java import smile.io.Input; // Local file path (absolute or relative) InputStream s1 = Input.stream("/data/iris.csv"); InputStream s2 = Input.stream("data/iris.csv"); // Windows drive-letter path — treated as a local file InputStream s3 = Input.stream("C:/data/iris.csv"); // file:// URI InputStream s4 = Input.stream("file:///data/iris.csv"); // HTTP / FTP — streams the remote content directly InputStream s5 = Input.stream("https://example.com/iris.csv"); // Buffered reader with explicit charset BufferedReader r = Input.reader("data/iris.csv", StandardCharsets.ISO_8859_1); ``` **Resolution rules:** | Input string | Resolved as | |---|---| | Starts with `file://` | Local path extracted from the URI | | Scheme is one character (e.g. `C:`) | Windows drive letter — treated as local path | | No scheme | Local path via `Path.of(path)` | | `http://`, `https://`, `ftp://` | Remote URL — opened with `URI.toURL().openStream()` | --- ## 3. Read — the one-stop reading interface `Read` is a static-method interface; you never instantiate it. ```java import smile.io.Read; ``` ### 3.1 Auto-dispatch by extension `Read.data(path)` examines the **last path segment's** file extension and delegates to the appropriate reader automatically. A query string or fragment in the path is stripped before the extension is extracted, so URIs like `s3://bucket/iris.csv?version=3` are handled correctly. ```java DataFrame df = Read.data("iris.csv"); // CSV DataFrame df = Read.data("weather.arff"); // ARFF DataFrame df = Read.data("users.json"); // JSON (single-line) DataFrame df = Read.data("airline.sas7bdat"); // SAS DataFrame df = Read.data("userdata.avro", "schema.avsc"); // Avro + schema path DataFrame df = Read.data("file:///data/users.parquet"); // Parquet DataFrame df = Read.data("events.feather"); // Arrow/Feather ``` **Extension → reader mapping:** | Extension(s) | Reader | |---|---| | `csv`, `txt`, `dat` | `Read.csv` | | `arff` | `Read.arff` | | `json` | `Read.json` | | `sas7bdat` | `Read.sas` | | `avro` | `Read.avro` (format = schema file path) | | `parquet` | `Read.parquet` | | `feather` | `Read.arrow` | The optional `format` parameter is passed through to the underlying reader: ```java // CSV: comma-separated key=value format options DataFrame df = Read.data("data.csv", "header=true,delimiter=\\t,comment=#"); // CSV: explicit "csv" keyword overrides unrecognised extensions DataFrame df = Read.data("data.dat", "csv"); DataFrame df = Read.data("data.txt", "csv,header=true"); // JSON: mode string DataFrame df = Read.data("records.json", "MULTI_LINE"); // Avro: path to the .avsc schema file DataFrame df = Read.data("records.avro", "schema/user.avsc"); ``` ### 3.2 CSV ```java // Simplest – comma-delimited, no header, schema inferred from first 1000 rows DataFrame df = Read.csv("iris.csv"); // With format string DataFrame df = Read.csv("prostate.csv", "header=true,delimiter=\\t"); // With explicit CSVFormat object CSVFormat fmt = CSVFormat.Builder.create() .setDelimiter('\t') .setHeader() .setSkipHeaderRecord(true) .get(); DataFrame df = Read.csv("prostate.csv", fmt); // With explicit CSVFormat + schema StructType schema = new StructType( new StructField("lcavol", DataTypes.DoubleType), new StructField("age", DataTypes.IntType)); DataFrame df = Read.csv("prostate.csv", fmt, schema); // From a java.nio.file.Path (no URISyntaxException) DataFrame df = Read.csv(Path.of("/data/iris.csv")); DataFrame df = Read.csv(Path.of("/data/iris.csv"), fmt); DataFrame df = Read.csv(Path.of("/data/iris.csv"), fmt, schema); ``` ### 3.3 JSON ```java // Single-line mode: one JSON object per line (default) DataFrame df = Read.json("books.json"); // Multi-line mode: entire file is a JSON array DataFrame df = Read.json("books.json", JSON.Mode.MULTI_LINE, null); // From Path DataFrame df = Read.json(Path.of("books.json")); DataFrame df = Read.json(Path.of("books.json"), JSON.Mode.MULTI_LINE, null); ``` ### 3.4 ARFF ```java // String path or URI DataFrame df = Read.arff("weather.arff"); // java.nio.file.Path DataFrame df = Read.arff(Path.of("weather.arff")); ``` ### 3.5 Apache Arrow / Feather ```java // String path or URI DataFrame df = Read.arrow("events.feather"); // java.nio.file.Path DataFrame df = Read.arrow(Path.of("events.feather")); ``` ### 3.6 Apache Avro Avro requires a separate schema (`.avsc`) file or `InputStream`: ```java // Schema as a file path string DataFrame df = Read.avro("users.avro", "schema/user.avsc"); // Schema as an InputStream InputStream schemaStream = getClass().getResourceAsStream("/user.avsc"); DataFrame df = Read.avro("users.avro", schemaStream); // From java.nio.file.Path DataFrame df = Read.avro(Path.of("users.avro"), Path.of("schema/user.avsc")); DataFrame df = Read.avro(Path.of("users.avro"), schemaStream); ``` ### 3.7 Apache Parquet Parquet is read via the **Apache Arrow Dataset API** and requires a `file://` URI on Windows (SMILE adds the leading `/` automatically): ```java // From java.nio.file.Path (recommended — SMILE handles URI conversion) DataFrame df = Read.parquet(Path.of("/data/users.parquet")); // From a URI string (add leading slash on Windows if needed) DataFrame df = Read.parquet("file:///data/users.parquet"); ``` ### 3.8 SAS7BDAT ```java // String path or URI DataFrame df = Read.sas("airline.sas7bdat"); // java.nio.file.Path DataFrame df = Read.sas(Path.of("airline.sas7bdat")); ``` ### 3.9 libsvm sparse format `Read.libsvm` returns a `SparseDataset` (not a `DataFrame`): ```java import smile.data.SparseDataset; // String path or URI SparseDataset train = Read.libsvm("news20.dat"); // java.nio.file.Path SparseDataset test = Read.libsvm(Path.of("news20.t.dat")); // From a BufferedReader SparseDataset ds = Read.libsvm(Files.newBufferedReader(path)); // Access samples int label = train.get(0).y(); // integer class label double v = train.get(0).x().get(196); // feature 196 value (0-based index) int ncol = train.ncol(); // number of features int nnz = train.nz(); // total non-zero entries ``` **libsvm format:** ```