# SMILE — Model Validation The `smile.validation` package provides everything needed to estimate how well a model generalizes to unseen data. It is built around three orthogonal concerns: 1. **Data splitting** — `Bag`, `Bootstrap`, `CrossValidation`, `LOOCV` 2. **Model evaluation** — `ClassificationValidation`, `RegressionValidation` and their aggregating counterparts `ClassificationValidations`, `RegressionValidations` 3. **Model selection** — `ModelSelection` (AIC / BIC) All types are serializable records or static-method-only interfaces, so they carry no mutable state and compose freely. --- ## Table of Contents 1. [Concepts](#concepts) 2. [Data Splitting](#data-splitting) - [Holdout (`Bag.split`)](#holdout-bagsplit) - [Stratified Holdout (`Bag.stratify`)](#stratified-holdout-bagstratify) - [Bootstrap](#bootstrap) - [K-Fold Cross-Validation](#k-fold-cross-validation) - [Stratified K-Fold](#stratified-k-fold) - [Group (Non-Overlapping) K-Fold](#group-non-overlapping-k-fold) - [Leave-One-Out CV (LOOCV)](#leave-one-out-cv-loocv) 3. [Classification Validation](#classification-validation) - [Single Split](#single-split) - [Multiple Splits](#multiple-splits) - [Understanding `ClassificationMetrics`](#understanding-classificationmetrics) 4. [Regression Validation](#regression-validation) - [Single Split](#single-split-1) - [Multiple Splits](#multiple-splits-1) - [Understanding `RegressionMetrics`](#understanding-regressionmetrics) 5. [Model Selection (AIC / BIC)](#model-selection-aic--bic) 6. [Workflows](#workflows) - [Quick Holdout Smoke Test](#quick-holdout-smoke-test) - [10-Fold CV with Aggregation](#10-fold-cv-with-aggregation) - [Repeated CV](#repeated-cv) - [Stratified Bootstrap](#stratified-bootstrap) - [Group K-Fold for Time-Series-Style Data](#group-k-fold-for-time-series-style-data) - [LOOCV for Small Datasets](#loocv-for-small-datasets) - [Comparing Models with AIC/BIC](#comparing-models-with-aicbic) 7. [Quick API Reference](#quick-api-reference) 8. [Common Pitfalls](#common-pitfalls) --- ## Concepts ### The `Bag` record Every splitting strategy returns one or more `Bag` objects. ```java public record Bag(int[] samples, int[] oob) ``` | Field | Meaning | |---|---| | `samples()` | Training indices into the original dataset | | `oob()` | Held-out (out-of-bag / test) indices | Indices are into the **original** array, not a copy — no data is ever duplicated. ### Hard vs. Soft classifiers The validation layer distinguishes two classifier flavours: - **Hard** (`Classifier.isSoft() == false`) — predicts a single class label. Metrics that require probability estimates (`AUC`, `LogLoss`, cross-entropy) are reported as `Double.NaN`. - **Soft** (`Classifier.isSoft() == true`) — also provides posterior probabilities. All metrics are computed and reported. --- ## Data Splitting ### Holdout (`Bag.split`) A single random train / test split. The test proportion is set with `holdout` ∈ (0, 1). ```java // 80% train, 20% test on 1000 raw samples Bag bag = Bag.split(1000, 0.2); int[] trainIdx = bag.samples(); int[] testIdx = bag.oob(); ``` For `DataFrame` inputs a convenience overload returns a typed pair: ```java var iris = new Iris(); Tuple2 split = Bag.split(iris.data(), 0.2); DataFrame train = split._1; DataFrame test = split._2; ``` `n` and `holdout` are validated; `holdout` must be strictly between 0 and 1. ### Stratified Holdout (`Bag.stratify`) Ensures the class distribution in each split mirrors the full dataset — essential when classes are imbalanced. ```java // Stratified 70/30 split for a DataFrame, using "species" as the class column Tuple2 split = Bag.stratify(iris.data(), "species", 0.3); ``` The low-level `int[]` overload is package-private and used internally by validation runners. ### Bootstrap Bootstrap sampling draws `n` samples **with replacement** from `n` originals, so roughly 63.2% of originals appear in the training set and ~36.8% appear only in the out-of-bag test set. ```java // 100 rounds of plain bootstrap for 500 samples Bag[] bags = Bootstrap.of(500, 100); ``` **Stratified bootstrap** preserves class proportions in each bag: ```java int[] labels = ...; // class label per sample Bag[] bags = Bootstrap.of(labels, 100); ``` Bootstrap runners for classifiers and regressors train and evaluate automatically: ```java var result = Bootstrap.classification(100, iris.formula(), iris.data(), DecisionTree::fit); System.out.println("Accuracy: " + result.avg().accuracy() + " ± " + result.std().accuracy()); ``` ### K-Fold Cross-Validation Partitions the data into `k` equal folds; each fold serves as the test set exactly once while the remaining `k−1` folds are used for training. ```java // 5-fold CV splits for 500 samples Bag[] folds = CrossValidation.of(500, 5); ``` `k` must satisfy `1 ≤ k ≤ n`. The last fold absorbs any remainder when `n` is not divisible by `k`. ### Stratified K-Fold Guarantees that each fold preserves the original class proportions: ```java int[] labels = ...; // one per sample Bag[] folds = CrossValidation.stratify(labels, 5); ``` A warning is logged (SLF4J) if any class has fewer examples than `k`, which would produce degenerate folds. ### Group (Non-Overlapping) K-Fold Used when samples belong to groups (e.g. subject IDs, document IDs, time windows) and leaking information across groups would inflate results. Each group appears entirely in either the training set or the test set for any given fold. ```java // group[i] is the group identifier for sample i int[] group = {0, 0, 1, 1, 1, 2, 2, 3, 3, 3}; Bag[] folds = CrossValidation.nonoverlap(group, 3); ``` Groups are balanced across folds greedily by size. `k` must not exceed the number of distinct groups. ### Leave-One-Out CV (LOOCV) In LOOCV every sample serves as the test set exactly once, making it the most data-efficient but computationally expensive strategy. ```java // Raw index splits: train[i] contains all indices except i int[][] trainSets = LOOCV.of(100); // trainSets[i].length == 99 for every i ``` Full classification and regression training loops are also available and return the same `ClassificationMetrics` / `RegressionMetrics` records as the other strategies. --- ## Classification Validation ### Single Split Train on an explicit train/test pair and get back a `ClassificationValidation` record containing the model, the truth labels, predictions, optional posteriors, the confusion matrix, and the computed metrics: ```java // Array-based trainer ClassificationValidation result = ClassificationValidation.of(trainX, trainY, testX, testY, DecisionTree::fit); System.out.println(result.metrics().accuracy()); System.out.println(result.confusion()); ``` With a `Formula` and `DataFrame` the API is symmetric: ```java var usps = new USPS(); ClassificationValidation result = ClassificationValidation.of(usps.formula(), usps.train(), usps.test(), DecisionTree::fit); System.out.println(result); ``` ### Multiple Splits Pass a `Bag[]` to train and evaluate over many folds and receive a `ClassificationValidations` that aggregates per-fold results: ```java Bag[] folds = CrossValidation.of(x.length, 10); ClassificationValidations cv = ClassificationValidation.of(folds, x, y, DecisionTree::fit); ClassificationMetrics avg = cv.avg(); ClassificationMetrics std = cv.std(); System.out.printf("Accuracy: %.2f%% ± %.2f%n", 100 * avg.accuracy(), 100 * std.accuracy()); ``` The `std` metrics represent the standard deviation across folds. With a single fold, `std` is `0.0` everywhere (instead of throwing an exception). Bootstrap and LOOCV runners follow the same pattern: ```java // Bootstrap var bs = Bootstrap.classification(100, formula, data, DecisionTree::fit); System.out.println(bs.avg().accuracy()); // Stratified CV var scv = CrossValidation.classification(5, formula, data, DecisionTree::fit); // Repeated CV (3 repetitions × 5 folds = 15 training runs) var rcv = CrossValidation.classification(3, 5, formula, data, DecisionTree::fit); ``` ### Understanding `ClassificationMetrics` ```java public record ClassificationMetrics( double fitTime, // ms to train double scoreTime, // ms to score the test set int size, // number of test samples int error, // number of misclassified samples double accuracy, // correct / total double sensitivity, // TP / (TP + FN) — binary or NaN for multiclass hard double specificity, // TN / (TN + FP) — binary or NaN double precision, // TP / (TP + FP) — binary or NaN double f1, // 2·P·R / (P+R) — binary or NaN double mcc, // Matthews Correlation Coefficient — binary or NaN double auc, // Area Under ROC — soft binary or NaN double logloss, // -log(p_correct) — soft binary or NaN double crossEntropy // mean cross-entropy — soft multiclass or NaN ) ``` Which fields are populated depends on the classifier and data: | Scenario | Populated | |---|---| | Hard binary | accuracy, error, sensitivity, specificity, precision, F1, MCC | | Soft binary | all of the above, plus AUC, log loss, cross-entropy | | Hard multiclass | accuracy, error | | Soft multiclass | accuracy, error, cross-entropy | `Double.NaN` is used for metrics that are not meaningful in the current scenario. Always guard display code with `!Double.isNaN(m.auc())` before printing probability-based metrics. --- ## Regression Validation ### Single Split ```java RegressionValidation result = RegressionValidation.of(abalone.formula(), abalone.train(), abalone.test(), RegressionTree::fit); System.out.println(result); // Prints: RSS, MSE, RMSE, MAD, R² ``` ### Multiple Splits ```java Bag[] folds = CrossValidation.of(x.length, 10); RegressionValidations cv = RegressionValidation.of(folds, x, y, RegressionTree::fit); System.out.printf("RMSE: %.4f ± %.4f%n", cv.avg().rmse(), cv.std().rmse()); ``` Bootstrap and LOOCV variants are also available: ```java var bs = Bootstrap.regression(100, formula, data, RegressionTree::fit); ``` ### Understanding `RegressionMetrics` ```java public record RegressionMetrics( double fitTime, // ms to train double scoreTime, // ms to score int size, // test set size double rss, // Residual Sum of Squares double mse, // Mean Squared Error double rmse, // Root Mean Squared Error double mad, // Mean Absolute Error (MAE) double r2 // Coefficient of Determination ) ``` All regression metrics are always populated — there is no hard/soft distinction. --- ## Model Selection (AIC / BIC) `ModelSelection` provides two static criteria for comparing models fit to the **same** dataset. Both penalize model complexity to prevent overfitting: | Criterion | Formula | Penalty | |---|---|---| | AIC (Akaike) | `2k − 2 log L` | `2k` | | BIC (Bayesian) | `k log n − 2 log L` | `k log n` | Here `L` is the maximised likelihood, `k` is the number of free parameters, and `n` is the sample size (BIC only). **Lower is better** for both AIC and BIC. ```java double logL1 = -120.0; // log-likelihood of model 1 double logL2 = -125.0; // log-likelihood of model 2 (simpler, fewer params) int k1 = 10, k2 = 4, n = 500; double aic1 = ModelSelection.AIC(logL1, k1); double aic2 = ModelSelection.AIC(logL2, k2); System.out.println(aic1 < aic2 ? "Model 1 preferred by AIC" : "Model 2 preferred by AIC"); double bic1 = ModelSelection.BIC(logL1, k1, n); double bic2 = ModelSelection.BIC(logL2, k2, n); System.out.println(bic1 < bic2 ? "Model 1 preferred by BIC" : "Model 2 preferred by BIC"); ``` **When to use which:** - **AIC** favours predictive accuracy; it is more suitable when the goal is to select a model that predicts well, even if it is slightly over-parameterised. - **BIC** is consistent — it selects the true model as `n → ∞` if the true model is among the candidates. It is more conservative and tends to prefer smaller models. The `log n` factor means that BIC penalizes complexity more than AIC whenever `n > e² ≈ 7.4`. --- ## Workflows ### Quick Holdout Smoke Test Use a holdout split when you want the fastest possible sanity check before committing to a full CV run: ```java var iris = new Iris(); Tuple2 split = Bag.split(iris.data(), 0.2); var result = ClassificationValidation.of( iris.formula(), split._1, split._2, DecisionTree::fit); System.out.println(result); ``` ### 10-Fold CV with Aggregation The idiomatic workflow for a thorough, low-variance estimate: ```java var iris = new Iris(); var cv = CrossValidation.classification(10, iris.formula(), iris.data(), DecisionTree::fit); System.out.printf("Accuracy: %.2f%% ± %.2f%n", 100 * cv.avg().accuracy(), 100 * cv.std().accuracy()); ``` The `std` field lets you report confidence intervals around each metric. ### Repeated CV Repeated CV runs standard k-fold multiple times with different random permutations, giving a more stable estimate at the cost of `round × k` training runs: ```java // 5 repetitions of 5-fold CV = 25 training runs var rcv = CrossValidation.classification(5, 5, iris.formula(), iris.data(), DecisionTree::fit); System.out.printf("Accuracy: %.2f%% ± %.2f%n", 100 * rcv.avg().accuracy(), 100 * rcv.std().accuracy()); ``` ### Stratified Bootstrap Bootstrap is often preferred for small datasets because the test set size varies per round (unlike fixed-fold CV). The stratified variant is recommended whenever classes are imbalanced: ```java int[] y = formula.y(data).toIntArray(); var bs = Bootstrap.classification(100, formula, data, DecisionTree::fit); System.out.printf("Accuracy: %.2f%% ± %.2f%n", 100 * bs.avg().accuracy(), 100 * bs.std().accuracy()); ``` ### Group K-Fold for Time-Series-Style Data When samples are grouped (e.g. multiple measurements per patient, or overlapping time windows), standard CV leaks information between folds. Use group k-fold: ```java // subjectId[i] == the subject/group to which sample i belongs int[] subjectId = ...; Bag[] folds = CrossValidation.nonoverlap(subjectId, 5); var cv = ClassificationValidation.of(folds, x, y, SVM::fit); System.out.println(cv.avg()); ``` ### LOOCV for Small Datasets LOOCV is unbiased and uses almost all data for training in each round, making it the right choice when data is scarce: ```java // Array-based var metrics = LOOCV.classification(x, y, LogisticRegression::fit); System.out.printf("Accuracy: %.2f%%%n", 100 * metrics.accuracy()); // Formula / DataFrame-based var metrics2 = LOOCV.classification(formula, data, DecisionTree::fit); ``` Prefer `CrossValidation.stratify` for datasets larger than ~200 samples, since the compute cost of LOOCV is `O(n)` training runs. ### Comparing Models with AIC/BIC Fit both models to the same training set, extract their log-likelihoods, and compare: ```java GaussianMixture m1 = GaussianMixture.fit(x, 2); // 2 components GaussianMixture m2 = GaussianMixture.fit(x, 5); // 5 components double aic1 = ModelSelection.AIC(m1.logLikelihood(), m1.numParameters()); double aic2 = ModelSelection.AIC(m2.logLikelihood(), m2.numParameters()); System.out.println("Preferred by AIC: " + (aic1 < aic2 ? "2-component" : "5-component")); ``` --- ## Quick API Reference ### Data Splitting | Method | Description | |---|---| | `Bag.split(n, holdout)` | Random holdout split on `n` raw indices | | `Bag.split(data, holdout)` | Random holdout split returning two `DataFrame`s | | `Bag.stratify(data, column, holdout)` | Stratified holdout split on a `DataFrame` | | `Bootstrap.of(n, k)` | `k` bootstrap bags from `n` samples | | `Bootstrap.of(category, k)` | `k` stratified bootstrap bags | | `CrossValidation.of(n, k)` | Standard k-fold splits | | `CrossValidation.stratify(labels, k)` | Stratified k-fold splits | | `CrossValidation.nonoverlap(group, k)` | Group k-fold splits | | `LOOCV.of(n)` | Leave-one-out training index arrays | ### Running Validation | Method | Returns | |---|---| | `ClassificationValidation.of(formula, train, test, trainer)` | `ClassificationValidation` | | `ClassificationValidation.of(bags, x, y, trainer)` | `ClassificationValidations` | | `CrossValidation.classification(k, formula, data, trainer)` | `ClassificationValidations` | | `CrossValidation.classification(round, k, formula, data, trainer)` | `ClassificationValidations` (repeated) | | `Bootstrap.classification(k, formula, data, trainer)` | `ClassificationValidations` | | `LOOCV.classification(x, y, trainer)` | `ClassificationMetrics` | | `RegressionValidation.of(formula, train, test, trainer)` | `RegressionValidation` | | `RegressionValidation.of(bags, x, y, trainer)` | `RegressionValidations` | | `CrossValidation.regression(k, formula, data, trainer)` | `RegressionValidations` | | `Bootstrap.regression(k, formula, data, trainer)` | `RegressionValidations` | | `LOOCV.regression(x, y, trainer)` | `RegressionMetrics` | ### Model Selection | Method | Formula | |---|---| | `ModelSelection.AIC(logL, k)` | `2k − 2 logL` | | `ModelSelection.BIC(logL, k, n)` | `k log n − 2 logL` | --- ## Common Pitfalls ### 1. Comparing cross-validated metrics across different sample sizes `ClassificationMetrics.size` records the test-set size for each round. When comparing models trained on different datasets, normalize by sample count rather than comparing raw error counts. ### 2. Using `std` from a single fold `ClassificationValidations.of` and `RegressionValidations.of` require a list of at least one `ClassificationValidation` / `RegressionValidation`. With exactly one round, `std` is `0.0` for every field — meaningful aggregation requires two or more rounds. ### 3. Ignoring `Double.NaN` in hard-classifier metrics Probability-based metrics (`auc`, `logloss`, `crossEntropy`) are `Double.NaN` for hard classifiers. Passing them to arithmetic expressions silently propagates `NaN`: ```java // Unsafe: if auc is NaN this prints NaN System.out.printf("AUC: %.4f%n", metrics.auc()); // Safe if (!Double.isNaN(metrics.auc())) { System.out.printf("AUC: %.4f%n", metrics.auc()); } ``` ### 4. Data leakage with group k-fold Use `CrossValidation.nonoverlap` whenever samples within a group share information (repeated measurements, sliding windows, augmented copies). Using standard k-fold in these cases inflates accuracy estimates because the same underlying signal appears in both train and test. ### 5. Repeated CV vs. more folds Increasing `k` beyond 10 rarely improves variance; instead, use repeated CV (`CrossValidation.classification(round, k, ...)`) to get a more stable estimate at the same cost as `round × k` training runs. ### 6. LOOCV on large datasets LOOCV fits the model `n` times. For `n = 10 000` with a non-trivial model this is prohibitively slow. Prefer stratified 10-fold CV or bootstrap for datasets larger than ~200–500 samples. ### 7. BIC requires `n > 0` `ModelSelection.BIC` calls `Math.log(n)`. Passing `n ≤ 0` produces `NaN` or `-Infinity` silently. Always ensure `n` is a positive integer matching the training sample count. --- *SMILE — Copyright © 2010-2026 Haifeng Li. GNU GPL licensed.*