# SMILE Base Module `smile-base` is the foundation module of the [SMILE](https://haifengl.github.io) (Statistical Machine Intelligence and Learning Engine) library. It provides all the data structures, mathematical primitives, statistical utilities, and I/O facilities that the rest of SMILE is built upon. --- ## Package Map | Package | Description | |---------|----------------------------------------------------------------------------------| | `smile.data` | [DataFrame, Tuple, type system, measures, vectors](#data) | | `smile.data.formula` | [Formula language for model matrices](#formula) | | `smile.data.transform` | [Feature-transformation pipelines](#data-transformation) | | `smile.datasets` | [Built-in benchmark datasets](#datasets) | | `smile.io` | [Data I/O — CSV, JSON, Parquet, ARFF, …](#data-io) | | `smile.tensor` | [Tensors and dense/sparse linear algebra](#tensor--linear-algebra) | | `smile.math` | [Core math utilities, MathEx, special functions](#math) | | `smile.math.distance` | [Distance and metric functions](#distances) | | `smile.math.kernel` | [Mercer kernel functions (SVM, GP, …)](#kernels) | | `smile.math.rbf` | [Radial basis functions](#radial-basis-functions) | | `smile.math.random` | [Pseudorandom number generators](#random-number-generators) | | `smile.math.BFGS` | [Quasi-Newton BFGS / L-BFGS optimization](#bfgs-optimisation) | | `smile.cs` | [Compressed sensing](#compressed-sensing) | | `smile.stat.distribution` | [Probability distributions](#probability-distributions) | | `smile.stat.hypothesis` | [Hypothesis testing](#hypothesis-testing) | | `smile.interpolation` | [1-D and 2-D interpolation](#interpolation) | | `smile.neighbor` | [Nearest-neighbor search (KD-tree, ball tree, LSH, …)](#nearest-neighbor-search) | | `smile.graph` | [Graph data structures and algorithms](#graph) | | `smile.gap` | [Genetic algorithm (GAP framework)](#genetic-algorithm) | | `smile.ica` | [Independent component analysis](#independent-component-analysis) | | `smile.hash` | [Non-cryptographic hash functions](#hash-functions) | | `smile.sort` | [Sorting and selection algorithms](#sorting--selection) | | `smile.wavelet` | [Wavelet transforms](#wavelets) | --- ## Data **Package:** `smile.data`, `smile.data.type`, `smile.data.measure`, `smile.data.vector` **Guide:** [DATA_FRAME.md](DATA_FRAME.md) `DataFrame` is SMILE's primary in-memory tabular data structure — a typed, column-oriented table with named fields, rich type metadata, and a streaming API. `Tuple` is a single row of a `DataFrame`. The `smile.data.type` package defines a complete type system (`DataType`, `StructType`, …), and `smile.data.measure` provides nominal, ordinal, interval, and ratio measurement scales. Strongly-typed column vectors (`IntVector`, `DoubleVector`, `StringVector`, …) live in `smile.data.vector`. ```java DataFrame iris = Read.arff(Paths.getTestData("weka/iris.arff")); DataFrame subset = iris.select("sepallength", "sepalwidth", "class"); double[] lengths = iris.column("sepallength").toDoubleArray(); ``` --- ## Formula **Package:** `smile.data.formula` **Guide:** [FORMULA.md](FORMULA.md) A compact, symbolic language for specifying model matrices (design matrices) from a `DataFrame`. Inspired by R's formula notation. ```java // All predictors except class Formula f = Formula.lhs("class"); // Explicit terms with an interaction Formula f = Formula.of("y", "x1", "x2", "x1:x2"); double[][] X = f.x(dataFrame).toArray(); double[] y = f.y(dataFrame).toDoubleArray(); ``` --- ## Data Transformation **Package:** `smile.data.transform` **Guide:** [DATA_TRANSFORMATION.md](DATA_TRANSFORMATION.md) Composable, serializable feature-transformation pipelines. Each transformer is fitted on training data and then applied consistently to new data. Includes scaling (`Standardizer`, `MaxAbsScaler`, `MinMaxScaler`), encoding (`OneHotEncoder`, `KBinsDiscretizer`), imputation, and more. ```java var scaler = Standardizer.fit(trainDF); DataFrame normalized = scaler.apply(trainDF); DataFrame normalizedTest = scaler.apply(testDF); ``` --- ## Datasets **Package:** `smile.datasets` **Guide:** [DATASET.md](DATASET.md) Ready-to-use loaders for over 30 standard machine-learning benchmark datasets (Iris, MNIST, USPS, Breast Cancer, CPU, Abalone, …). Each loader returns structured data suitable for immediate use with SMILE classifiers, regressors, or clustering algorithms. ```java var iris = Iris.load(); var mnist = MNIST.load(); var breastCancer = BreastCancer.load(); ``` --- ## Data I/O **Package:** `smile.io` **Guide:** [DATA_IO.md](DATA_IO.md) Unified read/write API for many tabular and hierarchical formats: | Format | Read | Write | |--------|:----:|:-----:| | CSV / TSV / DSV | ✓ | ✓ | | JSON (row-oriented) | ✓ | ✓ | | Apache Parquet | ✓ | ✓ | | Apache Avro | ✓ | — | | Apache Arrow | ✓ | ✓ | | Weka ARFF | ✓ | — | | LibSVM sparse | ✓ | — | | Matrix Market | ✓ | — | ```java DataFrame df = Read.csv("data.csv"); DataFrame df = Read.parquet("data.parquet"); Write.csv(df, Paths.get("out.csv")); ``` --- ## Tensor & Linear Algebra **Package:** `smile.tensor`, `smile.linalg` **Guide:** [TENSOR.md](TENSOR.md) Core numerical data structures used throughout SMILE: - **`DenseMatrix`** — double-precision dense matrix with BLAS/LAPACK backend - **`SparseMatrix`** — CSR compressed sparse matrix - **`BandMatrix`**, **`SymmMatrix`**, **`Cholesky`**, **`LU`**, **`QR`**, **`SVD`**, **`EVD`** — decompositions - **`IMatrix`** / **`FloatMatrix`** — integer and single-precision variants - **N-D `Tensor`** — general multi-dimensional array (float32 / float64) ```java DenseMatrix A = DenseMatrix.of(new double[][]{{1,2},{3,4}}); DenseMatrix B = A.mm(A); // matrix multiply var svd = A.svd(); // full SVD double[] x = A.solve(b); // least-squares solve ``` --- ## Math **Package:** `smile.math` `MathEx` is a large static utility class with fast implementations of common mathematical operations: log-sum-exp, softmax, entropy, distance helpers, combinatorics, special functions (gamma, beta, erf, …), and more. `smile.math.special` provides the underlying special-function implementations. ```java double lse = MathEx.logSumExp(logProbs); MathEx.softmax(scores); double h = MathEx.entropy(probs); ``` --- ## Distances **Package:** `smile.math.distance` **Guide:** [DISTANCES.md](DISTANCES.md) A rich collection of distance and similarity metrics implementing the `Distance` interface: | Metric | Class | |--------|-------| | Euclidean | `EuclideanDistance` | | Manhattan (L1) | `ManhattanDistance` | | Chebyshev (L∞) | `ChebyshevDistance` | | Minkowski (Lp) | `MinkowskiDistance` | | Cosine | `CosineDistance` | | Mahalanobis | `MahalanobisDistance` | | Hamming | `HammingDistance` | | Levenshtein (edit) | `EditDistance` | | Jaccard | `JaccardDistance` | | Dynamic Time Warping | `DynamicTimeWarping` | | Sparse vectors | `SparseEuclideanDistance`, `SparseManhattanDistance`, … | --- ## Kernels **Package:** `smile.math.kernel` **Guide:** [KERNELS.md](KERNELS.md) Mercer kernel functions used in support vector machines, Gaussian processes, and kernel methods. All implement `MercerKernel`. | Kernel | Class | |--------|-------| | Linear | `LinearKernel` | | Polynomial | `PolynomialKernel` | | Gaussian (RBF) | `GaussianKernel` | | Laplacian | `LaplacianKernel` | | Matérn 3/2, 5/2 | `Matern32Kernel`, `Matern52Kernel` | | Hyperbolic Tangent | `HyperbolicTangentKernel` | | Sparse variants | `SparseGaussianKernel`, `SparseLaplacianKernel`, … | --- ## Radial Basis Functions **Package:** `smile.math.rbf` **Guide:** [RBF.md](RBF.md) Radial basis functions used for interpolation, RBF networks, and other kernel methods. | RBF | Class | |-----|-------| | Gaussian | `GaussianRadialBasis` | | Multiquadric | `MultiquadricRadialBasis` | | Inverse Multiquadric | `InverseMultiquadricRadialBasis` | | Thin-plate spline | `ThinPlateSplineRadialBasis` | --- ## Random Number Generators **Package:** `smile.math.random` **Guide:** [RNG.md](RNG.md) High-quality pseudorandom number generators and distributions: - **`MersenneTwister`** — MT19937, the default general-purpose PRNG - **`XoShiRo256StarStar`**, **`XoRoShiRo128StarStar`** — fast modern RNGs - **`Halton`**, **`SobolSequence`** — low-discrepancy quasi-random sequences ```java var rng = new MersenneTwister(42); double u = rng.nextDouble(); int[] perm = rng.permutation(100); ``` --- ## BFGS Optimization **Package:** `smile.math` **Guide:** [BFGS.md](BFGS.md) Quasi-Newton BFGS and L-BFGS unconstrained optimisers. The `BFGS` class minimises a smooth, differentiable objective function given its gradient. L-BFGS uses a limited-memory approximation suitable for high-dimensional problems. ```java double[] x0 = {0.0, 0.0}; double[] xMin = BFGS.minimize(f, g, x0, 1e-6, 200); ``` --- ## Compressed Sensing **Package:** `smile.cs` **Guide:** [COMPRESSED_SENSING.md](COMPRESSED_SENSING.md) Sparse signal recovery from underdetermined linear measurements using the Basis Pursuit (BP) and LASSO formulations. Useful for compressive imaging, sparse regression, and dictionary learning. --- ## Probability Distributions **Package:** `smile.stat.distribution` **Guide:** [DISTRIBUTIONS.md](DISTRIBUTIONS.md) A comprehensive library of univariate and multivariate distributions, each implementing `Distribution` (or `DiscreteDistribution`). Every distribution supports PDF/PMF, CDF, quantile (inverse CDF), mean, variance, entropy, and random sampling. **Continuous:** `GaussianDistribution`, `ExponentialDistribution`, `GammaDistribution`, `BetaDistribution`, `WeibullDistribution`, `LogNormalDistribution`, `TDistribution`, `FDistribution`, `ChiSquaredDistribution`, `CauchyDistribution`, `LogisticDistribution`, `UniformDistribution`, `EmpiricalDistribution`, `KernelDensityEstimation`, … **Discrete:** `BinomialDistribution`, `PoissonDistribution`, `NegativeBinomialDistribution`, `GeometricDistribution`, `HypergeometricDistribution`, `DiscreteUniformDistribution`, … **Multivariate:** `MultivariateGaussianDistribution`, `DirichletDistribution`, `MultivariateExponentialFamilyMixture`, … ```java var gauss = new GaussianDistribution(0, 1); double p = gauss.cdf(1.96); // ≈ 0.975 double x = gauss.quantile(0.975); // ≈ 1.96 double[] samples = gauss.rand(1000); ``` --- ## Hypothesis Testing **Package:** `smile.stat.hypothesis` **Guide:** [HYPOTHESIS_TESTING.md](HYPOTHESIS_TESTING.md) Parametric and non-parametric tests for comparing means, variances, distributions, and correlations. | Test | Class / Method | |------|---------------| | One-sample t-test | `TTest.test(double[] x, double μ)` | | Two-sample t-test | `TTest.test(double[] x, double[] y)` | | Paired t-test | `TTest.pairedTest(double[] x, double[] y)` | | F-test (variance) | `FTest.test(double[] x, double[] y)` | | Chi-squared goodness-of-fit | `ChiSqTest.test(int[] counts, double[] prob)` | | Chi-squared independence | `ChiSqTest.test(int[][] table)` | | Kolmogorov-Smirnov | `KSTest.test(double[] x, Distribution d)` | | Two-sample KS | `KSTest.test(double[] x, double[] y)` | | Spearman correlation | `SpearmanTest` | | Kendall's τ | `KendallTest` | All test objects expose a `p-value` field for decision making. --- ## Interpolation **Package:** `smile.interpolation` **Guide:** [INTERPOLATION.md](INTERPOLATION.md) Smooth function reconstruction from a discrete set of sample points. **1-D:** `LinearInterpolation`, `PolynomialInterpolation`, `SplineInterpolation` (natural cubic spline), `CubicSplineInterpolation`, `RBFInterpolation1D`, `KrigingInterpolation1D` **2-D:** `BilinearInterpolation`, `BicubicInterpolation`, `CubicSplineInterpolation2D`, `RBFInterpolation2D`, `KrigingInterpolation2D`, `ShepardInterpolation2D` ```java double[] x = {0, 1, 2, 3}; double[] y = {0, 1, 4, 9}; var spline = new CubicSplineInterpolation(x, y); double v = spline.interpolate(1.5); // ≈ 2.25 ``` --- ## Nearest-Neighbor Search **Package:** `smile.neighbor` **Guide:** [NEAREST_NEIGHBOR.md](NEAREST_NEIGHBOR.md) Exact and approximate nearest-neighbor data structures. | Structure | Class | Best for | |-----------|-------|----------| | KD-tree | `KDTree` | Low-dimensional Euclidean data | | Ball tree | `BallTree` | Arbitrary metric spaces | | Cover tree | `CoverTree` | Expandable, metric spaces | | LSH | `LSH` | High-dimensional approximate NN | | MPLSH | `MPLSH` | Multi-probe LSH | All implement `NearestNeighborSearch` and support k-NN and range queries. ```java var kdtree = KDTree.of(data); Neighbor[] nn = kdtree.knn(query, 5); ``` --- ## Graph **Package:** `smile.graph` **Guide:** [GRAPH.md](GRAPH.md) Directed and undirected weighted graphs with adjacency-list and adjacency-matrix representations, plus classic graph algorithms. - **`AdjacencyList`** — sparse graph (recommended for most uses) - **`AdjacencyMatrix`** — dense graph - **Algorithms:** BFS, DFS, topological sort, shortest path (Dijkstra, Bellman-Ford), minimum spanning tree (Prim, Kruskal), strongly connected components (Tarjan) ```java var g = new AdjacencyList(6, false); // 6 vertices, undirected g.addEdge(0, 1, 1.0); int[] order = g.bfs(0); double[] dist = g.dijkstra(0); ``` --- ## Genetic Algorithm **Package:** `smile.gap` **Guide:** [GAP.md](GAP.md) A flexible genetic algorithm framework (GAP — Genetic Algorithm Platform) for combinatorial and continuous optimization. Users implement a `Chromosome` with `fitness()`, `crossover()`, and `mutate()` to define the problem; the `GeneticAlgorithm` driver handles selection, recombination, and termination. ```java var ga = new GeneticAlgorithm<>(population, 0.5, 0.01); Chromosome best = ga.evolve(1000, 1e-4); ``` --- ## Independent Component Analysis **Package:** `smile.ica` **Guide:** [ICA.md](ICA.md) FastICA for blind source separation. Recovers statistically independent components from a linear mixture of signals — widely used in signal processing, fMRI analysis, and feature extraction. ```java var ica = ICA.fit(X, 3); // extract 3 independent components double[][] S = ica.transform(X); ``` --- ## Hash Functions **Package:** `smile.hash` **Guide:** [HASH.md](HASH.md) High-performance, non-cryptographic hash functions and locality-sensitive hashing (LSH) families. | Hash | Class | Notes | |------|-------|-------| | MurmurHash3 | `MurmurHash3` | 32-bit / 128-bit; fast, high quality | | SHA-1 | `SHA` | Cryptographic (SHA-1 used for LSH) | | MinHash | `MinHash` | Jaccard similarity LSH | | SimHash | `SimHash` | Cosine similarity LSH | | Random projection LSH | `RandomProjectionHash` | Euclidean LSH | | Cross-polytope LSH | `CrossPolytope` | Efficient Euclidean LSH | --- ## Sorting & Selection **Package:** `smile.sort` **Guide:** [SORT.md](SORT.md) Allocation-free, cache-friendly sorting and selection algorithms used internally throughout SMILE. | Algorithm | Class / Method | |-----------|---------------| | Introsort (O(n log n)) | `IntroSort` | | Heap sort | `HeapSort` | | Shell sort | `ShellSort` | | Quickselect (k-th element) | `QuickSelect` | | Indexed sort | `QuickSort` (with index array) | | Partition | `QuickSelect.select(a, k)` | --- ## Wavelets **Package:** `smile.wavelet` **Guide:** [WAVELET.md](WAVELET.md) Discrete Wavelet Transform (DWT) and Wavelet Packet Transform (WPT) for signal processing, compression, and multi-resolution analysis. **Families:** Haar, Daubechies (D4–D20), Symlet (S8–S20), Coiflet (C6–C30), BestLocalized (BL14–BL20), Vaidyanathan. ```java double[] signal = ...; var wavelet = new DaubechiesWavelet(4); // D4 wavelet.transform(signal); // in-place DWT wavelet.inverse(signal); // reconstruction ``` --- ## Building ```bash # Build the module ./gradlew :base:build # Run tests ./gradlew :base:test # Generate Javadoc ./gradlew :base:javadoc ``` Dependencies: Java 25+, BLAS/LAPACK native libraries (optional — pure-Java fallback included). --- *SMILE — © 2010-2026 Haifeng Li. GNU GPL licensed.*