# SMILE Scala API The `smile-scala` module is an idiomatic Scala shim over the SMILE Java library. Except `smile.cas` package (Computer Algebra System), it adds nothing algorithmic — every function ultimately delegates to the same Java `fit`, `of`, or constructor — but it replaces verbose Java patterns with concise, expressive Scala idioms: - **Implicit conversions ("pimp-my-library")** enrich `DataFrame`, `Tuple`, arrays, and `String` with domain-specific methods. - **Operator DSL** — R-style formula syntax (`y ~ x1 + x2`), NumPy-style array slicing (`0 ~ 9 ~ 2`), and linear-algebra operators (`%*%`, `\`). - **Computer Algebra System** — symbolic differentiation and simplification on scalar, vector, and matrix expressions. - **Top-level functions** with named default arguments replace long static-method argument lists. - **Idiomatic `object` namespaces** (`read`, `write`, `gpr`, `validate`, `cv`, `loocv`, `bootstrap`) group related operations without polluting the package namespace. - **Macro-backed rendering** in Scala 2.13 detects notebook environments (Zeppelin, Databricks) at compile time and routes `show(…)` to either an in-process Swing window or an HTML `` element. The module depends on `:core` (ML), `:base` (data and I/O), `:nlp`, `:plot`, and `:json`. --- ## Table of Contents 1. [Installation](#installation) 2. [Data I/O — `read` and `write`](#data-io--read-and-write) 3. [DataFrame Extensions](#dataframe-extensions) 4. [Formula DSL](#formula-dsl) 5. [Math, Arrays, and Linear Algebra](#math-arrays-and-linear-algebra) 6. [Computer Algebra System (CAS)](#computer-algebra-system-cas) 7. [Classification](#classification) 8. [Regression](#regression) 9. [Clustering](#clustering) 10. [Dimensionality Reduction](#dimensionality-reduction) 11. [Manifold Learning](#manifold-learning) 12. [Natural Language Processing](#natural-language-processing) 13. [Sequence Labeling](#sequence-labeling) 14. [Association Rule Mining](#association-rule-mining) 15. [Wavelets](#wavelets) 16. [Model Validation](#model-validation) 17. [Plotting](#plotting) 18. [Utility Helpers](#utility-helpers) 19. [Complete Examples](#complete-examples) --- ## Installation Add the module to your `build.gradle.kts` (for use inside this Gradle project): ```kotlin dependencies { implementation(project(":scala")) } ``` Or, from SBT in a standalone project: ```scala libraryDependencies += "com.github.haifengl" %% "smile-scala" % "" ``` Import the relevant package objects at the top of each file. The most common imports are: ```scala import smile.io.* // read, write import smile.data.* // DataFrame implicits, summary import smile.data.formula.* // formula DSL: ~, +, -, ::, &&, ^ import smile.math.* // PimpedInt, PimpedDouble, array extensions, linalg import smile.classification.* import smile.regression.* import smile.clustering.* import smile.manifold.* import smile.nlp.* import smile.validation.* ``` --- ## Data I/O — `read` and `write` Both `read` and `write` are top-level Scala `object`s defined in `smile.io`. They serve as namespaces so you can write `read.csv(…)` instead of importing a static Java method. ### Loading data ```scala import smile.io.* // Auto-detect format from extension (.csv, .json, .arff, .parquet, .avro, .sas7bdat) val df = read.data("path/to/file.csv") val df = read.data("path/to/file.parquet") val df = read.data("path/to/file.json", "multi-line") // JSON mode hint // CSV with options (all have defaults — delimiter=",", header=true, quote='"') val df = read.csv("iris.csv") val df = read.csv("data.tsv", delimiter = "\t") val df = read.csv("data.csv", header = false, comment = '#') // Other formats val df = read.json("records.json") val df = read.json("records.json", JSON.Mode.MULTI_LINE, schema) val df = read.arff("weka.arff") val df = read.sas("dataset.sas7bdat") val df = read.arrow("data.arrow") val df = read.avro("data.avro", schemaInputStream) val df = read.parquet("data.parquet") val ds = read.libsvm("data.libsvm") // returns SparseDataset[Integer] val (vertices, edges) = read.wavefront("mesh.obj") // 3-D OBJ geometry // JDBC result set val df = read.jdbc(resultSet) // Deserialize a previously serialized model val model = read("model.bin") ``` ### Saving data ```scala import smile.io.* // Serialize any Serializable object (e.g. a trained model) write(model, "model.bin") // Write a DataFrame write.csv(df, "out.csv") write.csv(df, "out.tsv", delimiter = "\t") write.arff(df, "out.arff", "relation-name") write.arrow(df, "out.arrow") // Write raw arrays write.array(predictions, "predictions.txt") // one element per line write.table(matrix, "matrix.csv", delimiter = ",") // 2-D array to delimited file ``` --- ## DataFrame Extensions When you import `smile.data.*`, implicit conversions enrich `DataFrame` and `Tuple` with Scala-idiomatic methods. ### `DataFrameOps` — enriches `DataFrame` ```scala import smile.data.* val df: DataFrame = read.csv("iris.csv") // Select/drop columns by name or Range val sub = df.select("sepal.length", "sepal.width") val fewer = df.drop("class") val slice = df.of(0 until 100) // row slice using Scala Range // Functional operations val row: Option[Tuple] = df.find(_.getInt("class") == 1) val all: Boolean = df.forall(_.getDouble("petal.length") > 0.0) val any: Boolean = df.exists(_.getDouble("sepal.length") > 7.0) df.foreach(row => println(row)) val mapped: Array[Double] = df.map(_.getDouble(0)) val filtered: DataFrame = df.filter(_.getDouble("sepal.length") > 5.0) val (yes, no) = df.partition(_.getInt("class") == 0) val groups = df.groupBy(_.getInt("class")) // JSON conversion val json: String = df.toJSON ``` ### `TupleOps` — enriches `Tuple` ```scala val t: Tuple = df.get(0) val json: String = t.toJSON // handles categorical fields correctly ``` ### Summary statistics (top-level) ```scala import smile.data.* summary(intArray) // prints min/Q1/median/mean/Q3/max for Array[Int] summary(doubleArray) // same for Array[Double] ``` --- ## Formula DSL Import `smile.data.formula.*` to unlock an R-style formula language for specifying model structure. ### Basic syntax ```scala import smile.data.formula.* // y ~ x means "predict y from x" val f: Formula = "y" ~ "x" // Include multiple terms val f = "price" ~ "size" + "bedrooms" + "location" // Exclude a term with unary - val f = "y" ~ "." - "id" // use all columns except "id" // Intercept-only: just ". ~ ." ``` ### Interaction and crossing ```scala // Interaction term: a :: b (a*b without main effects in R notation) val f = "y" ~ "a" :: "b" // Crossing (main effects + interactions): a && b val f = "y" ~ "a" && "b" // expands to a + b + a:b // Degree on crossing val f = "y" ~ ("a" && "b") ^ 3 ``` ### Function terms All common `Math` functions are available as Formula terms: ```scala val f = "y" ~ log("income") + sqrt("age") + "gender" val f = "y" ~ abs("balance") + exp("rate") // Available: abs, ceil, floor, round, rint, exp, expm1, log, log1p, // log10, log2, signum, sign, sqrt, cbrt, sin, cos, tan, // sinh, cosh, tanh, asin, acos, atan, ulp ``` --- ## Math, Arrays, and Linear Algebra Import `smile.math.*` to get enriched numeric types, operator overloading for arrays and matrices, and many statistical/linear-algebra helpers. ### Enriched primitives ```scala import smile.math.* // PimpedInt — slice construction (Python-like) val s: Slice = 0 ~ 9 // indices 0..9 val s: Slice = 0 ~ 9 ~ 2 // indices 0, 2, 4, 6, 8 (step 2) // PimpedDouble — arithmetic with arrays and matrices 2.0 + someArray // returns VectorExpression 3.0 * someMatrix // returns MatrixExpression ``` ### Array extensions (`PimpedDoubleArray`, `PimpedArray2D`) ```scala import smile.math.* val a = Array(1.0, 2.0, 3.0) val b = Array(4.0, 5.0, 6.0) a += b // in-place element-wise addition a -= b a *= 2.0 a /= 2.0 // 2-D val m: Array[Array[Double]] = … m.toMatrix // converts to DenseMatrix // Sampling val sample = a.sample(50) // draw 50 elements without replacement ``` ### `VectorExpression` operators ```scala val u: VectorExpression = … val v: VectorExpression = … u + v // element-wise addition → VectorExpression u - v u * 3.0 u %*% v // dot product → Double (via simplify.toVector) ``` ### `MatrixExpression` operators ```scala val A: MatrixExpression = … val B: MatrixExpression = … A + B A - B A * B // element-wise A %*% B // matrix multiplication (uses optimal chain order) A.t // transpose A * v // matrix-vector product // Solve A x = b val x = A \ b // via LU or QR depending on shape ``` ### Linear algebra helpers (top-level in `smile.math`) ```scala import smile.math.* zeros(3, 4) // 3×4 zero matrix ones(3, 4) eye(5) // identity rand(3, 3) // uniform random randn(3, 3) // Gaussian random trace(A) diag(A) // extract diagonal or build diagonal matrix lu(A) // LU decomposition qr(A) // QR decomposition cholesky(A) eig(A) // eigenvalues only eigen(A) // full eigendecomposition svd(A) det(A) rank(A) inv(A) ``` ### Statistical tests (top-level) ```scala import smile.math.* chisqtest(freq) // Chi-squared goodness-of-fit chisqtest2(x, y) // Two-sample Chi-squared ftest(x, y) // F-test for variance equality ttest(x, mean) // One-sample t-test ttest2(x, y) // Two-sample t-test ttest(x, y, paired = true) // Paired t-test kstest(x, dist) // Kolmogorov-Smirnov pearsontest(x, y) // Pearson correlation spearmantest(x, y) // Spearman rank correlation kendalltest(x, y) // Kendall tau // Contingency-table Chi-squared chisqtest(table) ``` ### Special functions (top-level) ```scala import smile.math.* beta(a, b); erf(x); erfc(x); gamma(x); lgamma(x); digamma(x) inverf(p); inverfc(p); erfcc(x) ``` --- ## Computer Algebra System (CAS) The `smile.cas` package provides **symbolic** scalars, vectors, and matrices. Import `smile.cas.*` to enable implicit conversions from Scala literals to CAS nodes. ### Scalars ```scala import smile.cas.* // Literals become CAS nodes automatically val x: Var = "x" // Var — symbolic variable val a: Val = 3.14 // Val — numeric constant val n: IntVal = 2 // integer constant // Arithmetic val expr = x * x + 2 * x + 1 // Scalar expression val diff = expr.d("x") // symbolic derivative w.r.t. x: 2*x + 2 val simplified = diff.simplify // simplification // Helper functions val f = exp(x) + log(x) + sqrt(x) + sin(x) + cos(x) + tan(x) val g = abs("y") + ceil("z") + floor("w") ``` ### Vectors ```scala import smile.cas.* val v = Vector("a", "b", "c") // 3-element symbolic vector val u = Vector("x", "y") val dot = v * u // dot product expression val jac = v.d("x") // Jacobian w.r.t. scalar ``` ### Matrices ```scala import smile.cas.* val M = Matrix("M") // symbolic matrix variable val N = Matrix("N") val prod = M * N // symbolic matrix product val inv = M.inv // symbolic inverse val grad = M.d("alpha") // derivative w.r.t. scalar parameter ``` --- ## Classification Import `smile.classification.*`. Every function is wrapped with `time(…)` which logs its wall-clock duration. ### K-Nearest Neighbors ```scala import smile.classification.* // From a pre-built KNN search structure val model = knn(knnSearch, y, k = 5) // Build automatically from feature matrix (custom distance) val model = knn(x, y, k = 5, distance = new EuclideanDistance) // Euclidean distance shortcut val model = knn(x, y, k = 5) ``` ### Logistic Regression ```scala val model = logit(x, y, lambda = 0.01, // L2 regularization (0 = none) tol = 1e-5, // convergence tolerance maxIter = 500) ``` ### Maximum Entropy (Multinomial Logistic for sparse features) ```scala // x(i) is a sparse binary feature: array of non-zero feature indices val model = maxent(x, y, p = 50000, // feature space dimension lambda = 0.1, tol = 1e-5, maxIter = 500) ``` ### Multilayer Perceptron ```scala import smile.model.mlp.* import smile.util.function.TimeFunction val layers = Array( Layer.input(4), Layer.sigmoid(20), Layer.mle(3, OutputFunction.SOFTMAX) ) val model = mlp(x, y, layers, epochs = 10, learningRate = TimeFunction.linear(0.01, 10000, 0.001), momentum = TimeFunction.constant(0.0), weightDecay = 0.0, rho = 0.0, epsilon = 1e-7) ``` ### RBF Network ```scala // Provide explicit RBF neurons val neurons = RBF.fit(x, k = 10) val model = rbfnet(x, y, neurons, normalized = false) // Convenience: build Gaussian RBF with k-means automatically val model = rbfnet(x, y, k = 10, normalized = false) ``` ### Support Vector Machine ```scala import smile.math.kernel.* val kernel = new GaussianKernel(sigma = 1.0) val model = svm(x, y, kernel, C = 1.0, tol = 1e-3, epochs = 1) ``` ### Decision Tree (CART) ```scala import smile.data.formula.* import smile.model.cart.SplitRule val model = cart(formula, data, splitRule = SplitRule.GINI, maxDepth = 20, maxNodes = 0, // 0 = unlimited nodeSize = 5) ``` ### Random Forest ```scala val model = randomForest(formula, data, ntrees = 500, mtry = 0, // 0 = floor(sqrt(p)) splitRule = SplitRule.GINI, maxDepth = 20, maxNodes = 500, nodeSize = 1, subsample = 1.0, // 1.0 = with replacement classWeight = null, seeds = null) ``` ### Gradient Boosted Trees ```scala val model = gbm(formula, data, ntrees = 500, maxDepth = 20, maxNodes = 6, nodeSize = 5, shrinkage = 0.05, subsample = 0.7) ``` ### AdaBoost ```scala val model = adaboost(formula, data, ntrees = 500, maxDepth = 20, maxNodes = 6, nodeSize = 1) ``` ### Discriminant Analysis ```scala // Fisher's Linear Discriminant val model = fisher(x, y, L = -1, tol = 1e-4) // Linear Discriminant Analysis val model = lda(x, y, priori = null, tol = 1e-4) // Quadratic Discriminant Analysis val model = qda(x, y, priori = null, tol = 1e-4) // Regularized Discriminant Analysis (blends LDA and QDA) val model = rda(x, y, alpha = 0.5, // 0 = LDA, 1 = QDA priori = null, tol = 1e-4) ``` ### Naive Bayes ```scala import smile.classification.DiscreteNaiveBayes // Document classification with add-k smoothing val model = naiveBayes(x, y, model = DiscreteNaiveBayes.Model.MULTINOMIAL, priori = null, sigma = 1.0) // General form with continuous distributions val model = naiveBayes(priori, condprob) ``` ### Multiclass Wrappers ```scala // One-vs-One (K*(K-1)/2 binary classifiers; max-wins voting) val model = ovo(x, y) { (x, y) => svm(x, y, kernel, C = 1.0) } // One-vs-Rest (K binary classifiers; highest confidence wins) val model = ovr(x, y) { (x, y) => svm(x, y, kernel, C = 1.0) } ``` Both `ovo` and `ovr` accept any trainer function `(Array[T], Array[Int]) => Classifier[T]`, expressed as a curried Scala lambda. --- ## Regression Import `smile.regression.*`. ### Linear Models ```scala import smile.data.formula.* import smile.regression.* // Ordinary Least Squares val model = lm(formula, data, method = OLS.Method.QR, // "svd" or "qr" stderr = true, recursive = true) // Ridge Regression (L2 penalty) val model = ridge(formula, data, lambda = 0.1) // LASSO (L1 penalty; produces sparse solutions) val model = lasso(formula, data, lambda = 0.1, tol = 1e-3, maxIter = 5000) ``` ### Support Vector Regression ```scala val model = svm(x, y, kernel, eps = 0.1, // epsilon-insensitive loss threshold C = 1.0, // soft-margin penalty tol = 1e-3) ``` ### Regression Tree and Ensembles ```scala // Single regression tree val model = cart(formula, data, maxDepth = 20, maxNodes = 0, nodeSize = 5) // Random Forest val model = randomForest(formula, data, ntrees = 500, mtry = 0, maxDepth = 20, maxNodes = 500, nodeSize = 5, subsample = 1.0) // Gradient Boosted Trees import smile.model.cart.Loss val model = gbm(formula, data, loss = Loss.lad(), // least absolute deviation (robust default) ntrees = 500, maxDepth = 20, maxNodes = 6, nodeSize = 5, shrinkage = 0.05, subsample = 0.7) ``` ### Gaussian Process Regression Grouped under the `gpr` object: ```scala import smile.regression.gpr import smile.math.kernel.GaussianKernel val kernel = new GaussianKernel(sigma = 1.0) // Full GP — O(n³) in training, exact inference val model = gpr(x, y, kernel, noise = 0.01, normalize = true, tol = 1e-5, maxIter = 0) // maxIter=0 skips hyperparameter optimization // Subset-of-Regressors approximation (inducing points t ⊂ x) val model = gpr.approx(x, y, t, kernel, noise = 0.01) // Nyström approximation (inducing points may be external) val model = gpr.nystrom(x, y, t, kernel, noise = 0.01) ``` ### RBF Network ```scala // Provide explicit neurons val model = rbfnet(x, y, neurons, normalized = false) // Convenience: Gaussian RBF via k-means val model = rbfnet(x, y, k = 10) ``` --- ## Clustering Import `smile.clustering.*`. ### Hierarchical Clustering ```scala import smile.clustering.* // Euclidean distance; method ∈ "single" | "complete" | "upgma" | "average" | // "upgmc" | "centroid" | "wpgma" | // "wpgmc" | "median" | "ward" val hc = hclust(data, "ward") // Custom distance val hc = hclust(data, myDistance, "complete") // Cut the dendrogram to obtain k clusters val labels = hc.partition(k = 5) ``` ### Partitional Clustering ```scala // K-Means (best of 16 runs by default) val km = kmeans(data, k = 5, maxIter = 100, runs = 16) println(km.k) // actual number of clusters println(km.distortion) // within-cluster sum of squared distances // K-Modes (binary / categorical data) val km = kmodes(data, k = 5, maxIter = 100, runs = 10) // X-Means — automatically determines k using BIC val xm = xmeans(data, k = 20) // k is the upper bound // G-Means — automatically determines k using Gaussian normality test val gm = gmeans(data, k = 20) // Deterministic Annealing val da = dac(data, k = 10, alpha = 0.9) // CLARANS (medoid-based; any distance) val cl = clarans(data, myDistance, k = 5) ``` ### Density-Based Clustering ```scala // DBSCAN with Euclidean distance val db = dbscan(data, minPts = 5, radius = 0.5) // DBSCAN with custom distance val db = dbscan(data, myDistance, minPts = 5, radius = 0.5) // DBSCAN with pre-built RNN search structure val db = dbscan(data, rnnSearch, minPts = 5, radius = 0.5) // DENCLUE (kernel-density attractors) val dc = denclue(data, sigma = 0.5, m = 50) ``` ### Information-Theoretic Clustering ```scala import smile.util.SparseArray // SIB — co-occurrence data (e.g. document–word) val sb = sib(sparseData, k = 10, maxIter = 100, runs = 8) // MEC — minimum conditional entropy (works with any distance) val mc = mec(data, myDistance, k = 10, radius = 0.5) val mc = mec(data, myMetric, k = 10, radius = 0.5) val mc = mec(data, k = 10, radius = 0.5) // Euclidean shortcut // Spectral Clustering val sp = specc(data, k = 5, sigma = 1.0, l = 0, maxIter = 100) ``` Cluster assignments are accessed as: ```scala model.y // Array[Int] of cluster labels (-1 = noise in DBSCAN) model.centroids // cluster centres (for centroid-based models) ``` --- ## Dimensionality Reduction Import `smile.feature.extraction.*`. All methods are wrapped with `time(…)`. ```scala import smile.feature.extraction.* // PCA — Principal Component Analysis val pca = pca(data) val pca = pca(data, cor = true) // use correlation matrix // Probabilistic PCA (handles missing values) val ppca = ppca(data, k = 10) // Kernel PCA import smile.math.kernel.GaussianKernel val kpca = kpca(data, kernel = new GaussianKernel(1.0), k = 10) val kpca = kpca(data, new GaussianKernel(1.0), k = 10, threshold = 1e-4) // Generalized Hebbian Algorithm (online / incremental PCA) val gha = gha(data, k = 10) val gha = gha(data, k = 10, r = 0.0001) ``` After fitting, project new data: ```scala val embedding = pca.project(newData) pca.setProjection(k) // change number of retained components ``` --- ## Manifold Learning Import `smile.manifold.*`. All methods return low-dimensional coordinate arrays (`Array[Array[Double]]`) or dedicated result objects. ```scala import smile.manifold.* // Isomap — geodesic MDS (C-Isomap variant by default) val coords = isomap(data, k = 10, d = 2, CIsomap = true) // Locally Linear Embedding val coords = lle(data, k = 10, d = 2) // Laplacian Eigenmap val coords = laplacian(data, k = 10, d = 2, t = -1.0) // t > 0 uses Gaussian heat kernel; t ≤ 0 uses binary weights // t-SNE (2-D or 3-D; input may be pre-computed distance matrix) val result = tsne(data, d = 2, perplexity = 20.0, eta = 200.0, earlyExaggeration = 12.0, maxIter = 1000) val coords = result.coordinates // UMAP val coords = umap(data, k = 15, d = 2, epochs = 0, // 0 = auto learningRate = 1.0, minDist = 0.1, spread = 1.0, negativeSamples = 5, repulsionStrength = 1.0) // Classical MDS (equivalent to PCA when Euclidean distances are used) val result = mds(proximity, d = 2) // Non-metric (Kruskal) MDS val result = isomds(proximity, d = 2, tol = 1e-4, maxIter = 200) // Sammon Mapping val result = sammon(proximity, d = 2, step = 0.2, maxIter = 100) ``` --- ## Natural Language Processing Import `smile.nlp.*`. The implicit conversion `pimpString` enriches every `String` with NLP pipeline methods. ### String extension methods ```scala import smile.nlp.* val text = "Dr. Smith went to Washington D.C. He arrived on Tuesday." // Unicode normalization (NFKC, whitespace normalization, quote normalization) val clean = text.normalize // Sentence splitting val sentences: Array[String] = text.sentences // Tokenization with stop-word filtering val words: Array[String] = text.words // default stop list val words: Array[String] = text.words("comprehensive") // larger stop list val words: Array[String] = text.words("none") // no filtering val words: Array[String] = text.words("the,a,an") // custom stop list // Bag-of-words (word → count) val bag: Map[String, Int] = text.bag() // Porter stemming val bag: Map[String, Int] = text.bag(stemmer = None) // no stemming val bag: Map[String, Int] = text.bag(filter = "google") // Binary bag-of-words (presence/absence) val bag2: Set[String] = text.bag2() // Part-of-speech tagging (returns word–POS pairs) val tagged: Array[(String, PennTreebankPOS)] = "She sells seashells".postag // Keyword extraction val keywords: Seq[NGram] = text.keywords(k = 10) ``` ### Corpus and n-gram utilities ```scala import smile.nlp.* // Build an in-memory corpus val corp = corpus(Seq("First document text.", "Second document text.")) // Bigram collocations val topBigrams: Seq[Bigram] = bigram(k = 100, minFreq = 5, docs: _*) val sigBigrams: Seq[Bigram] = bigram(p = 0.01, minFreq = 5, docs: _*) // N-gram extraction (Apriori-style) val grams: Array[Array[NGram]] = ngram(maxNGramSize = 3, minFreq = 3, docs: _*) // HMM POS tagging on a pre-tokenised sentence val tags: Array[PennTreebankPOS] = postag(Array("She", "sells", "seashells")) ``` ### Stemming ```scala import smile.nlp.* porter.stem("running") // "run" lancaster.stem("running") // "run" (more aggressive) ``` ### Vectorization and TF-IDF ```scala import smile.nlp.* // Term-frequency feature vector val vocab = Array("machine", "learning", "deep") val features = vectorize(vocab, bag) // Array[Double] val sparse = vectorize(vocab, bag2) // Array[Int] (indices of present terms) // Document frequency array val dfreq: Array[Int] = df(vocab, corpusOfBags) // Whole-corpus TF-IDF normalized to unit L2 norm val matrix: Array[Array[Double]] = tfidf(corpusOfBags) // Single document val vec: Array[Double] = tfidf(bag, n = corpusSize, df = dfreq) ``` --- ## Sequence Labeling Import `smile.sequence.*`. ```scala import smile.sequence.* // Hidden Markov Model val model = hmm(pi, a, b) // from initial / transition / emission val model = hmm(observations, k) // learns from observation sequences // Conditional Random Field (linear-chain) val model = crf(x, y, feature, k, eta = 0.1, lambda = 0.1) // CRF with Gaussian process smoothing val model = gcrf(x, y, feature, k, eta = 0.1, lambda = 0.1) ``` --- ## Association Rule Mining Import `smile.association.*`. ```scala import smile.association.* val itemsets: Array[Array[Int]] = … // Build FP-tree val tree = fptree(itemsets) val tree = fptree(itemsets.toStream) // streaming variant // Mine frequent item sets val frequent = fpgrowth(tree, minSupport = 3) val frequent = fpgrowth(itemsets, minSupport = 3) // Generate association rules val rules = arm(tree, minSupport = 3, confidence = 0.5) val rules = arm(itemsets, minSupport = 3, confidence = 0.5) ``` --- ## Wavelets Import `smile.wavelet.*`. ```scala import smile.wavelet.* val wt = wavelet("D4") // Daubechies-4 filter // Available filters include: "Haar", "D4"–"D20" (even), "Coiflet1"–"Coiflet5", etc. val signal = Array(1.0, 2.0, 3.0, 4.0, 3.0, 2.0, 1.0, 0.0) // In-place discrete wavelet transform dwt(signal, wt) // In-place inverse DWT idwt(signal, wt) // Wavelet shrinkage denoising (modifies in-place) wsdenoise(signal, wt, soft = true) ``` --- ## Model Validation Import `smile.validation.*`. ### One-shot train/test evaluation ```scala import smile.validation.* // With raw arrays val result = validate.classification(x, y, testX, testY) { (x, y) => randomForest(Formula.lhs("label"), DataFrame.of(x, y), ntrees = 100) } // With DataFrame + Formula val result = validate.classification(formula, trainDf, testDf) { (f, df) => randomForest(f, df) } // Regression variants val result = validate.regression(x, y, testX, testY) { (x, y) => lm(…) } val result = validate.regression(formula, train, test) { (f, df) => lm(f, df) } ``` ### Cross-Validation ```scala val cv5 = cv.classification(k = 5, formula, data) { (f, df) => randomForest(f, df) } println(cv5.avg.accuracy) // With raw arrays val cv5 = cv.classification(k = 5, x, y) { (x, y) => lda(x, y) } // Regression val cv5r = cv.regression(k = 5, formula, data) { (f, df) => lm(f, df) } val cv5r = cv.regression(k = 5, x, y) { (x, y) => ridge(…) } ``` ### Leave-One-Out CV ```scala val loo = loocv.classification(formula, data) { (f, df) => cart(f, df) } val loo = loocv.regression(x, y) { (x, y) => lasso(…) } ``` ### Bootstrap ```scala val boot = bootstrap.classification(k = 100, x, y) { (x, y) => knn(x, y, 5) } val boot = bootstrap.regression(k = 100, formula, data) { (f, df) => gbm(f, df) } ``` ### Individual metric functions ```scala import smile.validation.* // Classification val cm = confusion(truth, predictions) val acc = accuracy(truth, predictions) val rec = recall(truth, predictions) val prec = precision(truth, predictions) val f1 = f1(truth, predictions) val auc = auc(truth, probabilities) val ll = logloss(truth, probabilities) val ce = crossentropy(truth, probMatrix) val mcc = mcc(truth, predictions) val sens = sensitivity(truth, predictions) val spec = specificity(truth, predictions) val fo = fallout(truth, predictions) val fdr = fdr(truth, predictions) // Regression val mseVal = mse(truth, predictions) val rmseVal = rmse(truth, predictions) val rssVal = rss(truth, predictions) val madVal = mad(truth, predictions) // Clustering val ri = randIndex(labels1, labels2) val ari = adjustedRandIndex(labels1, labels2) val nmiVal = nmi(labels1, labels2) ``` --- ## Plotting The module includes two complementary plot APIs: - **`smile.plot.swing`** — traditional Swing-based `Canvas` charts for desktop use. - **`smile.plot.vega`** — Vega-Lite declarative charts for notebooks and browser-based output. ### Displaying a chart (`show`) ```scala import smile.plot.* // Render a Canvas in a JFrame (desktop) or as HTML (notebook) show(canvas) show(multiFigurePane) show(vegaLiteSpec) ``` In Scala 2.13 notebook environments the `show` implicit calls are backed by macros that detect Zeppelin/Databricks context at compile time and emit HTML `` tags instead of opening a Swing window. ### Swing plots (`smile.plot.swing.*`) Every chart returns a `Canvas` that can be passed to `show(…)`. ```scala import smile.plot.swing.* // Scatter plot val c = plot(x, y, '.') // Array[Double] x and y val c = plot(data, labels, marks) // colour-coded by class label // Scatter-plot matrix val c = splom(data, marks, colNames) // Line plot val c = line(x, y) val c = staircase(x, y) // Box plot val c = boxplot(data) val c = boxplot(groups, names) // Histogram val c = hist(data) val c = hist(data, bins = 20) val c = hist3(x, y, bins = 20) // Q-Q plot val c = qqplot(data) // vs normal val c = qqplot(x, y) // two-sample val c = qqplot(data, distribution) // vs arbitrary distribution // Heatmap and sparse matrix spy plot val c = heatmap(matrix) val c = spy(sparseMatrix) val c = hexmap(data) // Contour and surface val c = contour(x, y, z) val c = surface(z) val c = wireframe(vertices, edges) val c = grid(ax, ay, az) // Dendrogram val c = dendrogram(hierarchicalClustering) // Scree plot (PCA) val c = screeplot(pca) // Text annotations val c = text(coords, labels) ``` ### Vega-Lite charts (`smile.plot.vega.*`) Build declarative specs using a fluent Scala API. The `VegaLite` companion object is the entry point. ```scala import smile.plot.vega.* // Single view val view = VegaLite.view() .mark("point") .x(Field("sepalLength", "quantitative")) .y(Field("petalLength", "quantitative")) .color(Field("species", "nominal")) .data(irisDataFrame) show(view) // Layered chart (multiple marks in the same coordinate system) val chart = VegaLite.layer(view1, view2) // Faceted chart val faceted = VegaLite.facet(view).row("origin").column("cylinders") // Concatenated charts val hcat = VegaLite.hconcat(view1, view2, view3) val vcat = VegaLite.vconcat(view1, view2) // Scatter-plot matrix val splomChart = VegaLite.splom(irisDataFrame) // Fluent global properties val chart = VegaLite.view() .background("#f5f5f5") .padding(10) .config(JsObject("view" -> JsObject("stroke" -> JsString("transparent")))) ``` --- ## Utility Helpers ### `time` — measure and log execution time ```scala import smile.util.time // Block form — returns the value, logs elapsed time with a label val model = time("Random Forest") { randomForest(formula, data, ntrees = 500) } // Toggle output time.on() // enable timing output (default) time.off() // suppress timing output time.echo // check current state ``` ### Implicit Java function converters ```scala import smile.util.{toJavaFunction, toJavaBiFunction} // Convert Scala lambdas to java.util.function types automatically val jf: java.util.function.Function[Int, String] = (i: Int) => i.toString val jbf: java.util.function.BiFunction[Int, Int, Int] = (a: Int, b: Int) => a + b ``` These conversions are automatically applied wherever SMILE's Java API requires a `Function` or `BiFunction` — for example, when passing trainers to `ovo`, `ovr`, `validate.classification`, or `cv.classification`. --- ## Complete Examples ### Example 1 — Load data and train a classifier ```scala import smile.io.* import smile.data.formula.* import smile.classification.* import smile.validation.* val df = read.csv("iris.csv") val formula = "class" ~ "." // 5-fold cross-validation on a random forest val result = cv.classification(k = 5, formula, df) { (f, d) => randomForest(f, d, ntrees = 100) } println(s"CV accuracy: ${result.avg.accuracy * 100 %.1f %%}") ``` ### Example 2 — Text classification pipeline ```scala import smile.io.* import smile.nlp.* import smile.classification.* import smile.validation.* val texts = Array("great product", "terrible service", "very happy") val labels = Array(1, 0, 1) // Build vocabulary from training data val bags = texts.map(_.bag()) val vocab = bags.flatMap(_.keys).distinct.sorted // Vectorise val x = bags.map(b => vectorize(vocab, b)) val y = labels // Train and evaluate val result = cv.classification(k = 3, x, y) { (x, y) => logit(x, y, lambda = 0.01) } println(result.avg.accuracy) ``` ### Example 3 — Regression with cross-validation ```scala import smile.io.* import smile.data.formula.* import smile.regression.* import smile.validation.* val longley = read.arff("data/regression/longley.arff") val formula = "Employed" ~ "." val cv5 = cv.regression(k = 5, formula, longley) { (f, df) => lm(f, df) } println(f"RMSE: ${cv5.avg.rmse}%.4f") ``` ### Example 4 — Gaussian Process Regression with Nyström approximation ```scala import smile.regression.gpr import smile.math.kernel.GaussianKernel val kernel = new GaussianKernel(sigma = 1.0) // Inducing inputs (e.g. k-means centroids of x) import smile.clustering.* val km = kmeans(x, k = 200) val t = km.centroids val model = gpr.nystrom(x, y, t, kernel, noise = 0.01, normalize = true) val predictions = x.map(model.predict) ``` ### Example 5 — NLP keyword extraction ```scala import smile.nlp.* val text = """ Machine learning is a field of artificial intelligence. It enables computers to learn from experience without being explicitly programmed. """ val keywords = text.keywords(k = 5) keywords.foreach(ng => println(ng.words.mkString(" "))) ``` ### Example 6 — Manifold learning and visualization ```scala import smile.io.* import smile.manifold.* import smile.plot.swing.* import smile.plot.* val (x, _) = read.csv("mnist.csv").toArray … // high-dimensional data val embedding = umap(x, k = 15, d = 2) val canvas = plot(embedding.map(_(0)), embedding.map(_(1)), '.') show(canvas) ``` ### Example 7 — Symbolic differentiation with CAS ```scala import smile.cas.* val x = "x" val y = "y" // Define f(x, y) = x² y + sin(x) y val f = (x ** 2) * y + sin(x) * y // Partial derivatives val df_dx = f.d("x").simplify // 2 x y + cos(x) y val df_dy = f.d("y").simplify // x² + sin(x) println(df_dx) println(df_dy) // Evaluate at x=1, y=2 val env = Map("x" -> 1.0, "y" -> 2.0) println(df_dx.apply(env)) ``` --- ## Notable Differences from the Kotlin Shim | Aspect | Scala | Kotlin | |---|---|---| | Extension mechanism | Implicit classes (`PimpedXxx`) via `implicit def` | Extension functions | | Formula DSL | Rich operator DSL: `~`, `+`, `-`, `::`, `&&`, `^`, function terms | Not present | | CAS | Full symbolic algebra (`smile.cas`) | Not present | | Plotting | Both Swing (`smile.plot.swing`) and Vega-Lite (`smile.plot.vega`) | Not present | | Notebook rendering | Macro-detected at compile time (Scala 2.13) | N/A | | Validation API | Object-based: `validate`, `cv`, `loocv`, `bootstrap` | Top-level functions | | Sequence models | HMM, CRF, GCRF | Not present | | Operator DSL | `%*%` (dot/matmul), `\` (solve), `~` (slice) | N/A | | Array slicing | `0 ~ 9 ~ 2` (Python-like with step) | N/A | | `gpr` namespace | `object gpr { apply, approx, nystrom }` | `object gpr` (same) | Both shims expose the same underlying Java algorithms. The Kotlin shim focuses on function-level conciseness; the Scala shim additionally provides a richer operator language and is more appropriate for exploratory notebook workflows that involve linear algebra, symbolic math, and interactive visualization. --- *SMILE — Copyright © 2010–2026 Haifeng Li. GNU GPL licensed.*