# Architecture Tribuo is a library for creating Machine Learning (ML) models and for using those models to make predictions on previously unseen data. A ML model is the result of applying some training algorithm to a dataset. Most commonly, such algorithms produce output in the form of a large number of floating point values; however, this output may take one of many different forms, such as a tree-structured if/else statement. In Tribuo, a model includes not only this output, but also the necessary feature and output statistics to map from the named feature space into Tribuo's ids, and from Tribuo's output ids into the named output space. A Tribuo `Model` can also be thought of as a learned mapping from a *sparse* feature space of doubles to a *dense* output space (e.g., of class label probabilities, or regressed outputs etc). Every dimension of the input and output are named. This naming system makes it possible to check that the input and model agree on the feature space they are using. ## Data flow overview

Tribuo architecture diagram

Tribuo loads data using a `DataSource` implementation, which might load from a location like a DB or a file on disk. This DataSource processes the input data, converting it into Tribuo's storage format, an `Example`. An Example is a tuple of an `Output` (i.e., what you want to predict) and a list of Features, where each `Feature` is a tuple of a `String` feature name and a `double` feature value. The DataSource is then read into a `Dataset`, which accumulates statistics about the data for future use in model construction. Datasets can be split into chunks to separate out training and testing data, or to filter out examples according to some criterion. As Examples are fed into a Dataset, the Features are observed and have their statistics recorded in a `FeatureMap`. Similarly the Outputs are recorded in the appropriate `OutputInfo` subclass for the specified Output subclass. Once the Dataset has been processed, it's passed to a `Trainer`, which contains the training algorithm along with any necessary parameter values (in ML these are called hyperparameters to differentiate them from the learned model parameters), and the Trainer performs some iterations of the training algorithm before producing the `Model`. A Model contains the necessary learned parameters to make predictions along with a `Provenance` object which records how the Model was constructed (e.g., data file name, data hash, trainer hyperparameters, time stamp, etc). Both Models and Datasets can be serialized out to disk using Java Serialization. Once a model has been trained, it can be fed previously unseen Examples to produce `Prediction`s of their Outputs. If the new Examples have known Outputs, then the Predictions can be passed to an `Evaluator`, which calculates statistics like the accuracy (i.e., the number of times the predicted output was the same as the provided output). ## Structure Tribuo includes several top level modules: - Core provides Tribuo's core classes and interfaces. - Data provides loaders for text, sql and csv data, along with the columnar package which provides infrastructure for working with columnar data. - Math provides Tribuo's linear algebra library, along with kernels and gradient optimizers. - JSON provides a JSON data loader and a tool to strip provenance from trained models. Tribuo has separate modules for each prediction task: - Classification contains an `Output` implementation called `Label`, which represents a multi-class classification. A `Label` is a tuple of a String name, and a double precision score value. For each of `OutputFactory`, `OutputInfo`, `Evaluator` and `Evaluation`, the Classification package includes a classification-specific implementation, namely `LabelFactory`, `LabelInfo`, `LabelEvaluator` and `LabelEvaluation`, respectively. - Regression contains an `Output` implementation called `Regressor`, which represents multidimensional regression. Each `Regressor` is a tuple of dimension names, double precision dimension values, and double precision dimension variances. It has companion implementations of `OutputFactory`, `OutputInfo`, `Evaluator` and `Evaluation` called `RegressionFactory`, `RegressionInfo`, `RegressionEvaluator` and `RegressionEvaluation`, respectively. By default, the dimensions are named "DIM-x" where x is a non-negative integer. - AnomalyDetection contains an `Output` implementation called `Event`, which represents the detection of an anomalous or expected event (represented by the `EventType` enum containing `ANOMALY` and `EXPECTED`). Each `Event` is a tuple of an `EventType` instance and a double precision score value, representing the score of the event type. The AnomalyDetection package has companion implementations of `OutputFactory`, `OutputInfo`, `Evaluator` and `Evaluation` called `AnomalyFactory`, `AnomalyInfo`, `AnomalyEvaluator` and `AnomalyEvaluation`, respectively. - Clustering contains an `Output` implementation called `ClusterID`, which represents the cluster id number assigned. Each `ClusterID` is a non-negative integer id number and a double precision score representing the strength of association. The Clustering package has companion implementations of `OutputFactory`, `OutputInfo`, `Evaluator` and `Evaluation` called `ClusteringFactory`, `ClusteringInfo`, `ClusteringEvaluator` and `ClusteringEvaluation`, respectively. - MultiLabel contains an `Output` implementation called `MultiLabel`, which represents a multi-label classification. Each `MultiLabel` is a possibly empty set of `Label` instances with their associated scores. The MultiLabel package has companion implementations of `OutputFactory`, `OutputInfo`, `Evaluator` and `Evaluation` called `MultiLabelFactory`, `MultiLabelInfo`, `MultiLabelEvaluator` and `MultiLabelEvaluation`, respectively. It also has a `Trainer` which accepts a `Trainer