# Architecture
Tribuo is a library for creating Machine Learning (ML) models and for using
those models to make predictions on previously unseen data.
A ML model is the result of applying some training algorithm to a dataset. Most
commonly, such algorithms produce output in the form of a large number of
floating point values; however, this output may take one of many different
forms, such as a tree-structured if/else statement. In Tribuo, a
model includes not only this output, but also the necessary feature and output
statistics to map from the named feature space into Tribuo's ids, and from
Tribuo's output ids into the named output space.
A Tribuo `Model` can also be thought of as a learned mapping from a *sparse*
feature space of doubles to a *dense* output space (e.g., of class label
probabilities, or regressed outputs etc). Every dimension of the input and
output are named. This naming system makes it possible to check that the input
and model agree on the feature space they are using.
## Data flow overview
Tribuo loads data using a `DataSource` implementation, which might load from a
location like a DB or a file on disk. This DataSource processes the input
data, converting it into Tribuo's storage format, an `Example`. An Example is a
tuple of an `Output` (i.e., what you want to predict) and a list of Features,
where each `Feature` is a tuple of a `String` feature name and a `double`
feature value. The DataSource is then read into a `Dataset`, which accumulates
statistics about the data for future use in model construction. Datasets can be
split into chunks to separate out training and testing data, or to filter out
examples according to some criterion. As Examples are fed into a Dataset, the
Features are observed and have their statistics recorded in a `FeatureMap`.
Similarly the Outputs are recorded in the appropriate `OutputInfo` subclass for
the specified Output subclass. Once the Dataset has been processed, it's passed
to a `Trainer`, which contains the training algorithm along with any necessary
parameter values (in ML these are called hyperparameters to differentiate them
from the learned model parameters), and the Trainer performs some iterations of
the training algorithm before producing the `Model`. A Model contains the
necessary learned parameters to make predictions along with a `Provenance`
object which records how the Model was constructed (e.g., data file name, data
hash, trainer hyperparameters, time stamp, etc). Both Models and Datasets can
be serialized out to disk using Java Serialization. Once a model has been
trained, it can be fed previously unseen Examples to produce `Prediction`s of
their Outputs. If the new Examples have known Outputs, then the Predictions can
be passed to an `Evaluator`, which calculates statistics like the accuracy
(i.e., the number of times the predicted output was the same as the provided
output).
## Structure
Tribuo includes several top level modules:
- Core provides Tribuo's core classes and interfaces.
- Data provides loaders for text, sql and csv data, along with the columnar
package which provides infrastructure for working with columnar data.
- Math provides Tribuo's linear algebra library, along with kernels and
gradient optimizers.
- JSON provides a JSON data loader and a tool to strip provenance from trained
models.
Tribuo has separate modules for each prediction task:
- Classification contains an `Output` implementation called `Label`, which
represents a multi-class classification. A `Label` is a tuple of a String
name, and a double precision score value. For each of `OutputFactory`,
`OutputInfo`, `Evaluator` and `Evaluation`, the Classification package includes
a classification-specific implementation, namely `LabelFactory`, `LabelInfo`,
`LabelEvaluator` and `LabelEvaluation`, respectively.
- Regression contains an `Output` implementation called `Regressor`, which
represents multidimensional regression. Each `Regressor` is a tuple of
dimension names, double precision dimension values, and double precision
dimension variances. It has companion implementations of `OutputFactory`,
`OutputInfo`, `Evaluator` and `Evaluation` called `RegressionFactory`,
`RegressionInfo`, `RegressionEvaluator` and `RegressionEvaluation`,
respectively. By default, the dimensions are named "DIM-x" where x is a
non-negative integer.
- AnomalyDetection contains an `Output` implementation called `Event`, which
represents the detection of an anomalous or expected event (represented by
the `EventType` enum containing `ANOMALY` and `EXPECTED`). Each `Event` is a
tuple of an `EventType` instance and a double precision score value,
representing the score of the event type. The AnomalyDetection package has
companion implementations of `OutputFactory`, `OutputInfo`, `Evaluator` and
`Evaluation` called `AnomalyFactory`, `AnomalyInfo`, `AnomalyEvaluator` and
`AnomalyEvaluation`, respectively.
- Clustering contains an `Output` implementation called `ClusterID`, which
represents the cluster id number assigned. Each `ClusterID` is a
non-negative integer id number and a double precision score representing the
strength of association. The Clustering package has companion implementations
of `OutputFactory`, `OutputInfo`, `Evaluator` and `Evaluation` called
`ClusteringFactory`, `ClusteringInfo`, `ClusteringEvaluator` and
`ClusteringEvaluation`, respectively.
- MultiLabel contains an `Output` implementation called `MultiLabel`, which
represents a multi-label classification. Each `MultiLabel` is a possibly
empty set of `Label` instances with their associated scores. The MultiLabel
package has companion implementations of `OutputFactory`, `OutputInfo`,
`Evaluator` and `Evaluation` called `MultiLabelFactory`, `MultiLabelInfo`,
`MultiLabelEvaluator` and `MultiLabelEvaluation`, respectively. It also has a
`Trainer` which accepts a `Trainer` and generates a
`Model` by using the inner trainer to make independent predictions
for each `Label`. This is a reasonable baseline strategy to use for multi-label
problems.
Finally, there are cross-cutting module collections:
- Common provides shared infrastructure for the prediction tasks.
- Interop provides infrastructure for working with large external libraries
like TensorFlow and ONNX Runtime.
- Util provides independent libraries that Tribuo uses for specific tasks. For
example, InformationTheory is a library of information theoretic functions,
and Tokens provides the interface Tribuo uses for tokenization along with
implementations of several tokenizers.
## Configuration, Options and Provenance
Many of Tribuo's trainers, datasources and other classes implement the
`Configurable` interface. This is provided by
[OLCUT](https://github.com/oracle/olcut), and allows for runtime configuration
of classes based on configuration files written in a variety of formats. The
default format is xml. Other available formats include JSON, protobuf & edn.
The configuration system is integrated into the command line arguments
`Options` system build into OLCUT's `ConfigurationManager`. Values in
configuration files can be overridden on the command line by supplying
`--@. ` in the arguments. The configuration
system provides the basis of Tribuo's model tracking `Provenance` system, which
records all hyperparameters, dataset parameters (e.g., file location, train/test
split, etc.), and any user-supplied instance information, along with run
specific information such as the file hash, number of training examples, etc.
A model provenance can be converted into a list of configurations for each
`Configurable` object involved in the model training. Similarly, an evaluation
provenance can be converted into the configurations for the model as well as
the configurations for the test dataset. These configurations can be loaded
into a fresh `ConfigurationManager` and optionally saved to disk. The
evaluation or model training can then be repeated or rerun with tweaks like new
data or a hyperparameter change.
Configurable classes have `@Config` annotations on their fields, and such
fields have the value from the configuration file inserted into them upon
construction in the configuration system. A snippet from the classification SGD
trainer is given below to illustrate this:
```java
public class LinearSGDTrainer implements Trainer, WeightedExamples {
@Config(description="The classification objective function to use.")
private LabelObjective objective = new LogMulticlass();
@Config(description="The gradient optimiser to use.")
private StochasticGradientOptimiser optimiser = new AdaGrad(1.0,0.1);
@Config(description="The number of gradient descent epochs.")
private int epochs = 5;
@Config(description="Log values after this many updates.")
private int loggingInterval = -1;
@Config(description="Minibatch size in SGD.")
private int minibatchSize = 1;
@Config(description="Seed for the RNG used to shuffle elements.")
private long seed = Trainer.DEFAULT_SEED;
@Config(description="Shuffle the data before each epoch. Only turn off for debugging.")
private boolean shuffle = true;
private SplittableRandom rng;
private int trainInvocationCounter;
}
```
Only fields which are configured need to be annotated `@Config`. Other fields
can be set in the appropriate constructor. OLCUT requires that all classes
which implement `Configurable` have a no-args constructor. The `Configurable`
interface allows for a `postConfig` method, which is called after the object
has been constructed and the appropriate field values inserted, but before it
is published or returned from the `ConfigurationManager`. This `postConfig`
method is used to perform the validation that would normally be performed in a
constructor, and it can be called from regular constructors. Default values
for the configurable parameters can be specified in the same way default fields
are usually specified. The `@Config` annotation has optional parameters for
supplying the description, declaring whether the field is mandatory, and
determining whether the field value should be redacted from any configuration
or provenance based on this object. More details about OLCUT can be found in
its [documentation](https://github.com/oracle/OLCUT).
The `LinearSGDTrainer` class above is configured by the xml snippet below:
```xml
```
This instantiates a `LinearSGDTrainer` with a logistic regression objective and
an `Adam` gradient optimiser, using [Andrei Karpathy's preferred
learning rate](https://twitter.com/karpathy/status/801621764144971776) and an
adjusted beta one parameter (note these parameters are just demonstration
values, we're not recommending these specific values).
As configuration is part of the class file rather than the public documented API
(because it operates on private fields), OLCUT ships with a CLI utility for
inspecting a configurable class and generating an example configuration in any
supported configuration format. To use this utility from the command line you
can run:
```shell
$ java -cp com.oracle.labs.mlrg.olcut.config.DescribeConfigurable -n -o -e xml
```
where the `-n` argument denotes what class to describe, `-o` denotes that an
example configuration should be generated, and `-e` gives the file format to
emit the example configuration in.
For example, running `DescribeConfigurable` on `LinearSGDTrainer` gives:
```shell
$ java -cp com.oracle.labs.mlrg.olcut.config.DescribeConfigurable -n org.tribuo.classification.sgd.linear.LinearSGDTrainer -o -e xml
```
```
Class: org.tribuo.classification.sgd.linear.LinearSGDTrainer
Field Name Type Mandatory Redact Default Description
epochs int false false 5 The number of gradient descent epochs.
loggingInterval int false false -1 Log values after this many updates.
minibatchSize int false false 1 Minibatch size in SGD.
objective org.tribuo.classification.sgd.LabelObjective false false LogMulticlass The classification objective function to use.
optimiser org.tribuo.math.StochasticGradientOptimiser false false AdaGrad(initialLearningRate=1.0,epsilon=0.1,initialValue=0.0) The gradient optimiser to use.
seed long false false 12345 Seed for the RNG used to shuffle elements.
shuffle boolean false false true Shuffle the data before each epoch. Only turn off for debugging.
Example :
```
It's also possible to access this information programmatically, but there are
several ways of doing that in OLCUT each appropriate to different use cases.
## Data Loading
### Built-in formats
Tribuo supports several common input formats for loading in data:
- libsvm/svmlight - a sparse numerical format for classification and regression
tasks.
- IDX - a dense multidimensional numerical format for classification and
regression. Tribuo will transparently decompress gzipped idx files as they are
read.
- CSV - a plain text delimited format (using an RFC4180 compliant parser).
- JSON - JavaScript Object Notation. Tribuo natively reads JSON objects, which
are a map from String to primitive value. The whole file is an array of such
objects.
- SQL - Tribuo has a JDBC loader, which can query a database and convert the
result set into Tribuo `Example`s.
- text - a one document per line format in which with the response variable
before the text is delimited by ` ## `.
There are two CSV loaders: A simple one for reading a CSV file (with or
without a header) where all the columns are either features or responses, and a
complex loader based on Tribuo's `RowProcessor`. The `RowProcessor` also
underlies the SQL and JSON loaders, and is extremely configurable. From v4.2
the simple `CSVLoader` wraps a `RowProcessor` to allow simple upgrading as the
CSVs become more complicated. For more details see the [Columnar
Inputs](#columnar-inputs) section below or look at the
[columnar data tutorial](https://github.com/oracle/tribuo/blob/main/tutorials/columnar-tribuo-v4.ipynb).
If there are other common formats of interest, let us know by filing an issue.
Tribuo's interfaces are extensible, and implementing another format simply
requires implementing the `DataSource` interface. We recommend using
`LibSVMDataSource` or `TextDataSource` as examples of how to implement a flat
file format. For columnar data, Tribuo has specialised processing
infrastructure. This is used for the CSV, JSON and SQL loaders, and it provides
a large amount of flexibility.
### Columnar Inputs
Columnar data sources require a configurable extraction step to map the columns
into Tribuo `Example` and `Feature` objects. A single column may contain
multiple features, may be extraneous, or may contain `Example`-level metadata.
In addition, the user must specify which column(s) contain the output variable.
To support this usecase, Tribuo provides the `RowProcessor`, a configurable
mechanism for converting a `ColumnarIterator.Row` (which is a tuple of a
`Map` and a row number) into an `Example`. The `RowProcessor`
uses four interfaces to process the input map:
- `FieldExtractor` - processes the whole row at once, extracting metadata
fields. These extracted fields, such as an `Example`'s id number, are then
written into the `Example`. As described in the javadoc, the `Example`'s weight
is handled as a special case of the metadata processing.
- `FieldProcessor` - processes a single field to produce a (possibly empty)
list of `Feature`s.
- `FeatureProcessor` - processes all the features after they have been
generated by a `FieldProcessor`. This allows for the generation of features
that depend upon multiple other features, such as conjunctions. It also
facilitates the filtering out of irrelevant, unnecessary or duplicated features.
- `ResponseProcessor` - processes the designated response fields using the
supplied `OutputFactory` to convert the field text into an `Output` instance.
These interfaces are supplied to the `RowProcessor` on construction (or
configuration). By default, `FieldProcessor`s are bound to a single column, but
there is an optional system which generates new `FieldProcessor`s based on
supplied regexes. This system can be used if the data is drawn from a
schema-less format where the presence of fields in particular documents is not
known in advance by the user. The regex system is also useful when the set of
fields is large and the number of unique `FieldProcessor`s is small. For
example, the same field processor can be applied to all columns whose name
begins with "A", thus avoiding the need to write a large configuration or code
file to describe all such columns. Although these regexes are usually
instantiated once, before any rows are processed, `RowProcessor` is
intentionally subclass-able so that developers can trigger expansion whenever
necessary. In the current implementation, there is at most one `FieldProcessor`
per field; we'll reconsider this restriction if there is sufficient interest.
Internally, the `RowProcessor` operates on `ColumnarFeature`, which is a
feature subclass that tracks both the feature name and the column name. It's
used to allow additional flexibility in the `FeatureProcessor`s when generating
conjunction or other cross-cutting features. `ColumnarFeature`s should not be
depended on when outside the columnar processing infrastructure since the
`Example` contract does not guarantee that feature objects are preserved after
being stored in an `Example`.
If your columnar data is not in a format currently supported by Tribuo, you can
subclass `ColumnarDataSource`, provide an implementation of `ColumnarIterator`,
which converts from your input format into `ColumnarIterator.Row`, and then
configure the `RowProcessor` to extract `Example`s from your data.
### Splitting up Datasets
`DataSource`s are not designed for splitting data into chunks; however, Tribuo
provides several other mechanisms for splitting data into training and test
sets, subsampling data based on its properties, and creating cross-validation
folds. The train/test and cross-validation splits are self-explanatory, though
it's worth noting that the cross-validation splits use the feature domain of
the entire, underlying dataset. The `DatasetView` underlies the
cross-validation splits and can also be constructed using a predicate function
(or a list of indices). The predicate function accepts an `Example` and thus
can depend on the features, outputs or metadata encoded in an `Example`.
## Transforming datasets
Tribuo supports independent, feature-based transformations including the
rescaling or binning of features. These feature transformations can be found in
the `org.tribuo.transform` package, which provides the mechanisms for fitting
and applying transformations. Transformations can be chained to create
pipelines that are applied in the supplied sequence to the specified
feature(s). Local transformation pipelines are those that are applied only to
the given named feature(s), whereas global transformations pipelines are
applied after local pipelines and apply to every feature. Local transformations
pipelines can also be applied to features which match a regular expression
(regex). Specifically, every feature name which matches the regex receives a
copy of that transformation pipeline, and that pipeline is then applied to the
feature. An exception will be thrown if an attempt is made to apply a regex
transformation pipeline to a feature that has already received a local
transformation pipeline. Additionally, all transformations are applied to the
feature domain to ensure it maintains the proper statistics.
Currently, transformations must be based on only a single feature at a time,
but we plan to introduce global feature transformations in some future release
to allow operations over the whole feature space such as PCA.
## Weights and Metadata
Examples can have metadata attached to them, and this metadata can be used to
filter out Examples or otherwise tag them for special processing. The metadata
takes the form of a `Map`, which can only be appended to; the
values cannot be modified after insertion. In addition, each Example has a
float-valued weight field, which can be used to denote the importance of an
Example in a training or evaluation setting. Only training algorithms that
implement the `WeightedExamples` tag interface support weighted examples;
otherwise the example weights are ignored. The weight field is currently
supported in the `RegressionEvaluator` if the weighted evaluation flag is
turned on. We'll consider adding this support to the other evaluators, although
it may require breaking API changes since the return types of some accessor
methods could change from integer to floating point values.
## Obfuscation
One of Tribuo's benefits is its extensive tracking of model metadata and
provenance; however, we realise this metadata isn't necessarily something that
should live in third-party accessible, deployed models. As a result, Tribuo
provides a few transformation mechanisms to remove metadata from a trained
model.
### Provenance
Provenance can be removed from `Model` objects using the `StripProvenance`
program located in the JSON module. There are three kinds of stored provenance:
trainer provenance, data provenance, and instance provenance. Each type of
provenance can be removed separately. It is also possible to insert a SHA-256
hash of the full provenance object into the model as a tracking mechanism. We
intend for the user to store the hash as a key for the original provenance JSON
in an external storage mechanism. Alternatively, `@Config` fields can be marked
`redact=True` which will prevent those values from being stored in the
provenance or any configuration.
### Feature Hashing
In addition to its use as a dimensionality reduction technique, feature hashing
also obfuscates the original feature names in cases where the forward mapping
from original names to hashed names has not been stored by the system. So as to
avoid the storage of such a forward mapping, Tribuo provides an implementation
of feature hashing that lives entirely in the feature domain object. This means
that Tribuo has no knowledge of the true feature names, and the system
transparently hashes the inputs. The feature names tend to be particularly
sensitive when working with NLP problems. For example, without such hashing,
bigrams would appear in the feature domains.
## Serialization
Tribuo supports Java serialization (i.e., using `java.io.Serializable`) and from
v4.3 it supports serializing objects to protobufs. Java serialization support is
deprecated, and will be removed in the next major version. While using the Java
serialization support we recommend the use of a serialization filter, more
information is given in our [Security documentation](Security.md).
Classes which support protobuf serialization
now implement `ProtoSerializable` where the type bound gives the type of the protobuf they
serialize to. Tribuo's protobuf serialization supports
all the types that Java serialization supports, with the exception of `Example`
metadata values which previously supported any `java.io.Serializable` type, and now
only support `String` values. Helper methods to deserialize objects from protobufs
have been added to all the major interfaces of the form `.deserialize(Proto)`
The protobuf definitions are packaged into Tribuo's jars, and the protobuf classes
are compiled using protoc `v3.19.4`. Tribuo's protobuf support includes versioning
of the protobufs to allow incremental modifications to the protobuf schemas as the
types evolve. This flexibility should allow protobuf to remain the preferred serialization
format for Tribuo without restricting the evolution of Tribuo's classes and interfaces.
As Java's generic type system is erased, the objects returned from this serialization
mechanism internally validate that the types are consistent, but users must validate
that the `Model` is of the expected type using
`Model.validate(Class extends Output>>)` or similar.
## ONNX Export
From v4.2 Tribuo supports exporting some models in the [ONNX](https://onnx.ai)
model format. The ONNX format is a cross-platform model exchange format which
can be loaded in by many machine learning libraries. Tribuo supports
inference on ONNX models via ONNX Runtime. Models which can be exported
implement the `ONNXExportable` interface, which provides methods for
constructing the ONNX protobuf and serializing it to disk. As of the release of
4.2, a subset of Tribuo's models are supported: linear models, sparse linear
models, LibSVM models, factorization machines, and ensembles thereof. We plan
to expand the set of exportable models in future releases. It is unlikely that
Tribuo will support direct ONNX export of TensorFlow models, however this can
be achieved by saving the Tribuo trained model in TensorFlow Saved Model
format, and then using the Python
[tf2onnx](https://github.com/onnx/tensorflow-onnx) project to convert that into
an onnx file.
### ONNX and provenance
Tribuo-exported ONNX files contain the Tribuo model provenance, stored as a
protobuf in the metadata field "TRIBUO\_PROVENANCE". If the model is loaded
back into Tribuo via ONNX Runtime, then the model provenance can be recovered
from the file, allowing the reproducibility system and the model tracking
features to work.
### ONNX and deployment
The ONNX format is widely supported in industry and across cloud providers.
Many hardware accelerators and edge computing vendors provide ONNX support for
their inference platforms, and this allows Tribuo-trained models to be widely
deployed after they have been exported. Tribuo provides an interface to [OCI
Data Science Model
Deployment](https://docs.oracle.com/en-us/iaas/data-science/using/model-dep-about.htm)
which deploys an ONNX model on [Oracle Cloud](https://www.oracle.com/cloud/),
and also can wrap a model deployment REST endpoint so it appears as a Tribuo
Model, allowing cloud deployment and inference from Tribuo. ONNX models are
also supported by [Oracle Machine Learning
Services](https://docs.oracle.com/en/database/oracle/machine-learning/omlss/index.html),
and many other cloud providers also provide ONNX model inference services which
can be used with exported Tribuo ONNX models.
## Reproducibility
From v4.2 Tribuo has a built-in reproducibility system for non-sequence Models.
This accepts a `Model` or `ModelProvenance` instance, automatically extracts
the configuration from the instance and then retrains the model, using the
data loading pipeline and training hyperparameters specified in the model provenance.
The system produces a diff of the reproduced model's provenance against the
original provenance, highlighting areas where the new model may behave differently
to the old one (e.g., showing if the number of features differs, or if the data
files have changed).
This is useful to check the validity of deployed production models, and to allow
easy comparison between a production model and one trained on current data. Over
time we plan to expand this system to support experimenting with different model
hyperparameters and training data configurations, tracking all this information
using the provenance built into Tribuo.
The reproducibility system requires Java 17, and as such is not included in the
`tribuo-all` Maven Central target. It is designed to be used in a development
environment rather than deployed in a production system like the rest of
Tribuo. As Tribuo migrates to newer versions of Java, we will consider
providing a jlink'd version of this utility.