# Query projects & experiments with ``RubiconJSON``

Users can utilize the ``RubiconJSON`` class to query ``rubicon-ml`` logs in a JSONPath-like manner.

``RubiconJSON`` takes in top-level ``Rubicon`` objects, ``Projects``, and/or ``Experiments`` and
composes a JSON representation of them. Then, with the `search` method, users can query their logged
data using JSONPath syntax.

``RubiconJSON`` relies on [the ``jsonpath_ng`` library](https://github.com/h2non/jsonpath-ng) for query
parsing. More information on the allowed syntax can be found
[here in their documentation](https://github.com/h2non/jsonpath-ng#jsonpath-syntax).

In [1]:
from rubicon_ml import Rubicon

from sklearn.datasets import load_wine
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, make_scorer, precision_score, recall_score
from sklearn.model_selection import ParameterGrid, train_test_split

### Trian some models, log some experiments

We'll start off by loading a dataset and creating our ``rubicon-ml`` project.

In [2]:
X, y = load_wine(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=0)

In [3]:
rubicon = Rubicon(persistence="memory", auto_git_enabled=True)
project = rubicon.get_or_create_project(name="jsonpath querying")

Now, let's train and evaluate some models and log thier metadata to ``rubicon-ml``.

In [4]:
for parameters in ParameterGrid({
    "n_estimators": [5, 50, 500],
    "min_samples_leaf": [1, 10, 100],
}):
    rfc = RandomForestClassifier(random_state=0, **parameters)

    tags = ["large"] if parameters["n_estimators"] > 10 else []
    experiment = project.log_experiment(model_name=rfc.__class__.__name__, tags=tags)
    for name, value in parameters.items():
        experiment.log_parameter(name=name, value=value)
    for name in X_train.columns:
        experiment.log_feature(name=name)

    rfc.fit(X_train, y_train)

    precision_scorer = make_scorer(precision_score, average="weighted", zero_division=0.0)
    precision = precision_scorer(rfc, X_test, y_test)
    recall_scorer = make_scorer(recall_score, average="weighted")
    recall = recall_scorer(rfc, X_test, y_test)

    experiment.log_metric(name="precision", value=precision)
    experiment.log_metric(name="recall", value=recall)
    experiment.log_artifact(data_object=rfc, name=rfc.__class__.__name__, tags=["trained"])

### Load experiments into the ``RubiconJSON`` class

The ``RubiconJSON`` class accepts ``Projects``, ``Experiments``, and top-level ``Rubicon`` objects as
an input. Once instantiated, the ``RubiconJSON`` class has a ``json`` property detailing each project
and experiment. Let's take a look at the representation of one of our experiments:

In [5]:
from rubicon_ml import RubiconJSON

rubicon_json = RubiconJSON(experiments=project.experiments())
rubicon_json.json["experiment"][0]

{'project_name': 'jsonpath querying',
 'id': '560c116e-0522-4ca9-acf9-d5fd6e5c9b44',
 'name': None,
 'description': None,
 'model_name': 'RandomForestClassifier',
 'branch_name': 'jsonpath',
 'commit_hash': 'c60285762eb792f76a8d60bfa1ce6e824cb94531',
 'training_metadata': None,
 'tags': [],
 'created_at': datetime.datetime(2023, 9, 23, 21, 52, 33, 164301),
 'feature': [{'name': 'alcohol',
   'id': 'bed69ee3-3af4-45ed-8955-bb3dca3693c7',
   'description': None,
   'importance': None,
   'tags': [],
   'created_at': datetime.datetime(2023, 9, 23, 21, 52, 33, 164974)},
  {'name': 'malic_acid',
   'id': '7714ebf1-7679-47c0-9fe9-e8cbe330c8b0',
   'description': None,
   'importance': None,
   'tags': [],
   'created_at': datetime.datetime(2023, 9, 23, 21, 52, 33, 165062)},
  {'name': 'ash',
   'id': '9e63133e-4c35-4bb5-9173-78f17c9c92d3',
   'description': None,
   'importance': None,
   'tags': [],
   'created_at': datetime.datetime(2023, 9, 23, 21, 52, 33, 165121)},
  {'name': 'alcalinity

### Query experiments with ``RubiconJSON.search``

Once created, we can use the ``RubiconJSON`` class to query our experiment metadata. We'll start by getting each
experiment that was tagged "large" during training.

In [6]:
experiment_query = "$..experiment[?(@.tags[*]=='large')]"

for match in rubicon_json.search(experiment_query):
    print(match.value)

{'project_name': 'jsonpath querying', 'id': 'e818e604-5b6c-455b-951f-68d87db287d2', 'name': None, 'description': None, 'model_name': 'RandomForestClassifier', 'branch_name': 'jsonpath', 'commit_hash': 'c60285762eb792f76a8d60bfa1ce6e824cb94531', 'training_metadata': None, 'tags': ['large'], 'created_at': datetime.datetime(2023, 9, 23, 21, 52, 33, 236186), 'feature': [{'name': 'alcohol', 'id': '010d78c9-648b-41c0-9bbb-6ea4c0098c9a', 'description': None, 'importance': None, 'tags': [], 'created_at': datetime.datetime(2023, 9, 23, 21, 52, 33, 236763)}, {'name': 'malic_acid', 'id': '42f821b3-b4c9-459a-ad19-e31dcd861df2', 'description': None, 'importance': None, 'tags': [], 'created_at': datetime.datetime(2023, 9, 23, 21, 52, 33, 236847)}, {'name': 'ash', 'id': '79111980-9804-4ae5-9819-0c7a0e30854d', 'description': None, 'importance': None, 'tags': [], 'created_at': datetime.datetime(2023, 9, 23, 21, 52, 33, 236908)}, {'name': 'alcalinity_of_ash', 'id': '11579ad0-bd9b-4a89-a07f-98ee99d77c7e'

We can access any attribute of the queried objects within the query as well.
Let's just get the ID's of those experiments from the last cell.

In [7]:
experiment_query += ".id"

for match in rubicon_json.search(experiment_query):
    print(match.value)

e818e604-5b6c-455b-951f-68d87db287d2
520c3905-b77b-4fe0-ac65-0fd639fb9e49
ef503b29-92d6-4e57-a79b-4a8d1894f72d
2f7e7339-1a46-4b37-a18d-5de043ac28f3
effc91cd-ad6f-420e-9654-9ba691d744ff
0eb7ec0c-61d0-4e8e-a662-e03eedad959d


Now, let's get _all_ the metrics from _every_ experiment:

In [8]:
metric_query = "$..experiment[*].metric"

for match in rubicon_json.search(metric_query):
    print(match.value)

[{'name': 'precision', 'value': 0.9513333333333333, 'id': '2c87e3c3-56cf-4dc7-b049-d497faab79e0', 'description': None, 'directionality': 'score', 'created_at': datetime.datetime(2023, 9, 23, 21, 52, 33, 172842), 'tags': []}, {'name': 'recall', 'value': 0.95, 'id': 'efd2e6c8-3cfc-4cca-943e-6bbad7a0c777', 'description': None, 'directionality': 'score', 'created_at': datetime.datetime(2023, 9, 23, 21, 52, 33, 172955), 'tags': []}]
[{'name': 'precision', 'value': 0.9684407096171803, 'id': 'dfe82079-3a29-4c7a-bc35-4d7d13f05b0b', 'description': None, 'directionality': 'score', 'created_at': datetime.datetime(2023, 9, 23, 21, 52, 33, 264896), 'tags': []}, {'name': 'recall', 'value': 0.9666666666666667, 'id': '0886983e-74c9-44b1-ae98-7437fae22057', 'description': None, 'directionality': 'score', 'created_at': datetime.datetime(2023, 9, 23, 21, 52, 33, 265031), 'tags': []}]
[{'name': 'precision', 'value': 0.9843137254901961, 'id': '08d9600b-8135-4073-9991-2fb121d90f16', 'description': None, 'di

Some of those precision scores are a lot better than others - let's just get the really high ones.

In [9]:
best_metric_query = "$..experiment[*].metric[?(@.name=='precision' & @.value>=0.96)]"

for match in rubicon_json.search(best_metric_query):
    print(match.value)

{'name': 'precision', 'value': 0.9684407096171803, 'id': 'dfe82079-3a29-4c7a-bc35-4d7d13f05b0b', 'description': None, 'directionality': 'score', 'created_at': datetime.datetime(2023, 9, 23, 21, 52, 33, 264896), 'tags': []}
{'name': 'precision', 'value': 0.9843137254901961, 'id': '08d9600b-8135-4073-9991-2fb121d90f16', 'description': None, 'directionality': 'score', 'created_at': datetime.datetime(2023, 9, 23, 21, 52, 33, 579588), 'tags': []}
{'name': 'precision', 'value': 0.9684407096171803, 'id': '5c0dd47d-ab4f-4990-a5b4-b30763f0b9df', 'description': None, 'directionality': 'score', 'created_at': datetime.datetime(2023, 9, 23, 21, 52, 34, 96576), 'tags': []}


We can retrieve the ID's of the experiments those metrics belong to for further exploration.

In [10]:
best_experiment_query = "$..experiment[?(@.metric[?(@.name=='precision' & @.value>=0.96)])].id"

for match in rubicon_json.search(best_experiment_query):
    print(match.value)

e818e604-5b6c-455b-951f-68d87db287d2
520c3905-b77b-4fe0-ac65-0fd639fb9e49
2f7e7339-1a46-4b37-a18d-5de043ac28f3


We can use the IDs to retrieve ``rubicon-ml`` experiments and dig deeper into the metadata.

In [11]:
for match in rubicon_json.search(best_experiment_query):
    experiment = project.experiment(id=match.value)

    print(experiment.artifact(name="RandomForestClassifier").get_data(unpickle=True))

RandomForestClassifier(n_estimators=50, random_state=0)
RandomForestClassifier(n_estimators=500, random_state=0)
RandomForestClassifier(min_samples_leaf=10, n_estimators=500, random_state=0)
