# Interpreting numeric split points in H2O POJO tree based models
This notebook explains how to correctly interpret split points that you might see in POJOs of H2O tree based models.

*Motivation*: we had seen there are users who are parsing H2O POJO and translating the Java code into another representation (SQL statements, ...). While we do not encourage users to use POJO in this particular use case we want to clarify how to interpret the numerical values correctly.

## Concept of floating point numbers in computers

Computers and software like H2O use floating-point representation of real numbers. In this representation sequences of bits (0/1) are used to store the number with a limited precision. In H2O we use mainly 32-bit and 64-bit floating point number representation.

Lets take look at one example of a floating point number - 25.695312 and use 32-bit and 64-bit representation to compare the behavior.

In [247]:
import numpy as np

In [248]:
f32 = np.float32("25.695312")
f32

25.695312

In [249]:
f64 = np.float64("25.695312")
f64

25.695312

If we try to compare the numbers we will see they are not actually the same number

In [250]:
f32 == f64

False

When two numbers are compared their precion is first adjusted to be the same. This typically means the lower precison number is converted to the higher precision representation. In this case `f32` will be converted to float64 representation. We can do the same thing explicitly:

In [251]:
np.float64(f32) == f64

False

The comparison failed because the converted number is actually different

In [252]:
np.float64(f32)

25.6953125

Notice the 7th decimal digit after the conversion.

In [253]:
np.float64(f32) - f64

4.999999987376214e-07

In [254]:
np.float64(f32) > f64

True

## Examining GBM POJO

Understanding how computers compare numbers of different precision is critical for correctly interpretting split points in tree-based POJOs. Lets now train a simple GBM model.

In [255]:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator

In [256]:
# Connect to a pre-existing cluster
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,09 secs
H2O_cluster_timezone:,America/New_York
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.35.0.99999
H2O_cluster_version_age:,2 hours and 53 minutes
H2O_cluster_name:,mkurka
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,7.094 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,16


In [257]:
from h2o.utils.shared_utils import _locate # private function. used to find files within h2o git project directory.

df = h2o.upload_file(path=_locate("smalldata/logreg/prostate.csv"))

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [258]:
# Remove ID from training frame
train = df.drop("ID")

In [259]:
# For VOL & GLEASON, a zero really means "missing"
vol = train['VOL']
vol[vol == 0] = None
gle = train['GLEASON']
gle[gle == 0] = None

In [260]:
# Convert CAPSULE to a logical factor
train['CAPSULE'] = train['CAPSULE'].asfactor()

In [261]:
# Run GBM
my_gbm = H2OGradientBoostingEstimator(ntrees=1, seed=1234)

my_gbm.train(y="CAPSULE", training_frame=train)

gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1636137917875_1


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,1.0,1.0,360.0,5.0,5.0,5.0,24.0,24.0,24.0




ModelMetricsBinomial: gbm
** Reported on train data. **

MSE: 0.22019689456071448
RMSE: 0.4692514193486414
LogLoss: 0.6319753099030868
Mean Per-Class Error: 0.20582476749877632
AUC: 0.8816907085888687
AUCPR: 0.8515845076604194
Gini: 0.7633814171777373

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.4008312811161997: 


Unnamed: 0,Unnamed: 1,0,1,Error,Rate
0,0,176.0,51.0,0.2247,(51.0/227.0)
1,1,29.0,124.0,0.1895,(29.0/153.0)
2,Total,205.0,175.0,0.2105,(80.0/380.0)



Maximum Metrics: Maximum metrics at their respective thresholds


Unnamed: 0,metric,threshold,value,idx
0,max f1,0.400831,0.756098,10.0
1,max f2,0.37984,0.831486,16.0
2,max f0point5,0.429293,0.783866,6.0
3,max accuracy,0.429293,0.807895,6.0
4,max precision,0.463528,1.0,0.0
5,max recall,0.372774,1.0,18.0
6,max specificity,0.463528,1.0,0.0
7,max absolute_mcc,0.412406,0.595958,7.0
8,max min_per_class_accuracy,0.404036,0.777778,9.0
9,max mean_per_class_accuracy,0.404036,0.794175,9.0



Gains/Lift Table: Avg response rate: 40.26 %, avg score: 40.30 %


Unnamed: 0,group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
0,1,0.084211,0.463528,2.48366,2.48366,1.0,0.463528,1.0,0.463528,0.20915,0.20915,148.366013,148.366013,0.20915
1,2,0.128947,0.457452,2.337562,2.432973,0.941176,0.457452,0.979592,0.46142,0.104575,0.313725,133.756248,143.297319,0.30932
2,3,0.157895,0.444791,2.032086,2.359477,0.818182,0.444791,0.95,0.458372,0.058824,0.372549,103.208556,135.947712,0.359333
3,4,0.218421,0.432692,1.835749,2.214348,0.73913,0.436693,0.891566,0.452364,0.111111,0.48366,83.574879,121.434759,0.444013
4,5,0.3,0.429622,1.682479,2.069717,0.677419,0.430389,0.833333,0.446389,0.137255,0.620915,68.247944,106.971678,0.537215
5,6,0.426316,0.404036,1.24183,1.824417,0.5,0.412442,0.734568,0.43633,0.156863,0.777778,24.183007,82.441701,0.58835
6,7,0.521053,0.392412,0.827887,1.64323,0.333333,0.395728,0.661616,0.428948,0.078431,0.856209,-17.211329,64.322968,0.561055
7,8,0.660526,0.383949,0.562338,1.414994,0.226415,0.385145,0.569721,0.419699,0.078431,0.934641,-43.766186,41.499362,0.45887
8,9,0.763158,0.37984,0.445785,1.284652,0.179487,0.380533,0.517241,0.414432,0.045752,0.980392,-55.421485,28.465179,0.363652
9,10,0.813158,0.373285,0.261438,1.221736,0.105263,0.373285,0.491909,0.411902,0.013072,0.993464,-73.856209,22.173573,0.301834




Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error
0,,2021-11-05 14:45:28,0.022 sec,0.0,0.490428,0.674064,0.5,0.402632,1.0,0.597368
1,,2021-11-05 14:45:28,0.182 sec,1.0,0.469251,0.631975,0.881691,0.851585,2.48366,0.210526



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,GLEASON,20.12532,1.0,0.496931
1,PSA,8.138151,0.404374,0.200946
2,VOL,6.416112,0.318808,0.158426
3,DPROS,5.819649,0.28917,0.143698
4,AGE,0.0,0.0,0.0
5,RACE,0.0,0.0,0.0
6,DCAPS,0.0,0.0,0.0




In [262]:
# Get the POJO
my_gbm.download_pojo()

/*
  Licensed under the Apache License, Version 2.0
    http://www.apache.org/licenses/LICENSE-2.0.html

  AUTOGENERATED BY H2O at 2021-11-05T14:45:28.555-04:00
  3.35.0.99999
  
  Standalone prediction code with sample test data for GBMModel named GBM_model_python_1636137917875_1

  How to download, compile and execute:
      mkdir tmpdir
      cd tmpdir
      curl http://192.168.86.229:54321/3/h2o-genmodel.jar > h2o-genmodel.jar
      curl http://192.168.86.229:54321/3/Models.java/GBM_model_python_1636137917875_1 > GBM_model_python_1636137917875_1.java
      javac -cp h2o-genmodel.jar -J-Xmx2g -J-XX:MaxPermSize=128m GBM_model_python_1636137917875_1.java

     (Note:  Try java argument -XX:+PrintCompilation to show runtime JIT compiler behavior.)
*/
import java.util.Map;
import hex.genmodel.GenModel;
import hex.genmodel.annotations.ModelPojo;

@ModelPojo(name="GBM_model_python_1636137917875_1", algorithm="gbm")
public class GBM_model_python_1636137917875_1 extends GenModel {
  public 

Please take a close look at the POJO code, you should see statements like this one
```
Double.isNaN(data[5]) || data[5 /* VOL */] < 25.695312f ? -0.09571693f : -0.16740088f
```
This code represents one split decision in a GBM tree. `data` represents a single input row. The split decision is looking a column `VOL` to decide whether the observation should go to the left sub-tree or go right based on the value of element 5 in the `data` array.

It is important to notice that `data` is defined as a double array:
```
double[] data
```
This means data is represented by 64-bit floating point numbers.
The split point itself is however outputted in 32-bit precision. In java code we capture this fact by using `f` suffix  in the number representation, eg.: `25.695312f`.

This means we have the same scenario as outlined in the beginning of this notebook - we are comparing numbers with two different precisions and we need to pay attention to how the numbers are interpreted.

In [263]:
data = np.array([0, 0, 0, 0, 0, np.float64(25.695312)])
data[5]

25.695312

The java comparison rewritten to Python would look like this:

In [264]:
data[5] < np.float32(25.695312)

True

This means that observation represented by array `data` should got the left subtree of the current node. If we ignored the fact that the split point is using 32-bit precision and considered it as 64-bit precision, we would miclassify the observation to left sub-tree.

In [265]:
data[5] < np.float64(25.695312)

False

## Expert options

### Forcing split point in POJO to be written in 64-bit precision

H2O allows users to modify the POJO output by setting a property `sys.ai.h2o.java.output.doubles`. Setting this property to `true` will cause the POJO generator to output split point in 64-bit precision (doubles) instead of the default 32-bit precision.

We can set this property even on a running H2O instance by invoking a rapids expression.

In [266]:
h2o.rapids("(setproperty \"{}\" \"{}\")".format("sys.ai.h2o.java.output.doubles", "true"))["string"]

'Old values of sys.ai.h2o.java.output.doubles (per node): null'

In [267]:
my_gbm.download_pojo()

/*
  Licensed under the Apache License, Version 2.0
    http://www.apache.org/licenses/LICENSE-2.0.html

  AUTOGENERATED BY H2O at 2021-11-05T14:45:28.619-04:00
  3.35.0.99999
  
  Standalone prediction code with sample test data for GBMModel named GBM_model_python_1636137917875_1

  How to download, compile and execute:
      mkdir tmpdir
      cd tmpdir
      curl http://192.168.86.229:54321/3/h2o-genmodel.jar > h2o-genmodel.jar
      curl http://192.168.86.229:54321/3/Models.java/GBM_model_python_1636137917875_1 > GBM_model_python_1636137917875_1.java
      javac -cp h2o-genmodel.jar -J-Xmx2g -J-XX:MaxPermSize=128m GBM_model_python_1636137917875_1.java

     (Note:  Try java argument -XX:+PrintCompilation to show runtime JIT compiler behavior.)
*/
import java.util.Map;
import hex.genmodel.GenModel;
import hex.genmodel.annotations.ModelPojo;

@ModelPojo(name="GBM_model_python_1636137917875_1", algorithm="gbm")
public class GBM_model_python_1636137917875_1 extends GenModel {
  public 

In the modified POJO output you can now see the original split is coded as
```
Double.isNaN(data[5]) || data[5 /* VOL */] < 25.6953125 ? -0.0957169309258461 : -0.16740088164806366
```
Notice the last decimal place and observer there is now no suffix `f` at the end of the number. Compare it to the original version
```
Double.isNaN(data[5]) || data[5 /* VOL */] < 25.695312f ? -0.09571693f : -0.16740088f
```

The 64-bit precision output might be more natural to users for understanding what the POJO is doing when deciding how should a given observation traverse the tree.

### Convert existing MOJO into POJO with 64-bit precision number representation

Suppose we already have a MOJO model that was created by an older H2O version and we want to see how would the POJO look like with numbers represented in 64-bits.

For this use case H2O provides a conversion tool `MojoConvertTool` as a part of the `h2o.jar`.

In [268]:
mojo_path = my_gbm.download_mojo()
mojo_path

'/Users/mkurka/git/h2o/h2o-3/GBM_model_python_1636137917875_1.zip'

In [269]:
# Find h2o.jar (this is using internal functions)
from h2o.backend import H2OLocalServer
h2o_jar = H2OLocalServer()._find_jar()

In [270]:
# Invoke MojoConvertTool without arguments to print out usage instructions
import subprocess
subprocess.call(["java", "-cp", h2o_jar, "water.tools.MojoConvertTool"], stderr=subprocess.STDOUT, shell=False)

java -cp h2o.jar water.tools.MojoConvertTool source_mojo.zip target_pojo.java


1

In [271]:
# Add path to MOJO file and write output to "pojo.java"
subprocess.call(["java", "-cp", h2o_jar, "water.tools.MojoConvertTool", mojo_path, "pojo.java"], stderr=subprocess.STDOUT, shell=False)


Starting local H2O instance to facilitate MOJO to POJO conversion.

14:45:29.416 [main] INFO  hex.tree.xgboost.util.NativeLibrary - Loaded library from lib/osx_64/libxgboost4j_minimal.dylib (/var/folders/v1/fkjmcbkd11v2mrm4dm6345ym0000gn/T/libxgboost4j_minimal6279070988842798503.dylib)
11-05 14:45:29.543 127.0.0.1:54321       79164        main  INFO water.default: ----- H2O started  -----
11-05 14:45:29.544 127.0.0.1:54321       79164        main  INFO water.default: Build git branch: master
11-05 14:45:29.544 127.0.0.1:54321       79164        main  INFO water.default: Build git hash: b9ba1af5f07c6dbc6369e41113ea43947109e054
11-05 14:45:29.544 127.0.0.1:54321       79164        main  INFO water.default: Build git describe: jenkins-master-5625-7-gb9ba1af5f0
11-05 14:45:29.544 127.0.0.1:54321       79164        main  INFO water.default: Build project version: 3.35.0.99999
11-05 14:45:29.544 127.0.0.1:54321       79164        main  INFO water.default: Build age: 2 hours and 53 minutes
1

0

In [272]:
# Display the content of the POJO
with open('pojo.java', 'r') as f:
    print(f.read())

/*
  Licensed under the Apache License, Version 2.0
    http://www.apache.org/licenses/LICENSE-2.0.html

  AUTOGENERATED BY H2O at 2021-11-05T14:45:30.759-04:00
  3.35.0.99999
  
  Standalone prediction code with sample test data for GBMModel named Generic_model_1636137928927_1

  How to download, compile and execute:
      mkdir tmpdir
      cd tmpdir
      curl http:/localhost/127.0.0.1:54321/3/h2o-genmodel.jar > h2o-genmodel.jar
      curl http:/localhost/127.0.0.1:54321/3/Models.java/Generic_model_1636137928927_1 > Generic_model_1636137928927_1.java
      javac -cp h2o-genmodel.jar -J-Xmx2g -J-XX:MaxPermSize=128m Generic_model_1636137928927_1.java

     (Note:  Try java argument -XX:+PrintCompilation to show runtime JIT compiler behavior.)
*/
import java.util.Map;
import hex.genmodel.GenModel;
import hex.genmodel.annotations.ModelPojo;

@ModelPojo(name="Generic_model_1636137928927_1", algorithm="gbm")
public class Generic_model_1636137928927_1 extends GenModel {
  public hex.ModelC

In [273]:
# Now specify system property sys.ai.h2o.java.output.doubles to output numbers in 64-bit precision
subprocess.call(["java", "-Dsys.ai.h2o.java.output.doubles=true", "-cp", h2o_jar, "water.tools.MojoConvertTool", mojo_path, "pojo64.java"], stderr=subprocess.STDOUT, shell=False)


Starting local H2O instance to facilitate MOJO to POJO conversion.

14:45:31.502 [main] INFO  hex.tree.xgboost.util.NativeLibrary - Loaded library from lib/osx_64/libxgboost4j_minimal.dylib (/var/folders/v1/fkjmcbkd11v2mrm4dm6345ym0000gn/T/libxgboost4j_minimal978915340387551523.dylib)
11-05 14:45:31.628 127.0.0.1:54321       79166        main  INFO water.default: ----- H2O started  -----
11-05 14:45:31.628 127.0.0.1:54321       79166        main  INFO water.default: Build git branch: master
11-05 14:45:31.628 127.0.0.1:54321       79166        main  INFO water.default: Build git hash: b9ba1af5f07c6dbc6369e41113ea43947109e054
11-05 14:45:31.628 127.0.0.1:54321       79166        main  INFO water.default: Build git describe: jenkins-master-5625-7-gb9ba1af5f0
11-05 14:45:31.629 127.0.0.1:54321       79166        main  INFO water.default: Build project version: 3.35.0.99999
11-05 14:45:31.629 127.0.0.1:54321       79166        main  INFO water.default: Build age: 2 hours and 53 minutes
11

0

In [274]:
# Display the content of the POJO with 64-bit number representation
with open('pojo64.java', 'r') as f:
    print(f.read())

/*
  Licensed under the Apache License, Version 2.0
    http://www.apache.org/licenses/LICENSE-2.0.html

  AUTOGENERATED BY H2O at 2021-11-05T14:45:32.815-04:00
  3.35.0.99999
  
  Standalone prediction code with sample test data for GBMModel named Generic_model_1636137931013_1

  How to download, compile and execute:
      mkdir tmpdir
      cd tmpdir
      curl http:/localhost/127.0.0.1:54321/3/h2o-genmodel.jar > h2o-genmodel.jar
      curl http:/localhost/127.0.0.1:54321/3/Models.java/Generic_model_1636137931013_1 > Generic_model_1636137931013_1.java
      javac -cp h2o-genmodel.jar -J-Xmx2g -J-XX:MaxPermSize=128m Generic_model_1636137931013_1.java

     (Note:  Try java argument -XX:+PrintCompilation to show runtime JIT compiler behavior.)
*/
import java.util.Map;
import hex.genmodel.GenModel;
import hex.genmodel.annotations.ModelPojo;

@ModelPojo(name="Generic_model_1636137931013_1", algorithm="gbm")
public class Generic_model_1636137931013_1 extends GenModel {
  public hex.ModelC