# Using eval_metric in XGBoost 

This notebook shows how to use XGBoost's Evaluation Metric (`eval_metric`) with H2O. 



### What is eval_metric?

Original (non-H2O) XGBoost library allows users to define one or more evaluation metrics that will be calculated on both training and validations datasets after each iteration. If no evaluation metric is defined by the user XGBoost will assign it based on the choice of the objective function. For example in the case of binary classification, XGBoost will use logloss to report performance on training and validation data. Logloss measures the extent to which predicted probabilities diverge from the actual class labels. You might want to consider evaluating the performance using a different metric, e.g. for imbalanced problems, where it is more important to correctly predict the positive minority class, you might want to use the Area under the PR curve and specify `eval_metric="aucpr"`. Evaluation metric can easily be used for early stopping.

### Is eval_metric needed with H2O XGBoost?

H2O's approach to calculating metrics is different. By default, H2O will calculate all appropritate metrics for the given problem. For example for binary classification, H2O will report logloss, AUC, AUCPR and also additional metrics. Users thus typically don't need to worry about selecting the appropriate metric before model training. When early stopping is used, users will need to chose from one built-in early stopping metrics. For consistency between different model types and/or algorithm implementations, these are always calculated by H2O itself and, in XGBoost's specific case independect of XGBoost's eval_metric implementation.

### When should eval_metric be considered?

While you typically don't need to specify your custom `eval_metric`, there are cases doing so would be beneficial.

Case 1: H2O doesn't provide a suitable built-in metric. Example: If you want to calculate classification error for a different threshold than the one automatically determined by H2O, you can do so only by specifying `eval_metric="error@<your threshold>"` because H2O currently doesn't have this capability.

Case 2: Frequent scoring. By default H2O uses timing-triggered scoring, it is trying to make sure that majority of the time is used on model training as opposed to just model scoring. You can override this behavior and specify manually at what iterations you want model to be scored (eg. each iteration, every 5th iteration, ...). Because H2O calculates all possible metrics as opposed just few in native XGBoost, and needs to extract XGBoost model from the native memory, it can have a significant overhead when user desires to score very frequently. This can slow down the model building.

### Example

We will create a synthetic classification dataset, show XGBoost model training with default parameters and compare it to the output of model training with eval_metric used for early stopping.

In [1]:
import h2o
h2o.init(strict_version_check=False)

versionFromGradle='3.39.0',projectVersion='3.39.0.99999',branch='michalk_eval-metric-ntb',lastCommitHash='256ed83c89220493cca0574ebd517eb09e3611fd',gitDescribe='jenkins-master-5998-dirty',compiledOn='2022-10-28 14:15:26',compiledBy='kurkami'
Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,5 hours 14 mins
H2O_cluster_timezone:,America/New_York
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.39.0.99999
H2O_cluster_version_age:,1 day
H2O_cluster_name:,kurkami
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.051 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


In [2]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=100000, 
                           n_informative=5,
                           n_classes=2,
                           random_state=42)
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.25)

In [3]:
from h2o import H2OFrame
train = H2OFrame(y_train, column_names=["y"]).cbind(H2OFrame(X_train))
valid = H2OFrame(y_validation, column_names=["y"]).cbind(H2OFrame(X_validation))
train["y"] = train["y"].asfactor()
valid["y"] = valid["y"].asfactor()

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [4]:
train

y,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,C15,C16,C17,C18,C19,C20
1,-0.460911,0.587072,1.27653,1.65893,0.82732,0.900837,1.27825,-1.15342,-1.10419,0.231584,1.42246,0.632382,1.6532,0.0807537,0.63636,-0.0407703,-0.712485,-0.987504,0.529023,0.813985
0,-0.122517,0.926967,-3.27666,1.65648,1.26381,0.866601,-0.112373,4.88467,-2.56306,1.37596,1.66414,0.750439,-0.712589,0.011264,0.338901,-0.91639,-1.37998,-1.37817,-0.702391,-0.222837
1,1.07287,1.35581,0.504084,-1.9799,0.860703,-0.753541,1.17781,0.364174,2.49498,-0.309902,0.52471,-0.128492,0.133623,1.77568,-1.70823,0.263603,-1.69966,0.77875,-0.578858,1.53087
1,-0.497834,0.427476,-1.67148,-1.71013,1.11473,0.0850045,-2.81089,3.2892,0.676264,1.04835,1.46596,0.818714,-1.03108,-1.65528,0.301107,1.19965,0.120588,0.820176,-0.930966,-0.82086
1,0.923038,1.51161,1.66059,1.45418,-2.3419,-0.757573,0.308842,-3.57087,-0.939606,0.369506,0.990806,0.67289,1.53918,1.79014,0.726961,0.0339329,-0.542889,-1.01856,1.97659,1.50693
0,0.575923,2.20724,0.595904,-0.416675,0.21922,0.787082,-0.0753548,-0.413841,-0.280823,0.756811,-0.115625,-1.03631,0.617746,1.15633,-1.3063,1.83909,0.610228,0.62274,-0.157638,-1.19873
0,-0.931317,3.38117,-1.25809,0.323422,4.46615,0.366347,-0.178002,4.59046,-0.995683,0.248196,-3.7437,1.22383,-1.00021,-0.754449,0.573945,1.94618,-2.16485,-0.285859,-3.99979,-1.39355
0,0.956366,1.02616,-0.55144,-0.66649,2.47247,1.14188,1.10357,2.61941,1.2818,0.528047,-1.10656,-0.664716,-0.849829,1.49331,-0.183118,0.618475,-0.413351,0.410573,-2.14432,2.35245
0,-0.0523249,1.73885,-1.47901,0.254941,-1.37106,-0.0549625,-1.19023,0.848204,-0.424188,-0.0600663,-0.781962,1.1958,-0.456581,-0.740832,2.26734,2.01902,1.89101,0.348762,0.260157,0.133197
0,-0.818169,-0.746954,-1.28428,-0.0786964,0.708029,-0.252285,-0.19201,1.33074,0.229397,-0.599344,-0.141543,-0.721745,0.442618,0.953017,1.00111,-0.862009,0.790705,1.56841,0.101824,1.48008


##### Train XGBoost model with logloss as stopping metric

We are specifying 1000 trees to be built and providing stopping criteria to make the model stop reasonably early for the purpose of this example.

In [5]:
from h2o.estimators.xgboost import H2OXGBoostEstimator
model_def = H2OXGBoostEstimator(ntrees=1000, max_depth=6, score_each_iteration=True, 
                                stopping_rounds=3, stopping_tolerance=0.05, stopping_metric="logloss")
model_def.train(y="y", training_frame=train, validation_frame=valid)

xgboost Model Build progress: |██████████████████████████████████████████████████| (done) 100%


Unnamed: 0,number_of_trees
,22.0

Unnamed: 0,0,1,Error,Rate
0,35155.0,2174.0,0.0582,(2174.0/37329.0)
1,1112.0,36559.0,0.0295,(1112.0/37671.0)
Total,36267.0,38733.0,0.0438,(3286.0/75000.0)

metric,threshold,value,idx
max f1,0.5101676,0.9569918,214.0
max f2,0.3082806,0.9715081,263.0
max f0point5,0.7532607,0.9592084,147.0
max accuracy,0.5561712,0.9564,201.0
max precision,0.9974565,1.0,0.0
max recall,0.0016028,1.0,399.0
max specificity,0.9974565,1.0,0.0
max absolute_mcc,0.5561712,0.9129277,201.0
max min_per_class_accuracy,0.6239853,0.9548073,184.0
max mean_per_class_accuracy,0.5561712,0.9563597,201.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0101733,0.9961097,1.9909214,1.9909214,1.0,0.9967516,1.0,0.9967516,0.0202543,0.0202543,99.0921398,99.0921398,0.0202543
2,0.0200267,0.9957153,1.9909214,1.9909214,1.0,0.9958894,1.0,0.9963274,0.0196172,0.0398715,99.0921398,99.0921398,0.0398715
3,0.03068,0.9953964,1.9909214,1.9909214,1.0,0.9955495,1.0,0.9960573,0.0212099,0.0610815,99.0921398,99.0921398,0.0610815
4,0.0401067,0.9951944,1.9881054,1.9902595,0.9985856,0.9952708,0.9996676,0.9958724,0.0187412,0.0798227,98.8105385,99.0259523,0.0797959
5,0.0517733,0.9949297,1.9863707,1.9893832,0.9977143,0.9950084,0.9992274,0.9956777,0.0231743,0.102997,98.6370721,98.9383216,0.1029166
6,0.1000133,0.9933097,1.98817,1.988798,0.998618,0.9941657,0.9989335,0.9949484,0.0959093,0.1989063,98.8169987,98.8798032,0.198692
7,0.15008,0.9904984,1.9840287,1.987207,0.9965379,0.9920577,0.9981343,0.9939841,0.0993337,0.29824,98.4028728,98.7206993,0.2976775
8,0.2000133,0.9849605,1.9824155,1.9860108,0.9957276,0.9879125,0.9975335,0.9924683,0.0989886,0.3972286,98.2415459,98.6010786,0.3962375
9,0.30004,0.9561046,1.9673021,1.9797737,0.9881365,0.9740387,0.9944007,0.9863243,0.1967827,0.5940113,96.7302096,97.9773725,0.5906359
10,0.4000133,0.8770613,1.9123254,1.9629167,0.9605228,0.9210076,0.9859338,0.97,0.1911815,0.7851929,91.2325408,96.2916704,0.773888

Unnamed: 0,0,1,Error,Rate
0,11844.0,840.0,0.0662,(840.0/12684.0)
1,385.0,11931.0,0.0313,(385.0/12316.0)
Total,12229.0,12771.0,0.049,(1225.0/25000.0)

metric,threshold,value,idx
max f1,0.4899956,0.9511699,217.0
max f2,0.2381639,0.9684011,281.0
max f0point5,0.7637601,0.9521032,144.0
max accuracy,0.4899956,0.951,217.0
max precision,0.9974683,1.0,0.0
max recall,0.0018354,1.0,398.0
max specificity,0.9974683,1.0,0.0
max absolute_mcc,0.4899956,0.9026291,217.0
max min_per_class_accuracy,0.6303638,0.9494966,182.0
max mean_per_class_accuracy,0.4899956,0.9512573,217.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.01028,0.9961244,2.0298798,2.0298798,1.0,0.9967313,1.0,0.9967313,0.0208672,0.0208672,102.9879831,102.9879831,0.0208672
2,0.02032,0.9958012,2.0298798,2.0298798,1.0,0.9959175,1.0,0.9963292,0.02038,0.0412472,102.9879831,102.9879831,0.0412472
3,0.03016,0.9954615,2.0216283,2.0271877,0.995935,0.9956102,0.9986737,0.9960946,0.0198928,0.06114,102.1628287,102.7187683,0.0610611
4,0.04108,0.9951944,2.0224444,2.0259268,0.996337,0.9952887,0.9980526,0.9958804,0.0220851,0.0832251,102.2444374,102.5926803,0.0830674
5,0.05256,0.9949297,2.0157343,2.0237006,0.9930314,0.9950085,0.9969559,0.99569,0.0231406,0.1063657,101.5734327,102.3700593,0.1060503
6,0.1,0.9932859,2.0196106,2.0217603,0.994941,0.9941552,0.996,0.9949619,0.0958103,0.202176,101.9610625,102.1760312,0.2013876
7,0.15,0.9902694,2.0233842,2.0223016,0.9968,0.991949,0.9962667,0.9939576,0.1011692,0.3033452,102.3384216,102.2301613,0.3022415
8,0.2,0.984066,2.0136408,2.0201364,0.992,0.9873945,0.9952,0.9923168,0.100682,0.4040273,101.3640792,102.0136408,0.4021351
9,0.3,0.9523897,1.9852225,2.0084984,0.978,0.9717051,0.9894667,0.9854462,0.1985222,0.6025495,98.5222475,100.849843,0.5963212
10,0.4,0.8671057,1.9235141,1.9872524,0.9476,0.9142567,0.979,0.9676488,0.1923514,0.7949009,92.3514128,98.7252355,0.7783447

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
,2022-10-28 14:19:27,0.031 sec,0.0,0.5,0.6931472,0.5,0.5022800,1.0,0.49772,0.5,0.6931472,0.5,0.4926400,1.0,0.50736
,2022-10-28 14:19:27,0.150 sec,1.0,0.3996010,0.5084556,0.9599605,0.9528824,1.9699443,0.08508,0.4002939,0.5095766,0.9584754,0.9501212,2.0073256,0.08816
,2022-10-28 14:19:27,0.213 sec,2.0,0.3370778,0.4032048,0.9693358,0.9638432,1.9749781,0.0786933,0.3378726,0.4043148,0.9689642,0.9627452,2.0180782,0.08072
,2022-10-28 14:19:27,0.272 sec,3.0,0.2957135,0.3335651,0.9735764,0.9692971,1.9823583,0.07124,0.2972804,0.3355690,0.9727585,0.9672690,2.0164813,0.07328
,2022-10-28 14:19:27,0.337 sec,4.0,0.2690565,0.2867050,0.9769386,0.9736661,1.9824404,0.0687733,0.2711930,0.2894303,0.9753680,0.9709177,2.0166126,0.07028
,2022-10-28 14:19:28,0.461 sec,5.0,0.2529770,0.2548994,0.9782349,0.9753981,1.9830013,0.06936,0.2550437,0.2575265,0.9769549,0.9724197,2.0204531,0.06956
,2022-10-28 14:19:28,0.553 sec,6.0,0.2430550,0.2328859,0.9792536,0.9767333,1.9826945,0.06864,0.2452571,0.2357295,0.9778085,0.9737537,2.0198804,0.06872
,2022-10-28 14:19:28,0.646 sec,7.0,0.2366703,0.2173043,0.9801005,0.9779518,1.9840125,0.0655333,0.2388924,0.2202064,0.9787945,0.9751087,2.0189075,0.06696
,2022-10-28 14:19:28,0.742 sec,8.0,0.2274282,0.2004642,0.9816894,0.9795849,1.9883425,0.06172,0.2302864,0.2040544,0.9803330,0.9772714,2.0219506,0.06516
,2022-10-28 14:19:28,0.831 sec,9.0,0.2235048,0.1908878,0.9823444,0.9805219,1.9883948,0.0608,0.2267565,0.1951048,0.9808128,0.9779093,2.0222773,0.06432

variable,relative_importance,scaled_importance,percentage
C11,70178.75,1.0,0.4261358
C2,35732.6640625,0.5091664,0.2169741
C13,21608.4863281,0.3079064,0.13121
C8,15654.5498047,0.2230668,0.0950568
C3,12402.2607422,0.1767239,0.0753084
C19,5739.7641602,0.0817878,0.0348527
C5,2835.9191895,0.0404099,0.0172201
C6,88.9352493,0.0012673,0.00054
C12,51.987648,0.0007408,0.0003157
C17,49.1711044,0.0007007,0.0002986


Scoring history will show us some of the metrics that were calculated on training and validation datasets. In our case, `validation_logloss` was used as the metric for early stopping.

In [6]:
model_def.scoring_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error
0,,2022-10-28 14:19:27,0.031 sec,0.0,0.5,0.693147,0.5,0.50228,1.0,0.49772,0.5,0.693147,0.5,0.49264,1.0,0.50736
1,,2022-10-28 14:19:27,0.150 sec,1.0,0.399601,0.508456,0.95996,0.952882,1.969944,0.08508,0.400294,0.509577,0.958475,0.950121,2.007326,0.08816
2,,2022-10-28 14:19:27,0.213 sec,2.0,0.337078,0.403205,0.969336,0.963843,1.974978,0.078693,0.337873,0.404315,0.968964,0.962745,2.018078,0.08072
3,,2022-10-28 14:19:27,0.272 sec,3.0,0.295714,0.333565,0.973576,0.969297,1.982358,0.07124,0.29728,0.335569,0.972759,0.967269,2.016481,0.07328
4,,2022-10-28 14:19:27,0.337 sec,4.0,0.269057,0.286705,0.976939,0.973666,1.98244,0.068773,0.271193,0.28943,0.975368,0.970918,2.016613,0.07028
5,,2022-10-28 14:19:28,0.461 sec,5.0,0.252977,0.254899,0.978235,0.975398,1.983001,0.06936,0.255044,0.257526,0.976955,0.97242,2.020453,0.06956
6,,2022-10-28 14:19:28,0.553 sec,6.0,0.243055,0.232886,0.979254,0.976733,1.982694,0.06864,0.245257,0.235729,0.977808,0.973754,2.01988,0.06872
7,,2022-10-28 14:19:28,0.646 sec,7.0,0.23667,0.217304,0.9801,0.977952,1.984012,0.065533,0.238892,0.220206,0.978794,0.975109,2.018908,0.06696
8,,2022-10-28 14:19:28,0.742 sec,8.0,0.227428,0.200464,0.981689,0.979585,1.988342,0.06172,0.230286,0.204054,0.980333,0.977271,2.021951,0.06516
9,,2022-10-28 14:19:28,0.831 sec,9.0,0.223505,0.190888,0.982344,0.980522,1.988395,0.0608,0.226757,0.195105,0.980813,0.977909,2.022277,0.06432


##### Train XGBoost model and use eval_metric="logloss" for early stopping

We will use the same parameters as in the first case and add `eval_metric="logloss"`. To use actually use the value of the `eval_metric` for early stopping, we also need to specify `stopping_metric="custom"`.

In [7]:
model_eval = H2OXGBoostEstimator(ntrees=1000, max_depth=6, score_each_iteration=True, 
                                 eval_metric="logloss",
                                 stopping_rounds=3, stopping_tolerance=0.05, stopping_metric="custom")
model_eval.train(y="y", training_frame=train, validation_frame=valid)

xgboost Model Build progress: |██████████████████████████████████████████████████| (done) 100%


Unnamed: 0,number_of_trees
,22.0

Unnamed: 0,0,1,Error,Rate
0,35155.0,2174.0,0.0582,(2174.0/37329.0)
1,1112.0,36559.0,0.0295,(1112.0/37671.0)
Total,36267.0,38733.0,0.0438,(3286.0/75000.0)

metric,threshold,value,idx
max f1,0.5101676,0.9569918,214.0
max f2,0.3082806,0.9715081,263.0
max f0point5,0.7532607,0.9592084,147.0
max accuracy,0.5561712,0.9564,201.0
max precision,0.9974565,1.0,0.0
max recall,0.0016028,1.0,399.0
max specificity,0.9974565,1.0,0.0
max absolute_mcc,0.5561712,0.9129277,201.0
max min_per_class_accuracy,0.6239853,0.9548073,184.0
max mean_per_class_accuracy,0.5561712,0.9563597,201.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0101733,0.9961097,1.9909214,1.9909214,1.0,0.9967516,1.0,0.9967516,0.0202543,0.0202543,99.0921398,99.0921398,0.0202543
2,0.0200267,0.9957153,1.9909214,1.9909214,1.0,0.9958894,1.0,0.9963274,0.0196172,0.0398715,99.0921398,99.0921398,0.0398715
3,0.03068,0.9953964,1.9909214,1.9909214,1.0,0.9955495,1.0,0.9960573,0.0212099,0.0610815,99.0921398,99.0921398,0.0610815
4,0.0401067,0.9951944,1.9881054,1.9902595,0.9985856,0.9952708,0.9996676,0.9958724,0.0187412,0.0798227,98.8105385,99.0259523,0.0797959
5,0.0517733,0.9949297,1.9863707,1.9893832,0.9977143,0.9950084,0.9992274,0.9956777,0.0231743,0.102997,98.6370721,98.9383216,0.1029166
6,0.1000133,0.9933097,1.98817,1.988798,0.998618,0.9941657,0.9989335,0.9949484,0.0959093,0.1989063,98.8169987,98.8798032,0.198692
7,0.15008,0.9904984,1.9840287,1.987207,0.9965379,0.9920577,0.9981343,0.9939841,0.0993337,0.29824,98.4028728,98.7206993,0.2976775
8,0.2000133,0.9849605,1.9824155,1.9860108,0.9957276,0.9879125,0.9975335,0.9924683,0.0989886,0.3972286,98.2415459,98.6010786,0.3962375
9,0.30004,0.9561046,1.9673021,1.9797737,0.9881365,0.9740387,0.9944007,0.9863243,0.1967827,0.5940113,96.7302096,97.9773725,0.5906359
10,0.4000133,0.8770613,1.9123254,1.9629167,0.9605228,0.9210076,0.9859338,0.97,0.1911815,0.7851929,91.2325408,96.2916704,0.773888

Unnamed: 0,0,1,Error,Rate
0,11844.0,840.0,0.0662,(840.0/12684.0)
1,385.0,11931.0,0.0313,(385.0/12316.0)
Total,12229.0,12771.0,0.049,(1225.0/25000.0)

metric,threshold,value,idx
max f1,0.4899956,0.9511699,217.0
max f2,0.2381639,0.9684011,281.0
max f0point5,0.7637601,0.9521032,144.0
max accuracy,0.4899956,0.951,217.0
max precision,0.9974683,1.0,0.0
max recall,0.0018354,1.0,398.0
max specificity,0.9974683,1.0,0.0
max absolute_mcc,0.4899956,0.9026291,217.0
max min_per_class_accuracy,0.6303638,0.9494966,182.0
max mean_per_class_accuracy,0.4899956,0.9512573,217.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.01028,0.9961244,2.0298798,2.0298798,1.0,0.9967313,1.0,0.9967313,0.0208672,0.0208672,102.9879831,102.9879831,0.0208672
2,0.02032,0.9958012,2.0298798,2.0298798,1.0,0.9959175,1.0,0.9963292,0.02038,0.0412472,102.9879831,102.9879831,0.0412472
3,0.03016,0.9954615,2.0216283,2.0271877,0.995935,0.9956102,0.9986737,0.9960946,0.0198928,0.06114,102.1628287,102.7187683,0.0610611
4,0.04108,0.9951944,2.0224444,2.0259268,0.996337,0.9952887,0.9980526,0.9958804,0.0220851,0.0832251,102.2444374,102.5926803,0.0830674
5,0.05256,0.9949297,2.0157343,2.0237006,0.9930314,0.9950085,0.9969559,0.99569,0.0231406,0.1063657,101.5734327,102.3700593,0.1060503
6,0.1,0.9932859,2.0196106,2.0217603,0.994941,0.9941552,0.996,0.9949619,0.0958103,0.202176,101.9610625,102.1760312,0.2013876
7,0.15,0.9902694,2.0233842,2.0223016,0.9968,0.991949,0.9962667,0.9939576,0.1011692,0.3033452,102.3384216,102.2301613,0.3022415
8,0.2,0.984066,2.0136408,2.0201364,0.992,0.9873945,0.9952,0.9923168,0.100682,0.4040273,101.3640792,102.0136408,0.4021351
9,0.3,0.9523897,1.9852225,2.0084984,0.978,0.9717051,0.9894667,0.9854462,0.1985222,0.6025495,98.5222475,100.849843,0.5963212
10,0.4,0.8671057,1.9235141,1.9872524,0.9476,0.9142567,0.979,0.9676488,0.1923514,0.7949009,92.3514128,98.7252355,0.7783447

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,training_custom,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error,validation_custom
,2022-10-28 14:19:30,0.025 sec,0.0,0.5,0.6931472,0.5,0.5022800,1.0,0.49772,0.693147,0.5,0.6931472,0.5,0.4926400,1.0,0.50736,0.693147
,2022-10-28 14:19:30,0.139 sec,1.0,0.3996010,0.5084556,0.9599605,0.9528824,1.9699443,0.08508,0.508456,0.4002939,0.5095766,0.9584754,0.9501212,2.0073256,0.08816,0.509577
,2022-10-28 14:19:30,0.179 sec,2.0,0.3370778,0.4032048,0.9693358,0.9638432,1.9749781,0.0786933,0.403205,0.3378726,0.4043148,0.9689642,0.9627452,2.0180782,0.08072,0.404315
,2022-10-28 14:19:30,0.228 sec,3.0,0.2957135,0.3335651,0.9735764,0.9692971,1.9823583,0.07124,0.333565,0.2972804,0.3355690,0.9727585,0.9672690,2.0164813,0.07328,0.335569
,2022-10-28 14:19:30,0.280 sec,4.0,0.2690565,0.2867050,0.9769386,0.9736661,1.9824404,0.0687733,0.286705,0.2711930,0.2894303,0.9753680,0.9709177,2.0166126,0.07028,0.28943
,2022-10-28 14:19:30,0.356 sec,5.0,0.2529770,0.2548994,0.9782349,0.9753981,1.9830013,0.06936,0.254899,0.2550437,0.2575265,0.9769549,0.9724197,2.0204531,0.06956,0.257526
,2022-10-28 14:19:30,0.434 sec,6.0,0.2430550,0.2328859,0.9792536,0.9767333,1.9826945,0.06864,0.232886,0.2452571,0.2357295,0.9778085,0.9737537,2.0198804,0.06872,0.235729
,2022-10-28 14:19:30,0.511 sec,7.0,0.2366703,0.2173043,0.9801005,0.9779518,1.9840125,0.0655333,0.217304,0.2388924,0.2202064,0.9787945,0.9751087,2.0189075,0.06696,0.220206
,2022-10-28 14:19:30,0.587 sec,8.0,0.2274282,0.2004642,0.9816894,0.9795849,1.9883425,0.06172,0.200464,0.2302864,0.2040544,0.9803330,0.9772714,2.0219506,0.06516,0.204054
,2022-10-28 14:19:31,0.676 sec,9.0,0.2235048,0.1908878,0.9823444,0.9805219,1.9883948,0.0608,0.190888,0.2267565,0.1951048,0.9808128,0.9779093,2.0222773,0.06432,0.195105

variable,relative_importance,scaled_importance,percentage
C11,70178.75,1.0,0.4261358
C2,35732.6640625,0.5091664,0.2169741
C13,21608.4863281,0.3079064,0.13121
C8,15654.5498047,0.2230668,0.0950568
C3,12402.2607422,0.1767239,0.0753084
C19,5739.7641602,0.0817878,0.0348527
C5,2835.9191895,0.0404099,0.0172201
C6,88.9352493,0.0012673,0.00054
C12,51.987648,0.0007408,0.0003157
C17,49.1711044,0.0007007,0.0002986


The scoring history for model with `eval_metric="logloss"`  will look similar to the scoring history of the first model. This is expected - we didn't actually changed the training behavior, we only changed the source of the values that trigger early stopping. In this case, we are stopping on values of `validation_custom`. This value correspond to the value calculated and returned by XGBoost. It should be close to H2O's own `validation_logloss` value, there can be only a small difference caused by a different precision in XGBoost and H2O (the values should be within absolute tolerance of 1e-5). This is, however, something to keep in mind. There can be edge cases where H2O metric will differ slightly from the conceptually same XGBoost metric and this might cause the models to stop at a different iteration.

The scoring history will also have value of `eval_metric` for the training frame - see column `training_custom`.

In [8]:
model_eval.scoring_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,training_custom,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error,validation_custom
0,,2022-10-28 14:19:30,0.025 sec,0.0,0.5,0.693147,0.5,0.50228,1.0,0.49772,0.693147,0.5,0.693147,0.5,0.49264,1.0,0.50736,0.693147
1,,2022-10-28 14:19:30,0.139 sec,1.0,0.399601,0.508456,0.95996,0.952882,1.969944,0.08508,0.508456,0.400294,0.509577,0.958475,0.950121,2.007326,0.08816,0.509577
2,,2022-10-28 14:19:30,0.179 sec,2.0,0.337078,0.403205,0.969336,0.963843,1.974978,0.078693,0.403205,0.337873,0.404315,0.968964,0.962745,2.018078,0.08072,0.404315
3,,2022-10-28 14:19:30,0.228 sec,3.0,0.295714,0.333565,0.973576,0.969297,1.982358,0.07124,0.333565,0.29728,0.335569,0.972759,0.967269,2.016481,0.07328,0.335569
4,,2022-10-28 14:19:30,0.280 sec,4.0,0.269057,0.286705,0.976939,0.973666,1.98244,0.068773,0.286705,0.271193,0.28943,0.975368,0.970918,2.016613,0.07028,0.28943
5,,2022-10-28 14:19:30,0.356 sec,5.0,0.252977,0.254899,0.978235,0.975398,1.983001,0.06936,0.254899,0.255044,0.257526,0.976955,0.97242,2.020453,0.06956,0.257526
6,,2022-10-28 14:19:30,0.434 sec,6.0,0.243055,0.232886,0.979254,0.976733,1.982694,0.06864,0.232886,0.245257,0.235729,0.977808,0.973754,2.01988,0.06872,0.235729
7,,2022-10-28 14:19:30,0.511 sec,7.0,0.23667,0.217304,0.9801,0.977952,1.984012,0.065533,0.217304,0.238892,0.220206,0.978794,0.975109,2.018908,0.06696,0.220206
8,,2022-10-28 14:19:30,0.587 sec,8.0,0.227428,0.200464,0.981689,0.979585,1.988342,0.06172,0.200464,0.230286,0.204054,0.980333,0.977271,2.021951,0.06516,0.204054
9,,2022-10-28 14:19:31,0.676 sec,9.0,0.223505,0.190888,0.982344,0.980522,1.988395,0.0608,0.190888,0.226757,0.195105,0.980813,0.977909,2.022277,0.06432,0.195105


##### Train XGBoost model, use eval_metric="logloss" for early stopping and disable H2O metrics to speed-up model training

In this example we will keep the same parameters as in the previous case and add flag `score_eval_metric_only=True`. This flag will instruct H2O to disable its own scoring and solely rely on `eval_metric` for early stopping and recording the scoring history.

In [9]:
model_eval_only = H2OXGBoostEstimator(ntrees=1000, max_depth=6, score_each_iteration=True, 
                                      eval_metric="logloss",
                                      stopping_rounds=3, stopping_tolerance=0.05, stopping_metric="custom",
                                      score_eval_metric_only=True)
model_eval_only.train(y="y", training_frame=train, validation_frame=valid)

xgboost Model Build progress: |██████████████████████████████████████████████████| (done) 100%


Unnamed: 0,number_of_trees
,22.0

Unnamed: 0,0,1,Error,Rate
0,35155.0,2174.0,0.0582,(2174.0/37329.0)
1,1112.0,36559.0,0.0295,(1112.0/37671.0)
Total,36267.0,38733.0,0.0438,(3286.0/75000.0)

metric,threshold,value,idx
max f1,0.5101676,0.9569918,214.0
max f2,0.3082806,0.9715081,263.0
max f0point5,0.7532607,0.9592084,147.0
max accuracy,0.5561712,0.9564,201.0
max precision,0.9974565,1.0,0.0
max recall,0.0016028,1.0,399.0
max specificity,0.9974565,1.0,0.0
max absolute_mcc,0.5561712,0.9129277,201.0
max min_per_class_accuracy,0.6239853,0.9548073,184.0
max mean_per_class_accuracy,0.5561712,0.9563597,201.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.0101733,0.9961097,1.9909214,1.9909214,1.0,0.9967516,1.0,0.9967516,0.0202543,0.0202543,99.0921398,99.0921398,0.0202543
2,0.0200267,0.9957153,1.9909214,1.9909214,1.0,0.9958894,1.0,0.9963274,0.0196172,0.0398715,99.0921398,99.0921398,0.0398715
3,0.03068,0.9953964,1.9909214,1.9909214,1.0,0.9955495,1.0,0.9960573,0.0212099,0.0610815,99.0921398,99.0921398,0.0610815
4,0.0401067,0.9951944,1.9881054,1.9902595,0.9985856,0.9952708,0.9996676,0.9958724,0.0187412,0.0798227,98.8105385,99.0259523,0.0797959
5,0.0517733,0.9949297,1.9863707,1.9893832,0.9977143,0.9950084,0.9992274,0.9956777,0.0231743,0.102997,98.6370721,98.9383216,0.1029166
6,0.1000133,0.9933097,1.98817,1.988798,0.998618,0.9941657,0.9989335,0.9949484,0.0959093,0.1989063,98.8169987,98.8798032,0.198692
7,0.15008,0.9904984,1.9840287,1.987207,0.9965379,0.9920577,0.9981343,0.9939841,0.0993337,0.29824,98.4028728,98.7206993,0.2976775
8,0.2000133,0.9849605,1.9824155,1.9860108,0.9957276,0.9879125,0.9975335,0.9924683,0.0989886,0.3972286,98.2415459,98.6010786,0.3962375
9,0.30004,0.9561046,1.9673021,1.9797737,0.9881365,0.9740387,0.9944007,0.9863243,0.1967827,0.5940113,96.7302096,97.9773725,0.5906359
10,0.4000133,0.8770613,1.9123254,1.9629167,0.9605228,0.9210076,0.9859338,0.97,0.1911815,0.7851929,91.2325408,96.2916704,0.773888

Unnamed: 0,0,1,Error,Rate
0,11844.0,840.0,0.0662,(840.0/12684.0)
1,385.0,11931.0,0.0313,(385.0/12316.0)
Total,12229.0,12771.0,0.049,(1225.0/25000.0)

metric,threshold,value,idx
max f1,0.4899956,0.9511699,217.0
max f2,0.2381639,0.9684011,281.0
max f0point5,0.7637601,0.9521032,144.0
max accuracy,0.4899956,0.951,217.0
max precision,0.9974683,1.0,0.0
max recall,0.0018354,1.0,398.0
max specificity,0.9974683,1.0,0.0
max absolute_mcc,0.4899956,0.9026291,217.0
max min_per_class_accuracy,0.6303638,0.9494966,182.0
max mean_per_class_accuracy,0.4899956,0.9512573,217.0

group,cumulative_data_fraction,lower_threshold,lift,cumulative_lift,response_rate,score,cumulative_response_rate,cumulative_score,capture_rate,cumulative_capture_rate,gain,cumulative_gain,kolmogorov_smirnov
1,0.01028,0.9961244,2.0298798,2.0298798,1.0,0.9967313,1.0,0.9967313,0.0208672,0.0208672,102.9879831,102.9879831,0.0208672
2,0.02032,0.9958012,2.0298798,2.0298798,1.0,0.9959175,1.0,0.9963292,0.02038,0.0412472,102.9879831,102.9879831,0.0412472
3,0.03016,0.9954615,2.0216283,2.0271877,0.995935,0.9956102,0.9986737,0.9960946,0.0198928,0.06114,102.1628287,102.7187683,0.0610611
4,0.04108,0.9951944,2.0224444,2.0259268,0.996337,0.9952887,0.9980526,0.9958804,0.0220851,0.0832251,102.2444374,102.5926803,0.0830674
5,0.05256,0.9949297,2.0157343,2.0237006,0.9930314,0.9950085,0.9969559,0.99569,0.0231406,0.1063657,101.5734327,102.3700593,0.1060503
6,0.1,0.9932859,2.0196106,2.0217603,0.994941,0.9941552,0.996,0.9949619,0.0958103,0.202176,101.9610625,102.1760312,0.2013876
7,0.15,0.9902694,2.0233842,2.0223016,0.9968,0.991949,0.9962667,0.9939576,0.1011692,0.3033452,102.3384216,102.2301613,0.3022415
8,0.2,0.984066,2.0136408,2.0201364,0.992,0.9873945,0.9952,0.9923168,0.100682,0.4040273,101.3640792,102.0136408,0.4021351
9,0.3,0.9523897,1.9852225,2.0084984,0.978,0.9717051,0.9894667,0.9854462,0.1985222,0.6025495,98.5222475,100.849843,0.5963212
10,0.4,0.8671057,1.9235141,1.9872524,0.9476,0.9142567,0.979,0.9676488,0.1923514,0.7949009,92.3514128,98.7252355,0.7783447

Unnamed: 0,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,training_custom,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error,validation_custom
,2022-10-28 14:19:33,0.030 sec,0.0,,,,,,,0.693147,,,,,,,0.693147
,2022-10-28 14:19:33,0.118 sec,1.0,,,,,,,0.508456,,,,,,,0.509577
,2022-10-28 14:19:33,0.127 sec,2.0,,,,,,,0.403205,,,,,,,0.404315
,2022-10-28 14:19:33,0.135 sec,3.0,,,,,,,0.333565,,,,,,,0.335569
,2022-10-28 14:19:33,0.144 sec,4.0,,,,,,,0.286705,,,,,,,0.28943
,2022-10-28 14:19:33,0.151 sec,5.0,,,,,,,0.254899,,,,,,,0.257526
,2022-10-28 14:19:33,0.161 sec,6.0,,,,,,,0.232886,,,,,,,0.235729
,2022-10-28 14:19:33,0.169 sec,7.0,,,,,,,0.217304,,,,,,,0.220206
,2022-10-28 14:19:33,0.177 sec,8.0,,,,,,,0.200464,,,,,,,0.204054
,2022-10-28 14:19:33,0.185 sec,9.0,,,,,,,0.190888,,,,,,,0.195105

variable,relative_importance,scaled_importance,percentage
C11,70178.75,1.0,0.4261358
C2,35732.6640625,0.5091664,0.2169741
C13,21608.4863281,0.3079064,0.13121
C8,15654.5498047,0.2230668,0.0950568
C3,12402.2607422,0.1767239,0.0753084
C19,5739.7641602,0.0817878,0.0348527
C5,2835.9191895,0.0404099,0.0172201
C6,88.9352493,0.0012673,0.00054
C12,51.987648,0.0007408,0.0003157
C17,49.1711044,0.0007007,0.0002986


Scoring history will show undefined values for H2O metrics for all scoring iteration except for the final one. Values of columns `training_custom` and `validation_custom` will be the only ones populated for all of the iterations.

In the final iteration H2O performs full scoring, that is why we see all values defined in the last row of the scoring history.

In [10]:
model_eval_only.scoring_history()

Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_auc,training_pr_auc,training_lift,training_classification_error,training_custom,validation_rmse,validation_logloss,validation_auc,validation_pr_auc,validation_lift,validation_classification_error,validation_custom
0,,2022-10-28 14:19:33,0.030 sec,0.0,,,,,,,0.693147,,,,,,,0.693147
1,,2022-10-28 14:19:33,0.118 sec,1.0,,,,,,,0.508456,,,,,,,0.509577
2,,2022-10-28 14:19:33,0.127 sec,2.0,,,,,,,0.403205,,,,,,,0.404315
3,,2022-10-28 14:19:33,0.135 sec,3.0,,,,,,,0.333565,,,,,,,0.335569
4,,2022-10-28 14:19:33,0.144 sec,4.0,,,,,,,0.286705,,,,,,,0.28943
5,,2022-10-28 14:19:33,0.151 sec,5.0,,,,,,,0.254899,,,,,,,0.257526
6,,2022-10-28 14:19:33,0.161 sec,6.0,,,,,,,0.232886,,,,,,,0.235729
7,,2022-10-28 14:19:33,0.169 sec,7.0,,,,,,,0.217304,,,,,,,0.220206
8,,2022-10-28 14:19:33,0.177 sec,8.0,,,,,,,0.200464,,,,,,,0.204054
9,,2022-10-28 14:19:33,0.185 sec,9.0,,,,,,,0.190888,,,,,,,0.195105


Models `model_eval` and `model_eval_only` are guaranteed to be identical in behavior (same trees, same thresholds,..). The only technical difference between them is that the first one doesn't have full scoring history.

We can also see that flag `score_eval_metric_only=True` saved us some training time. Model `model_eval_only` was built faster:

In [11]:
def total_duration(scoring_history):
    return(sum(map(lambda x: float(x.strip().split(' ')[0]), scoring_history["duration"].tolist())))


print("Duration (s) with H2O scoring: %s" % total_duration(model_eval.scoring_history()))
print("Duration (s) with only eval_metric scored: %s" % total_duration(model_eval_only.scoring_history()))


Duration (s) with H2O scoring: 21.281
Duration (s) with only eval_metric scored: 4.67


In [12]:
model_eval.scoring_history()["duration"]

0      0.025 sec
1      0.139 sec
2      0.179 sec
3      0.228 sec
4      0.280 sec
5      0.356 sec
6      0.434 sec
7      0.511 sec
8      0.587 sec
9      0.676 sec
10     0.773 sec
11     0.854 sec
12     0.943 sec
13     1.058 sec
14     1.160 sec
15     1.246 sec
16     1.345 sec
17     1.462 sec
18     1.566 sec
19     1.684 sec
20     1.802 sec
21     1.916 sec
22     2.057 sec
Name: duration, dtype: object

In [13]:
model_eval_only.scoring_history()["duration"]

0      0.030 sec
1      0.118 sec
2      0.127 sec
3      0.135 sec
4      0.144 sec
5      0.151 sec
6      0.161 sec
7      0.169 sec
8      0.177 sec
9      0.185 sec
10     0.194 sec
11     0.202 sec
12     0.213 sec
13     0.224 sec
14     0.234 sec
15     0.243 sec
16     0.251 sec
17     0.261 sec
18     0.270 sec
19     0.279 sec
20     0.291 sec
21     0.301 sec
22     0.310 sec
Name: duration, dtype: object