## Hyperparameter error examples

Since the schema of the `C` hyperparameter of `LR` specifies an
exclusive minimum of zero, passing zero is not valid. Lale internally
calls an off-the-shelf JSON Schema validator when an operator gets
configured with concrete hyperparameter values.

In [1]:
from sklearn.linear_model import LogisticRegression as LR
import lale
lale.wrap_imported_operators()
from lale.settings import set_disable_data_schema_validation, set_disable_hyperparams_schema_validation
#enable schema validation explicitly for the notebook
set_disable_data_schema_validation(False)
set_disable_hyperparams_schema_validation(False)

In [2]:
import jsonschema
import sys
try:
    LR(C=0.0)
except jsonschema.ValidationError as e:
    message = e.message
print(message, file=sys.stderr)
assert message.startswith('Invalid configuration for LR(C=0.0)')

Invalid configuration for LR(C=0.0) due to invalid value C=0.0.
Some possible fixes include:
- set C=1.0
Schema of argument C: {
    "description": "Inverse regularization strength. Smaller values specify stronger regularization.",
    "type": "number",
    "distribution": "loguniform",
    "minimum": 0.0,
    "exclusiveMinimum": true,
    "default": 1.0,
    "minimumForOptimizer": 0.03125,
    "maximumForOptimizer": 32768,
}
Invalid value: 0.0


Besides per-hyperparameter types, there are also conditional
inter-hyperparameter constraints. These are checked using the
same call to an off-the-shelf JSON Schema validator.

In [3]:
try:
    LR(LR.enum.solver.sag, LR.enum.penalty.l1)
except jsonschema.ValidationError as e:
    message = e.message
print(message, file=sys.stderr)
assert message.find('support only l2 or no penalties') != -1

Invalid configuration for LR(solver='sag', penalty='l1') due to constraint the newton-cg, sag, and lbfgs solvers support only l2 or no penalties.
Some possible fixes include:
- set penalty='l2'
Schema of failing constraint: https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.logistic_regression.html#constraint-1
Invalid value: {'solver': 'sag', 'penalty': 'l1', 'dual': False, 'C': 1.0, 'tol': 0.0001, 'fit_intercept': True, 'intercept_scaling': 1.0, 'class_weight': None, 'random_state': None, 'max_iter': 100, 'multi_class': 'auto', 'verbose': 0, 'warm_start': False, 'n_jobs': None, 'l1_ratio': None}


There are even constraints that affect three different hyperparameters.

In [4]:
try:
    LR(LR.enum.penalty.l2, LR.enum.solver.sag, dual=True)
except jsonschema.ValidationError as e:
    message = e.message
print(message, file=sys.stderr)
assert message.find('dual formulation is only implemented for') != -1

Invalid configuration for LR(penalty='l2', solver='sag', dual=True) due to constraint the dual formulation is only implemented for l2 penalty with the liblinear solver.
Some possible fixes include:
- set dual=False
Schema of failing constraint: https://lale.readthedocs.io/en/latest/modules/lale.lib.sklearn.logistic_regression.html#constraint-2
Invalid value: {'penalty': 'l2', 'solver': 'sag', 'dual': True, 'C': 1.0, 'tol': 0.0001, 'fit_intercept': True, 'intercept_scaling': 1.0, 'class_weight': None, 'random_state': None, 'max_iter': 100, 'multi_class': 'auto', 'verbose': 0, 'warm_start': False, 'n_jobs': None, 'l1_ratio': None}


## Dataset error example for individual operator

Lale uses JSON Schema validation not only for hyperparameters but also
for data. The dataset `train_X` is multimodal: some columns contain
text strings whereas others contain numbers.

In [5]:
import pandas as pd
from lale.datasets.uci.uci_datasets import fetch_drugscom
train_X, train_y, test_X, test_y = fetch_drugscom()
pd.concat([train_X.head(), train_y.head()], axis=1)

Unnamed: 0,drugName,condition,review,date,usefulCount,rating
0,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...","May 20, 2012",27,9.0
1,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...","April 27, 2010",192,8.0
2,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...","December 14, 2009",17,5.0
3,Ortho Evra,Birth Control,"""This is my first time using any form of birth...","November 3, 2015",10,8.0
4,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...","November 27, 2016",37,9.0


In [6]:
#Enable the schema validation for data 
from lale.settings import set_disable_data_schema_validation
set_disable_data_schema_validation(False)

In [7]:
from lale.pretty_print import ipython_display
ipython_display(lale.datasets.data_schemas.to_schema(train_X))

```python
{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "type": "array",
    "items": {
        "type": "array",
        "minItems": 5,
        "maxItems": 5,
        "items": [
            {"description": "drugName", "type": "string"},
            {
                "description": "condition",
                "anyOf": [{"type": "string"}, {"enum": [NaN]}],
            },
            {"description": "review", "type": "string"},
            {"description": "date", "type": "string"},
            {"description": "usefulCount", "type": "integer", "minimum": 0},
        ],
    },
    "minItems": 161297,
    "maxItems": 161297,
}
```

Since `train_X` contains strings but `LR` expects only numbers, the
call to `fit` reports a type error.

In [8]:
trainable_lr = LR(max_iter=1000)
try:
    LR.validate_schema(train_X, train_y)
except ValueError as e:
    message = str(e)
print(message, file=sys.stderr)
assert message.startswith('LR.fit() invalid X')

LR.fit() invalid X, the schema of the actual data is not a subschema of the expected schema of the argument.
actual_schema = {
    "$schema": "http://json-schema.org/draft-04/schema#",
    "type": "array",
    "items": {
        "type": "array",
        "minItems": 5,
        "maxItems": 5,
        "items": [
            {"description": "drugName", "type": "string"},
            {
                "description": "condition",
                "anyOf": [{"type": "string"}, {"enum": [NaN]}],
            },
            {"description": "review", "type": "string"},
            {"description": "date", "type": "string"},
            {"description": "usefulCount", "type": "integer", "minimum": 0},
        ],
    },
    "minItems": 161297,
    "maxItems": 161297,
}
expected_schema = {
    "description": "Features; the outer array is over samples.",
    "type": "array",
    "items": {"type": "array", "items": {"type": "number"}},
}


Load a pure numerical dataset instead.

In [9]:
from lale.datasets import load_iris_df
(train_X, train_y), (test_X, test_y) = load_iris_df()
train_X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.0,3.4,1.6,0.4
1,6.3,3.3,4.7,1.6
2,5.1,3.4,1.5,0.2
3,4.8,3.0,1.4,0.1
4,6.7,3.1,4.7,1.5


Training LR with the Iris dataset works fine.

In [10]:
trained_lr = trainable_lr.fit(train_X, train_y)

## Lifecycle error example

Lale encourages separating the lifecycle states, here represented
by `trainable_lr` vs. `trained_lr`. The `predict` method should
only be called on a trained model.

In [11]:
predicted = trained_lr.predict(test_X)
print(f'test_y    {[*test_y]}')
print(f'predicted {[*predicted]}')

test_y    [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]
predicted [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]


On the other hand, the `predict` method should not be called on a trainable model.

In [12]:
import warnings
warnings.filterwarnings("error", category=DeprecationWarning)
try:
    predicted = trainable_lr.predict(test_X)
except DeprecationWarning as w:
    message = str(w)
print(message, file=sys.stderr)
assert message.startswith('The `predict` method is deprecated on a trainable')
print(f'test_y    {[*test_y]}')
print(f'predicted {[*predicted]}')

test_y    [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]
predicted [2, 1, 1, 0, 2, 0, 1, 1, 0, 0, 1, 0, 1, 1, 2, 0, 2, 1, 1, 0, 0, 2, 2, 0, 2, 1, 0, 2, 1, 0]


The `predict` method is deprecated on a trainable operator, because the learned coefficients could be accidentally overwritten by retraining. Call `predict` on the trained operator returned by `fit` instead.


## Delegate error example

LogisticRegression is an estimator and therefore does not have a
transform method, even when trained.

In [None]:
try:
    trained_lr.transform(train_X)
except AttributeError as e:
    message = 'AttributeError'
    print(message, file=sys.stderr)
assert message.startswith('AttributeError')