{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Demo Import from Sklearn with Schemas from Lale\n",
"\n",
"This notebook shows how to use Lale directly with sklearn operators.\n",
"The function `lale.wrap_imported_operators()` will automatically wrap\n",
"known sklearn operators into Lale operators."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Usability\n",
"\n",
"To make Lale easy to learn and use, its APIs imitate those of\n",
"[sklearn](https://scikit-learn.org/), with init, fit, and predict,\n",
"and with pipelines."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"truth [6, 9, 3, 7, 2, 1, 5, 2, 5, 2, 1, 9, 4, 0, 4, 2, 3, 7, 8, 8]\n"
]
}
],
"source": [
"import sklearn.datasets\n",
"import sklearn.model_selection\n",
"digits = sklearn.datasets.load_digits()\n",
"X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(\n",
" digits.data, digits.target, test_size=0.2, random_state=42)\n",
"print(f'truth {y_test.tolist()[:20]}')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"actual [6, 9, 3, 7, 2, 2, 5, 2, 5, 2, 1, 4, 4, 0, 4, 2, 3, 7, 8, 8]\n"
]
}
],
"source": [
"import lale\n",
"from sklearn.linear_model import LogisticRegression as LR\n",
"lale.wrap_imported_operators()\n",
"\n",
"trainable_lr = LR(LR.solver.lbfgs, C=0.0001)\n",
"trained_lr = trainable_lr.fit(X_train, y_train)\n",
"predictions = trained_lr.predict(X_test)\n",
"print(f'actual {predictions.tolist()[:20]}')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy 91.7%\n"
]
}
],
"source": [
"from sklearn.metrics import accuracy_score\n",
"print(f'accuracy {accuracy_score(y_test, predictions):.1%}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Correctness\n",
"\n",
"Lale uses [JSON Schema](https://json-schema.org/) to check for valid\n",
"hyperparameters. These schemas enable not just validation but also\n",
"interactive documentation. Thanks to using a single source of truth, the\n",
"documentation is correct by construction."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Invalid configuration for LR(solver='adam', C=0.01) due to invalid value solver=adam.\n",
"Schema of argument solver: {\n",
" 'description': 'Algorithm for optimization problem.',\n",
" 'enum': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],\n",
" 'default': 'liblinear'}\n",
"Value: adam\n"
]
}
],
"source": [
"from jsonschema import ValidationError\n",
"try:\n",
" lale_lr = LR(solver='adam', C=0.01)\n",
"except ValidationError as e:\n",
" print(e.message)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'description': 'Inverse regularization strength. Smaller values specify stronger regularization.',\n",
" 'type': 'number',\n",
" 'distribution': 'loguniform',\n",
" 'minimum': 0.0,\n",
" 'exclusiveMinimum': True,\n",
" 'default': 1.0,\n",
" 'minimumForOptimizer': 0.03125,\n",
" 'maximumForOptimizer': 32768}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LR.hyperparam_schema('C')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'solver': 'liblinear',\n",
" 'penalty': 'l2',\n",
" 'dual': False,\n",
" 'C': 1.0,\n",
" 'tol': 0.0001,\n",
" 'fit_intercept': True,\n",
" 'intercept_scaling': 1.0,\n",
" 'class_weight': None,\n",
" 'random_state': None,\n",
" 'max_iter': 100,\n",
" 'multi_class': 'ovr',\n",
" 'verbose': 0,\n",
" 'warm_start': False,\n",
" 'n_jobs': None}"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LR.hyperparam_defaults()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Automation\n",
"\n",
"Lale includes a compiler that converts types (expressed as JSON\n",
"Schema) to optimizer search spaces. It currently has back-ends for\n",
"[hyperopt](http://hyperopt.github.io/hyperopt/),\n",
"[GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), and\n",
"[SMAC](http://www.automl.org/smac/).\n",
"We are also actively working towards various other forms of AI\n",
"automation using various other tools."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"100%|██████████| 10/10 [00:02<00:00, 4.48trial/s, best loss: -0.9777777777777777]\n",
"best hyperparams {'C': 1685.8724563983353, 'class_weight': None, 'dual': False, 'fit_intercept': True, 'multi_class': 'auto', 'penalty': 'l2', 'solver': 'sag', 'tol': 0.026043867770646253}\n",
"\n",
"accuracy 97.8%\n"
]
}
],
"source": [
"from lale.search.op2hp import hyperopt_search_space\n",
"from hyperopt import STATUS_OK, Trials, fmin, tpe, space_eval\n",
"import lale.helpers\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")\n",
"\n",
"def objective(hyperparams):\n",
" trainable = LR(**lale.helpers.dict_without(hyperparams, 'name'))\n",
" trained = trainable.fit(X_train, y_train)\n",
" predictions = trained.predict(X_test)\n",
" return {'loss': -accuracy_score(y_test, predictions), 'status': STATUS_OK}\n",
"\n",
"search_space = hyperopt_search_space(LR)\n",
"\n",
"trials = Trials()\n",
"fmin(objective, search_space, algo=tpe.suggest, max_evals=10, trials=trials)\n",
"best_hps = space_eval(search_space, trials.argmin)\n",
"print(f'best hyperparams {lale.helpers.dict_without(best_hps, \"name\")}\\n')\n",
"print(f'accuracy {-min(trials.losses()):.1%}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Composition\n",
"\n",
"Lale supports composite models, which resemble sklearn pipelines but are\n",
"more expressive.\n",
"\n",
"| Symbol | Name | Description | Sklearn feature |\n",
"| ------ | ---- | ------------ | --------------- |\n",
"| >> | pipe | Feed to next | `make_pipeline` |\n",
"| & | and | Run both | `make_union`, includes concat |\n",
"| | | or | Choose one | (missing) |"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.decomposition import PCA\n",
"from sklearn.svm import SVC\n",
"from lale.lib.lale import ConcatFeatures as Cat\n",
"from lale.lib.lale import NoOp\n",
"lale.wrap_imported_operators()\n",
"\n",
"optimizable = (PCA & NoOp) >> Cat >> (LR | SVC)\n",
"optimizable.visualize()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from lale.operators import make_pipeline, make_union, make_choice\n",
"optimizable = make_pipeline(make_union(PCA, NoOp), make_choice(LR, SVC))\n",
"optimizable.visualize()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"100%|██████████| 10/10 [00:31<00:00, 3.18s/trial, best loss: -0.9902556438206324]\n"
]
}
],
"source": [
"import lale.lib.lale.hyperopt\n",
"Optimizer = lale.lib.lale.hyperopt.Hyperopt\n",
"trained = optimizable.auto_configure(X_train, y_train, optimizer=Optimizer, max_evals=10)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"accuracy 98.9%\n"
]
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"predictions = trained.predict(X_test)\n",
"print(f'accuracy {accuracy_score(y_test, predictions):.1%}')\n",
"trained.visualize()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Input and Output Schemas\n",
"\n",
"Besides schemas for hyperparameter, Lale also provides operator tags\n",
"and schemas for input and output data of operators."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'pre': ['~categoricals'],\n",
" 'op': ['estimator', 'classifier', 'interpretable'],\n",
" 'post': ['probabilities']}"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LR.get_tags()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'$schema': 'http://json-schema.org/draft-04/schema#',\n",
" 'type': 'object',\n",
" 'required': ['X', 'y'],\n",
" 'additionalProperties': False,\n",
" 'properties': {'X': {'description': 'Features; the outer array is over samples.',\n",
" 'type': 'array',\n",
" 'items': {'type': 'array', 'items': {'type': 'number'}}},\n",
" 'y': {'description': 'Target class labels; the array is over samples.',\n",
" 'anyOf': [{'type': 'array', 'items': {'type': 'number'}},\n",
" {'type': 'array', 'items': {'type': 'string'}}]}}}"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LR.get_schema('input_fit')"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'$schema': 'http://json-schema.org/draft-04/schema#',\n",
" 'description': 'Predicted class label per sample.',\n",
" 'anyOf': [{'type': 'array', 'items': {'type': 'number'}},\n",
" {'type': 'array', 'items': {'type': 'string'}}]}"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LR.get_schema('output_predict')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}