{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Demo Import from Sklearn with Schemas from Lale\n", "\n", "This notebook shows how to use Lale directly with sklearn operators.\n", "The function `lale.wrap_imported_operators()` will automatically wrap\n", "known sklearn operators into Lale operators." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Usability\n", "\n", "To make Lale easy to learn and use, its APIs imitate those of\n", "[sklearn](https://scikit-learn.org/), with init, fit, and predict,\n", "and with pipelines." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "truth [6, 9, 3, 7, 2, 1, 5, 2, 5, 2, 1, 9, 4, 0, 4, 2, 3, 7, 8, 8]\n" ] } ], "source": [ "import sklearn.datasets\n", "import sklearn.model_selection\n", "digits = sklearn.datasets.load_digits()\n", "X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(\n", " digits.data, digits.target, test_size=0.2, random_state=42)\n", "print(f'truth {y_test.tolist()[:20]}')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "actual [6, 9, 3, 7, 2, 2, 5, 2, 5, 2, 1, 4, 4, 0, 4, 2, 3, 7, 8, 8]\n" ] } ], "source": [ "import lale\n", "from sklearn.linear_model import LogisticRegression as LR\n", "lale.wrap_imported_operators()\n", "\n", "trainable_lr = LR(LR.enum.solver.lbfgs, C=0.0001)\n", "trained_lr = trainable_lr.fit(X_train, y_train)\n", "predictions = trained_lr.predict(X_test)\n", "print(f'actual {predictions.tolist()[:20]}')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "accuracy 92.2%\n" ] } ], "source": [ "from sklearn.metrics import accuracy_score\n", "print(f'accuracy {accuracy_score(y_test, predictions):.1%}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Correctness\n", "\n", "Lale uses [JSON Schema](https://json-schema.org/) to check for valid\n", "hyperparameters. These schemas enable not just validation but also\n", "interactive documentation. Thanks to using a single source of truth, the\n", "documentation is correct by construction." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Invalid configuration for LR(solver='adam', C=0.01) due to invalid value solver=adam.\n", "Schema of argument solver: {\n", " \"default\": \"lbfgs\",\n", " \"description\": \"Algorithm for optimization problem.\",\n", " \"enum\": [\"newton-cg\", \"lbfgs\", \"liblinear\", \"sag\", \"saga\"],\n", "}\n", "Value: adam\n" ] } ], "source": [ "from jsonschema import ValidationError\n", "try:\n", " lale_lr = LR(solver='adam', C=0.01)\n", "except ValidationError as e:\n", " print(e.message)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'description': 'Inverse regularization strength. Smaller values specify stronger regularization.',\n", " 'type': 'number',\n", " 'distribution': 'loguniform',\n", " 'minimum': 0.0,\n", " 'exclusiveMinimum': True,\n", " 'default': 1.0,\n", " 'minimumForOptimizer': 0.03125,\n", " 'maximumForOptimizer': 32768}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LR.hyperparam_schema('C')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "mappingproxy({'solver': 'lbfgs',\n", " 'penalty': 'l2',\n", " 'dual': False,\n", " 'C': 1.0,\n", " 'tol': 0.0001,\n", " 'fit_intercept': True,\n", " 'intercept_scaling': 1.0,\n", " 'class_weight': None,\n", " 'random_state': None,\n", " 'max_iter': 100,\n", " 'multi_class': 'auto',\n", " 'verbose': 0,\n", " 'warm_start': False,\n", " 'n_jobs': None,\n", " 'l1_ratio': None})" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LR.get_defaults()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Automation\n", "\n", "Lale includes a compiler that converts types (expressed as JSON\n", "Schema) to optimizer search spaces. It currently has back-ends for\n", "[hyperopt](http://hyperopt.github.io/hyperopt/),\n", "[GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html), and\n", "[SMAC](http://www.automl.org/smac/).\n", "We are also actively working towards various other forms of AI\n", "automation using various other tools." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100%|██████████| 10/10 [00:03<00:00, 2.93trial/s, best loss: -0.975]\n", "best hyperparams {'dual': False, 'fit_intercept': False, 'intercept_scaling': 0.03784617564805115, 'max_iter': 99, 'multi_class': 'auto', 'solver': 'saga', 'tol': 0.005801390831569728}\n", "\n", "accuracy 97.5%\n" ] } ], "source": [ "from lale.search.op2hp import hyperopt_search_space\n", "from hyperopt import STATUS_OK, Trials, fmin, tpe, space_eval\n", "import lale.helpers\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "def objective(hyperparams):\n", " trainable = LR(**lale.helpers.dict_without(hyperparams, 'name'))\n", " trained = trainable.fit(X_train, y_train)\n", " predictions = trained.predict(X_test)\n", " return {'loss': -accuracy_score(y_test, predictions), 'status': STATUS_OK}\n", "\n", "search_space = hyperopt_search_space(LR)\n", "\n", "trials = Trials()\n", "fmin(objective, search_space, algo=tpe.suggest, max_evals=10, trials=trials)\n", "best_hps = space_eval(search_space, trials.argmin)\n", "print(f'best hyperparams {lale.helpers.dict_without(best_hps, \"name\")}\\n')\n", "print(f'accuracy {-min(trials.losses()):.1%}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Composition\n", "\n", "Lale supports composite models, which resemble sklearn pipelines but are\n", "more expressive.\n", "\n", "| Symbol | Name | Description | Sklearn feature |\n", "| ------ | ---- | ------------ | --------------- |\n", "| >> | pipe | Feed to next | `make_pipeline` |\n", "| & | and | Run both | `make_union`, includes concat |\n", "| | | or | Choose one | (missing) |" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "cluster:(root)\n", "\n", "\n", "\n", "\n", "\n", "cluster:choice\n", "\n", "\n", "Choice\n", "\n", "\n", "\n", "\n", "\n", "pca\n", "\n", "\n", "PCA\n", "\n", "\n", "\n", "\n", "\n", "cat\n", "\n", "\n", "Cat\n", "\n", "\n", "\n", "\n", "\n", "pca->cat\n", "\n", "\n", "\n", "\n", "\n", "no_op\n", "\n", "\n", "No-\n", "Op\n", "\n", "\n", "\n", "\n", "\n", "no_op->cat\n", "\n", "\n", "\n", "\n", "\n", "lr\n", "\n", "\n", "LR\n", "\n", "\n", "\n", "\n", "\n", "cat->lr\n", "\n", "\n", "\n", "\n", "\n", "svc\n", "\n", "\n", "SVC\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.decomposition import PCA\n", "from sklearn.svm import SVC\n", "from lale.lib.lale import ConcatFeatures as Cat\n", "from lale.lib.lale import NoOp\n", "lale.wrap_imported_operators()\n", "\n", "optimizable = (PCA & NoOp) >> Cat >> (LR | SVC)\n", "optimizable.visualize()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "cluster:(root)\n", "\n", "\n", "\n", "\n", "\n", "cluster:choice\n", "\n", "\n", "Choice\n", "\n", "\n", "\n", "\n", "\n", "pca\n", "\n", "\n", "PCA\n", "\n", "\n", "\n", "\n", "\n", "cat\n", "\n", "\n", "Cat\n", "\n", "\n", "\n", "\n", "\n", "pca->cat\n", "\n", "\n", "\n", "\n", "\n", "no_op\n", "\n", "\n", "No-\n", "Op\n", "\n", "\n", "\n", "\n", "\n", "no_op->cat\n", "\n", "\n", "\n", "\n", "\n", "lr\n", "\n", "\n", "LR\n", "\n", "\n", "\n", "\n", "\n", "cat->lr\n", "\n", "\n", "\n", "\n", "\n", "svc\n", "\n", "\n", "SVC\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from lale.operators import make_pipeline, make_union, make_choice\n", "optimizable = make_pipeline(make_union(PCA, NoOp), make_choice(LR, SVC))\n", "optimizable.visualize()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100%|██████████| 10/10 [00:41<00:00, 4.18s/trial, best loss: -0.982597754548974]\n", "1 out of 10 trials failed, call summary() for details.\n", "Run with verbose=True to see per-trial exceptions.\n" ] } ], "source": [ "import lale.lib.lale.hyperopt\n", "Optimizer = lale.lib.lale.hyperopt.Hyperopt\n", "trained = optimizable.auto_configure(X_train, y_train, optimizer=Optimizer, max_evals=10)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "accuracy 98.9%\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "cluster:(root)\n", "\n", "\n", "\n", "\n", "\n", "\n", "pca\n", "\n", "\n", "PCA\n", "\n", "\n", "\n", "\n", "\n", "cat\n", "\n", "\n", "Cat\n", "\n", "\n", "\n", "\n", "\n", "pca->cat\n", "\n", "\n", "\n", "\n", "\n", "no_op\n", "\n", "\n", "No-\n", "Op\n", "\n", "\n", "\n", "\n", "\n", "no_op->cat\n", "\n", "\n", "\n", "\n", "\n", "svc\n", "\n", "\n", "SVC\n", "\n", "\n", "\n", "\n", "\n", "cat->svc\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "predictions = trained.predict(X_test)\n", "print(f'accuracy {accuracy_score(y_test, predictions):.1%}')\n", "trained.visualize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Input and Output Schemas\n", "\n", "Besides schemas for hyperparameter, Lale also provides operator tags\n", "and schemas for input and output data of operators." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'pre': ['~categoricals'],\n", " 'op': ['estimator', 'classifier', 'interpretable'],\n", " 'post': ['probabilities']}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LR.get_tags()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'type': 'object',\n", " 'required': ['X', 'y'],\n", " 'additionalProperties': False,\n", " 'properties': {'X': {'description': 'Features; the outer array is over samples.',\n", " 'type': 'array',\n", " 'items': {'type': 'array', 'items': {'type': 'number'}}},\n", " 'y': {'description': 'Target class labels; the array is over samples.',\n", " 'anyOf': [{'type': 'array', 'items': {'type': 'number'}},\n", " {'type': 'array', 'items': {'type': 'string'}},\n", " {'type': 'array', 'items': {'type': 'boolean'}}]}}}" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LR.get_schema('input_fit')" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'description': 'Predicted class label per sample.',\n", " 'anyOf': [{'type': 'array', 'items': {'type': 'number'}},\n", " {'type': 'array', 'items': {'type': 'string'}},\n", " {'type': 'array', 'items': {'type': 'boolean'}}]}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "LR.get_schema('output_predict')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }