{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lale: Library for Semi-Automated Data Science\n", "\n", "Martin Hirzel, Kiran Kate, Avi Shinnar, Guillaume Baudart, and Pari Ram\n", "\n", "5 November 2019\n", "\n", "Examples, documentation, code: https://github.com/ibm/lale\n", "\n", "\"logo\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook is the basis for a\n", "[talk](https://pydata.org/nyc2019/schedule/presentation/29/type-driven-automated-learning-with-lale/)\n", "about [Lale](https://github.com/ibm/lale). Lale is an open-source\n", "Python library for semi-automated data science. Lale is compatible\n", "with scikit-learn, adding a simple interface to existing\n", "machine-learning automation tools. Lale lets you search over possible\n", "pipelines in just a few lines of code while remaining in control of\n", "your work." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Value Proposition\n", "\n", "When writing machine-learning pipelines, you have a lot of decisions\n", "to make, such as picking transformers, estimators, and\n", "hyperparameters. Since some of these decisions are tricky, you will\n", "likely find yourself searching over many possible pipelines.\n", "Machine-learning automation tools help with this search.\n", "Unfortunately, each of these tools has its own API, and the search\n", "spaces are not necessarily consistent nor even correct. We have\n", "discovered that types (such as enum, float, or dictionary) can both\n", "check the correctness of, and help automatically search over,\n", "hyperparameters and pipeline configurations.\n", "\n", "To address this issue, we have open-sourced Lale, an open-source\n", "Python library for semi-automated data science. Lale is compatible\n", "with scikit-learn, adding a simple interface to existing\n", "machine-learning automation tools.\n", "Lale is designed to augment, but not replace, the data scientist.\n", "\n", "The **target user** of Lale is the working data scientist.\n", "The **scope** of Lale includes machine learning (both deep learning\n", "and non-DL) and data preparation. The **value** of Lale encompasses:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical + Continuous Dataset\n", "\n", "To demonstrate automated machine learning (AutoML), we first need\n", "a dataset. We will use a tabular dataset that has two kinds of\n", "features: categorical features (columns that can contain one of a\n", "small set of strings) and continuous features (numerical columns). In\n", "particular, we use the `credit-g` dataset from OpenML. After fetching\n", "the data, we display a few rows to get a better understanding of its\n", "labels and features." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: liac-arff>=2.4.0 in /usr/local/lib/python3.7/site-packages (2.4.0)\n", "\u001b[33mWARNING: You are using pip version 19.3.1; however, version 21.0.1 is available.\n", "You should consider upgrading via the 'pip install --upgrade pip' command.\u001b[0m\n" ] } ], "source": [ "!pip install 'liac-arff>=2.4.0'" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classchecking_statusdurationcredit_historypurposecredit_amountsavings_statusemploymentinstallment_commitmentpersonal_statusother_partiesresidence_sinceproperty_magnitudeageother_payment_planshousingexisting_creditsjobnum_dependentsown_telephoneforeign_worker
835bad<012.0no credits/all paidnew car1082.0<1001<=X<44.0male singlenone4.0car48.0bankown2.0skilled1.0noneyes
192bad0<=X<20027.0existing paidbusiness3915.0<1001<=X<44.0male singlenone2.0car36.0noneown1.0skilled2.0yesyes
629goodno checking9.0existing paideducation3832.0no known savings>=71.0male singlenone4.0real estate64.0noneown1.0unskilled resident1.0noneyes
559bad0<=X<20018.0critical/other existing creditfurniture/equipment1928.0<100<12.0male singlenone2.0real estate31.0noneown2.0unskilled resident1.0noneyes
684good0<=X<20036.0delayed previouslybusiness9857.0100<=X<5004<=X<71.0male singlenone3.0life insurance31.0noneown2.0unskilled resident2.0yesyes
\n", "
" ], "text/plain": [ " class checking_status duration credit_history \\\n", "835 bad <0 12.0 no credits/all paid \n", "192 bad 0<=X<200 27.0 existing paid \n", "629 good no checking 9.0 existing paid \n", "559 bad 0<=X<200 18.0 critical/other existing credit \n", "684 good 0<=X<200 36.0 delayed previously \n", "\n", " purpose credit_amount savings_status employment \\\n", "835 new car 1082.0 <100 1<=X<4 \n", "192 business 3915.0 <100 1<=X<4 \n", "629 education 3832.0 no known savings >=7 \n", "559 furniture/equipment 1928.0 <100 <1 \n", "684 business 9857.0 100<=X<500 4<=X<7 \n", "\n", " installment_commitment personal_status other_parties residence_since \\\n", "835 4.0 male single none 4.0 \n", "192 4.0 male single none 2.0 \n", "629 1.0 male single none 4.0 \n", "559 2.0 male single none 2.0 \n", "684 1.0 male single none 3.0 \n", "\n", " property_magnitude age other_payment_plans housing existing_credits \\\n", "835 car 48.0 bank own 2.0 \n", "192 car 36.0 none own 1.0 \n", "629 real estate 64.0 none own 1.0 \n", "559 real estate 31.0 none own 2.0 \n", "684 life insurance 31.0 none own 2.0 \n", "\n", " job num_dependents own_telephone foreign_worker \n", "835 skilled 1.0 none yes \n", "192 skilled 2.0 yes yes \n", "629 unskilled resident 1.0 none yes \n", "559 unskilled resident 1.0 none yes \n", "684 unskilled resident 2.0 yes yes " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import lale.datasets.openml\n", "import pandas as pd\n", "(train_X, train_y), (test_X, test_y) = lale.datasets.openml.fetch(\n", " 'credit-g', 'classification', preprocess=False)\n", "# print last five rows of labels in train_y and features in train_X\n", "pd.options.display.max_columns = None\n", "pd.concat([train_y.tail(5), train_X.tail(5)], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The labels `y` are either 0 or 1, which means that this is a binary\n", "classification task. The remaining 20 columns are features. Some of the\n", "features are categorical, such as `checking_status` or\n", "`credit_history`. Other features are continuous, such as `duration` or\n", "`credit_amount`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Manual Pipeline\n", "\n", "Lale is designed to support both manual and automated data science,\n", "with a consistent API that works for both of these cases as well as\n", "for the spectrum of semi-automated cases in between. A\n", "machine-learning *pipeline* is a computational graph, where each\n", "node is an *operator* (which transforms data or makes predictions)\n", "and each edge is an intermediary *dataset* (outputs from previous\n", "operators are piped as inputs to the next operators). We first import\n", "some operators from scikit-learn and ask Lale to wrap them for us." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import Normalizer as Norm\n", "from sklearn.preprocessing import OneHotEncoder as OneHot\n", "from sklearn.linear_model import LogisticRegression as LR\n", "import lale\n", "lale.wrap_imported_operators()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next cell imports a couple of utility operators from Lale's\n", "standard library." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/usr/local/lib/python3.7/site-packages/lightgbm/__init__.py:46: UserWarning: Starting from version 2.2.1, the library file in distribution wheels for macOS is built by the Apple Clang (Xcode_8.3.3) compiler.\n", "This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.\n", "Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.\n", "You can install the OpenMP library by the following command: ``brew install libomp``.\n", " \"You can install the OpenMP library by the following command: ``brew install libomp``.\", UserWarning)\n", "/usr/local/lib/python3.7/site-packages/pyparsing.py:3190: FutureWarning: Possible set intersection at position 3\n", " self.re = re.compile(self.reString)\n" ] } ], "source": [ "from lale.lib.lale import Project\n", "from lale.lib.lale import ConcatFeatures as Concat" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have all the operators we need, we arrange them into a\n", "pipeline. The `Project` operator works like the corresponding\n", "relational algebra primitive, picking a subset of the columns of the\n", "dataset. An sub-pipeline of the form `op1 >> op2` *pipes* the output\n", "from `op1` into `op2`. An sub-pipeline of the form `op1 & op2` causes\n", "both op1 *and* op2 to execute on the same data. The overall pipeline\n", "preprocesses numbers with a normalizer and strings with a one-hot\n", "encoder, then concatenates the corresponding columns and pipes the\n", "result into an `LR` classifier." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "cluster:(root)\n", "\n", "\n", "\n", "\n", "\n", "\n", "project_0\n", "\n", "\n", "Project\n", "\n", "\n", "\n", "\n", "\n", "norm\n", "\n", "\n", "Norm\n", "\n", "\n", "\n", "\n", "\n", "project_0->norm\n", "\n", "\n", "\n", "\n", "\n", "concat\n", "\n", "\n", "Concat\n", "\n", "\n", "\n", "\n", "\n", "norm->concat\n", "\n", "\n", "\n", "\n", "\n", "project_1\n", "\n", "\n", "Project\n", "\n", "\n", "\n", "\n", "\n", "one_hot\n", "\n", "\n", "One-\n", "Hot\n", "\n", "\n", "\n", "\n", "\n", "project_1->one_hot\n", "\n", "\n", "\n", "\n", "\n", "one_hot->concat\n", "\n", "\n", "\n", "\n", "\n", "lr\n", "\n", "\n", "LR\n", "\n", "\n", "\n", "\n", "\n", "concat->lr\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "manual_trainable = (\n", " ( Project(columns={'type': 'number'}) >> Norm()\n", " & Project(columns={'type': 'string'}) >> OneHot())\n", " >> Concat\n", " >> LR(LR.enum.penalty.l2, C=0.001))\n", "manual_trainable.visualize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we have manually chosen all operators and configured\n", "their hyperparameters. For the `LR` operator, we configured the\n", "penalty hyperparameter with `l2` and the regularization constant `C`\n", "with `0.001`. Depending on what tool you use to you view this notebook,\n", "you can explore these hyperparameters by hovering over\n", "the visualization above and observing the tooltips that pop up.\n", "Furthermore, each node in the visualization is a hyperlink that takes\n", "you to the documentation of the corresponding operator. Calling `fit`\n", "on the trainable pipeline returns a trained pipeline, and calling\n", "`predict` on the trained pipeline returns predictions. We can use\n", "off-the-shelf scikit-learn metrics to evaluate the result. In this\n", "case, the accuracy is poor." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "accuracy 70.9%\n" ] } ], "source": [ "import sklearn.metrics\n", "manual_trained = manual_trainable.fit(train_X, train_y)\n", "manual_y = manual_trained.predict(test_X)\n", "print(f'accuracy {sklearn.metrics.accuracy_score(test_y, manual_y):.1%}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pipeline Combinators\n", "\n", "As the previous example demonstrates, Lale provides combinators `>>`\n", "and `&` for arranging operators into a pipeline. These combinators are\n", "actually syntactic sugar for functions `make_pipeline` and\n", "`make_union`, which were inspired by the corresponding functions in\n", "scikit-learn. To support AutoML, Lale supports a third combinator `|`\n", "that specifies an algorithmic choice. Scikit-learn does not have a\n", "corresponding function. Various AutoML tools support algorithmic\n", "choice in some form or other, using their own tool-specific syntax.\n", "Lale's `|` combinator is syntactic sugar for Lale's `make_choice`\n", "function.\n", "\n", "| Lale feature | Name | Description | Scikit-learn feature |\n", "| ----------------------- | ---- | ------------ | ----------------------------------- |\n", "| >> or `make_pipeline` | pipe | feed to next | `make_pipeline` |\n", "| & or `make_union` | and | run both | `make_union` or `ColumnTransformer` |\n", "| | or `make_choice` | or | choose one | N/A (specific to given AutoML tool) |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Automated Pipeline\n", "\n", "Next, we will use automation to search for a better pipeline for the\n", "`credit-g` dataset. Specifically, we will perform combined\n", "hyperparameter optimization and algorithm selection (CASH). To do\n", "this, we first import a couple more operators to serve as choices in\n", "the algorithm selection." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from xgboost import XGBClassifier as XGBoost\n", "from sklearn.svm import LinearSVC\n", "lale.wrap_imported_operators()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we arrange the pipeline that specifies the search space for\n", "CASH. As promised, the look-and-feel for the automated case resembles\n", "that for the manual case from earlier. The main difference is the use\n", "of the `|` combinator to provide algorithmic choice\n", "(`LR | XGBoost | LinearSVC`). However, there is another difference: several of the\n", "operators are not configured with hyperparameters. For instance, this\n", "pipeline writes `LR` instead of `LR(LR.penalty.l1, C=0.001)`.\n", "Omitting the arguments means that as the data scientist, we do not\n", "bind the hyperparameters by hand in this pipeline. Instead, we leave\n", "these bindings free for CASH to search over. Similarly, the code\n", "contains `Norm` instead of `Norm()` and `OneHot` instead of `OneHot()`.\n", "This illustrates that automated hyperparameter tuning can be used not\n", "only on the final classifier but also inside of nested preprocessing\n", "sub-pipelines." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "cluster:(root)\n", "\n", "\n", "\n", "\n", "\n", "cluster:choice\n", "\n", "\n", "Choice\n", "\n", "\n", "\n", "\n", "\n", "project_0\n", "\n", "\n", "Project\n", "\n", "\n", "\n", "\n", "\n", "norm\n", "\n", "\n", "Norm\n", "\n", "\n", "\n", "\n", "\n", "project_0->norm\n", "\n", "\n", "\n", "\n", "\n", "concat\n", "\n", "\n", "Concat\n", "\n", "\n", "\n", "\n", "\n", "norm->concat\n", "\n", "\n", "\n", "\n", "\n", "project_1\n", "\n", "\n", "Project\n", "\n", "\n", "\n", "\n", "\n", "one_hot\n", "\n", "\n", "One-\n", "Hot\n", "\n", "\n", "\n", "\n", "\n", "project_1->one_hot\n", "\n", "\n", "\n", "\n", "\n", "one_hot->concat\n", "\n", "\n", "\n", "\n", "\n", "lr\n", "\n", "\n", "LR\n", "\n", "\n", "\n", "\n", "\n", "concat->lr\n", "\n", "\n", "\n", "\n", "\n", "xg_boost\n", "\n", "\n", "XG-\n", "Boost\n", "\n", "\n", "\n", "\n", "\n", "linear_svc\n", "\n", "\n", "Linear-\n", "SVC\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "auto_planned = (\n", " ( Project(columns={'type': 'number'}) >> Norm\n", " & Project(columns={'type': 'string'}) >> OneHot)\n", " >> Concat\n", " >> (LR | XGBoost | LinearSVC))\n", "auto_planned.visualize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The colors in the visualization indicate the particular mix of manual\n", "vs. automated bindings and will be explained in more detail below.\n", "For now, we look at how to actually invoke an AutoML tool from Lale.\n", "Lale provides bindings for multiple such tools; here we use the\n", "popular hyperopt open-source library. Specifically,\n", "`Hyperopt` takes the `auto_planned` pipeline from above as\n", "an argument, along with optional specifications for the number of\n", "cross-validation folds and trials to run. Calling `fit` yields a\n", "trained pipeline. The code that uses that trained pipeline for\n", "prediction and evaluation is the same as in the manual use case we saw\n", "before." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[15:01:49] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:01:51] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:01:52] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:02:06] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:02:08] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:02:11] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:02:12] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:02:14] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:02:16] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "100%|██████████| 10/10 [01:01<00:00, 6.12s/trial, best loss: -0.7537102284860132]\n", "accuracy 71.2%\n" ] } ], "source": [ "from lale.lib.lale.hyperopt import Hyperopt\n", "auto_optimizer = Hyperopt(estimator=auto_planned, cv=3, max_evals=10)\n", "auto_trained = auto_optimizer.fit(train_X, train_y)\n", "auto_y = auto_trained.predict(test_X)\n", "print(f'accuracy {sklearn.metrics.accuracy_score(test_y, auto_y):.1%}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The predictive performance is better, and in fact, it is in the same\n", "ball-park as the state-of-the-art performance reported for this\n", "dataset in OpenML. Since we left various choices to automation, at\n", "this point, we probably want to inspect what the automation chose\n", "for us. One way to do that is by visualizing the pipeline as before.\n", "This reveals that hyperopt chose `XGBoost` instead of the scikit-learn\n", "classifiers. Also, by hovering over the nodes in the visualization, we\n", "can explore how hyperopt configured the hyperparameters. Note that all\n", "nodes are visualized in white to indicate they are fully trained." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "cluster:(root)\n", "\n", "\n", "\n", "\n", "\n", "\n", "project_0\n", "\n", "\n", "Project\n", "\n", "\n", "\n", "\n", "\n", "norm\n", "\n", "\n", "Norm\n", "\n", "\n", "\n", "\n", "\n", "project_0->norm\n", "\n", "\n", "\n", "\n", "\n", "concat\n", "\n", "\n", "Concat\n", "\n", "\n", "\n", "\n", "\n", "norm->concat\n", "\n", "\n", "\n", "\n", "\n", "project_1\n", "\n", "\n", "Project\n", "\n", "\n", "\n", "\n", "\n", "one_hot\n", "\n", "\n", "One-\n", "Hot\n", "\n", "\n", "\n", "\n", "\n", "project_1->one_hot\n", "\n", "\n", "\n", "\n", "\n", "one_hot->concat\n", "\n", "\n", "\n", "\n", "\n", "lr\n", "\n", "\n", "LR\n", "\n", "\n", "\n", "\n", "\n", "concat->lr\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "auto_trained.get_pipeline().visualize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another way to inspect the results of automation is to pretty-print\n", "the trained pipeline back as Python source code. This lets us look at\n", "the output of automation in the same syntax we used to specify\n", "the input to automation in the first place." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "```python\n", "project_0 = Project(columns={\"type\": \"number\"})\n", "norm = Norm(norm=\"max\")\n", "project_1 = Project(columns={\"type\": \"string\"})\n", "lr = LR(\n", " fit_intercept=False,\n", " intercept_scaling=0.38106476479749274,\n", " max_iter=422,\n", " multi_class=\"multinomial\",\n", " solver=\"saga\",\n", " tol=0.004851580781685862,\n", ")\n", "pipeline = ((project_0 >> norm) & (project_1 >> OneHot())) >> Concat() >> lr\n", "```" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "auto_trained.get_pipeline().pretty_print(ipython_display=True, show_imports=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bindings as Lifecycle\n", "\n", "The previous example already alluded to the notion of bindings and\n", "that those bindings are reflected in node visualization colors.\n", "Bindings here relate to the mathematical notion of free or bound\n", "variables in formulas. *Bindings as lifecycle* is one of the fundamental\n", "novel concepts around which we designed Lale. Lale distinguishes two\n", "kinds of operators: *individual operators* are data-science primitives\n", "such as data preprocessors or predictive models, whereas *pipelines*\n", "are compositions of operators. Each operator, whether individual or\n", "composite, can be in one of four lifecycle states: meta-model,\n", "planned, trainable, and trained. The lifecycle state of an operator is\n", "defined by which bindings it has. Specifically:\n", "\n", "- *Meta-model*: Individual operators at the meta-model state have\n", " schemas and priors for their hyperparameters. Pipelines at the\n", " meta-model state have steps (which are other operators) and a\n", " grammar.\n", "\n", "- *Planned*: To get a planned pipeline, we need to *arrange* it by\n", " binding a specific graph topology. This topology is consistent with\n", " but more concrete than the steps and grammar from the meta-model\n", " state.\n", "\n", "- *Trainable*: To get a trainable individual operator, we need to\n", " initialize the concrete bindings for its hyperparameters.\n", " Scikit-learn does this using `__init__`; Lale emulates that syntax\n", " using `__call__`. To get a trainable pipeline, we need to bind\n", " concrete operator choices given by the `|` combinator.\n", "\n", "- *Trained*: To get a trained individual operator, we need to *fit* it\n", " to the data, thus binding its learnable coefficients. A trained\n", " pipeline is like a trainable pipeline where all steps are trained.\n", " More generally, the lifecycle state of a pipeline is limited by the\n", " least upper bound of the states of its steps.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The interested reader can explore Lale's concept of bindings as\n", "lifecycle in more detail in our\n", "[paper](https://arxiv.org/pdf/1906.03957.pdf).\n", "\n", "The above diagram is actually a Venn diagram. Each lifecycle state is\n", "a subset of its predecessor. For instance, a trained operator has all\n", "the bindings of a planned operator and in addition also binds learned\n", "coefficients. Since a trained operator has all bindings required for\n", "being a trainable operator, it can be used where a trainable operator\n", "is expected. Thus, the set of trained operators is a subset of the set\n", "of trainable operators. This relationship between the lifecycle states is\n", "essential for offering a flexible semi-automated data science\n", "experience." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Semi-Automated Data Science\n", "\n", "Why would data scientists prefer semi-automation over full automation?\n", "The main reason is to excert control over what kind of pipelines they\n", "find. Below are several scenarios motivating semi-automated data\n", "science.\n", "\n", "| Manual control over automation | Examples |\n", "| ----------------------------------- | -------- |\n", "| Restrict available operator choices | Interpretable, or based on licenses, or based on GPU requirements, ... |\n", "| Tweak graph topology | Custom preprocessing, or multi-modal data, or fairness mitigation, ... |\n", "| Tweak hyperparametrer schemas | Adjust range for continuous, or restrict choices for categorical, ... |\n", "| Expand available operator choices | Wrap existing library, or write your own operators, ... |\n", "\n", "In fact, we envision that this will often happen via trial-and-error,\n", "where you as the data scientist specify a first planned pipeline, let\n", "AutoML do its search, then inspect the results, edit your code, and\n", "try again. You already saw some of the Lale features for inspecting\n", "the results of automation above. Lale further supports this\n", "workflow by letting you specify *partial bindings* of some\n", "hyperparameters while allowing other to remain free, and by letting\n", "you *freeze* an operator at the trainable or trained state.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 20 Newsgroups Dataset\n", "\n", "For the next examples in this notebook, we will need a different\n", "dataset. The following code fetches the 20 newsgroups data, using a\n", "function that comes with scikit-learn. Then, it prints a\n", "representative sample of the labels and features." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yX
0rec.autosFrom: lerxst@wam.umd.edu (where's my thing)\\nS...
1comp.sys.mac.hardwareFrom: guykuo@carson.u.washington.edu (Guy Kuo)...
2comp.sys.mac.hardwareFrom: twillis@ec.ecn.purdue.edu (Thomas E Will...
3comp.graphicsFrom: jgreen@amber (Joe Green)\\nSubject: Re: W...
4sci.spaceFrom: jcm@head-cfa.harvard.edu (Jonathan McDow...
\n", "
" ], "text/plain": [ " y X\n", "0 rec.autos From: lerxst@wam.umd.edu (where's my thing)\\nS...\n", "1 comp.sys.mac.hardware From: guykuo@carson.u.washington.edu (Guy Kuo)...\n", "2 comp.sys.mac.hardware From: twillis@ec.ecn.purdue.edu (Thomas E Will...\n", "3 comp.graphics From: jgreen@amber (Joe Green)\\nSubject: Re: W...\n", "4 sci.space From: jcm@head-cfa.harvard.edu (Jonathan McDow..." ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sklearn.datasets\n", "news = sklearn.datasets.fetch_20newsgroups()\n", "news_X, news_y = news.data, news.target\n", "pd.DataFrame({'y': [news.target_names[i] for i in news_y], 'X': news_X}).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The labels `y` are strings taken from a set of 20 different newsgroup\n", "names. That means that this is a 20-way classification dataset. The\n", "features `X` consist of a single string with a message posted to one\n", "of the newsgroups. The task is to use the message to predict which\n", "group it was posted on." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Constraints in Scikit-learn\n", "\n", "As implied by the title of this notebook and the corresponding talk,\n", "Lale is built around the concept of types. In programming languages, a\n", "*type* specifies a set of valid values for a variable (e.g., for a\n", "hyperparameter). While types are in the foreground of some\n", "programming languages, Python keeps types more in the background.\n", "Python is dynamically typed, and libraries such as\n", "scikit-learn rely more on exceptions than on types for their error\n", "checking. To demonstrate this, we first import some scikit-learn\n", "modules." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "import sklearn.feature_extraction.text\n", "import sklearn.pipeline\n", "import sklearn.linear_model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we create a scikit-learn pipeline, using fully qualified names\n", "to ensure and clarify that we are not using any Lale facilities. The\n", "pipeline consists of just two operators, a TFIDF transformer for\n", "extracting features from the message text followed by a\n", "logistic regression for classifying the message by newsgroup." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "no error detected yet\n" ] } ], "source": [ "sklearn_misconfigured = sklearn.pipeline.make_pipeline(\n", " sklearn.feature_extraction.text.TfidfVectorizer(),\n", " sklearn.linear_model.LogisticRegression(solver='sag', penalty='l1'))\n", "print('no error detected yet')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above code actually contains a mistake in the hyperparameters for\n", "`LogisticRegression`. Unfortunately, scikit-learn does not detect this\n", "mistake when the hyperparameters are configured in the code. Instead,\n", "it delays its error checking until we attempt to fit the pipeline to\n", "data." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 3.45 s, sys: 92.9 ms, total: 3.54 s\n", "Wall time: 3.59 s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Solver sag supports only 'l2' or 'none' penalties, got l1 penalty.\n" ] } ], "source": [ "%%time\n", "import sys\n", "try:\n", " sklearn_misconfigured.fit(news_X, news_y)\n", "except ValueError as e:\n", " print(e, file=sys.stderr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above output demonstrates that scikit-learn takes a few seconds to\n", "report the error. This is because it first trains the TFIDF\n", "transformer on the data, and only when that is done, it attempts to\n", "train the `LogisticRegression`, triggering the exception. While a few\n", "seconds are no big deal, the 20 newsgroup dataset is small, and for larger\n", "datasets, the delay is larger. Furthermore, while in this case, the\n", "erroneous code is not far away from the code that triggers the error\n", "report, for larger code bases, that distance is also further. The\n", "error message is nice and clear about the erroneous hyperpareters, but\n", "does not mention which operator was misconfigured (here,\n", "`LogisticRegression`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Constraints in AutoML\n", "\n", "The above example illustrates an exception caused by manually\n", "misconfigured hyperparameters. Tools for automated machine learning\n", "usually run many trials with different hyperparameter configurations.\n", "Just as in the manual case, in general, some of these automated trials\n", "may raise exceptions.\n", "\n", "**Solution 1:** Unconstrained search space\n", "\n", "- {solver: \\[linear, sag, lbfgs\\], penalty: \\[l1, l2\\]}\n", "- catch exception (after some time)\n", "- return made-up loss `np.float.max`\n", "\n", "This first solution has the benefit of being simple, but the drawback\n", "that it may waste computational resources on training upstream\n", "operators before detecting erroneous downstream operators. It also\n", "leads to a larger-than-necessary search space. While we can use a high\n", "loss to steer the AutoML tool away from poorly performing points in\n", "the search space, in our experiments, we have encountered cases where\n", "this adversely affects convergence.\n", "\n", "**Solution 2:** Constrained search space\n", "\n", "- {solver: \\[linear, sag, lbfgs\\], penalty: \\[l1, l2]\\} **and** (**if** solver: [sag, lbfgs] **then** penalty: [l2])}\n", "- no exceptions (no time wasted)\n", "- no made-up loss\n", "\n", "The second solution is the one we advocate in Lale. The Lale library\n", "contains hyperparameter schemas for a large number of operators. These\n", "schemas are type specifications that encode not just the valid values\n", "for each hyperparameter in isolation, but also constraints cutting\n", "across multiple hyperparameters such as `solver` and `penalty` in the\n", "example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Constraints in Lale\n", "\n", "To demonstrate how Lale handles constraints, we first import Lale's\n", "wrapper for TFIDF. We need not import LogisticRegression here since we\n", "already imported it earlier in the notebook." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "from lale.lib.sklearn import TfidfVectorizer as Tfidf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lale uses JSON Schema to express types. JSON Schema is a widely\n", "supported and adopted standard proposal for specifying the valid\n", "structure of JSON documents. By using JSON Schema for hyperparameters,\n", "we avoided the need to implement any custom schema validation code.\n", "Instead, we simply put the concrete hyperparameters into a JSON\n", "document and then use an off-the-shelf validator to check that against\n", "the schema." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 23 ms, sys: 2.6 ms, total: 25.6 ms\n", "Wall time: 25.5 ms\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Invalid configuration for LR(solver='sag', penalty='l1') due to constraint the newton-cg, sag, and lbfgs solvers support only l2 or no penalties.\n", "Schema of constraint 1: {\n", " \"description\": \"The newton-cg, sag, and lbfgs solvers support only l2 or no penalties.\",\n", " \"anyOf\": [\n", " {\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"solver\": {\"not\": {\"enum\": [\"newton-cg\", \"sag\", \"lbfgs\"]}}\n", " },\n", " },\n", " {\n", " \"type\": \"object\",\n", " \"properties\": {\"penalty\": {\"enum\": [\"l2\", \"none\"]}},\n", " },\n", " ],\n", "}\n", "Value: {'solver': 'sag', 'penalty': 'l1', 'dual': False, 'C': 1.0, 'tol': 0.0001, 'fit_intercept': True, 'intercept_scaling': 1.0, 'class_weight': None, 'random_state': None, 'max_iter': 100, 'multi_class': 'auto', 'verbose': 0, 'warm_start': False, 'n_jobs': None, 'l1_ratio': None}\n" ] } ], "source": [ "%%time\n", "import jsonschema\n", "try:\n", " lale_misconfigured = Tfidf >> LR(LR.enum.solver.sag, LR.enum.penalty.l1)\n", "except jsonschema.ValidationError as e:\n", " print(e.message, file=sys.stderr)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output demonstrates that the error check happened more than 100\n", "times faster than in the previous example. This is because the code\n", "did not need to train TFIDF. The error check also happened closer to\n", "the root cause of the error, so the code dinstance is smaller.\n", "Furthermore, the error message indicates\n", "not just the wrong hyperparameters but also which operator was\n", "misconfigured (here, `LR`)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Schemas as Documentation\n", "\n", "The above code example demonstrates that Lale uses JSON Schema for\n", "error checking. But we can do better than just using schemas for one\n", "purpose. Since all Lale operators carry hyperparameter schemas,\n", "we can also use those same schemas for interactive documentation. The\n", "following code illustrates that by inspecting the schema of a\n", "continuous hyperparameter." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'description': 'Number of trees to fit.',\n", " 'type': 'integer',\n", " 'default': 100,\n", " 'minimumForOptimizer': 50,\n", " 'maximumForOptimizer': 1000}" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "XGBoost.hyperparam_schema('n_estimators')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, the same interactive documentation approach also works for\n", "categorical hyperparameters." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'description': 'Specify which booster to use.',\n", " 'enum': ['gbtree', 'gblinear', 'dart', None],\n", " 'default': None}" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "XGBoost.hyperparam_schema('booster')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the same schemas for two purposes (validation and documentation)\n", "makes them stronger. As a user, you can be confident that the\n", "documentation you read is in sync with the error checks in the code.\n", "Furthermore, while schemas check hyperparameters, the reverse is also\n", "true: when tests reveal mistakes in the schemas themselves, they can\n", "be corrected in a single location, thus improving both validation and\n", "documentation. However, we can do even better than using the same\n", "schemas for two purposes: we can also use them for a third purpose\n", "(automated search)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Types as Search Spaces\n", "\n", "The following picture illustrates how Lale uses types for AutoML.\n", "The data scientist is shown on the left. They interact with a pipeline\n", "at either the planned or trainable lifecycle state, or at some mixed\n", "state as discussed above under semi-automated AutoML. Every\n", "individual operator in Lale carries types in the form of a schema.\n", "Users rarely have to write these schemas by hand, since the Lale\n", "library already includes schemas for, at last count, 133 individual\n", "operators. Given a planned pipeline, when the data scientist kicks\n", "off a search, Lale automatically generates a search space appropriate\n", "for the given combined algorithm selection and hyperparameter\n", "optimization tool, e.g., hyperopt. For each trial, the\n", "CASH tool then acquires a point in the search space, e.g., using an\n", "acquisition function in Bayesian optimization. Then, Lale\n", "automatically decodes that search point into a trainable Lale\n", "pipeline. By construction, this trainable pipeline validates against\n", "the schemas. We can train and score the pipeline, usually with\n", "cross-validation, to return a loss value to the CASH tool. If this\n", "is the best pipeline found so far, the CASH tool will remember it\n", "as the incumbent. When the search terminates, the CASH tool returns\n", "the final incumbent, minimizing the loss. \n", "*Types as search spaces* is another fundamental\n", "novel concept around which we designed Lale, described in more detail\n", "in our [paper](https://arxiv.org/pdf/1906.03957.pdf).\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Customizing Schemas\n", "\n", "While you can use Lale operators with their schemas as-is, you can\n", "also customize the schemas to excert more control over the automation.\n", "For instance, you might want to reduce the number of trees in an\n", "XGBoost forest to reduce memory consumption or to improve\n", "explainability. Or you might want to hand-pick one of the boosters\n", "to reduce the search space and thus hopefully speed up the search." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "import lale.schemas as schemas\n", "Grove = XGBoost.customize_schema(\n", " n_estimators=schemas.Int(minimum=2, maximum=6),\n", " booster=schemas.Enum(['gbtree'], default='gbtree'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As this example demonstrates, Lale provides a simple Python API for\n", "writing schemas, which it then converts to JSON Schema internally.\n", "The result of customization is a new copy of the operator that can be\n", "used in the same way as any other operator in Lale. In particular,\n", "it can be part of a pipeline as before." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "grove_planned = ( Project(columns={'type': 'number'}) >> Norm\n", " & Project(columns={'type': 'string'}) >> OneHot\n", " ) >> Concat >> Grove" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given this new planned pipeline, we use hyperopt as before to search for a\n", "good trained pipeline." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[15:02:56] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:02:57] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:02:58] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:02:59] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:02:59] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:00] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:01] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:02] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:03] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:03] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:04] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:05] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:06] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:07] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:07] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:08] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:09] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:10] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:11] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:12] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:12] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:13] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:14] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:15] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:16] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:16] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:17] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:18] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "[15:03:19] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[15:03:20] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "100%|██████████| 10/10 [00:24<00:00, 2.49s/trial, best loss: -0.7313420884048686]\n", "[15:03:20] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.\n", "accuracy 73.0%\n" ] } ], "source": [ "grove_optimizer = Hyperopt(estimator=grove_planned, cv=3, max_evals=10, verbose=True)\n", "grove_trained = grove_optimizer.fit(train_X, train_y)\n", "grove_y = grove_trained.predict(test_X)\n", "print(f'accuracy {sklearn.metrics.accuracy_score(test_y, grove_y):.1%}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result shows that the predictive performance is not quite as good\n", "as before. As a data scientist, you can weigh that against other\n", "needs, and possibly experiment more. Of course, you can also display\n", "the result of automation, e.g., by pretty-printing it back as code." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "```python\n", "project_0 = Project(columns={\"type\": \"number\"})\n", "norm = Norm(norm=\"l1\")\n", "project_1 = Project(columns={\"type\": \"string\"})\n", "grove = Grove(\n", " gamma=0.8726303533099419,\n", " learning_rate=0.7606845221791637,\n", " max_depth=2,\n", " min_child_weight=11,\n", " n_estimators=5,\n", " reg_alpha=0.5980470775121279,\n", " reg_lambda=0.2546844052569046,\n", " subsample=0.8142720284737895,\n", ")\n", "pipeline = (\n", " ((project_0 >> norm) & (project_1 >> OneHot())) >> Concat() >> grove\n", ")\n", "```" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "grove_trained.get_pipeline().pretty_print(ipython_display=True, show_imports=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-learn Compatible Interoperability\n", "\n", "Scikit-learn has a big following, which is well-earned: it is one of the\n", "most complete and most usable machine-learning libraries available.\n", "In developing Lale, we spent a lot of effort on maintaining\n", "scikit-learn compatibility. Earlier on, you already saw that Lale\n", "pipelines work with off-the-shelf scikit-learn functions such as\n", "metrics, and with other libraries such as XGBoost. To demonstrate\n", "interoperability with additional libraries, we ran experiments with\n", "pipelines from various different data modalities. For the movies\n", "review text dataset, the best pipeline used a PyTorch implementation\n", "of the BERT embedding. For the car tabular dataset, the best pipeline\n", "used a Java implementation of the J48 decision tree. For the CIFAR-10\n", "images dataset, the best pipeline used a PyTorch implementation of a\n", "ResNet50 neural network. And finally, for the epilepsy time-series\n", "dataset, the best pipeline used a window transformer and voting\n", "operator pair that was written for the task and then made available in\n", "the Lale library. That demonstrates that Lale can use operators from\n", "other libraries beyond scikit-learn and even from other languages than\n", "Python. It also demonstrates that for some tasks, this interoperability\n", "is necessary for better predictive performance.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "Please check out the Lale github repository for examples,\n", "documentation, and code: https://github.com/ibm/lale\n", "\n", "This notebook started by listing three values that Lale provides:\n", "automation, interoperability, and usability. Then, this notebook\n", "introduced three fundamental concepts around which Lale is designed:\n", "bindings as lifecycle, types as search spaces, and scikit-learn\n", "compatible interoperability. As the following picture illustrates, the\n", "three concepts sit at the intersection of the three values.\n", "\n", "" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" } }, "nbformat": 4, "nbformat_minor": 2 }