{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lale's AutoPipeline Operator\n", "\n", "The `lale.lib.lale.AutoPipeline` operator automatically creates a pipeline for a dataset.\n", "It is designed for simplicity, requiring minimum configuration to get started.\n", "You can use it for an initial experiment with new data. See also the\n", "[API documentation](https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.auto_pipeline.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset\n", "\n", "This demonstration uses the [credit-g](https://www.openml.org/d/31) dataset from OpenML.\n", "The dataset has both categorical features, represented as strings, and\n", "numeric features. For illustration purposes, we also add some missing values,\n", "represented as `NaN`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train_X.shape (670, 20)\n" ] } ], "source": [ "import lale.datasets.openml\n", "import lale.helpers\n", "(orig_train_X, train_y), (test_X, test_y) = \\\n", " lale.datasets.openml.fetch('credit-g', 'classification', preprocess=False)\n", "train_X = lale.helpers.add_missing_values(orig_train_X, seed=42)\n", "print(f'train_X.shape {train_X.shape}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Printing the last few samples of the training data reveals\n", "`credit_amount=NaN` for sample number 763." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
classchecking_statusdurationcredit_historypurposecredit_amountsavings_statusemploymentinstallment_commitmentpersonal_statusother_partiesresidence_sinceproperty_magnitudeageother_payment_planshousingexisting_creditsjobnum_dependentsown_telephoneforeign_worker
763badno checking21.0critical/other existing creditnew carNaNno known savings>=74.0male singlenone4.0no known property30.0nonefor free1.0high qualif/self emp/mgmt1.0yesyes
835bad<012.0no credits/all paidnew car1082.0<1001<=X<44.0male singlenone4.0car48.0bankown2.0skilled1.0noneyes
192bad0<=X<20027.0existing paidbusiness3915.0<1001<=X<44.0male singlenone2.0car36.0noneown1.0skilled2.0yesyes
629goodno checking9.0existing paideducation3832.0no known savings>=71.0male singlenone4.0real estate64.0noneown1.0unskilled resident1.0noneyes
559bad0<=X<20018.0critical/other existing creditfurniture/equipment1928.0<100<12.0male singlenone2.0real estate31.0noneown2.0unskilled resident1.0noneyes
684good0<=X<20036.0delayed previouslybusiness9857.0100<=X<5004<=X<71.0male singlenone3.0life insurance31.0noneown2.0unskilled resident2.0yesyes
\n", "
" ], "text/plain": [ " class checking_status duration credit_history \\\n", "763 bad no checking 21.0 critical/other existing credit \n", "835 bad <0 12.0 no credits/all paid \n", "192 bad 0<=X<200 27.0 existing paid \n", "629 good no checking 9.0 existing paid \n", "559 bad 0<=X<200 18.0 critical/other existing credit \n", "684 good 0<=X<200 36.0 delayed previously \n", "\n", " purpose credit_amount savings_status employment \\\n", "763 new car NaN no known savings >=7 \n", "835 new car 1082.0 <100 1<=X<4 \n", "192 business 3915.0 <100 1<=X<4 \n", "629 education 3832.0 no known savings >=7 \n", "559 furniture/equipment 1928.0 <100 <1 \n", "684 business 9857.0 100<=X<500 4<=X<7 \n", "\n", " installment_commitment personal_status other_parties residence_since \\\n", "763 4.0 male single none 4.0 \n", "835 4.0 male single none 4.0 \n", "192 4.0 male single none 2.0 \n", "629 1.0 male single none 4.0 \n", "559 2.0 male single none 2.0 \n", "684 1.0 male single none 3.0 \n", "\n", " property_magnitude age other_payment_plans housing existing_credits \\\n", "763 no known property 30.0 none for free 1.0 \n", "835 car 48.0 bank own 2.0 \n", "192 car 36.0 none own 1.0 \n", "629 real estate 64.0 none own 1.0 \n", "559 real estate 31.0 none own 2.0 \n", "684 life insurance 31.0 none own 2.0 \n", "\n", " job num_dependents own_telephone foreign_worker \n", "763 high qualif/self emp/mgmt 1.0 yes yes \n", "835 skilled 1.0 none yes \n", "192 skilled 2.0 yes yes \n", "629 unskilled resident 1.0 none yes \n", "559 unskilled resident 1.0 none yes \n", "684 unskilled resident 2.0 yes yes " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.options.display.max_columns = None\n", "pd.concat([train_y.tail(6), train_X.tail(6)], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sklearn Interface\n", "\n", "We designed the AutoPipeline operator to follow sklearn's init-fit-predict\n", "conventions to make it easy to use for anyone familiar with sklearn.\n", "\n", "- During initialization, you can configure the behavior of the operator.\n", " Here, we set `prediction_type='classification'`; AutoPipeline also supports\n", " `'regression'`. The `max_opt_time` is the timeout in seconds for finding a\n", " pipeline for the dataset.\n", "- The call to `fit` initiates training, which tries out various pipelines\n", " on the dataset.\n", "- Finally, after training, `predict` makes predictions using the best found\n", " pipeline." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from lale.lib.lale import AutoPipeline\n", "trainable = AutoPipeline(prediction_type='classification', max_opt_time=90)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 29.9 s, sys: 1.59 s, total: 31.5 s\n", "Wall time: 1min 39s\n" ] } ], "source": [ "%%time\n", "trained = trainable.fit(train_X, train_y)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "predicted = trained.predict(test_X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result of `predict` can then be either used directly (e.g., printed),\n", "or passed to other sklearn functions (e.g., for scoring)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "last 10 predictions: ['bad' 'bad' 'good' 'bad' 'good' 'bad']\n", "accuracy 73.9%\n" ] } ], "source": [ "from sklearn.metrics import accuracy_score\n", "print(f'last 10 predictions: {predicted[-6:]}')\n", "print(f'accuracy {accuracy_score(predicted, test_y):.1%}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Inspecting Results\n", "\n", "After training, you can look at a leaderboard of all the pipelines tried\n", "during the search by calling the `summary` method." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
losstimelog_lossstatustid
name
gbt_all-0.7059702.4268310.721055okNaN
p2-0.7044782.6494560.561390ok2.0
p1-0.6985071.5914140.570718ok1.0
baseline-0.6955220.077895NaNokNaN
p4-0.6955222.758111NaNok4.0
p0-0.6865671.7580130.616135ok0.0
p3-0.6835821.7777910.765601ok3.0
gbt_num-0.6417911.1300190.768635okNaN
p5NaNNaNNaNnew5.0
\n", "
" ], "text/plain": [ " loss time log_loss status tid\n", "name \n", "gbt_all -0.705970 2.426831 0.721055 ok NaN\n", "p2 -0.704478 2.649456 0.561390 ok 2.0\n", "p1 -0.698507 1.591414 0.570718 ok 1.0\n", "baseline -0.695522 0.077895 NaN ok NaN\n", "p4 -0.695522 2.758111 NaN ok 4.0\n", "p0 -0.686567 1.758013 0.616135 ok 0.0\n", "p3 -0.683582 1.777791 0.765601 ok 3.0\n", "gbt_num -0.641791 1.130019 0.768635 ok NaN\n", "p5 NaN NaN NaN new 5.0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "trained.summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `get_pipeline` method lets you retrieve any pipeline by name.\n", "By default, when no name is specified, it returns the best pipeline\n", "found, in other words, the pipeline with the lowest loss in the\n", "leaderboard.\n", "You can call `predict` on that pipeline in typical sklearn fashion.\n", "Furthermore, you can inspect that pipeline by calling `visualize`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "cluster:(root)\n", "\n", "\n", "\n", "\n", "\n", "project_0\n", "\n", "\n", "Project\n", "\n", "\n", "\n", "\n", "simple_imputer_0\n", "\n", "\n", "Simple-\n", "Imputer\n", "\n", "\n", "\n", "\n", "project_0->simple_imputer_0\n", "\n", "\n", "\n", "\n", "concat_features\n", "\n", "\n", "Concat-\n", "Features\n", "\n", "\n", "\n", "\n", "simple_imputer_0->concat_features\n", "\n", "\n", "\n", "\n", "project_1\n", "\n", "\n", "Project\n", "\n", "\n", "\n", "\n", "simple_imputer_1\n", "\n", "\n", "Simple-\n", "Imputer\n", "\n", "\n", "\n", "\n", "project_1->simple_imputer_1\n", "\n", "\n", "\n", "\n", "one_hot_encoder\n", "\n", "\n", "One-\n", "Hot-\n", "Encoder\n", "\n", "\n", "\n", "\n", "simple_imputer_1->one_hot_encoder\n", "\n", "\n", "\n", "\n", "one_hot_encoder->concat_features\n", "\n", "\n", "\n", "\n", "xgb_classifier\n", "\n", "\n", "XGB-\n", "Classifier\n", "\n", "\n", "\n", "\n", "concat_features->xgb_classifier\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "best_found = trained.get_pipeline()\n", "best_found.visualize()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The visualization reveals the operators from which the pipeline is\n", "composed. Since our example dataset contains both numeric features and\n", "categorical features, the pipeline contains two preprocessing paths,\n", "using `Project` to keep only the relevant columns.\n", "When you hover the mouse pointer over an operator in the visualization,\n", "a tooltip shows how it is configured. Furthermore, if you click on an\n", "operator in the visualization, you get to a documentation page\n", "for that operator.\n", "You can also pretty-print any pipeline found during the search back as\n", "Python code to better understand how its operators were configured." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "```python\n", "from lale.lib.lale import Project\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.preprocessing import OneHotEncoder\n", "from lale.lib.lale import ConcatFeatures\n", "from xgboost import XGBClassifier\n", "import lale\n", "\n", "lale.wrap_imported_operators()\n", "project_0 = Project(\n", " columns={\"type\": \"number\"},\n", " drop_columns=lale.lib.lale.categorical(max_values=5),\n", ")\n", "project_1 = Project(columns=lale.lib.lale.categorical(max_values=5))\n", "simple_imputer_1 = SimpleImputer(strategy=\"most_frequent\")\n", "one_hot_encoder = OneHotEncoder(handle_unknown=\"ignore\")\n", "pipeline = (\n", " (\n", " (project_0 >> SimpleImputer())\n", " & (project_1 >> simple_imputer_1 >> one_hot_encoder)\n", " )\n", " >> ConcatFeatures\n", " >> XGBClassifier()\n", ")\n", "```" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "best_found.pretty_print(ipython_display=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next Steps\n", "\n", "You can try out the AutoPipeline operator on your own data.\n", "Alternatively, if you want more control, you also plan your own pipelines,\n", "then use Lale to do automated algorithm selection and hyperparameter tuning\n", "on them. Check out the other [example\n", "notebooks](https://github.com/IBM/lale/tree/master/examples) for Lale,\n", "and in particular, the [introductory\n", "guide](https://nbviewer.jupyter.org/github/IBM/lale/blob/master/examples/docs_guide_for_sklearn_users.ipynb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }