{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lale's AutoPipeline Operator\n",
"\n",
"The `lale.lib.lale.AutoPipeline` operator automatically creates a pipeline for a dataset.\n",
"It is designed for simplicity, requiring minimum configuration to get started.\n",
"You can use it for an initial experiment with new data. See also the\n",
"[API documentation](https://lale.readthedocs.io/en/latest/modules/lale.lib.lale.auto_pipeline.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dataset\n",
"\n",
"This demonstration uses the [credit-g](https://www.openml.org/d/31) dataset from OpenML.\n",
"The dataset has both categorical features, represented as strings, and\n",
"numeric features. For illustration purposes, we also add some missing values,\n",
"represented as `NaN`. Printing the last few samples of the training data reveals\n",
"`credit_amount=NaN` for sample number 763."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
checking_status
\n",
"
duration
\n",
"
credit_history
\n",
"
purpose
\n",
"
credit_amount
\n",
"
savings_status
\n",
"
employment
\n",
"
installment_commitment
\n",
"
personal_status
\n",
"
other_parties
\n",
"
residence_since
\n",
"
property_magnitude
\n",
"
age
\n",
"
other_payment_plans
\n",
"
housing
\n",
"
existing_credits
\n",
"
job
\n",
"
num_dependents
\n",
"
own_telephone
\n",
"
foreign_worker
\n",
"
\n",
" \n",
" \n",
"
\n",
"
763
\n",
"
no checking
\n",
"
21.0
\n",
"
critical/other existing credit
\n",
"
new car
\n",
"
NaN
\n",
"
no known savings
\n",
"
>=7
\n",
"
4.0
\n",
"
male single
\n",
"
none
\n",
"
4.0
\n",
"
no known property
\n",
"
30.0
\n",
"
none
\n",
"
for free
\n",
"
1.0
\n",
"
high qualif/self emp/mgmt
\n",
"
1.0
\n",
"
yes
\n",
"
yes
\n",
"
\n",
"
\n",
"
835
\n",
"
<0
\n",
"
12.0
\n",
"
no credits/all paid
\n",
"
new car
\n",
"
1082.0
\n",
"
<100
\n",
"
1<=X<4
\n",
"
4.0
\n",
"
male single
\n",
"
none
\n",
"
4.0
\n",
"
car
\n",
"
48.0
\n",
"
bank
\n",
"
own
\n",
"
2.0
\n",
"
skilled
\n",
"
1.0
\n",
"
none
\n",
"
yes
\n",
"
\n",
"
\n",
"
192
\n",
"
0<=X<200
\n",
"
27.0
\n",
"
existing paid
\n",
"
business
\n",
"
3915.0
\n",
"
<100
\n",
"
1<=X<4
\n",
"
4.0
\n",
"
male single
\n",
"
none
\n",
"
2.0
\n",
"
car
\n",
"
36.0
\n",
"
none
\n",
"
own
\n",
"
1.0
\n",
"
skilled
\n",
"
2.0
\n",
"
yes
\n",
"
yes
\n",
"
\n",
"
\n",
"
629
\n",
"
no checking
\n",
"
9.0
\n",
"
existing paid
\n",
"
education
\n",
"
3832.0
\n",
"
no known savings
\n",
"
>=7
\n",
"
1.0
\n",
"
male single
\n",
"
none
\n",
"
4.0
\n",
"
real estate
\n",
"
64.0
\n",
"
none
\n",
"
own
\n",
"
1.0
\n",
"
unskilled resident
\n",
"
1.0
\n",
"
none
\n",
"
yes
\n",
"
\n",
"
\n",
"
559
\n",
"
0<=X<200
\n",
"
18.0
\n",
"
critical/other existing credit
\n",
"
furniture/equipment
\n",
"
1928.0
\n",
"
<100
\n",
"
<1
\n",
"
2.0
\n",
"
male single
\n",
"
none
\n",
"
2.0
\n",
"
real estate
\n",
"
31.0
\n",
"
none
\n",
"
own
\n",
"
2.0
\n",
"
unskilled resident
\n",
"
1.0
\n",
"
none
\n",
"
yes
\n",
"
\n",
"
\n",
"
684
\n",
"
0<=X<200
\n",
"
36.0
\n",
"
delayed previously
\n",
"
business
\n",
"
9857.0
\n",
"
100<=X<500
\n",
"
4<=X<7
\n",
"
1.0
\n",
"
male single
\n",
"
none
\n",
"
3.0
\n",
"
life insurance
\n",
"
31.0
\n",
"
none
\n",
"
own
\n",
"
2.0
\n",
"
unskilled resident
\n",
"
2.0
\n",
"
yes
\n",
"
yes
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" checking_status duration credit_history \\\n",
"763 no checking 21.0 critical/other existing credit \n",
"835 <0 12.0 no credits/all paid \n",
"192 0<=X<200 27.0 existing paid \n",
"629 no checking 9.0 existing paid \n",
"559 0<=X<200 18.0 critical/other existing credit \n",
"684 0<=X<200 36.0 delayed previously \n",
"\n",
" purpose credit_amount savings_status employment \\\n",
"763 new car NaN no known savings >=7 \n",
"835 new car 1082.0 <100 1<=X<4 \n",
"192 business 3915.0 <100 1<=X<4 \n",
"629 education 3832.0 no known savings >=7 \n",
"559 furniture/equipment 1928.0 <100 <1 \n",
"684 business 9857.0 100<=X<500 4<=X<7 \n",
"\n",
" installment_commitment personal_status other_parties residence_since \\\n",
"763 4.0 male single none 4.0 \n",
"835 4.0 male single none 4.0 \n",
"192 4.0 male single none 2.0 \n",
"629 1.0 male single none 4.0 \n",
"559 2.0 male single none 2.0 \n",
"684 1.0 male single none 3.0 \n",
"\n",
" property_magnitude age other_payment_plans housing existing_credits \\\n",
"763 no known property 30.0 none for free 1.0 \n",
"835 car 48.0 bank own 2.0 \n",
"192 car 36.0 none own 1.0 \n",
"629 real estate 64.0 none own 1.0 \n",
"559 real estate 31.0 none own 2.0 \n",
"684 life insurance 31.0 none own 2.0 \n",
"\n",
" job num_dependents own_telephone foreign_worker \n",
"763 high qualif/self emp/mgmt 1.0 yes yes \n",
"835 skilled 1.0 none yes \n",
"192 skilled 2.0 yes yes \n",
"629 unskilled resident 1.0 none yes \n",
"559 unskilled resident 1.0 none yes \n",
"684 unskilled resident 2.0 yes yes "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import lale.datasets.openml\n",
"import lale.helpers\n",
"(orig_train_X, train_y), (test_X, test_y) = \\\n",
" lale.datasets.openml.fetch('credit-g', 'classification', preprocess=False)\n",
"train_X = lale.helpers.add_missing_values(orig_train_X, seed=42)\n",
"train_X.tail(6)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sklearn Interface\n",
"\n",
"We designed the AutoPipeline operator to follow sklearn's init-fit-predict\n",
"conventions to make it easy to use for anyone familiar with sklearn.\n",
"\n",
"- During initialization, you can configure the behavior of the operator.\n",
" Here, we set `prediction_type='classification'`; AutoPipeline also supports\n",
" `'regression'`. The `max_opt_time` is the timeout in seconds for finding a\n",
" pipeline for the dataset.\n",
"- The call to `fit` initiates training, which tries out various pipelines\n",
" on the dataset.\n",
"- Finally, after training, `predict` makes predictions using the best found\n",
" pipeline."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from lale.lib.lale import AutoPipeline\n",
"trainable = AutoPipeline(prediction_type='classification', max_opt_time=90)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 21 s, sys: 1.34 s, total: 22.3 s\n",
"Wall time: 1min 37s\n"
]
}
],
"source": [
"%%time\n",
"trained = trainable.fit(train_X, train_y)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"predicted = trained.predict(test_X)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result of `predict` can then be either used directly (e.g., printed),\n",
"or passed to other sklearn functions (e.g., for scoring)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"first 15 predictions: [1 1 1 1 1 1 1 1 1 0 0 1 1 1 1]\n",
"accuracy 76.1%\n"
]
}
],
"source": [
"from sklearn.metrics import accuracy_score\n",
"print(f'first 15 predictions: {predicted[:15]}')\n",
"print(f'accuracy {accuracy_score(predicted, test_y):.1%}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Inspecting Results\n",
"\n",
"After training, you can look at a leaderboard of all the pipelines tried\n",
"during the search by calling the `summary` method."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
log_loss
\n",
"
loss
\n",
"
status
\n",
"
tid
\n",
"
time
\n",
"
\n",
"
\n",
"
name
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
gbt_all
\n",
"
0.537487
\n",
"
-0.735821
\n",
"
ok
\n",
"
NaN
\n",
"
1.577811
\n",
"
\n",
"
\n",
"
p6
\n",
"
0.606116
\n",
"
-0.705970
\n",
"
ok
\n",
"
6.0
\n",
"
1.694068
\n",
"
\n",
"
\n",
"
p1
\n",
"
0.578674
\n",
"
-0.704478
\n",
"
ok
\n",
"
1.0
\n",
"
1.496361
\n",
"
\n",
"
\n",
"
baseline
\n",
"
NaN
\n",
"
-0.695522
\n",
"
ok
\n",
"
NaN
\n",
"
0.106716
\n",
"
\n",
"
\n",
"
p4
\n",
"
NaN
\n",
"
-0.695522
\n",
"
ok
\n",
"
4.0
\n",
"
1.499307
\n",
"
\n",
"
\n",
"
p0
\n",
"
0.616135
\n",
"
-0.686567
\n",
"
ok
\n",
"
0.0
\n",
"
1.432088
\n",
"
\n",
"
\n",
"
gbt_num
\n",
"
0.609914
\n",
"
-0.680597
\n",
"
ok
\n",
"
NaN
\n",
"
0.710606
\n",
"
\n",
"
\n",
"
p5
\n",
"
NaN
\n",
"
-0.674627
\n",
"
ok
\n",
"
5.0
\n",
"
2.432221
\n",
"
\n",
"
\n",
"
p2
\n",
"
0.878202
\n",
"
-0.664179
\n",
"
ok
\n",
"
2.0
\n",
"
2.304230
\n",
"
\n",
"
\n",
"
p3
\n",
"
0.675781
\n",
"
-0.664179
\n",
"
ok
\n",
"
3.0
\n",
"
1.317636
\n",
"
\n",
"
\n",
"
p7
\n",
"
NaN
\n",
"
NaN
\n",
"
new
\n",
"
7.0
\n",
"
NaN
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" log_loss loss status tid time\n",
"name \n",
"gbt_all 0.537487 -0.735821 ok NaN 1.577811\n",
"p6 0.606116 -0.705970 ok 6.0 1.694068\n",
"p1 0.578674 -0.704478 ok 1.0 1.496361\n",
"baseline NaN -0.695522 ok NaN 0.106716\n",
"p4 NaN -0.695522 ok 4.0 1.499307\n",
"p0 0.616135 -0.686567 ok 0.0 1.432088\n",
"gbt_num 0.609914 -0.680597 ok NaN 0.710606\n",
"p5 NaN -0.674627 ok 5.0 2.432221\n",
"p2 0.878202 -0.664179 ok 2.0 2.304230\n",
"p3 0.675781 -0.664179 ok 3.0 1.317636\n",
"p7 NaN NaN new 7.0 NaN"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"trained.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `get_pipeline` method lets you retrieve any pipeline by name.\n",
"By default, when no name is specified, it returns the best pipeline\n",
"found, in other words, the pipeline with the lowest loss in the\n",
"leaderboard.\n",
"You can call `predict` on that pipeline in typical sklearn fashion.\n",
"Furthermore, you can inspect that pipeline by calling `visualize`."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"best_found = trained.get_pipeline()\n",
"best_found.visualize()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The visualization reveals the operators from which the pipeline is\n",
"composed. Since our example dataset contains both numeric features and\n",
"categorical features, the pipeline contains two preprocessing paths,\n",
"using `Project` to keep only the relevant columns.\n",
"When you hover the mouse pointer over an operator in the visualization,\n",
"a tooltip shows how it is configured. Furthermore, if you click on an\n",
"operator in the visualization, you get to a documentation page\n",
"for that operator.\n",
"You can also pretty-print any pipeline found during the search back as\n",
"Python code to better understand how its operators were configured."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/markdown": [
"```python\n",
"from lale.lib.lale import Project\n",
"from sklearn.impute import SimpleImputer\n",
"from sklearn.preprocessing import OneHotEncoder\n",
"from lale.lib.lale import ConcatFeatures\n",
"from xgboost import XGBClassifier\n",
"import lale\n",
"\n",
"lale.wrap_imported_operators()\n",
"project_0 = Project(\n",
" columns={\"type\": \"number\"},\n",
" drop_columns=lale.lib.lale.categorical(max_values=5),\n",
")\n",
"project_1 = Project(columns=lale.lib.lale.categorical(max_values=5))\n",
"simple_imputer_1 = SimpleImputer(strategy=\"most_frequent\")\n",
"one_hot_encoder = OneHotEncoder(handle_unknown=\"ignore\")\n",
"pipeline = (\n",
" (\n",
" (project_0 >> SimpleImputer())\n",
" & (project_1 >> simple_imputer_1 >> one_hot_encoder)\n",
" )\n",
" >> ConcatFeatures\n",
" >> XGBClassifier()\n",
")\n",
"```"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"best_found.pretty_print(ipython_display=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next Steps\n",
"\n",
"You can try out the AutoPipeline operator on your own data.\n",
"Alternatively, if you want more control, you also plan your own pipelines,\n",
"then use Lale to do automated algorithm selection and hyperparameter tuning\n",
"on them. Check out the other [example\n",
"notebooks](https://github.com/IBM/lale/tree/master/examples) for Lale,\n",
"and in particular, the [introductory\n",
"guide](https://nbviewer.jupyter.org/github/IBM/lale/blob/master/examples/docs_guide_for_sklearn_users.ipynb)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}