# OPTaaS Scikit-learn Pipelines

### Note: To run this notebook, you need an API Key. You can get one here.

Using the OPTaaS Python Client, you can optimize any scikit-learn pipeline. For each step or estimator in the pipeline, OPTaaS just needs to know what parameters to optimize and what constraints will apply to them.

Your pipeline can even include **optional** steps (such as feature selection), **choice** steps (such as choosing between a set of classifiers) and **nested** pipelines.

We have provided pre-defined parameters and constraints for some of the most widely used estimators, such as Random Forest and XGBoost. The example below demonstrates how to use them. See also our [tutorial on defining your own custom optimizable estimators](07.%20Custom%20Scikit-learn%20Estimators.ipynb).

## Load your dataset

We will run a classification pipeline using the German Credit Data available [here](https://newonlinecourses.science.psu.edu/stat857/node/215/). The data contains 1000 rows, with 20 feature columns and 1 target column which includes 2 classes.

In [1]:
import pandas as pd

data = pd.read_csv('../../data/german_credit.csv')
features = data[data.columns.drop(['Creditability'])]
target = data['Creditability']

## Create your OptimizablePipeline

Our pipeline will include:

- An optional feature selection step using PCA

- A choice of classifier from: Random Forest, Extra Trees and Gradient Boost

In [2]:
from mindfoundry.optaas.client.sklearn_pipelines.estimators.pca import PCA
from mindfoundry.optaas.client.sklearn_pipelines.estimators.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from mindfoundry.optaas.client.sklearn_pipelines.mixin import OptimizablePipeline, choice, optional_step

optimizable_pipeline = OptimizablePipeline([
 ('feature_selection', optional_step(PCA())),
 ('classification', choice(
 RandomForestClassifier(),
 ExtraTreesClassifier(),
 GradientBoostingClassifier()
 ))
])

## Connect to the OPTaaS server using your API Key

We now create a client, and connect to the web service that will perform our optimization. You will need to input your personal API key. Make sure you keep your key private and don't commit it to your version control system. 

In [3]:
from mindfoundry.optaas.client.client import OPTaaSClient

client = OPTaaSClient('https://optaas.mindfoundry.ai', '')

## Create your Sklearn Task

We don't need to worry about specifying all the parameters and constraints - they are generated based on our OptimizablePipeline. Sometimes we will need to provide additional kwargs, e.g. `feature_count` which is required by PCA.

If we do need to optimize any additional parameters that are outside of our pipeline, we can include them in `additional_parameters` and `additional_constraints`.

In [4]:
from mindfoundry.optaas.client.parameter import IntParameter
from mindfoundry.optaas.client.constraint import Constraint

my_extra_param = IntParameter('extra', id='extra', minimum=0, maximum=10)
my_extra_constraint = Constraint(my_extra_param != 7)

task = client.create_sklearn_task(
 title='My Sklearn Task', 
 pipeline=optimizable_pipeline,
 feature_count=len(features.columns),
 additional_parameters=[my_extra_param],
 additional_constraints=[my_extra_constraint],
 min_known_score=0, max_known_score=1 # optional: define the min and max known score values
)

display(task.parameters)
display(task.constraints)

[{'id': 'pipeline',
 'name': 'pipeline',
 'type': 'group',
 'items': [{'id': 'pipeline__feature_selection',
 'name': 'feature_selection',
 'type': 'group',
 'optional': True,
 'items': [{'id': 'pipeline__feature_selection__n_components',
 'name': 'n_components',
 'type': 'integer',
 'minimum': 1,
 'maximum': 20},
 {'id': 'pipeline__feature_selection__whiten',
 'name': 'whiten',
 'type': 'boolean',
 'default': False}]},
 {'id': 'classification',
 'name': 'classification',
 'type': 'choice',
 'choices': [{'id': 'pipeline__classification__0',
 'name': '0',
 'type': 'group',
 'items': [{'id': 'pipeline__classification__0__max_features',
 'name': 'max_features',
 'type': 'categorical',
 'default': 'auto',
 'enum': ['auto', 'sqrt', 'log2']},
 {'id': 'pipeline__classification__0__min_samples_split',
 'name': 'min_samples_split',
 'type': 'integer',
 'default': 2,
 'minimum': 2,
 'maximum': 20,
 'distribution': 'Uniform'},
 {'id': 'pipeline__classification__0__min_samples_leaf',
 'name': 'min_

['#extra != 7']

## Define your scoring function

We define a function to run our pipeline and calculate the mean score and variance:

In [5]:
from sklearn.model_selection import cross_val_score

def scoring_function(pipeline):
 scores = cross_val_score(pipeline, features, target, scoring='f1_micro', cv=5)
 return scores.mean(), scores.var()

## Run your task

We run the task for 20 iterations and review the results:

In [6]:
best_result = task.run(scoring_function, 20)
print("Best Result: ", best_result)

Running task "My Sklearn Task" for 20 iterations
(or until score is 1.0 or better)

Iteration: 0 Score: 0.7139774505043966 Variance: 0.0011199704714439302
Pipeline: Pipeline(memory=None,
 steps=[('feature_selection', PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
 svd_solver='auto', tol=0.0, whiten=False)), ('classification', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
 max_depth=None, max_features='auto', max_lea...n_jobs=1,
 oob_score=False, random_state=None, verbose=0,
 warm_start=False))])

Iteration: 1 Score: 0.7429884974795155 Variance: 0.0013301807783011328
Pipeline: Pipeline(memory=None,
 steps=[('classification', GradientBoostingClassifier(criterion='friedman_mse', init=None,
 learning_rate=0.1, loss='deviance', max_depth=3,
 max_features='auto', max_leaf_nodes=None,
 min_impurity_decrease=0.0, min_impurity_split=None,
 min_samples_leaf=1, min_samples_split=2,
 min_weight_fraction_leaf=0.0, n_estimators=100,
 presort=

Iteration: 16 Score: 0.7339884794974615 Variance: 0.001510748319642625
Pipeline: Pipeline(memory=None,
 steps=[('classification', GradientBoostingClassifier(criterion='friedman_mse', init=None,
 learning_rate=0.1, loss='deviance', max_depth=3,
 max_features='sqrt', max_leaf_nodes=None,
 min_impurity_decrease=0.0, min_impurity_split=None,
 min_samples_leaf=1, min_samples_split=2,
 min_weight_fraction_leaf=0.0, n_estimators=100,
 presort='auto', random_state=None, subsample=1.0, verbose=0,
 warm_start=False))])

Iteration: 17 Score: 0.723010435585286 Variance: 0.0013184775096818067
Pipeline: Pipeline(memory=None,
 steps=[('feature_selection', PCA(copy=True, iterated_power='auto', n_components=10, random_state=None,
 svd_solver='auto', tol=0.0, whiten=False)), ('classification', ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
 max_depth=50, max_features='auto', max_leaf_no...timators=10, n_jobs=1,
 oob_score=False, random_state=None, verbose=0, warm_start=False)