{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sklearn Pipeline Permuter Example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", "This example shows how to systematically evaluate different machine learning pipelines. \n", "\n", "This is, for instance, useful if combinations of different feature selection methods with different estimators want to be evaluated in one step.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports and Helper Functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "from pathlib import Path\n", "from shutil import rmtree\n", "\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# Utils\n", "from sklearn.datasets import load_breast_cancer, load_diabetes\n", "\n", "# Preprocessing & Feature Selection\n", "from sklearn.feature_selection import SelectKBest, RFE\n", "from sklearn.preprocessing import MinMaxScaler, StandardScaler\n", "\n", "# Classification\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.svm import SVC\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.ensemble import AdaBoostClassifier\n", "\n", "# Regression\n", "from sklearn.neighbors import KNeighborsRegressor\n", "from sklearn.svm import SVR\n", "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.ensemble import AdaBoostRegressor\n", "\n", "\n", "# Cross-Validation\n", "from sklearn.model_selection import KFold\n", "\n", "from biopsykit.classification.model_selection import SklearnPipelinePermuter\n", "\n", "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create temporary directory" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "tmpdir = Path(\"tmpdir\")\n", "tmpdir.mkdir(exist_ok=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Example Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "breast_cancer = load_breast_cancer()\n", "X = breast_cancer.data\n", "y = breast_cancer.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specify Estimator Combinations and Parameters for Hyperparameter Search" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "model_dict = {\n", " \"scaler\": {\"StandardScaler\": StandardScaler(), \"MinMaxScaler\": MinMaxScaler()},\n", " \"reduce_dim\": {\"SelectKBest\": SelectKBest(), \"RFE\": RFE(SVC(kernel=\"linear\", C=1))},\n", " \"clf\": {\n", " \"KNeighborsClassifier\": KNeighborsClassifier(),\n", " \"DecisionTreeClassifier\": DecisionTreeClassifier(),\n", " # \"SVC\": SVC(),\n", " # \"AdaBoostClassifier\": AdaBoostClassifier(),\n", " },\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "params_dict = {\n", " \"StandardScaler\": None,\n", " \"MinMaxScaler\": None,\n", " \"SelectKBest\": {\"k\": [2, 4, \"all\"]},\n", " \"RFE\": {\"n_features_to_select\": [2, 4, None]},\n", " \"KNeighborsClassifier\": {\"n_neighbors\": [2, 4], \"weights\": [\"uniform\", \"distance\"]},\n", " \"DecisionTreeClassifier\": {\"criterion\": [\"gini\", \"entropy\"], \"max_depth\": [2, 4]},\n", " # \"SVC\": [\n", " # {\n", " # \"kernel\": [\"linear\"],\n", " # \"C\": np.logspace(start=-2, stop=2, num=5)\n", " # },\n", " # {\n", " # \"kernel\": [\"rbf\"],\n", " # \"C\": np.logspace(start=-2, stop=2, num=5),\n", " # \"gamma\": np.logspace(start=-2, stop=2, num=5)\n", " # }\n", " # ],\n", " # \"AdaBoostClassifier\": {\n", " # \"base_estimator\": [DecisionTreeClassifier(max_depth=1)],\n", " # \"n_estimators\": np.arange(20, 110, 10),\n", " # \"learning_rate\": np.arange(0.6, 1.1, 0.1)\n", " # },\n", "}\n", "\n", "\n", "# use randomized-search for decision tree classifier, use grid-search (the default) for all other estimators\n", "hyper_search_dict = {\"DecisionTreeClassifier\": {\"search_method\": \"random\", \"n_iter\": 2}}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup PipelinePermuter and Cross-Validations for Model Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: For further information please visit the documentation of [SklearnPipelinePermuter](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "pipeline_permuter = SklearnPipelinePermuter(\n", " model_dict, params_dict, hyper_search_dict=hyper_search_dict, random_state=42\n", ")\n", "\n", "outer_cv = KFold(5)\n", "inner_cv = KFold(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fit all Parameter Combinations" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "pipeline_permuter.fit(X=X, y=y, outer_cv=outer_cv, inner_cv=inner_cv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Display Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Metric Summary for Classification Pipelines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The summary of all relevant metrics (performance scores, confusion matrix, true and predicted labels) of the **best-performing pipelines** for each fold (i.e., the [best_pipeline()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.best_pipeline) parameter of each inner `cv` object), evaluated for each evaluated pipeline combination." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "pipeline_permuter.metric_summary()" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "List of `Pipeline` objects for the **best pipeline** for each evaluated pipeline combination." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "pycharm": { "name": "#%%\n" }, "tags": [] }, "outputs": [], "source": [ "pipeline_permuter.best_estimator_summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Mean Performance Scores for Individual Hyperparameter Combinations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The performance scores for each pipeline and parameter combinations, respectively, averaged over all outer CV folds using [SklearnPipelinePermuter.mean_pipeline_score_results()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.mean_pipeline_score_results).\n", "\n", "**NOTE**:\n", "* The summary of these pipelines does not necessarily correspond to the best-performing pipeline as returned by\n", " [SklearnPipelinePermuter.metric_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.metric_summary) or \n", " [SklearnPipelinePermuter.best_estimator_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.best_estimator_summary) because the\n", " best-performing pipelines are determined by averaging the `best_estimator` instances, as determined by\n", " `scikit-learn`, over all folds. Hence, all `best_estimator` instances can have a **different** set of\n", " hyperparameters, whereas in this function, it is explicitely averaged over the **same** set of hyperparameters.\n", "* Thus, this function should only be used if you want to gain a deeper understanding of the different hyperparameter\n", " combinations and their performance. If you want to get the best-performing pipeline(s) to report in a paper,\n", " use [SklearnPipelinePermuter.metric_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.metric_summary) or \n", " [SklearnPipelinePermuter.best_estimator_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.best_estimator_summary) instead." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "pipeline_permuter.mean_pipeline_score_results()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Best Hyperparameter Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The pipeline with the hyperparameter combination which achieved the highest average test score over all outer CV folds (i.e., the parameter combination which represents the first row of [mean_pipeline_score_results()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.mean_pipeline_score_results)).\n", "\n", "**NOTE**:\n", "* The summary of these pipelines does not necessarily correspond to the best-performing pipeline as returned by\n", " [SklearnPipelinePermuter.metric_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.metric_summary) or \n", " [SklearnPipelinePermuter.best_estimator_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.best_estimator_summary) because the\n", " best-performing pipelines are determined by averaging the `best_estimator` instances, as determined by\n", " `scikit-learn`, over all folds. Hence, all `best_estimator` instances can have a **different** set of\n", " hyperparameters, whereas in this function, it is explicitely averaged over the **same** set of hyperparameters.\n", "* Thus, this function should only be used if you want to gain a deeper understanding of the different hyperparameter\n", " combinations and their performance. If you want to get the best-performing pipeline(s) to report in a paper,\n", " use [SklearnPipelinePermuter.metric_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.metric_summary) or \n", " [SklearnPipelinePermuter.best_estimator_summary()](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter.best_estimator_summary) instead." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "pipeline_permuter.best_hyperparameter_pipeline()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load Example Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "diabetes_data = load_diabetes()\n", "X_reg = diabetes_data.data\n", "y_reg = diabetes_data.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Specify Estimator Combinations and Parameters for Hyperparameter Search" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_dict_reg = {\n", " \"scaler\": {\"StandardScaler\": StandardScaler(), \"MinMaxScaler\": MinMaxScaler()},\n", " \"reduce_dim\": {\"SelectKBest\": SelectKBest(), \"RFE\": RFE(SVR(kernel=\"linear\", C=1))},\n", " \"clf\": {\n", " \"KNeighborsRegressor\": KNeighborsRegressor(),\n", " \"DecisionTreeRegressor\": DecisionTreeRegressor(),\n", " # \"SVR\": SVR(),\n", " # \"AdaBoostRegressor\": AdaBoostRegressor(),\n", " },\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "params_dict_reg = {\n", " \"StandardScaler\": None,\n", " \"MinMaxScaler\": None,\n", " \"SelectKBest\": {\"k\": [2, 4, \"all\"]},\n", " \"RFE\": {\"n_features_to_select\": [2, 4]},\n", " \"KNeighborsRegressor\": {\"n_neighbors\": [2, 4], \"weights\": [\"uniform\", \"distance\"]},\n", " \"DecisionTreeRegressor\": {\"max_depth\": [2, 4]},\n", " # \"SVR\": [\n", " # {\n", " # \"kernel\": [\"linear\"],\n", " # \"C\": np.logspace(start=-2, stop=2, num=5)\n", " # },\n", " # {\n", " # \"kernel\": [\"rbf\"],\n", " # \"C\": np.logspace(start=-2, stop=2, num=5),\n", " # \"gamma\": np.logspace(start=-2, stop=2, num=5)\n", " # }\n", " # ],\n", " # \"AdaBoostRegressor\": {\n", " # \"base_estimator\": [DecisionTreeClassifier(max_depth=1)],\n", " # \"n_estimators\": np.arange(20, 110, 10),\n", " # \"learning_rate\": np.arange(0.6, 1.1, 0.1)\n", " # },\n", "}\n", "\n", "\n", "# use randomized-search for decision tree classifier, use grid-search (the default) for all other estimators\n", "hyper_search_dict_reg = {\"DecisionTreeRegressor\": {\"search_method\": \"random\", \"n_iter\": 2}}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Setup PipelinePermuter and Cross-Validations for Model Evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: For further information please visit the documentatin of [SklearnPipelinePermuter](https://biopsykit.readthedocs.io/en/latest/api/biopsykit.classification.model_selection.sklearn_pipeline_permuter.html#biopsykit.classification.model_selection.sklearn_pipeline_permuter.SklearnPipelinePermuter)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline_permuter_regression = SklearnPipelinePermuter(\n", " model_dict_reg, params_dict_reg, hyper_search_dict=hyper_search_dict_reg\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "outer_cv = KFold(5)\n", "inner_cv = KFold(5)\n", "\n", "pipeline_permuter_regression.fit(X_reg, y_reg, outer_cv=outer_cv, inner_cv=inner_cv, scoring=\"r2\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Display Results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This works analogously to the classification example." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further Functions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Export Results as LaTeX Table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "print(pipeline_permuter.metric_summary_to_latex())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Save and Load `PipelinePermuter` results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Save to Pickle File" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "pipeline_permuter.to_pickle(tmpdir.joinpath(\"test.pkl\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Load from Pickle File" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "pipeline_permuter_load = SklearnPipelinePermuter.from_pickle(tmpdir.joinpath(\"test.pkl\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fit pipeline combinations and save intermediate results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This saves the current state after successfully evaluating one pipeline combination." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline_permuter.fit_and_save_intermediate(\n", " X=X, y=y, outer_cv=outer_cv, inner_cv=inner_cv, file_path=tmpdir.joinpath(\"test.pkl\")\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Merge multiple `PipelinePermuter` instances" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the case the evaluation of different classification pipelines had to be split (e.g., due to runtime reasons), the `PipelinePermuter` instances can be saved separately and afterwards merged back into one joint `PipelinePermuter` instance.\n", "\n", "The following example provides a minimal working example, consisting of the steps: \n", "* Initializing, fitting, and saving different `PipelinePermuter` instances\n", "* Loading saved `PipelinePermuter` instances from disk\n", "* Merging multiple `PipelinePermuter` instances into one instance for joint evaluation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Load Example Dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "breast_cancer = load_breast_cancer()\n", "X = breast_cancer.data\n", "y = breast_cancer.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Fit and Save Different `PipelinePermuter` instances" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "model_dict_01 = {\n", " \"scaler\": {\"StandardScaler\": StandardScaler()},\n", " \"reduce_dim\": {\"SelectKBest\": SelectKBest(), \"RFE\": RFE(SVC(kernel=\"linear\", C=1))},\n", " \"clf\": {\n", " \"KNeighborsClassifier\": KNeighborsClassifier(),\n", " },\n", "}\n", "params_dict_01 = {\n", " \"StandardScaler\": None,\n", " \"SelectKBest\": {\"k\": [2, 4, \"all\"]},\n", " \"RFE\": {\"n_features_to_select\": [2, 4, None]},\n", " \"KNeighborsClassifier\": {\"n_neighbors\": [2, 4], \"weights\": [\"uniform\", \"distance\"]},\n", "}\n", "\n", "pipeline_permuter_01 = SklearnPipelinePermuter(model_dict_01, params_dict_01, random_state=42)\n", "\n", "pipeline_permuter_01.fit(X, y, outer_cv=KFold(5), inner_cv=KFold(5), verbose=0)\n", "pipeline_permuter_01.to_pickle(tmpdir.joinpath(\"permuter_01.pkl\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "model_dict_02 = {\n", " \"scaler\": {\"MinMaxScaler\": MinMaxScaler()},\n", " \"reduce_dim\": {\"SelectKBest\": SelectKBest(), \"RFE\": RFE(SVC(kernel=\"linear\", C=1))},\n", " \"clf\": {\n", " \"KNeighborsClassifier\": KNeighborsClassifier(),\n", " },\n", "}\n", "params_dict_02 = {\n", " \"MinMaxScaler\": None,\n", " \"SelectKBest\": {\"k\": [2, 4, \"all\"]},\n", " \"RFE\": {\"n_features_to_select\": [2, 4, None]},\n", " \"KNeighborsClassifier\": {\"n_neighbors\": [2, 4], \"weights\": [\"uniform\", \"distance\"]},\n", "}\n", "\n", "pipeline_permuter_02 = SklearnPipelinePermuter(model_dict_02, params_dict_02, random_state=42)\n", "\n", "pipeline_permuter_02.fit(X, y, outer_cv=KFold(5), inner_cv=KFold(5), verbose=0)\n", "pipeline_permuter_02.to_pickle(tmpdir.joinpath(\"permuter_02.pkl\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "model_dict_03 = {\n", " \"scaler\": {\"StandardScaler\": StandardScaler(), \"MinMaxScaler\": MinMaxScaler()},\n", " \"reduce_dim\": {\"SelectKBest\": SelectKBest(), \"RFE\": RFE(SVC(kernel=\"linear\", C=1))},\n", " \"clf\": {\n", " \"DecisionTreeClassifier\": DecisionTreeClassifier(),\n", " },\n", "}\n", "params_dict_03 = {\n", " \"StandardScaler\": None,\n", " \"MinMaxScaler\": None,\n", " \"SelectKBest\": {\"k\": [2, 4, \"all\"]},\n", " \"RFE\": {\"n_features_to_select\": [2, 4, None]},\n", " \"DecisionTreeClassifier\": {\"criterion\": [\"gini\", \"entropy\"], \"max_depth\": [2, 4]},\n", "}\n", "\n", "pipeline_permuter_03 = SklearnPipelinePermuter(model_dict_03, params_dict_03, random_state=42)\n", "\n", "pipeline_permuter_03.fit(X, y, outer_cv=KFold(5), inner_cv=KFold(5), verbose=0)\n", "pipeline_permuter_03.to_pickle(tmpdir.joinpath(\"permuter_03.pkl\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Load and Merge `PipelinePermuter` instances" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "permuter_file_list = sorted(tmpdir.glob(\"permuter_*.pkl\"))\n", "print(permuter_file_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "permuter_list = [SklearnPipelinePermuter.from_pickle(p) for p in permuter_file_list]\n", "permuter_list" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "merged_permuter = SklearnPipelinePermuter.merge_permuter_instances(permuter_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double-check if permuters were correcrtly merged:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "for p in permuter_list:\n", " display(p.best_estimator_summary())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "merged_permuter.best_estimator_summary()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Updated partially fitted `SklearnPipelinePermuter` with additional Parameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example, we perform an experiment using a partial hyperparameter set. We save this object as pickle file, load it in the next step, update the parameter sets, and continue with our experiments. This is useful for incremental experiments without having to run multiple experiments and merge different `SklearnPipelinePermuter` instances." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Do Partial Fitting" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "model_dict_partial = {\n", " \"scaler\": {\"StandardScaler\": StandardScaler()},\n", " \"reduce_dim\": {\"SelectKBest\": SelectKBest(), \"RFE\": RFE(SVC(kernel=\"linear\", C=1))},\n", " \"clf\": {\n", " \"KNeighborsClassifier\": KNeighborsClassifier(),\n", " },\n", "}\n", "params_dict_partial = {\n", " \"StandardScaler\": None,\n", " \"SelectKBest\": {\"k\": [2, 4, \"all\"]},\n", " \"RFE\": {\"n_features_to_select\": [2, 4, None]},\n", " \"KNeighborsClassifier\": {\"n_neighbors\": [2, 4], \"weights\": [\"uniform\", \"distance\"]},\n", "}\n", "\n", "pipeline_permuter_partial = SklearnPipelinePermuter(model_dict_partial, params_dict_partial, random_state=42)\n", "\n", "pipeline_permuter_partial.fit(X, y, outer_cv=KFold(5), inner_cv=KFold(5))\n", "pipeline_permuter_partial.to_pickle(tmpdir.joinpath(\"permuter_partial.pkl\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Load Partially Fitted Model, Update with Total Parameter Dicts, and Fit the Remaining Combinations" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "model_dict_total = {\n", " \"scaler\": {\"StandardScaler\": StandardScaler(), \"MinMaxScaler\": MinMaxScaler()},\n", " \"reduce_dim\": {\"SelectKBest\": SelectKBest(), \"RFE\": RFE(SVC(kernel=\"linear\", C=1))},\n", " \"clf\": {\n", " \"KNeighborsClassifier\": KNeighborsClassifier(),\n", " \"DecisionTreeClassifier\": DecisionTreeClassifier(),\n", " },\n", "}\n", "\n", "params_dict_total = {\n", " \"StandardScaler\": None,\n", " \"MinMaxScaler\": None,\n", " \"SelectKBest\": {\"k\": [2, 4, \"all\"]},\n", " \"RFE\": {\"n_features_to_select\": [2, 4, None]},\n", " \"KNeighborsClassifier\": {\"n_neighbors\": [2, 4], \"weights\": [\"uniform\", \"distance\"]},\n", " \"DecisionTreeClassifier\": {\"criterion\": [\"gini\", \"entropy\"], \"max_depth\": [2, 4]},\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "pipeline_permuter_total = SklearnPipelinePermuter.from_pickle(tmpdir.joinpath(\"permuter_partial.pkl\"))\n", "pipeline_permuter_total = pipeline_permuter_total.update_permuter(model_dict_total, params_dict_total)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "pipeline_permuter_total.fit(X, y, outer_cv=KFold(5), inner_cv=KFold(5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cleanup" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [] }, "outputs": [], "source": [ "rmtree(tmpdir)" ] } ], "metadata": { "kernelspec": { "display_name": "biopsykit", "language": "python", "name": "biopsykit" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 4 }