{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n", "###
Author: Aleksandr Korotkov, ODS Slack krotix\n", " \n", "##
Tutorial\n", "###
Yet another ensemble learning helper" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![title](https://static1.squarespace.com/static/57dc396a03596e8da9fe6b73/t/57eef283b3db2ba633355a07/1480477568336/UBC_Bands.jpg)\n", "
Image by Brian Hawkes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What does ensemble learning mean?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Ensemble learning** - this is a method use multiple learning algorithms to obtain(I'd say usually it so, but not in any cases) better predictive performance than could be obtained from any of the constituent learning algorithms alone.\n", "\n", "The most common techniques are:\n", "* boosting\n", "* bagging\n", "* stacking\n", "\n", "We are looking at some meta-algorithms to make an ensemble, which can improve the performance of a metric, get a better speed of experimenting and simplify code.\n", "\n", "I would like to show several libraries for ensambling in python:\n", "* https://github.com/rasbt/mlxtend.git - a library of useful tools for the day-to-day data science tasks.\n", "* https://github.com/flennerhag/mlens - a library for high performance ensemble learning.\n", "* https://github.com/Menelau/DESlib - an easy-to-use ensemble learning library focused on the implementation of the state-of-the-art techniques for dynamic classifier and ensemble selection.\n", "\n", "We will understand the use of different libraries on a simple example and plot of decision boundaries to visualize differences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###
Installation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install mlxtend\n", "!pip install mlens\n", "!pip install deslib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Prepare our notebook to further experiments:\n", "* import all libraries\n", "* load example data and split it\n", "* make classifiers for comparison\n", "* create utility function\n", "\n", "We'll use Iris dataset as an example.\n", "\n", "Features\n", "* Sepal length\n", "* Sepal width\n", "* Petal length\n", "* Petal width\n", "\n", "Number of samples: 150.\n", "\n", "Target variable (discrete): {50x Setosa, 50x Versicolor, 50x Virginica}" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import itertools\n", "import warnings\n", "\n", "import matplotlib.gridspec as gridspec\n", "import matplotlib.pyplot as plt\n", "# common libraries\n", "import numpy as np\n", "from deslib.dcs import MCB\n", "from deslib.des.knora_e import KNORAE\n", "from deslib.static import StaticSelection\n", "from mlens.ensemble import (BlendEnsemble, SequentialEnsemble, Subsemble,\n", " SuperLearner)\n", "from mlxtend.classifier import (EnsembleVoteClassifier, StackingClassifier,\n", " StackingCVClassifier)\n", "from mlxtend.data import iris_data\n", "from mlxtend.plotting import plot_decision_regions\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import accuracy_score, classification_report\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.svm import SVC\n", "\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# random seed\n", "seed = 10\n", "\n", "# Loading example data\n", "X, y = iris_data()\n", "X = X[:, [0, 2]]\n", "\n", "# split the data into training and test data\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.25, random_state=seed\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initializing several classifiers\n", "clf1 = LogisticRegression(random_state=seed)\n", "clf2 = RandomForestClassifier(random_state=seed)\n", "clf3 = SVC(random_state=seed, probability=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def compare(classifier, X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test):\n", " # Plotting Decision Regions\n", " gs = gridspec.GridSpec(2, 2)\n", " fig = plt.figure(figsize=(10, 8))\n", "\n", " # Label for our classifiers\n", " labels = [\"Logistic Regression\", \"Random Forest\", \"RBF kernel SVM\", \"Ensemble\"]\n", "\n", " classifiers = [clf1, clf2, clf3, classifier]\n", " for clf, label, grid in zip(\n", " classifiers, labels, itertools.product([0, 1], repeat=2)\n", " ):\n", " clf.fit(X, y)\n", " ax = plt.subplot(gs[grid[0], grid[1]])\n", " fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2)\n", " plt.title(label)\n", "\n", " plt.show()\n", "\n", " for clf, label in zip(classifiers, labels):\n", " print(label)\n", " print(classification_report(clf.predict(X_test), y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###
Mlxtend" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mlxtend classes:\n", "* EnsembleVoteClassifier - a majority voting helper for classification\n", "* StackingClassifier - an ensemble-learning meta-classifier for stacking\n", "* StackingCVClassifier - an ensemble-learning meta-classifier for stacking using cross-validation to prepare the inputs for the level-2 classifier to prevent overfitting\n", "\n", "Let's discover how we can use it by examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### EnsembleVoteClassifier\n", "\n", "To get more information look https://rasbt.github.io/mlxtend/user_guide/classifier/EnsembleVoteClassifier/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "eclf = EnsembleVoteClassifier(clfs=[clf1, clf2, clf3], weights=[2, 1, 1], voting=\"soft\")\n", "compare(eclf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### StackingClassifier\n", "\n", "To get more information look https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sclf = StackingClassifier(\n", " classifiers=[clf1, clf2, clf3], meta_classifier=LogisticRegression()\n", ")\n", "compare(sclf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### StackingCVClassifier\n", "\n", "To get more information look https://rasbt.github.io/mlxtend/user_guide/classifier/StackingCVClassifier/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scvclf = StackingCVClassifier(\n", " classifiers=[clf1, clf2, clf3], meta_classifier=LogisticRegression()\n", ")\n", "compare(scvclf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##
mlens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Mlens has several useful classes:\n", "* SuperLearner - a stacking ensemble \n", "* Subsemble - a supervised ensemble algorithm that uses subsets of the full data to fit a layer\n", "* BlendEnsemble - a supervised ensemble uses the meta-learner to estimate the prediction matrix \n", "* SequentialEnsemble - a multi-layer ensemble learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### SuperLearner\n", "\n", "To get more information follow the link http://ml-ensemble.com/info/start/ensembles.html#super-learner" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sl = SuperLearner(folds=5, random_state=seed, verbose=2)\n", "\n", "# Build the first layer\n", "sl.add([clf1, clf2, clf3])\n", "# Attach the final meta-estimator\n", "sl.add_meta(LogisticRegression())\n", "\n", "compare(sl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Subsemble\n", "\n", "To get more information follow the link http://ml-ensemble.com/info/start/ensembles.html#subsemble" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sub = Subsemble(partitions=3, random_state=seed, verbose=2, shuffle=True)\n", "\n", "# Build the first layer\n", "sub.add([clf1, clf2, clf3])\n", "sub.add_meta(SVC())\n", "\n", "compare(sub)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### BlendEnsemble\n", "\n", "To get more information follow the link http://ml-ensemble.com/info/start/ensembles.html#blend-ensemble" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "be = BlendEnsemble(test_size=0.7, random_state=seed, verbose=2, shuffle=True)\n", "\n", "# Build the first layer\n", "be.add([clf1, clf2, clf3])\n", "be.add_meta(LogisticRegression())\n", "\n", "compare(be)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### SequentialEnsemble\n", "\n", "To get more information follow the link http://ml-ensemble.com/info/start/ensembles.html#sequential-ensemble" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "se = SequentialEnsemble(random_state=seed, shuffle=True)\n", "\n", "# The initial layer is a blended layer, same as a layer in the BlendEnsemble\n", "se.add(\"blend\", [clf1, clf2])\n", "\n", "# The second layer is a stacked layer, same as a layer of the SuperLearner\n", "se.add(\"stack\", [clf1, clf3])\n", "\n", "# The meta estimator is added as in any other ensemble\n", "se.add_meta(SVC())\n", "\n", "compare(se)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##
DESlib" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DESlib has 23 algorithms different ensemble technics split into 3 group:\n", "* Dynamic Ensemble Selection(DES)\n", "* Dynamic Classifier Selection(DCS)\n", "* Baseline methods(static)\n", "\n", "let's try some of them from different groups:\n", "* KNORAE - Dynamic Ensemble Selection(DES) algorithm based on k-Nearest Oracle-Eliminate(KNORA-E)\n", "* MCB - Dynamic Classifier Selection(DCS) algorithm based on Multiple Classifier Behaviour (MCB)\n", "* StaticSelection - Baseline method(static) for an ensemble model that selects N classifiers with the best performance\n", "\n", "To get more information follow the link https://deslib.readthedocs.io/en/latest/api.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### KNORAE \n", "To get more information follow the link https://deslib.readthedocs.io/en/latest/modules/des/knora_e.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kne = KNORAE([clf1, clf2, clf3])\n", "compare(kne)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### MCB\n", "\n", "To get more information follow the link https://deslib.readthedocs.io/en/latest/modules/dcs/mcb.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mcb = MCB([clf1, clf2, clf3])\n", "compare(mcb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### StaticSelection\n", "\n", "To get more information follow the link https://deslib.readthedocs.io/en/latest/modules/static/static_selection.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ss = StaticSelection([clf1, clf2, clf3])\n", "compare(ss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###
Summary\n", " \n", "We looked at different algorithms and different libraries, which might save a lot of time when you need to use ensemble technic. Ensemble modeling is a powerful way to improve the performance of your machine learning models. If you wish to be on the top of the leaderboard in any machine learning competition or want to improve models you are working on – ensemble is the way to go.\n", " \n", "P.S. There is no silver bullet... try different tools and algorithms." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Resources: \n", "* https://en.wikipedia.org/wiki/Ensemble_learning\n", "* https://rasbt.github.io/mlxtend/USER_GUIDE_INDEX/\n", "* https://www.dataquest.io/blog/introduction-to-ensembles/\n", "* https://www.slideshare.net/SessionsEvents/erin-ledell-machine-learning-scientist-h2oai-at-mlconf-atl-2016\n", "* https://medium.com/@rrfd/boosting-bagging-and-stacking-ensemble-methods-with-sklearn-and-mlens-a455c0c982de\n", "* https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }