{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n", "###
Author: Korgun Dmitry, @tbb\n", " \n", "##
Tutorial\n", "###
\"Something else about ensemble learning\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal behind ensemble methods is to combine different classifiers into a meta-classifier that has a better generalization performance than each individual classifier alone. For example, assuming that we collected prediction from 10 different kaggle-kernels, ensemble method would allow us to combine these predictions to come up with a prediction that is more accurate and robust than the prediction by one each kernel. There are several ways to create an ensemble of classifiers each aimed for own purpose:\n", "\n", "* **Bagging** - decrease the variance\n", "* **Boosting** - decrease the bias\n", "* **Stacking** - improve the predictive force\n", "\n", "What is \"bagging\" and \"boosting\" you already know from lectures, but let me remind you main ideas.\n", "\n", "**_Bagging_** - generate additional data for training from the original dataset using combinations with repetitions to produce multisets of the same size as the original dataset. By increasing the size of training set you can't improve the model predictive force, but just decrease the variance, narrowly tuning the prediction to the expected outcome.\n", "\n", "**_Boosting_** - two-step approach, where first uses subsets of the original data to produce a series of averagely performing models and then \"boosts\" their performance by combining them together using a particular cost function (e.g. majority vote). Unlike bagging, in the classical boosting the subset creation is not random and depends upon the performance of the previous models: every new subset contains the elements that were misclassified by the previous model.\n", "\n", "**_Stacking (Blending)_ ** - is similar to boosting: you also apply several models to your original data. The difference here is that you don't have an empirical formula for your weight function, rather you introduce a meta-level and use another model/approach to estimate the input together with outputs of every model to estimate the weights, in other words, to determine what models perform well and what badly given these input data.\n", "\n", "### Intro\n", "Before we start, I guess, we should see a graph that demonstrates the relationship between the ensemble and individual classifier error. In other words, this graph visualizes the Condorcet’s jury theorem." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import math\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "%matplotlib inline\n", "\n", "from itertools import product\n", "\n", "from scipy.misc import comb" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# calculate ensemble error\n", "def ensemble_error(n_clf, error):\n", " k_start = math.ceil(n_clf / 2)\n", " probs = [\n", " comb(n_clf, k) * error ** k * (1 - error) ** (n_clf - k)\n", " for k in range(k_start, n_clf + 1)\n", " ]\n", " return sum(probs)\n", "\n", "\n", "error_range = np.arange(0.0, 1.01, 0.01)\n", "errors = [ensemble_error(n_clf=11, error=error) for error in error_range]\n", "\n", "plt.plot(error_range, errors, label=\"Ensemble error\", linewidth=2)\n", "plt.plot(error_range, error_range, linestyle=\"--\", label=\"Base error\", linewidth=2)\n", "plt.xlabel(\"Base error\")\n", "plt.ylabel(\"Base/Ensemble error\")\n", "plt.legend(loc=\"best\")\n", "plt.grid()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the error probability of an ensemble is always better than the error of an individual classifier as long as the classifier performs better than random guessing.\n", "\n", "Let's start with a warm-up exercise and implement a simple ensemble classifier for majority voting like an example of simplest ensemble algorithm." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "\n", "from sklearn import datasets\n", "# import some useful stuff\n", "from sklearn.base import BaseEstimator, ClassifierMixin, clone\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.metrics import auc, roc_curve\n", "from sklearn.model_selection import (GridSearchCV, cross_val_score,\n", " train_test_split)\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.pipeline import Pipeline, _name_estimators\n", "from sklearn.preprocessing import LabelEncoder, StandardScaler\n", "from sklearn.tree import DecisionTreeClassifier\n", "\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# and make a small helper function to plot classifiers decision area\n", "def plot_clf_area(\n", " classifiers, labels, X, s_row=2, s_col=2, scaling=True, colors=None, markers=None\n", "):\n", " if not colors:\n", " colors = [\"green\", \"red\", \"blue\"]\n", "\n", " if not markers:\n", " markers = [\"^\", \"o\", \"x\"]\n", "\n", " if scaling:\n", " sc = StandardScaler()\n", " X_std = sc.fit_transform(X)\n", "\n", " # find plot boundaries\n", " x_min = X_std[:, 0].min() - 1\n", " x_max = X_std[:, 0].max() + 1\n", " y_min = X_std[:, 1].min() - 1\n", " y_max = X_std[:, 1].max() + 1\n", "\n", " xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))\n", "\n", " f, axarr = plt.subplots(\n", " nrows=s_row, ncols=s_col, sharex=\"col\", sharey=\"row\", figsize=(12, 8)\n", " )\n", " for idx, clf, tt in zip(product(range(s_row), range(s_col)), classifiers, labels):\n", " clf.fit(X_std, y_train)\n", " Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])\n", " Z = Z.reshape(xx.shape)\n", " axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.3)\n", "\n", " for label, color, marker in zip(np.unique(y_train), colors, markers):\n", " axarr[idx[0], idx[1]].scatter(\n", " X_std[y_train == label, 0],\n", " X_std[y_train == label, 1],\n", " c=color,\n", " marker=marker,\n", " s=50,\n", " )\n", " axarr[idx[0], idx[1]].set_title(tt)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Implementing a simple majority vote classifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class MajorityVoteClassifier(BaseEstimator, ClassifierMixin):\n", " \"\"\"\n", " A Majority vote ensemble classifier\n", " \n", " Params\n", " -----\n", " classifiers : array, shape = [n_classifiers]\n", " Classifiers for the ensemble\n", " \n", " vote : str, {'label', 'probability'}\n", " Default: 'label'\n", " If 'label' the prediction based on the argmax\n", " of class labels. Else if 'probability', the\n", " argmax of the sum of probabilities is used to\n", " predict the class label.\n", " \n", " weights : array, shape = [n_classifiers]\n", " Optional, default: None\n", " If a list of 'int' or 'float' values are provided,\n", " the classifiers are weighted by importance;\n", " Uses uniform weights if 'None'\n", " \"\"\"\n", "\n", " def __init__(self, classifiers, vote=\"label\", weights=None):\n", " self.classifiers = classifiers\n", " self.named_classifiers = {\n", " key: value for key, value in _name_estimators(classifiers)\n", " }\n", " self.vote = vote\n", " self.weights = weights\n", "\n", " def fit(self, X, y):\n", " \"\"\"\n", " Fit classifiers.\n", " \n", " Params\n", " -----\n", " X : {array, matrix}\n", " shape = [n_samples, n_features]\n", " Matrix of training samples.\n", " \n", " y : array, shape = [n_samples]\n", " Vector of target labels.\n", " \"\"\"\n", "\n", " # Use LabelEncoder to ensure class labels start with 0\n", " # which is important for np.argmax call in self.predict\n", " self.le_ = LabelEncoder()\n", " self.le_.fit(y)\n", " self.classes_ = self.le_.classes_\n", " self.classifiers_ = []\n", " for clf in self.classifiers:\n", " fitted_clf = clone(clf).fit(X, self.le_.transform(y))\n", " self.classifiers_.append(fitted_clf)\n", " return self\n", "\n", " def predict(self, X):\n", " \"\"\"\n", " Predict class labels for X.\n", " \n", " Params\n", " -----\n", " X : {array, matrix}\n", " shape = [n_samples, n_features]\n", " Matrix of training samples.\n", " \n", " Returns\n", " -----\n", " maj_vote : array, shape = [n_samples]\n", " Predicted class labels.\n", " \"\"\"\n", "\n", " if self.vote == \"probability\":\n", " maj_vote = np.argmax(self.predict_proba(X), axis=1)\n", " else:\n", " predictions = np.asarray([clf.predict(X) for clf in self.classifiers_]).T\n", " maj_vote = np.apply_along_axis(\n", " lambda x: np.argmax(np.bincount(x, weights=self.weights)),\n", " axis=1,\n", " arr=predictions,\n", " )\n", "\n", " maj_vote = self.le_.inverse_transform(maj_vote)\n", " return maj_vote\n", "\n", " def predict_proba(self, X):\n", " \"\"\"\n", " Predict class probabilities for X.\n", "\n", " Params\n", " -----\n", " X : {array, matrix}\n", " shape = [n_samples, n_features]\n", " Training vectors, where n_samples is the number\n", " of samples and n_features the number of features.\n", "\n", " Returns\n", " -----\n", " avg_proba : array\n", " shape = [n_samples, n_classes]\n", " Weighted average probability for each class per sample\n", " \"\"\"\n", " probas = np.asarray([clf.predict_proba(X) for clf in self.classifiers_])\n", " avg_proba = np.average(probas, axis=0, weights=self.weights)\n", " return avg_proba" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parent classes **_BaseEstimator_** and **_ClassifierMixin_** give some some base functionality like *get_params* and *set_params* for free.\n", "\n", "Now it's time to test out classifier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# load data\n", "wine = datasets.load_wine()\n", "wine.feature_names" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# use only two feature - alco & hue\n", "X, y = wine.data[:, [0, 10]], wine.target\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.3, random_state=11\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# make base classifiers\n", "clf1 = LogisticRegression(penalty=\"l2\", C=0.001, random_state=11)\n", "clf2 = DecisionTreeClassifier(max_depth=2, criterion=\"entropy\", random_state=11)\n", "clf3 = KNeighborsClassifier(n_neighbors=1, p=2, metric=\"minkowski\")\n", "\n", "# LR and KNN use Euclidian distance metric so need to scale the data\n", "pipe1 = Pipeline([[\"sc\", StandardScaler()], [\"clf\", clf1]])\n", "pipe3 = Pipeline([[\"sc\", StandardScaler()], [\"clf\", clf3]])\n", "\n", "mv_clf = MajorityVoteClassifier(classifiers=[pipe1, clf2, pipe3])\n", "\n", "labels = [\"Logistic Regresion\", \"Decision Tree\", \"KNN\", \"Majority Vote\"]\n", "\n", "all_clf = [pipe1, clf2, pipe3, mv_clf]\n", "for clf, label in zip(all_clf, labels):\n", " scores = cross_val_score(estimator=clf, X=X_train, y=y_train, cv=10)\n", " print(f\"ROC AUC: {scores.mean():.2f} (+/- {scores.std():.2f} {label})\")\n", "\n", "\n", "plot_clf_area(all_clf, labels, X_train)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the perfomance of the MajorityVotingClassifier gas substabtially improved over the individual classifiers in the 10-fold cross-validation evaluation. Note that the decicion regions of the ensemble classifier seem to be a hybrid of the decision regions from the individual classifiers. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Stacking\n", "The majority vote approach similar to stacking. However, the stacking algorithm used in combination with a model that predicts the final class label using the predictions of the individual classifiers in the ensemble as input.\n", "\n", "The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier, that called meta-classifier, to combine their predictions, with the aim of reducing the generalization error.\n", "\n", "Let’s say you want to do 2-fold stacking:\n", "\n", "* Split the train set in 2 parts: train_a and train_b\n", "* Fit a first-stage model on train_a and create predictions for train_b\n", "* Fit the same model on train_b and create predictions for train_a\n", "* Finally fit the model on the entire train set and create predictions for the test set.\n", "* Now train a second-stage stacker model on the probabilities from the first-stage model(s).\n", "\n", "We will use only meta features and 1-block validation. You can easily add the necessary functionality if you need.\n", "Let implement Stacking based on the MajorityVoteClassifier class. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "class StackingClassifier(BaseEstimator, ClassifierMixin):\n", " \"\"\"A Stacking classifier for scikit-learn estimators for classification.\n", " \n", " Params\n", " -----\n", " classifiers : array, shape = [n_classifiers]\n", " A list of classifiers for stacking.\n", " meta_classifier : object\n", " The meta-classifier to be fitted on the ensemble of\n", " classifiers\n", " use_probas : bool (default: True)\n", " If True, trains meta-classifier based on predicted probabilities\n", " instead of class labels.\n", " average_probas : bool (default: True)\n", " Averages the probabilities as meta features if True.\n", "\n", " \"\"\"\n", "\n", " def __init__(\n", " self, classifiers, meta_classifier, use_probas=True, average_probas=True\n", " ):\n", "\n", " self.classifiers = classifiers\n", " self.meta_classifier = meta_classifier\n", " self.named_classifiers = {\n", " key: value for key, value in _name_estimators(classifiers)\n", " }\n", " self.named_meta_classifier = {\n", " f\"meta-{key}\": value for key, value in _name_estimators([meta_classifier])\n", " }\n", " self.use_probas = use_probas\n", " self.average_probas = average_probas\n", "\n", " def fit(self, X, y):\n", " \"\"\" Fit ensemble classifers and the meta-classifier.\n", " \n", " Params\n", " -----\n", " X : {array, matrix}, shape = [n_samples, n_features]\n", " Training vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", " y : array, shape = [n_samples] or [n_samples, n_outputs]\n", " Target values.\n", " \"\"\"\n", " self.classifiers_ = [clone(clf) for clf in self.classifiers]\n", " self.meta_clf_ = clone(self.meta_classifier)\n", "\n", " for clf in self.classifiers_:\n", " clf.fit(X, y)\n", "\n", " meta_features = self.predict_meta_features(X)\n", " self.meta_clf_.fit(meta_features, y)\n", " return self\n", "\n", " def predict(self, X):\n", " \"\"\" Predict target values for X.\n", " Params\n", " -----\n", " X : {array, matrix}, shape = [n_samples, n_features]\n", " Training vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", "\n", " Returns\n", " -----\n", " labels : array, shape = [n_samples] or [n_samples, n_outputs]\n", " Predicted class labels.\n", " \"\"\"\n", " meta_features = self.predict_meta_features(X)\n", "\n", " return self.meta_clf_.predict(meta_features)\n", "\n", " def predict_proba(self, X):\n", " \"\"\" Predict class probabilities for X.\n", " Params\n", " -----\n", " X : {array, matrix}, shape = [n_samples, n_features]\n", " Training vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", "\n", " Returns\n", " -----\n", " proba : array, shape = [n_samples, n_classes] or a list of \\\n", " n_outputs of such arrays if n_outputs > 1.\n", " Probability for each class per sample.\n", " \"\"\"\n", " meta_features = self.predict_meta_features(X)\n", "\n", " return self.meta_clf_.predict_proba(meta_features)\n", "\n", " def predict_meta_features(self, X):\n", " \"\"\" Get meta-features of test-data.\n", " Params\n", " -----\n", " X : array, shape = [n_samples, n_features]\n", " Test vectors, where n_samples is the number of samples and\n", " n_features is the number of features.\n", "\n", " Returns\n", " -----\n", " meta-features : array, shape = [n_samples, n_classifiers]\n", " Returns the meta-features for test data.\n", " \"\"\"\n", " if self.use_probas:\n", " probas = np.asarray([clf.predict_proba(X) for clf in self.classifiers_])\n", " if self.average_probas:\n", " vals = np.average(probas, axis=0)\n", " else:\n", " vals = np.concatenate(probas, axis=1)\n", " else:\n", " vals = np.column_stack([clf.predict(X) for clf in self.classifiers_])\n", " return vals\n", "\n", " def get_params(self, deep=True):\n", " \"\"\"Return estimator parameter names for GridSearch support.\"\"\"\n", "\n", " if not deep:\n", " return super(StackingClassifier, self).get_params(deep=False)\n", " else:\n", " out = self.named_classifiers.copy()\n", " for name, step in self.named_classifiers.items():\n", " for key, value in step.get_params(deep=True).items():\n", " out[f\"{name}__{key}\"] = value\n", "\n", " out.update(self.named_meta_classifier.copy())\n", " for name, step in self.named_meta_classifier.items():\n", " for key, value in step.get_params(deep=True).items():\n", " out[f\"{name}__{key}\"] = value\n", "\n", " for key, value in (\n", " super(StackingClassifier, self).get_params(deep=False).items()\n", " ):\n", " out[f\"{key}\"] = value\n", "\n", " return out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Usually, **_LogisticRegression_** is used as a meta-model and we will not change the tradition. Let's check StackingClassifier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# make base LR classifiers\n", "lr1 = LogisticRegression(C=0.1, random_state=11)\n", "lr2 = LogisticRegression(C=1, random_state=11)\n", "lr3 = LogisticRegression(C=10, random_state=11)\n", "\n", "# make base DT classifiers\n", "dt1 = DecisionTreeClassifier(max_depth=1, random_state=11)\n", "dt2 = DecisionTreeClassifier(max_depth=2, random_state=11)\n", "dt3 = DecisionTreeClassifier(max_depth=3, random_state=11)\n", "\n", "# make base KNN classifiers\n", "knn1 = KNeighborsClassifier(n_neighbors=1)\n", "knn2 = KNeighborsClassifier(n_neighbors=2)\n", "\n", "# scale data for metrics classifiers\n", "pipe1 = Pipeline([[\"sc\", StandardScaler()], [\"clf\", lr1]])\n", "pipe2 = Pipeline([[\"sc\", StandardScaler()], [\"clf\", lr2]])\n", "pipe3 = Pipeline([[\"sc\", StandardScaler()], [\"clf\", lr3]])\n", "pipe4 = Pipeline([[\"sc\", StandardScaler()], [\"clf\", knn1]])\n", "pipe5 = Pipeline([[\"sc\", StandardScaler()], [\"clf\", knn2]])\n", "clfs = [pipe1, pipe2, pipe3, dt1, dt2, dt3, pipe4, pipe5]\n", "\n", "# make meta classifiers\n", "meta_clf = LogisticRegression(random_state=11)\n", "stacking = StackingClassifier(classifiers=clfs, meta_classifier=meta_clf)\n", "\n", "labels = [\n", " \"Logistic Regresion C=0.1\",\n", " \"Logistic Regresion C=1\",\n", " \"Logistic Regresion C=10\",\n", " \"Decision Tree depth=1\",\n", " \"Decision Tree depth=2\",\n", " \"Decision Tree depth=3\",\n", " \"KNN 1\",\n", " \"KNN 2\",\n", " \"Stacking\",\n", "]\n", "\n", "clfs = clfs + [stacking]\n", "\n", "for clf, label in zip(clfs, labels):\n", " scores = cross_val_score(estimator=clf, X=X_train, y=y_train, cv=10)\n", " print(f\"ROC AUC: {scores.mean():.2f} (+/- {scores.std():.2f} {label})\")\n", "\n", "plot_clf_area(clfs, labels, X_train, s_row=3, s_col=3)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Blending\n", "Blending is a word introduced by the Netflix winners. It is very close to stacked generalization, but a bit simpler and less risk of an information leak. Some researchers use “stacked ensembling” and “blending” interchangeably.\n", "\n", "With blending, instead of creating out-of-fold predictions for the train set, you create a small holdout set of say 10% of the train set. The stacker model then trains on this holdout set only.\n", "\n", "Blending has a few benefits:\n", "\n", "* It is simpler than stacking.\n", "* It wards against an information leak: The generalizers and stackers use different data.\n", "\n", "The cons are:\n", "* You use less data overall\n", "* The final model may overfit to the holdout set." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Summary\n", "\n", "Ensemble methods combine different classification models to cancel out their individual weakness, which often results in stable and well-performing models that are very attractive for machine learning competitions and sometimes for industrial applications too." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Resources\n", "\n", "1. [Ensemble Learning to Improve Machine Learning Results](https://blog.statsbot.co/ensemble-learning-d1dcd548e936)\n", "2. [KAGGLE ENSEMBLING GUIDE](https://mlwave.com/kaggle-ensembling-guide/)\n", "3. [The BigChaos Solution to the Netflix Grand Prize](https://www.netflixprize.com/assets/GrandPrize2009_BPC_BigChaos.pdf)\n", "4. [Feature-Weighted Linear Stacking](https://arxiv.org/pdf/0911.0460.pdf)\n", "5. [Stacking example](https://github.com/Dyakonov/ml_hacks/blob/master/dj_stacking.ipynb)\n", "6. [A Kaggler's Guide to Model Stacking in Practice](http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.3" } }, "nbformat": 4, "nbformat_minor": 2 }