{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Overview of multiclass training meta-estimators\n\nIn this example, we discuss the problem of classification when the target\nvariable is composed of more than two classes. This is called multiclass\nclassification.\n\nIn scikit-learn, all estimators support multiclass classification out of the\nbox: the most sensible strategy was implemented for the end-user. The\n:mod:`sklearn.multiclass` module implements various strategies that one can use\nfor experimenting or developing third-party estimators that only support binary\nclassification.\n\n:mod:`sklearn.multiclass` includes OvO/OvR strategies used to train a\nmulticlass classifier by fitting a set of binary classifiers (the\n:class:`~sklearn.multiclass.OneVsOneClassifier` and\n:class:`~sklearn.multiclass.OneVsRestClassifier` meta-estimators). This example\nwill review them.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Yeast UCI dataset\n\nIn this example, we use a UCI dataset [1]_, generally referred as the Yeast\ndataset. We use the :func:`sklearn.datasets.fetch_openml` function to load\nthe dataset from OpenML.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import fetch_openml\n\nX, y = fetch_openml(data_id=181, as_frame=True, return_X_y=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To know the type of data science problem we are dealing with, we can check\nthe target for which we want to build a predictive model.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y.value_counts().sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the target is discrete and composed of 10 classes. We therefore\ndeal with a multiclass classification problem.\n\n## Strategies comparison\n\nIn the following experiment, we use a\n:class:`~sklearn.tree.DecisionTreeClassifier` and a\n:class:`~sklearn.model_selection.RepeatedStratifiedKFold` cross-validation\nwith 3 splits and 5 repetitions.\n\nWe compare the following strategies:\n\n* :class:~sklearn.tree.DecisionTreeClassifier can handle multiclass\n classification without needing any special adjustments. It works by breaking\n down the training data into smaller subsets and focusing on the most common\n class in each subset. By repeating this process, the model can accurately\n classify input data into multiple different classes.\n* :class:`~sklearn.multiclass.OneVsOneClassifier` trains a set of binary\n classifiers where each classifier is trained to distinguish between\n two classes.\n* :class:`~sklearn.multiclass.OneVsRestClassifier`: trains a set of binary\n classifiers where each classifier is trained to distinguish between\n one class and the rest of the classes.\n* :class:`~sklearn.multiclass.OutputCodeClassifier`: trains a set of binary\n classifiers where each classifier is trained to distinguish between\n a set of classes from the rest of the classes. The set of classes is\n defined by a codebook, which is randomly generated in scikit-learn. This\n method exposes a parameter `code_size` to control the size of the codebook.\n We set it above one since we are not interested in compressing the class\n representation.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n\nfrom sklearn.model_selection import RepeatedStratifiedKFold, cross_validate\nfrom sklearn.multiclass import (\n OneVsOneClassifier,\n OneVsRestClassifier,\n OutputCodeClassifier,\n)\nfrom sklearn.tree import DecisionTreeClassifier\n\ncv = RepeatedStratifiedKFold(n_splits=3, n_repeats=5, random_state=0)\n\ntree = DecisionTreeClassifier(random_state=0)\novo_tree = OneVsOneClassifier(tree)\novr_tree = OneVsRestClassifier(tree)\necoc = OutputCodeClassifier(tree, code_size=2)\n\ncv_results_tree = cross_validate(tree, X, y, cv=cv, n_jobs=2)\ncv_results_ovo = cross_validate(ovo_tree, X, y, cv=cv, n_jobs=2)\ncv_results_ovr = cross_validate(ovr_tree, X, y, cv=cv, n_jobs=2)\ncv_results_ecoc = cross_validate(ecoc, X, y, cv=cv, n_jobs=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now compare the statistical performance of the different strategies.\nWe plot the score distribution of the different strategies.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n\nscores = pd.DataFrame(\n {\n \"DecisionTreeClassifier\": cv_results_tree[\"test_score\"],\n \"OneVsOneClassifier\": cv_results_ovo[\"test_score\"],\n \"OneVsRestClassifier\": cv_results_ovr[\"test_score\"],\n \"OutputCodeClassifier\": cv_results_ecoc[\"test_score\"],\n }\n)\nax = scores.plot.kde(legend=True)\nax.set_xlabel(\"Accuracy score\")\nax.set_xlim([0, 0.7])\n_ = ax.set_title(\n \"Density of the accuracy scores for the different multiclass strategies\"\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At a first glance, we can see that the built-in strategy of the decision\ntree classifier is working quite well. One-vs-one and the error-correcting\noutput code strategies are working even better. However, the\none-vs-rest strategy is not working as well as the other strategies.\n\nIndeed, these results reproduce something reported in the literature\nas in [2]_. However, the story is not as simple as it seems.\n\n## The importance of hyperparameters search\n\nIt was later shown in [3]_ that the multiclass strategies would show similar\nscores if the hyperparameters of the base classifiers are first optimized.\n\nHere we try to reproduce such result by at least optimizing the depth of the\nbase decision tree.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n\nparam_grid = {\"max_depth\": [3, 5, 8]}\ntree_optimized = GridSearchCV(tree, param_grid=param_grid, cv=3)\novo_tree = OneVsOneClassifier(tree_optimized)\novr_tree = OneVsRestClassifier(tree_optimized)\necoc = OutputCodeClassifier(tree_optimized, code_size=2)\n\ncv_results_tree = cross_validate(tree_optimized, X, y, cv=cv, n_jobs=2)\ncv_results_ovo = cross_validate(ovo_tree, X, y, cv=cv, n_jobs=2)\ncv_results_ovr = cross_validate(ovr_tree, X, y, cv=cv, n_jobs=2)\ncv_results_ecoc = cross_validate(ecoc, X, y, cv=cv, n_jobs=2)\n\nscores = pd.DataFrame(\n {\n \"DecisionTreeClassifier\": cv_results_tree[\"test_score\"],\n \"OneVsOneClassifier\": cv_results_ovo[\"test_score\"],\n \"OneVsRestClassifier\": cv_results_ovr[\"test_score\"],\n \"OutputCodeClassifier\": cv_results_ecoc[\"test_score\"],\n }\n)\nax = scores.plot.kde(legend=True)\nax.set_xlabel(\"Accuracy score\")\nax.set_xlim([0, 0.7])\n_ = ax.set_title(\n \"Density of the accuracy scores for the different multiclass strategies\"\n)\n\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that once the hyperparameters are optimized, all multiclass\nstrategies have similar performance as discussed in [3]_.\n\n## Conclusion\n\nWe can get some intuition behind those results.\n\nFirst, the reason for which one-vs-one and error-correcting output code are\noutperforming the tree when the hyperparameters are not optimized relies on\nfact that they ensemble a larger number of classifiers. The ensembling\nimproves the generalization performance. This is a bit similar why a bagging\nclassifier generally performs better than a single decision tree if no care\nis taken to optimize the hyperparameters.\n\nThen, we see the importance of optimizing the hyperparameters. Indeed, it\nshould be regularly explored when developing predictive models even if\ntechniques such as ensembling help at reducing this impact.\n\nFinally, it is important to recall that the estimators in scikit-learn\nare developed with a specific strategy to handle multiclass classification\nout of the box. So for these estimators, it means that there is no need to\nuse different strategies. These strategies are mainly useful for third-party\nestimators supporting only binary classification. In all cases, we also show\nthat the hyperparameters should be optimized.\n\n## References\n\n.. [1] https://archive.ics.uci.edu/ml/datasets/Yeast\n\n.. [2] [\"Reducing multiclass to binary: A unifying approach for margin classifiers.\"\n Allwein, Erin L., Robert E. Schapire, and Yoram Singer.\n Journal of machine learning research. 1 Dec (2000): 113-141.](https://www.jmlr.org/papers/volume1/allwein00a/allwein00a.pdf)\n\n.. [3] [\"In defense of one-vs-all classification.\"\n Journal of Machine Learning Research. 5 Jan (2004): 101-141.](https://www.jmlr.org/papers/volume5/rifkin04a/rifkin04a.pdf)\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }