{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Plotting Learning Curves and Checking Models' Scalability\n\nIn this example, we show how to use the class\n:class:`~sklearn.model_selection.LearningCurveDisplay` to easily plot learning\ncurves. In addition, we give an interpretation to the learning curves obtained\nfor a naive Bayes and SVM classifiers.\n\nThen, we explore and draw some conclusions about the scalability of these predictive\nmodels by looking at their computational cost and not only at their statistical\naccuracy.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning Curve\n\nLearning curves show the effect of adding more samples during the training\nprocess. The effect is depicted by checking the statistical performance of\nthe model in terms of training score and testing score.\n\nHere, we compute the learning curve of a naive Bayes classifier and a SVM\nclassifier with a RBF kernel using the digits dataset.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import load_digits\nfrom sklearn.naive_bayes import GaussianNB\nfrom sklearn.svm import SVC\n\nX, y = load_digits(return_X_y=True)\nnaive_bayes = GaussianNB()\nsvc = SVC(kernel=\"rbf\", gamma=0.001)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The :meth:`~sklearn.model_selection.LearningCurveDisplay.from_estimator`\ndisplays the learning curve given the dataset and the predictive model to\nanalyze. To get an estimate of the scores uncertainty, this method uses\na cross-validation procedure.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.model_selection import LearningCurveDisplay, ShuffleSplit\n\nfig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 6), sharey=True)\n\ncommon_params = {\n \"X\": X,\n \"y\": y,\n \"train_sizes\": np.linspace(0.1, 1.0, 5),\n \"cv\": ShuffleSplit(n_splits=50, test_size=0.2, random_state=0),\n \"score_type\": \"both\",\n \"n_jobs\": 4,\n \"line_kw\": {\"marker\": \"o\"},\n \"std_display_style\": \"fill_between\",\n \"score_name\": \"Accuracy\",\n}\n\nfor ax_idx, estimator in enumerate([naive_bayes, svc]):\n LearningCurveDisplay.from_estimator(estimator, **common_params, ax=ax[ax_idx])\n handles, label = ax[ax_idx].get_legend_handles_labels()\n ax[ax_idx].legend(handles[:2], [\"Training Score\", \"Test Score\"])\n ax[ax_idx].set_title(f\"Learning Curve for {estimator.__class__.__name__}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first analyze the learning curve of the naive Bayes classifier. Its shape\ncan be found in more complex datasets very often: the training score is very\nhigh when using few samples for training and decreases when increasing the\nnumber of samples, whereas the test score is very low at the beginning and\nthen increases when adding samples. The training and test scores become more\nrealistic when all the samples are used for training.\n\nWe see another typical learning curve for the SVM classifier with RBF kernel.\nThe training score remains high regardless of the size of the training set.\nOn the other hand, the test score increases with the size of the training\ndataset. Indeed, it increases up to a point where it reaches a plateau.\nObserving such a plateau is an indication that it might not be useful to\nacquire new data to train the model since the generalization performance of\nthe model will not increase anymore.\n\n## Complexity analysis\n\nIn addition to these learning curves, it is also possible to look at the\nscalability of the predictive models in terms of training and scoring times.\n\nThe :class:`~sklearn.model_selection.LearningCurveDisplay` class does not\nprovide such information. We need to resort to the\n:func:`~sklearn.model_selection.learning_curve` function instead and make\nthe plot manually.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.model_selection import learning_curve\n\ncommon_params = {\n \"X\": X,\n \"y\": y,\n \"train_sizes\": np.linspace(0.1, 1.0, 5),\n \"cv\": ShuffleSplit(n_splits=50, test_size=0.2, random_state=0),\n \"n_jobs\": 4,\n \"return_times\": True,\n}\n\ntrain_sizes, _, test_scores_nb, fit_times_nb, score_times_nb = learning_curve(\n naive_bayes, **common_params\n)\ntrain_sizes, _, test_scores_svm, fit_times_svm, score_times_svm = learning_curve(\n svc, **common_params\n)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(16, 12), sharex=True)\n\nfor ax_idx, (fit_times, score_times, estimator) in enumerate(\n zip(\n [fit_times_nb, fit_times_svm],\n [score_times_nb, score_times_svm],\n [naive_bayes, svc],\n )\n):\n # scalability regarding the fit time\n ax[0, ax_idx].plot(train_sizes, fit_times.mean(axis=1), \"o-\")\n ax[0, ax_idx].fill_between(\n train_sizes,\n fit_times.mean(axis=1) - fit_times.std(axis=1),\n fit_times.mean(axis=1) + fit_times.std(axis=1),\n alpha=0.3,\n )\n ax[0, ax_idx].set_ylabel(\"Fit time (s)\")\n ax[0, ax_idx].set_title(\n f\"Scalability of the {estimator.__class__.__name__} classifier\"\n )\n\n # scalability regarding the score time\n ax[1, ax_idx].plot(train_sizes, score_times.mean(axis=1), \"o-\")\n ax[1, ax_idx].fill_between(\n train_sizes,\n score_times.mean(axis=1) - score_times.std(axis=1),\n score_times.mean(axis=1) + score_times.std(axis=1),\n alpha=0.3,\n )\n ax[1, ax_idx].set_ylabel(\"Score time (s)\")\n ax[1, ax_idx].set_xlabel(\"Number of training samples\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the scalability of the SVM and naive Bayes classifiers is very\ndifferent. The SVM classifier complexity at fit and score time increases\nrapidly with the number of samples. Indeed, it is known that the fit time\ncomplexity of this classifier is more than quadratic with the number of\nsamples which makes it hard to scale to dataset with more than a few\n10,000 samples. In contrast, the naive Bayes classifier scales much better\nwith a lower complexity at fit and score time.\n\nSubsequently, we can check the trade-off between increased training time and\nthe cross-validation score.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))\n\nfor ax_idx, (fit_times, test_scores, estimator) in enumerate(\n zip(\n [fit_times_nb, fit_times_svm],\n [test_scores_nb, test_scores_svm],\n [naive_bayes, svc],\n )\n):\n ax[ax_idx].plot(fit_times.mean(axis=1), test_scores.mean(axis=1), \"o-\")\n ax[ax_idx].fill_between(\n fit_times.mean(axis=1),\n test_scores.mean(axis=1) - test_scores.std(axis=1),\n test_scores.mean(axis=1) + test_scores.std(axis=1),\n alpha=0.3,\n )\n ax[ax_idx].set_ylabel(\"Accuracy\")\n ax[ax_idx].set_xlabel(\"Fit time (s)\")\n ax[ax_idx].set_title(\n f\"Performance of the {estimator.__class__.__name__} classifier\"\n )\n\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In these plots, we can look for the inflection point for which the\ncross-validation score does not increase anymore and only the training time\nincreases.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }