{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Decision boundary of semi-supervised classifiers versus SVM on the Iris dataset\n\nThis example compares decision boundaries learned by two semi-supervised\nmethods, namely :class:`~sklearn.semi_supervised.LabelSpreading` and\n:class:`~sklearn.semi_supervised.SelfTrainingClassifier`, while varying the\nproportion of labeled training data from small fractions up to the full dataset.\n\nBoth methods rely on RBF kernels: :class:`~sklearn.semi_supervised.LabelSpreading` uses\nit by default, and :class:`~sklearn.semi_supervised.SelfTrainingClassifier` is paired\nhere with :class:`~sklearn.svm.SVC` as base estimator (also RBF-based by default) to\nallow a fair comparison. With 100% labeled data,\n:class:`~sklearn.semi_supervised.SelfTrainingClassifier` reduces to a fully supervised\n:class:`~sklearn.svm.SVC`, since there are no unlabeled points left to pseudo-label.\n\nIn a second section, we explain how `predict_proba` is computed in\n:class:`~sklearn.semi_supervised.LabelSpreading` and\n:class:`~sklearn.semi_supervised.SelfTrainingClassifier`.\n\nSee\n`sphx_glr_auto_examples_semi_supervised_plot_semi_supervised_newsgroups.py`\nfor a comparison of `LabelSpreading` and `SelfTrainingClassifier` in terms of\nperformance.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.patches as mpatches\nimport matplotlib.pyplot as plt\nimport numpy as np\n\nfrom sklearn.calibration import CalibratedClassifierCV\nfrom sklearn.datasets import load_iris\nfrom sklearn.inspection import DecisionBoundaryDisplay\nfrom sklearn.semi_supervised import LabelSpreading, SelfTrainingClassifier\nfrom sklearn.svm import SVC\n\niris = load_iris()\nX = iris.data[:, :2]\ny = iris.target\n\nrng = np.random.RandomState(42)\ny_rand = rng.rand(y.shape[0])\ny_10 = np.copy(y)\ny_10[y_rand > 0.1] = -1 # set random samples to be unlabeled\ny_30 = np.copy(y)\ny_30[y_rand > 0.3] = -1\n\nls10 = (LabelSpreading().fit(X, y_10), y_10, \"LabelSpreading with 10% labeled data\")\nls30 = (LabelSpreading().fit(X, y_30), y_30, \"LabelSpreading with 30% labeled data\")\nls100 = (LabelSpreading().fit(X, y), y, \"LabelSpreading with 100% labeled data\")\n\nbase_classifier = CalibratedClassifierCV(SVC(gamma=0.5, random_state=42))\nst10 = (\n SelfTrainingClassifier(base_classifier).fit(X, y_10),\n y_10,\n \"Self-training with 10% labeled data\",\n)\nst30 = (\n SelfTrainingClassifier(base_classifier).fit(X, y_30),\n y_30,\n \"Self-training with 30% labeled data\",\n)\nrbf_svc = (\n base_classifier.fit(X, y),\n y,\n \"SVC with rbf kernel\\n(equivalent to Self-training with 100% labeled data)\",\n)\n\ntab10 = plt.get_cmap(\"tab10\")\ncolor_map = {cls: tab10(cls) for cls in np.unique(y)}\ncolor_map[-1] = (1, 1, 1)\nclassifiers = (ls10, st10, ls30, st30, ls100, rbf_svc)\n\nfig, axes = plt.subplots(nrows=3, ncols=2, sharex=\"col\", sharey=\"row\", figsize=(10, 12))\naxes = axes.ravel()\n\nhandles = [\n mpatches.Patch(facecolor=tab10(i), edgecolor=\"black\", label=iris.target_names[i])\n for i in np.unique(y)\n]\nhandles.append(mpatches.Patch(facecolor=\"white\", edgecolor=\"black\", label=\"Unlabeled\"))\n\nfor ax, (clf, y_train, title) in zip(axes, classifiers):\n DecisionBoundaryDisplay.from_estimator(\n clf,\n X,\n response_method=\"predict_proba\",\n plot_method=\"contourf\",\n ax=ax,\n )\n colors = [color_map[label] for label in y_train]\n ax.scatter(X[:, 0], X[:, 1], c=colors, edgecolor=\"black\")\n ax.set_title(title)\nfig.suptitle(\n \"Semi-supervised decision boundaries with varying fractions of labeled data\", y=1\n)\nfig.legend(\n handles=handles, loc=\"lower center\", ncol=len(handles), bbox_to_anchor=(0.5, 0.0)\n)\nfig.tight_layout(rect=[0, 0.03, 1, 1])\nplt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that the decision boundaries are already quite similar to those\nusing the full labeled data available for training, even when using a very\nsmall subset of the labels.\n\n## Interpretation of `predict_proba`\n\n### `predict_proba` in `LabelSpreading`\n\n:class:`~sklearn.semi_supervised.LabelSpreading` constructs a similarity graph\nfrom the data, by default using an RBF kernel. This means each sample is\nconnected to every other with a weight that decays with their squared\nEuclidean distance, scaled by a parameter `gamma`.\n\nOnce we have that weighted graph, labels are propagated along the graph\nedges. Each sample gradually takes on a soft label distribution that reflects\na weighted average of the labels of its neighbors until the process converges.\nThese per-sample distributions are stored in `label_distributions_`.\n\n`predict_proba` computes the class probabilities for a new point by taking a\nweighted average of the rows in `label_distributions_`, where the weights come\nfrom the RBF kernel similarities between the new point and the training\nsamples. The averaged values are then renormalized so that they sum to one.\n\nJust keep in mind that these \"probabilities\" are graph-based scores, not\ncalibrated posteriors. Don't over-interpret their absolute values.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics.pairwise import rbf_kernel\n\nls = ls100[0] # fitted LabelSpreading instance\nx_query = np.array([[3.5, 1.5]]) # point in the soft blue region\n\n# Step 1: similarities between query and all training samples\nW = rbf_kernel(x_query, X, gamma=ls.gamma) # `gamma=20` by default\n\n# Step 2: weighted average of label distributions\nprobs = np.dot(W, ls.label_distributions_)\n\n# Step 3: normalize to sum to 1\nprobs /= probs.sum(axis=1, keepdims=True)\n\nprint(\"Manual:\", probs)\nprint(\"API :\", ls.predict_proba(x_query))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `predict_proba` in `SelfTrainingClassifier`\n\n:class:`~sklearn.semi_supervised.SelfTrainingClassifier` works by repeatedly\nfitting its base estimator on the currently labeled data, then adding\npseudo-labels for unlabeled points whose predicted probabilities exceed a\nconfidence threshold. This process continues until no new points can be\nlabeled, at which point the classifier has a final fitted base estimator\nstored in the attribute `estimator_`.\n\nWhen you call `predict_proba` on the `SelfTrainingClassifier`, it simply\ndelegates to this final estimator.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "st = st10[0]\nprint(\"Manual:\", st.estimator_.predict_proba(x_query))\nprint(\"API :\", st.predict_proba(x_query))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In both methods, semi-supervised learning can be understood as constructing a\ncategorical distribution over classes for each sample.\n:class:`~sklearn.semi_supervised.LabelSpreading` keeps these distributions soft and\nupdates them through graph-based propagation.\nPredictions (including `predict_proba`) remain tied to the training set, which\nmust be stored for inference.\n\n:class:`~sklearn.semi_supervised.SelfTrainingClassifier` instead uses these\ndistributions internally to decide which unlabeled points to assign pseudo-labels\nduring training, but at prediction time the returned probabilities come directly from\nthe final fitted estimator, and therefore the decision rule does not require storing\nthe training data.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 0 }