{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Label Propagation digits: Demonstrating performance\n\nThis example demonstrates the power of semisupervised learning by\ntraining a Label Spreading model to classify handwritten digits\nwith sets of very few labels.\n\nThe handwritten digit dataset has 1797 total points. The model will\nbe trained using all points, but only 30 will be labeled. Results\nin the form of a confusion matrix and a series of metrics over each\nclass will be very good.\n\nAt the end, the top 10 most uncertain predictions will be shown.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data generation\n\nWe use the digits dataset. We only use a subset of randomly selected samples.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nfrom sklearn import datasets\n\ndigits = datasets.load_digits()\nrng = np.random.RandomState(2)\nindices = np.arange(len(digits.data))\nrng.shuffle(indices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We selected 340 samples of which only 40 will be associated with a known label.\nTherefore, we store the indices of the 300 other samples for which we are not\nsupposed to know their labels.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X = digits.data[indices[:340]]\ny = digits.target[indices[:340]]\nimages = digits.images[indices[:340]]\n\nn_total_samples = len(y)\nn_labeled_points = 40\n\nindices = np.arange(n_total_samples)\n\nunlabeled_set = indices[n_labeled_points:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Shuffle everything around\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y_train = np.copy(y)\ny_train[unlabeled_set] = -1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Semi-supervised learning\n\nWe fit a :class:`~sklearn.semi_supervised.LabelSpreading` and use it to predict\nthe unknown labels.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import classification_report\nfrom sklearn.semi_supervised import LabelSpreading\n\nlp_model = LabelSpreading(gamma=0.25, max_iter=20)\nlp_model.fit(X, y_train)\npredicted_labels = lp_model.transduction_[unlabeled_set]\ntrue_labels = y[unlabeled_set]\n\nprint(\n \"Label Spreading model: %d labeled & %d unlabeled points (%d total)\"\n % (n_labeled_points, n_total_samples - n_labeled_points, n_total_samples)\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Classification report\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(classification_report(true_labels, predicted_labels))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Confusion matrix\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.metrics import ConfusionMatrixDisplay\n\nConfusionMatrixDisplay.from_predictions(\n true_labels, predicted_labels, labels=lp_model.classes_\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot the most uncertain predictions\n\nHere, we will pick and show the 10 most uncertain predictions.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from scipy import stats\n\npred_entropies = stats.distributions.entropy(lp_model.label_distributions_.T)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pick the top 10 most uncertain labels\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "uncertainty_index = np.argsort(pred_entropies)[-10:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nf = plt.figure(figsize=(7, 5))\nfor index, image_index in enumerate(uncertainty_index):\n image = images[image_index]\n\n sub = f.add_subplot(2, 5, index + 1)\n sub.imshow(image, cmap=plt.cm.gray_r)\n plt.xticks([])\n plt.yticks([])\n sub.set_title(\n \"predict: %i\\ntrue: %i\" % (lp_model.transduction_[image_index], y[image_index])\n )\n\nf.suptitle(\"Learning with small amount of labeled data\")\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.21" } }, "nbformat": 4, "nbformat_minor": 0 }