{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n# Semi-supervised Classification on a Text Dataset\n\nThis example demonstrates the effectiveness of semi-supervised learning\nfor text classification on :class:`TF-IDF\n` features when labeled data\nis scarce. For such purpose we compare four different approaches:\n\n1. Supervised learning using 100% of labels in the training set (best-case\n scenario)\n\n - Uses :class:`~sklearn.linear_model.SGDClassifier` with full supervision\n - Represents the best possible performance when labeled data is abundant\n\n2. Supervised learning using 20% of labels in the training set (baseline)\n\n - Same model as the best-case scenario but trained on a random 20% subset of\n the labeled training data\n - Shows the performance degradation of a fully supervised model due to\n limited labeled data\n\n3. :class:`~sklearn.semi_supervised.SelfTrainingClassifier` (semi-supervised)\n\n - Uses 20% labeled data + 80% unlabeled data for training\n - Iteratively predicts labels for unlabeled data\n - Demonstrates how self-training can improve performance\n\n4. :class:`~sklearn.semi_supervised.LabelSpreading` (semi-supervised)\n\n - Uses 20% labeled data + 80% unlabeled data for training\n - Propagates labels through the data manifold\n - Shows how graph-based methods can leverage unlabeled data\n\nThe example uses the 20 newsgroups dataset, focusing on five categories.\nThe results demonstrate how semi-supervised methods can achieve better\nperformance than supervised learning with limited labeled data by\neffectively utilizing unlabeled samples.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups\nfrom sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer\nfrom sklearn.linear_model import SGDClassifier\nfrom sklearn.metrics import f1_score\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.semi_supervised import LabelSpreading, SelfTrainingClassifier\n\n# Loading dataset containing first five categories\ndata = fetch_20newsgroups(\n subset=\"train\",\n categories=[\n \"alt.atheism\",\n \"comp.graphics\",\n \"comp.os.ms-windows.misc\",\n \"comp.sys.ibm.pc.hardware\",\n \"comp.sys.mac.hardware\",\n ],\n)\n\n# Parameters\nsdg_params = dict(alpha=1e-5, penalty=\"l2\", loss=\"log_loss\")\nvectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)\n\n# Supervised Pipeline\npipeline = Pipeline(\n [\n (\"vect\", CountVectorizer(**vectorizer_params)),\n (\"tfidf\", TfidfTransformer()),\n (\"clf\", SGDClassifier(**sdg_params)),\n ]\n)\n# SelfTraining Pipeline\nst_pipeline = Pipeline(\n [\n (\"vect\", CountVectorizer(**vectorizer_params)),\n (\"tfidf\", TfidfTransformer()),\n (\"clf\", SelfTrainingClassifier(SGDClassifier(**sdg_params))),\n ]\n)\n# LabelSpreading Pipeline\nls_pipeline = Pipeline(\n [\n (\"vect\", CountVectorizer(**vectorizer_params)),\n (\"tfidf\", TfidfTransformer()),\n (\"clf\", LabelSpreading()),\n ]\n)\n\n\ndef eval_and_get_f1(clf, X_train, y_train, X_test, y_test):\n \"\"\"Evaluate model performance and return F1 score\"\"\"\n print(f\" Number of training samples: {len(X_train)}\")\n print(f\" Unlabeled samples in training set: {sum(1 for x in y_train if x == -1)}\")\n clf.fit(X_train, y_train)\n y_pred = clf.predict(X_test)\n f1 = f1_score(y_test, y_pred, average=\"micro\")\n print(f\" Micro-averaged F1 score on test set: {f1:.3f}\")\n print(\"\\n\")\n return f1\n\n\nX, y = data.data, data.target\nX_train, X_test, y_train, y_test = train_test_split(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Evaluate a supervised SGDClassifier using 100% of the (labeled) training set.\nThis represents the best-case performance when the model has full access to all\nlabeled examples.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "f1_scores = {}\nprint(\"1. Supervised SGDClassifier on 100% of the data:\")\nf1_scores[\"Supervised (100%)\"] = eval_and_get_f1(\n pipeline, X_train, y_train, X_test, y_test\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Evaluate a supervised SGDClassifier trained on only 20% of the data.\nThis serves as a baseline to illustrate the performance drop caused by limiting\nthe training samples.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n\nprint(\"2. Supervised SGDClassifier on 20% of the training data:\")\nrng = np.random.default_rng(42)\ny_mask = rng.random(len(y_train)) < 0.2\n# X_20 and y_20 are the subset of the train dataset indicated by the mask\nX_20, y_20 = map(list, zip(*((x, y) for x, y, m in zip(X_train, y_train, y_mask) if m)))\nf1_scores[\"Supervised (20%)\"] = eval_and_get_f1(pipeline, X_20, y_20, X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Evaluate a semi-supervised SelfTrainingClassifier using 20% labeled and 80%\nunlabeled data.\nThe remaining 80% of the training labels are masked as unlabeled (-1),\nallowing the model to iteratively label and learn from them.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\n \"3. SelfTrainingClassifier (semi-supervised) using 20% labeled \"\n \"+ 80% unlabeled data):\"\n)\ny_train_semi = y_train.copy()\ny_train_semi[~y_mask] = -1\nf1_scores[\"SelfTraining\"] = eval_and_get_f1(\n st_pipeline, X_train, y_train_semi, X_test, y_test\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Evaluate a semi-supervised LabelSpreading model using 20% labeled and 80%\nunlabeled data.\nLike SelfTraining, the model infers labels for the unlabeled portion of the data\nto enhance performance.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(\"4. LabelSpreading (semi-supervised) using 20% labeled + 80% unlabeled data:\")\nf1_scores[\"LabelSpreading\"] = eval_and_get_f1(\n ls_pipeline, X_train, y_train_semi, X_test, y_test\n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plot results\nVisualize the performance of different classification approaches using a bar chart.\nThis helps to compare how each method performs based on the\nmicro-averaged :func:`~sklearn.metrics.f1_score`.\nMicro-averaging computes metrics globally across all classes,\nwhich gives a single overall measure of performance and allows fair comparison\nbetween the different approaches, even in the presence of class imbalance.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n\nplt.figure(figsize=(10, 6))\n\nmodels = list(f1_scores.keys())\nscores = list(f1_scores.values())\n\ncolors = [\"royalblue\", \"royalblue\", \"forestgreen\", \"royalblue\"]\nbars = plt.bar(models, scores, color=colors)\n\nplt.title(\"Comparison of Classification Approaches\")\nplt.ylabel(\"Micro-averaged F1 Score on test set\")\nplt.xticks()\n\nfor bar in bars:\n height = bar.get_height()\n plt.text(\n bar.get_x() + bar.get_width() / 2.0,\n height,\n f\"{height:.2f}\",\n ha=\"center\",\n va=\"bottom\",\n )\n\nplt.figtext(\n 0.5,\n 0.02,\n \"SelfTraining classifier shows improved performance over \"\n \"supervised learning with limited data\",\n ha=\"center\",\n va=\"bottom\",\n fontsize=10,\n style=\"italic\",\n)\n\nplt.tight_layout()\nplt.subplots_adjust(bottom=0.15)\nplt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.14" } }, "nbformat": 4, "nbformat_minor": 0 }