{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# A demo of the Spectral Biclustering algorithm\n\nThis example demonstrates how to generate a checkerboard dataset and bicluster\nit using the :class:`~sklearn.cluster.SpectralBiclustering` algorithm. The\nspectral biclustering algorithm is specifically designed to cluster data by\nsimultaneously considering both the rows (samples) and columns (features) of a\nmatrix. It aims to identify patterns not only between samples but also within\nsubsets of samples, allowing for the detection of localized structure within the\ndata. This makes spectral biclustering particularly well-suited for datasets\nwhere the order or arrangement of features is fixed, such as in images, time\nseries, or genomes.\n\nThe data is generated, then shuffled and passed to the spectral biclustering\nalgorithm. The rows and columns of the shuffled matrix are then rearranged to\nplot the biclusters found.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: The scikit-learn developers\n# SPDX-License-Identifier: BSD-3-Clause"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Generate sample data\nWe generate the sample data using the\n:func:`~sklearn.datasets.make_checkerboard` function. Each pixel within\n`shape=(300, 300)` represents with its color a value from a uniform\ndistribution. The noise is added from a normal distribution, where the value\nchosen for `noise` is the standard deviation.\n\nAs you can see, the data is distributed over 12 cluster cells and is\nrelatively well distinguishable.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from matplotlib import pyplot as plt\n\nfrom sklearn.datasets import make_checkerboard\n\nn_clusters = (4, 3)\ndata, rows, columns = make_checkerboard(\n    shape=(300, 300), n_clusters=n_clusters, noise=10, shuffle=False, random_state=42\n)\n\nplt.matshow(data, cmap=plt.cm.Blues)\nplt.title(\"Original dataset\")\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We shuffle the data and the goal is to reconstruct it afterwards using\n:class:`~sklearn.cluster.SpectralBiclustering`.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\n\n# Creating lists of shuffled row and column indices\nrng = np.random.RandomState(0)\nrow_idx_shuffled = rng.permutation(data.shape[0])\ncol_idx_shuffled = rng.permutation(data.shape[1])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "We redefine the shuffled data and plot it. We observe that we lost the\nstructure of original data matrix.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "data = data[row_idx_shuffled][:, col_idx_shuffled]\n\nplt.matshow(data, cmap=plt.cm.Blues)\nplt.title(\"Shuffled dataset\")\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Fitting `SpectralBiclustering`\nWe fit the model and compare the obtained clusters with the ground truth. Note\nthat when creating the model we specify the same number of clusters that we\nused to create the dataset (`n_clusters = (4, 3)`), which will contribute to\nobtain a good result.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.cluster import SpectralBiclustering\nfrom sklearn.metrics import consensus_score\n\nmodel = SpectralBiclustering(n_clusters=n_clusters, method=\"log\", random_state=0)\nmodel.fit(data)\n\n# Compute the similarity of two sets of biclusters\nscore = consensus_score(\n    model.biclusters_, (rows[:, row_idx_shuffled], columns[:, col_idx_shuffled])\n)\nprint(f\"consensus score: {score:.1f}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The score is between 0 and 1, where 1 corresponds to a perfect matching. It\nshows the quality of the biclustering.\n\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Plotting results\nNow, we rearrange the data based on the row and column labels assigned by the\n:class:`~sklearn.cluster.SpectralBiclustering` model in ascending order and\nplot again. The `row_labels_` range from 0 to 3, while the `column_labels_`\nrange from 0 to 2, representing a total of 4 clusters per row and 3 clusters\nper column.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Reordering first the rows and then the columns.\nreordered_rows = data[np.argsort(model.row_labels_)]\nreordered_data = reordered_rows[:, np.argsort(model.column_labels_)]\n\nplt.matshow(reordered_data, cmap=plt.cm.Blues)\nplt.title(\"After biclustering; rearranged to show biclusters\")\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "As a last step, we want to demonstrate the relationships between the row\nand column labels assigned by the model. Therefore, we create a grid with\n:func:`numpy.outer`, which takes the sorted `row_labels_` and `column_labels_`\nand adds 1 to each to ensure that the labels start from 1 instead of 0 for\nbetter visualization.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "plt.matshow(\n    np.outer(np.sort(model.row_labels_) + 1, np.sort(model.column_labels_) + 1),\n    cmap=plt.cm.Blues,\n)\nplt.title(\"Checkerboard structure of rearranged data\")\nplt.show()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "The outer product of the row and column label vectors shows a representation\nof the checkerboard structure, where different combinations of row and column\nlabels are represented by different shades of blue.\n\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.18"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}