{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "00f03585",
   "metadata": {},
   "source": [
    "# Artificial Intelligence and Modern Physics: a Two Way Connection\n",
    "\n",
    "The following is part of the hands-on sessions of the [AIPhy](https://aiphy.fisica.unimib.it/) school.\n",
    "The notebook aims at giving an **introduction to machine learning** methods in Python.\n",
    "Tutorials deal with different **unsupervised and supervised algorithms**.\n",
    "Students will learn how to build these algorithms from scratch using basic Python classes.\n",
    "They will then apply different techniques to test and evaluate them on toy and real world datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "933ab00d",
   "metadata": {},
   "source": [
    "## Preliminary Actions\n",
    "\n",
    "I recommend you use a Python **virtual environment** to setup your work area:\n",
    "\n",
    "```bash\n",
    "python -m venv .venv\n",
    "```\n",
    "\n",
    "At the time of writing the Python version is `3.10.12`: you can change this either with your distribution package manager, or by installing a **Conda** environment (e.g. `conda create -y -n aiphy python\"==3.10.12\" && conda activate aiphy`).\n",
    "\n",
    "You can then activate it with:\n",
    "\n",
    "```bash\n",
    "source .venv/bin/activate\n",
    "```\n",
    "\n",
    "Before we begin, remember to install the requirements:\n",
    "\n",
    "```bash\n",
    "pip install -r requirements.txt\n",
    "```\n",
    "\n",
    "We shall use mainly the `numpy` library for the algorithms, and the `matplotlib` library for plots."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ce10b261",
   "metadata": {},
   "source": [
    "## K-means Clustering\n",
    "\n",
    "The [K-means](https://en.wikipedia.org/wiki/K-means_clustering) algorithm is an **unsupervised** learning algorithm used for exploration and **clustering**.\n",
    "\n",
    "The intuitive idea is to **partition** a dataset $\\mathcal{D} = \\{ \\mathbf{x}_1, \\mathbf{x}_2, \\ldots, \\mathbf{x}_n \\}$ ($\\mathbf{X}_i \\in \\mathbb{R}^p~ \\forall i = 1, \\dots, n$) into $K$ clusters (or sets) $S_k$, such that $\\bigcup\\limits_{k = 1}^K S_k = \\mathcal{D}$.\n",
    "Each cluster is defined by **centroids** $\\{ \\mathbf{c}_1, \\mathbf{c}_2, \\ldots, \\mathbf{c}_K \\}$, where $\\mathbf{c}_k \\in \\mathbb{R}^p~ \\forall k = 1, \\dots, K$:\n",
    "\n",
    "$$\n",
    "\\mathbf{c}_i = \\frac{1}{\\left| S_i \\right|} \\sum\\limits_{j~|~\\mathbf{x}_j \\in S_i} \\mathbf{x}_j.\n",
    "$$\n",
    "\n",
    "The algorithm is trained to find the best centroids for each cluster as to minimise the distance between samples in the same cluster:\n",
    "\n",
    "$$\n",
    "\\argmin\\limits_{S_1, \\dots, S_k} \\sum\\limits_{k = 1}^K \\sum\\limits_{j~|~\\mathbf{x}_j \\in S_i} \\left|\\left| \\mathbf{x}_j - \\mathbf{c}_k \\right|\\right|^2.\n",
    "$$\n",
    "\n",
    "This objective function can be solved using different algorithms.\n",
    "For instance the widely known and used [Lloyd's algorithm](https://en.wikipedia.org/wiki/Lloyd%27s_algorithm):\n",
    "\n",
    "0. **pre-step** (only at first iteration): randomly choose the initial centroids $\\mathbf{c}_k$ using data points;\n",
    "1. **assignment** (loop over $k$): select samples $\\mathbf{x}_i$ closest to centroid $\\mathbf{c}_k$ and label it as belonging to cluster $k$;\n",
    "2. **update** (loop over $k$): recompute the centroid $\\mathbf{c}_k$ as $\\mathbf{c}_k = \\frac{1}{\\left| S_k \\right|} \\sum\\limits_{j~|~\\mathbf{x}_j \\in S_k} \\mathbf{x}_j$ using the new assignments;\n",
    "3. repeat 1. and 2. until convergence (e.g. a minimum change in the centroid positions).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "983dff1b",
   "metadata": {},
   "source": [
    "## Coding K-Means Clustering\n",
    "\n",
    "In the following, we build the code for a complete K-Means algorithò, using Python classes.\n",
    "We first import the necessary libraries, and build an abstract class of transformations and projections, capable to fit and transform data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42950dd0-4383-445b-8444-1a7276ed0b5f",
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 150,
   "id": "e25ef088",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import jdc  # this is a Jupyter extension to write classes in multiple cells\n",
    "import matplotlib as mpl\n",
    "from typing import Tuple\n",
    "from numpy.typing import NDArray\n",
    "from matplotlib import pyplot as plt\n",
    "from matplotlib.colors import ListedColormap\n",
    "from sklearn.datasets import load_breast_cancer\n",
    "from sklearn.decomposition import PCA  # let us use the \"real\" implementation\n",
    "from IPython.display import display, Image"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8d3975d",
   "metadata": {},
   "source": [
    "We then select a style for the plots: you can freely play with this parameter, but I like this one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 151,
   "id": "3a06f695-d611-4105-bdde-a434d470a238",
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.style.use('grayscale')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff506a6a",
   "metadata": {},
   "source": [
    "#### The Abstract Implementation\n",
    "\n",
    "We first consider the structure of the desired data transformation.\n",
    "Ideally, it needs:\n",
    "\n",
    "1. an **initialisation step** to randomly initialize the centroids,\n",
    "2. a function to compute the **euclidean distance** between sets of points,\n",
    "3. an **assignment** function to label each point as belonging to a cluster,\n",
    "4. a **centroid update** function to compute the new centroid positions,\n",
    "5. to have a **fitting** functionality to compute the cluster centroids,\n",
    "6. to have a **predict** method, capable of taking inputs points and assign them to the fitted clusters.\n",
    "\n",
    "Here is an abstract implementation to be inherited by the the actual implementation of the K-Means class:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 152,
   "id": "a442afd0",
   "metadata": {},
   "outputs": [],
   "source": [
    "class Clustering:\n",
    "    \"\"\"An abstract clustering class,\"\"\"\n",
    "\n",
    "    def __init__(self, n_clusters: int = 2) -> None:\n",
    "        \"\"\"\n",
    "        Parameters\n",
    "        ----------\n",
    "        n_clusters : int\n",
    "            The number of clusters to be created.\n",
    "\n",
    "        Raises\n",
    "        ------\n",
    "        ValueError\n",
    "            If the number of clusters is less than 2.\n",
    "        \"\"\"\n",
    "        if n_clusters < 2:\n",
    "            raise ValueError(\n",
    "                'The number of clusters must be at least 2! However, we found %d < 2 clusters.'\n",
    "                % n_clusters)\n",
    "        self.n_clusters = n_clusters\n",
    "        self._fitted = False  # save the fitted status of the clustering\n",
    "\n",
    "    def _initialize_centres(self, X: NDArray) -> NDArray:\n",
    "        \"\"\"\n",
    "        Initialise the positions of the centroids.\n",
    "\n",
    "        Paramenters\n",
    "        -----------\n",
    "        X : NDArray\n",
    "            The data to be clustered.\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        NDArray\n",
    "            The initial positions of the centroids.\n",
    "        \"\"\"\n",
    "        raise NotImplementedError(\n",
    "            'All subclasses of Clustering must implement the _initialize_centres method!'\n",
    "        )\n",
    "\n",
    "    def _euclidean_distance(self, A: NDArray, B: NDArray) -> NDArray:\n",
    "        \"\"\"\n",
    "        Compute the Euclidean distance between two sets of points.\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "        A : NDArray\n",
    "            The first set of points.\n",
    "        B : NDArray\n",
    "            The second set of points.\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        NDArray\n",
    "            The Euclidean distance between the two sets of points.\n",
    "        \"\"\"\n",
    "        raise NotImplementedError(\n",
    "            'All subclasses of Clustering must implement the _euclidean_distance method!'\n",
    "        )\n",
    "\n",
    "    def _cluster_assignment(self, dist: NDArray) -> NDArray:\n",
    "        \"\"\"\n",
    "        Compute the assignment of each point to a cluster.\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "        dist : NDArray\n",
    "            The distance matrix.\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        NDArray\n",
    "            The cluster assignment of each point.\n",
    "        \"\"\"\n",
    "        raise NotImplementedError(\n",
    "            'All subclasses of Clustering must implement the _cluster_assignment method!'\n",
    "        )\n",
    "\n",
    "    def _centre_positions(self, X: NDArray, clust: NDArray) -> NDArray:\n",
    "        \"\"\"\n",
    "        Compute the positions of the new centroids.\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "        X : NDArray\n",
    "            The data to be clustered.\n",
    "        clust : NDArray\n",
    "            The cluster assignment of each point.\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        NDArray\n",
    "            The positions of the new centroids.\n",
    "        \"\"\"\n",
    "        raise NotImplementedError(\n",
    "            'All subclasses of Clustering must implement the _centre_positions method!'\n",
    "        )\n",
    "\n",
    "    def fit(self, X: NDArray) -> 'Clustering':\n",
    "        \"\"\"\n",
    "        Fit the clustering model.\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "        X : NDArray\n",
    "            The data to be clustered.\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        Clustering\n",
    "            The fitted clustering model.\n",
    "        \"\"\"\n",
    "        raise NotImplementedError(\n",
    "            'All subclasses of Clustering must implement the fit method!')\n",
    "\n",
    "    def predict(self, X: NDArray) -> Tuple[NDArray, NDArray]:\n",
    "        \"\"\"\n",
    "        Predict the cluster assignment of each point.\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "        X : NDArray\n",
    "            The data to be clustered.\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        Tuple[NDArray, NDArray]\n",
    "            The cluster assignment of each point and the positions of the new centroids.\n",
    "        \"\"\"\n",
    "        raise NotImplementedError(\n",
    "            'All subclasses of Clustering must implement the predict method!')\n",
    "\n",
    "    def fit_predict(self, X: NDArray) -> Tuple[NDArray, NDArray]:\n",
    "        \"\"\"\n",
    "        Fit and predict the labels of the data.\n",
    "\n",
    "        This is a wrapper for the `fit` and `predict` methods.\n",
    "\n",
    "        Parameters\n",
    "        ----------\n",
    "        X : NDArray\n",
    "            The data to be clustered.\n",
    "\n",
    "        Returns\n",
    "        -------\n",
    "        Tuple[NDArray, NDArray]\n",
    "            The cluster assignment of each point and the positions of the new centroids.\n",
    "        \"\"\"\n",
    "        self.fit(X)\n",
    "        return self.predict(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "451761e5",
   "metadata": {},
   "source": [
    "#### The Actual Implementation\n",
    "\n",
    "We can now proceed to implement the K-Means class."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1172774f",
   "metadata": {},
   "source": [
    "Let us start with the constructor.\n",
    "Clearly, a clustering method must specify the number of clusters we would like to produce (this is already taken into account in the abstract class).\n",
    "However, in K-Means, this is done iteratively: we should add a way to stop the iteration with a convergence criterion.\n",
    "Moreover, the initialisation is random: we should implement something for consistency (e.g. fixing a seed of a random number generator)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 153,
   "id": "688e493f-229a-4a92-beaa-fe93ee5810bb",
   "metadata": {},
   "outputs": [],
   "source": [
    "class KMeans(Clustering):\n",
    "    \"\"\"K-Means Clustering using the Lloyd's algorithm.\"\"\"\n",
    "\n",
    "    def __init__(self,\n",
    "                 n_clusters: int = 2,\n",
    "                 max_iter: int = 1000,\n",
    "                 tol: float = 0.0001,\n",
    "                 seed: int = 42) -> None:\n",
    "        \"\"\"\n",
    "        Parameters\n",
    "        ----------\n",
    "        n_clusters : int\n",
    "            The number of clusters to be created.\n",
    "        max_iter : int\n",
    "            The maximum number of iterations allowed.\n",
    "        tol : float\n",
    "            The stopping criterion (based on tolerance) on the position of centroids between iterations.\n",
    "        seed : float\n",
    "            The random seed.\n",
    "\n",
    "        Raises\n",
    "        ------\n",
    "        ValueError\n",
    "            If the number of clusters is less than 2.\n",
    "        \"\"\"\n",
    "        # YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d950114",
   "metadata": {},
   "source": [
    "We then start by initializing the centroids: we shall choose random data points from the dataset to be the initial centres of the clusters.\n",
    "\n",
    "**HINT**: take a look at the `numpy` library for something to \"choose\" from a set of data points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 154,
   "id": "d208246f-97a8-4d2a-9687-b4b65201a415",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%add_to KMeans\n",
    "def _initialize_centres(self, X: NDArray) -> NDArray:\n",
    "    \"\"\"\n",
    "    Initialise the positions of the centroids.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : NDArray\n",
    "        The data to be clustered.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    NDArray\n",
    "        The initial positions of the centroids.\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d64ba85",
   "metadata": {},
   "source": [
    "To proceed, we need a function capable of computing the distance between two sets of points, namely:\n",
    "\n",
    "- the data poinsts (usually of shape `(n_samples, n_dimensions)`),\n",
    "- the centroids (usually of shape `(n_clusters, n_dimensions)`).\n",
    "\n",
    "The output of the function should be a distance matrix of shape `(n_samples, n_clusters)`, in order to have the distance of each piece of data with respect to each cluster.\n",
    "\n",
    "**HINT**: this is actually not that obvious, and it might require to know something about \"broadcasting\" `numpy` arrays.\n",
    "However, different implementations may have different ways of doing this: there is no \"right\" way to compute it, just do it (we shall not worry about computing time or optimisation here)!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 155,
   "id": "945fead4-08fd-4ab2-8500-e7a172bc4dfa",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%add_to KMeans\n",
    "def _euclidean_distance(self, A: NDArray, B: NDArray) -> NDArray:\n",
    "    \"\"\"\n",
    "    Compute the Euclidean distance between two sets of points.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    A : NDArray\n",
    "        The first set of points.\n",
    "    B : NDArray\n",
    "        The second set of points.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    NDArray\n",
    "        The Euclidean distance between the two sets of points.\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f685371b",
   "metadata": {},
   "source": [
    "Cluster assignments should be easy once we have the distance matrix of shape `(n_samples, n_clusters)`.\n",
    "Just find the cluster nearest to the data point!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 156,
   "id": "273ddb85-2ee7-45d8-9e12-5a1b895a7dfb",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%add_to KMeans\n",
    "def _cluster_assignment(self, dist: NDArray) -> NDArray:\n",
    "    \"\"\"\n",
    "    Compute the assignment of each point to a cluster.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    dist : NDArray\n",
    "        The distance matrix.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    NDArray\n",
    "        The cluster assignment of each point.\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5beaa5c1",
   "metadata": {},
   "source": [
    "Next step is to compute the position of the new centroids: given the data points and their cluster assignments, we need to compute the baricentre of each cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "id": "8181241b-9b2a-43ff-b01a-00deff18694a",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%add_to KMeans\n",
    "def _centre_positions(self, X: NDArray, clust: NDArray) -> NDArray:\n",
    "    \"\"\"\n",
    "    Compute the positions of the new centroids.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : NDArray\n",
    "        The data to be clustered.\n",
    "    clust : NDArray\n",
    "        The cluster assignment of each point.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    NDArray\n",
    "        The positions of the centroids.\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d36bc10",
   "metadata": {},
   "source": [
    "We can finally put all together in a \"training loop\": given some \"training data\", we need to compute the positions of the centroids until convergence is reached (here modelled as both a maximum number of iterations and a tolerance).\n",
    "\n",
    "**HINT**: we should take care of invalid inputs.\n",
    "Suppose the number of clusters is greater than the number of data points, does this make any sense?\n",
    "\n",
    "**BONUS**: it might be interesting to keep track of the history of the centroids during the training loop.\n",
    "Can you think of a way to do that? Maybe a list of positions might do the job..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 158,
   "id": "1fda6070-77fa-4893-a2dd-6e3dd2271671",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%add_to KMeans\n",
    "def fit(self, X: NDArray) -> 'KMeans':\n",
    "    \"\"\"\n",
    "    Fit the clustering model.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : NDArray\n",
    "        The data to be clustered.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    KMeans\n",
    "        The fitted clustering model.\n",
    "    \"\"\"\n",
    "    # Check that the number of clusters is consistent with the data\n",
    "    if self.n_clusters > X.shape[0]:\n",
    "\n",
    "        # YOUR CODE HERE\n",
    "\n",
    "    # Initialise the centroids\n",
    "    # YOUR CODE HERE\n",
    "\n",
    "    # Keep track of the history of centroids\n",
    "    centres_history = [centres]\n",
    "\n",
    "    # Perform the clustering (implement stopping criterion based on iterations)\n",
    "    # YOUR CODE HERE\n",
    "\n",
    "        # Save the position of the centres\n",
    "        old_centres = centres\n",
    "\n",
    "        # Compute distance and assign the clusters\n",
    "        # YOUR CODE HERE\n",
    "\n",
    "        # Do not forget to update the clusters (and store the history)\n",
    "        # YOUR CODE HERE\n",
    "\n",
    "        # Apply the stopping criterion based on tolerance\n",
    "        # YOUR CODE HERE\n",
    "\n",
    "    # Save the results\n",
    "    self.cluster_centres_ = centres\n",
    "    self.cluster_centres_history_ = np.array(centres_history)\n",
    "    self.n_iter = n\n",
    "\n",
    "    # Update the fitted status\n",
    "    self._fitted = # YOUR CODE HERE\n",
    "    return self"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c223b453",
   "metadata": {},
   "source": [
    "Finally, write the prediction loop: once we have the centroids, this amounts to assign a cluster to each piece of data independently.\n",
    "\n",
    "**N.B.**: if the fitting method has not been called, we should probably raise an error..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 159,
   "id": "777775ae-33c8-4d5d-bf0a-832c8970b78a",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%add_to KMeans\n",
    "def predict(self, X: NDArray) -> Tuple[NDArray, NDArray]:\n",
    "    \"\"\"\n",
    "    Predict the closest cluster each sample in X belongs to.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : NDArray\n",
    "        The data to be clustered.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    Tuple[NDArray, NDArray]\n",
    "        The predicted cluster indices for each sample in X and the cluster centres.\n",
    "    \"\"\"\n",
    "    # Check that the model has been fitted\n",
    "    if not self._fitted:\n",
    "        raise RuntimeError('KMeans must be fitted before calling the transform method!')\n",
    "\n",
    "    # Compute the cluster assignment\n",
    "    X = X.copy()\n",
    "    # YOUR CODE HERE\n",
    "\n",
    "    # Return cluster assignments (labels), and the position of the centres\n",
    "    return labels, self.cluster_centres_"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b7d5d3e",
   "metadata": {},
   "source": [
    "As usual, before moving on, let us test the implementation of the functions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74eb1834",
   "metadata": {},
   "outputs": [],
   "source": [
    "# THIS IS A TEST CELL. DO NOT DELETE IT.\n",
    "\n",
    "# Generate some test data\n",
    "A = np.array([[2, 3], [4, 5], [5, 6]])\n",
    "B = np.array([[0, 1], [5, 0]])\n",
    "X = np.vstack([A, B])\n",
    "\n",
    "# Create the model\n",
    "kmeans = KMeans(n_clusters=2)\n",
    "\n",
    "# Check the centre initialisation method\n",
    "centres = kmeans._initialize_centres(X)\n",
    "if centres.shape != (2, 2):\n",
    "    display(Image('img/allegri_giacca.gif', width=500))\n",
    "    raise ValueError(\n",
    "        'The shape of the centroid list is incorrect! It should be (2, 2), but we received %s.'\n",
    "        % centres.shape)\n",
    "for c in centres:\n",
    "    if (not c in A) and (not c in B):\n",
    "        display(Image('img/allegri_giacca.gif', width=500))\n",
    "        raise ValueError(\n",
    "            'The centroid list is incorrect! The centroid %s is neither in the subset A nor in the subset B.'\n",
    "            % c)\n",
    "\n",
    "# Check the euclidean distance method\n",
    "dist_gt = np.linalg.norm(X.reshape(-1, 1, 2) - centres.reshape(1, -1, 2),\n",
    "                         axis=-1)\n",
    "dist = kmeans._euclidean_distance(X, centres)\n",
    "if dist.shape != dist_gt.shape:\n",
    "    display(Image('img/allegri_dipoco.gif', width=500))\n",
    "    raise ValueError(\n",
    "        'The shape of the distance matrix is incorrect! It should be %s, but we received %s.'\n",
    "        % (dist_gt.shape, dist.shape))\n",
    "if not np.allclose(dist, dist_gt):\n",
    "    display(Image('img/allegri_dipoco.gif', width=500))\n",
    "    raise ValueError(\n",
    "        'There is a problem with the euclidean distance method! The computed distances do not match.'\n",
    "    )\n",
    "\n",
    "# Check the cluster assignment method\n",
    "clust_gt = np.argmin(dist_gt, axis=-1)\n",
    "clust = kmeans._cluster_assignment(dist)\n",
    "if (clust_gt != clust).all():\n",
    "    display(Image('img/allegri_dipoco.gif', width=500))\n",
    "    raise ValueError(\n",
    "        'There is a problem with the cluster assignment method! The computed cluster assignments do not match.'\n",
    "    )\n",
    "\n",
    "# Compute the new centre positions\n",
    "centres_gt = [X[clust_gt == i].mean(axis=0) for i in range(2)]\n",
    "centres = kmeans._centre_positions(X, clust)\n",
    "if not np.allclose(centres, centres_gt):\n",
    "    display(Image('img/allegri_dipoco.gif', width=500))\n",
    "    raise ValueError(\n",
    "        'There is a problem with the new centre positions! The computed centre positions do not match.'\n",
    "    )\n",
    "\n",
    "# Check the fit method\n",
    "kmeans = kmeans.fit(X)\n",
    "\n",
    "if (not hasattr(kmeans, 'cluster_centres_')) and (not hasattr(\n",
    "        kmeans, 'cluster_centres')):\n",
    "    display(Image('img/allegri_dipoco.gif', width=500))\n",
    "    raise ValueError('The fit method does not return the cluster centres!')\n",
    "if not kmeans._fitted:\n",
    "    display(Image('img/allegri_giacca.gif', width=500))\n",
    "    raise RuntimeError(\n",
    "        'The K-Means class does not update the fitted state correctly!')\n",
    "\n",
    "# Everything passed\n",
    "print('All tests passed!')\n",
    "display(Image('img/allegri_calma.gif', width=500))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6a64a11",
   "metadata": {},
   "source": [
    "## Some Simple Tests\n",
    "\n",
    "We proceed to test the unsupervised algorithm on some synthetic data.\n",
    "We generate data from three different multivariate normal distributions, presenting both simple and difficult separation of the clusters:\n",
    "\n",
    "$$\n",
    "\\mathcal{N}(\\mathbf{x} ~\\mid~ \\mathbf{\\mu}, \\mathbf{\\Sigma})\n",
    "=\n",
    "\\frac{1}{2 \\pi \\sqrt{\\det{\\mathbf{\\Sigma}}}}\n",
    "\\exp\\left({-\\left(\\mathbf{x} - \\mathbf{\\mu}\\right)^T \\mathbf{\\Sigma}^{-1} \\left(\\mathbf{x} - \\mathbf{\\mu} \\right)}\\right),\n",
    "$$\n",
    "\n",
    "where $\\mathbf{\\Sigma}$ is the population covariance matrix, and $\\mathbf{\\mu}$ is the population mean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 161,
   "id": "ea6cb542-1813-4c75-aa8e-f86861c9a6cd",
   "metadata": {},
   "outputs": [],
   "source": [
    "gen = np.random.default_rng()\n",
    "A = gen.multivariate_normal(mean=[0, 0], cov=[[6, 3], [3, 4]], size=1000)\n",
    "A_labels = np.array([0] * 1000)\n",
    "B = gen.multivariate_normal(mean=[8, 9], cov=[[1, 0], [0, 1]], size=1000)\n",
    "B_labels = np.array([1] * 1000)\n",
    "C = gen.multivariate_normal(mean=[-5, 10], cov=[[3, 2], [2, 2]], size=1000)\n",
    "C_labels = np.array([2] * 1000)\n",
    "X = np.vstack([A, B, C])\n",
    "y = np.hstack([A_labels, B_labels, C_labels])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68683bd1",
   "metadata": {},
   "source": [
    "We then fit the `KMeans` clustering algorithm and predict the cluster assignments.\n",
    "As this is simply a test, we shall use 3 clusters.\n",
    "However, you are invited to experiment with different values for the number of clusters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 162,
   "id": "6c1e4234-add1-43cc-b578-fe926402d1f1",
   "metadata": {},
   "outputs": [],
   "source": [
    "kmeans = # YOUR CODE HERE\n",
    "y_pred, centres = # YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec741660",
   "metadata": {},
   "source": [
    "We then plot the data and the predicted clusters in two separate axes for visualisation purposes.\n",
    "Does the cluster separation look good?\n",
    "\n",
    "**HINT**: this is a scatter plot.\n",
    "You can plot separately different clusters (e.g. using a loop over the predicted labels) to better control the colours of the markers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b12fe65c-688b-42cb-9650-930360035b4f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a figure\n",
    "fig, ax = plt.subplots(ncols=2, figsize=(12, 5), layout='constrained')\n",
    "\n",
    "# Plot the original data\n",
    "ax[0].scatter(  # YOUR CODE HERE,\n",
    "    color='r', alpha=0.5, label='data (1)')\n",
    "ax[0].scatter(  # YOUR CODE HERE,\n",
    "    color='g', alpha=0.5, label='data (2)')\n",
    "ax[0].scatter(  # YOUR CODE HERE,\n",
    "    color='b', alpha=0.5, label='data (3)')\n",
    "ax[0].legend()\n",
    "ax[0].set_xlabel('x')\n",
    "ax[0].set_ylabel('y')\n",
    "ax[0].set_title('Original Data')\n",
    "\n",
    "# Plot the cluster assignments (loop over the labels)\n",
    "\n",
    "# YOUR CODE HERE\n",
    "\n",
    "# Plot the cluster centres for the sake of completeness\n",
    "ax[1].scatter(centres[..., 0], centres[..., 1], marker='x', label='centroids')\n",
    "ax[1].legend()\n",
    "ax[1].set_xlabel('x')\n",
    "ax[1].set_ylabel('y')\n",
    "ax[1].set_title('Cluster Assignments')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "be21f2d4",
   "metadata": {},
   "source": [
    "### A Simple Visualisation\n",
    "\n",
    "Just to be sure, it has to be noted that clustering algorithms, as well as classification algorithms, do not simply assign a piece of data to a given class.\n",
    "The real effect of training such functions is to create a **partition** of the feature space: each data point falling into one of the subsets of the space is assigned to a different class.\n",
    "In other words, clustering algorithms and classification algorithms create **piece-wise constant** functions in feature space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a4fc56a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a grid of points\n",
    "xx = np.linspace(1.1 * X[..., 0].min(), 1.1 * X[..., 0].max(), 1000)\n",
    "yy = np.linspace(1.1 * X[..., 1].min(), 1.1 * X[..., 1].max(), 1000)\n",
    "\n",
    "# Create a meshgrid\n",
    "xx, yy = np.meshgrid(xx, yy)\n",
    "\n",
    "# Flatten the meshgrid\n",
    "X_mesh = np.stack([xx.ravel(), yy.ravel()], axis=-1)\n",
    "\n",
    "# Predict the cluster assignments\n",
    "y_pred_mesh, _ = kmeans.predict(X_mesh)\n",
    "\n",
    "# Build the contour plot\n",
    "fig, ax = plt.subplots(figsize=(6, 5), layout='constrained')\n",
    "cmap = ['r', 'g', 'b']\n",
    "cmap_mpl = ListedColormap(cmap)\n",
    "ax.contourf(xx, yy, y_pred_mesh.reshape(xx.shape), cmap=cmap_mpl, alpha=0.3)\n",
    "for i in np.unique(y_pred):\n",
    "    ax.scatter(X[y_pred == i, 0],\n",
    "               X[y_pred == i, 1],\n",
    "               c=cmap[i],\n",
    "               alpha=0.75,\n",
    "               label=f'cluster ({i+1:d})')\n",
    "ax.scatter(centres[..., 0], centres[..., 1], marker='x', label='centroids')\n",
    "ax.legend()\n",
    "ax.set_xlabel('x')\n",
    "ax.set_ylabel('y')\n",
    "ax.set_title('Decision Boundaries')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d653baea",
   "metadata": {},
   "source": [
    "### Evaluation\n",
    "\n",
    "It starts to become interesting!\n",
    "We would like to evaluate the quality of the clustering, as we can see that not all data points have been correctly labelled.\n",
    "We will use an **information theoretical** approach to evaluate the [mutual information](https://en.wikipedia.org/wiki/Mutual_information) of the clusters.\n",
    "In other words, we will measure the \"amount of information\" (measured in _bits_) needed obtained by using the cluster assignments with respect to the ground truth:\n",
    "\n",
    "$$\n",
    "\\mathrm{MI}(y, \\widehat{y})\n",
    "=\n",
    "\\sum\\limits_{k, h = 1}^K\n",
    "\\mathrm{P}(Y = y_k \\wedge \\widehat{Y} = \\widehat{y}_h)\n",
    "\\log_2\n",
    "\\left(\n",
    "    \\frac{\\mathrm{P}\\left(Y = y_k \\wedge \\widehat{Y} = \\widehat{y}_h \\right)}{\\mathrm{P}(Y = y_k) \\mathrm{P}(\\widehat{Y} = \\widehat{y}_h)}\n",
    "\\right)\n",
    "=\n",
    "\\mathrm{H}(Y) - H(Y | \\widehat{Y}),\n",
    "$$\n",
    "\n",
    "where $H(X)$ is the **entropy** of the variable $X$ and $H(X | Y)$ is the **conditional entropy** of $X$ given $Y$:\n",
    "\n",
    "$$\n",
    "H(Z) = - \\sum\\limits_{i = 1}^N \\mathrm{P}(Z = z_i) \\log_2 \\mathrm{P}(Z = z_i),\n",
    "$$\n",
    "\n",
    "$$\n",
    "H(Z | Y) = \\mathrm{H}(Z \\wedge Y) - \\mathrm{H}(Y).\n",
    "$$\n",
    "\n",
    "In our implementation, we will actually use a **normalized** version of the MI, as we shall use the **Normalised Mutual Information** (NMI)\n",
    "\n",
    "$$\n",
    "\\mathrm{NMI}(X, Y) = \\frac{MI(X, Y)}{\\frac{H(X) + H(Y)}{2}}\n",
    "$$\n",
    "\n",
    "which is more sensitive to extreme cases, such as unbalanced cluster assignemnts.\n",
    "Moreover, it is more directly comparable, as it is a normalised (\"adimensional\") quantity."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0cd0bdb",
   "metadata": {},
   "source": [
    "Before proceeding, let me suggest you compute the following functions:\n",
    "\n",
    "**N.B.**: use the $\\log_2$ function in your implementation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 164,
   "id": "cd785bbd-58b0-40a8-b9cb-8f107b3d2f6f",
   "metadata": {},
   "outputs": [],
   "source": [
    "def entropy(X: NDArray) -> float:\n",
    "    \"\"\"\n",
    "    Entropy of the random variable X: H(X)\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : array_like\n",
    "        The random variable\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    H : float\n",
    "        The entropy\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "\n",
    "\n",
    "def entropy_joint(X: NDArray, Y: NDArray) -> float:\n",
    "    \"\"\"\n",
    "    Joint entropy of the random variables X and Y: H(X, Y)\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : array_like\n",
    "        The first random variable\n",
    "    Y : array_like\n",
    "        The second random variable\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    H : float\n",
    "        The joint entropy\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "\n",
    "\n",
    "def entropy_conditional(X: NDArray, Y: NDArray) -> float:\n",
    "    \"\"\"\n",
    "    Conditional entropy of the random variable Y given X: H(Y | X)\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : array_like\n",
    "        The first random variable\n",
    "    Y : array_like\n",
    "        The second random variable\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    H : float\n",
    "        The conditional entropy\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "\n",
    "\n",
    "def mutual_information(X: NDArray, Y: NDArray) -> float:\n",
    "    \"\"\"\n",
    "    Mutual information of the random variables X and Y: MI(X, Y)\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : array_like\n",
    "        The first random variable\n",
    "    Y : array_like\n",
    "        The second random variable\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    MI : float\n",
    "        The mutual information\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE\n",
    "\n",
    "\n",
    "def normalised_mutual_information(X: NDArray, Y: NDArray) -> float:\n",
    "    \"\"\"\n",
    "    Normalised mutual information of the random variables X and Y: NMI(X, Y)\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    X : array_like\n",
    "        The first random variable\n",
    "    Y : array_like\n",
    "        The second random variable\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    NMI : float\n",
    "        The normalised mutual information\n",
    "    \"\"\"\n",
    "    # YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "08c1eeba",
   "metadata": {},
   "outputs": [],
   "source": [
    "# THIS IS A TEST CELL. DO NOT MODIFY IT!\n",
    "\n",
    "# Generate some data\n",
    "P = np.array([0, 1, 2, 3, 2, 3, 1, 1, 1, 0, 2, 3, 0, 2])\n",
    "\n",
    "# Compute the entropy\n",
    "_, counts = np.unique(P, return_counts=True)\n",
    "prob = counts / sum(counts)\n",
    "H_gt = -float(sum(prob * np.log2(prob)))\n",
    "H = entropy(P)\n",
    "if not np.isclose(H, H_gt):\n",
    "    display(Image('img/allegri_giacca.gif', width=500))\n",
    "    raise ValueError('Entropy mismatch: %r != %r' % (H, H_gt))\n",
    "\n",
    "# All tests passed\n",
    "print('All tests passed!')\n",
    "display(Image('img/allegri_calma.gif', width=500))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43e15cc2",
   "metadata": {},
   "source": [
    "We finally move to the evaluation of the clustering results by modifying the number of clusters and computing the normalised mutual information.\n",
    "\n",
    "**N.B.**: this is literally just a test case.\n",
    "We do not worry about finding a good training/test split as this is just for illustration purposes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 166,
   "id": "b7af5512-05ba-48a0-bf13-9e2c42447e51",
   "metadata": {},
   "outputs": [],
   "source": [
    "nmi = []\n",
    "\n",
    "# Loop over different numbers of clusters\n",
    "for k in range(2, 10):\n",
    "\n",
    "    # Fit the k-means model\n",
    "\n",
    "    # YOUR CODE HERE\n",
    "\n",
    "    # Evaluate the k-means model\n",
    "\n",
    "    # YOUR CODE HERE\n",
    "\n",
    "# Convert to array\n",
    "nmi = np.array(nmi)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a438db13",
   "metadata": {},
   "source": [
    "We plot the values of the normalised mutual information against the number of clusters.\n",
    "What can we reasonably expect to find?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5857f5bd-b2bf-4923-855f-7098e42ad51d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the figure\n",
    "fig, ax = plt.subplots(figsize=(6, 5), layout='constrained')\n",
    "\n",
    "# Plot the results\n",
    "\n",
    "# YOUR CODE HERE\n",
    "\n",
    "ax.set_xlabel('no. of clusters')\n",
    "ax.set_ylabel('normalised mutual information')\n",
    "ax.set_title('Clustering Evaluation')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e99fc0c0",
   "metadata": {},
   "source": [
    "Does the result look reasonable? Does it shock you?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ec0aa1d",
   "metadata": {},
   "source": [
    "## K-Means Clustering and Biomedical Data\n",
    "\n",
    "We proceed to a different case study with real data.\n",
    "In particular, we use the [Wisconsin Breast Cancer dataset](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic) to try and use clustering to binary classify **malignant** ($Y = 1$) and **benign** ($Y = 0$) instances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 168,
   "id": "96bd4eca-5c4c-4b55-9830-40beb0ce30ec",
   "metadata": {},
   "outputs": [],
   "source": [
    "X, y = load_breast_cancer(return_X_y=True)\n",
    "gen = np.random.default_rng(42)\n",
    "\n",
    "# Divide the dataset into training and test sets\n",
    "\n",
    "# YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c10b24ea",
   "metadata": {},
   "source": [
    "We perform a PCA to reduce the dimensionality of the data and visualise them.\n",
    "We shall use the `scikit-learn` implementation of PCA: it is slightly different from what we coded, but not that much...\n",
    "\n",
    "**N.B.**: main differences between our implementation and `scikit-learn`'s:\n",
    "\n",
    "- the `transform` method outputs a single value: the principal components,\n",
    "- the `loadings_` attribute is called `components_`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 169,
   "id": "808c6546-4bbc-4733-9c66-82ae1278bce2",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Perform the PCA\n",
    "pca = PCA(n_components=2)\n",
    "X_train_vis = # YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d146a4ee",
   "metadata": {},
   "source": [
    "We then plot the data using the principal components:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "655b6913-7515-4e03-a13b-30fd795110c7",
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(6, 5), layout='constrained')\n",
    "ax.scatter(# YOUR CODE HERE)\n",
    "ax.set_xlabel(\n",
    "    f'1st principal component ({pca.explained_variance_ratio_[0]:.1%})')\n",
    "ax.set_ylabel(\n",
    "    f'2nd principal component ({pca.explained_variance_ratio_[1]:.1%})')\n",
    "ax.set_title('Breast Cancer Wisconsin Dataset')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "387ee3c8",
   "metadata": {},
   "source": [
    "Let us try to evaluate the quality of clustering by performing the computation for several values of number of clusters: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9e2b9f96-2f3f-4c1d-b1fa-4cac69a10b3b",
   "metadata": {},
   "outputs": [],
   "source": [
    "metric = []\n",
    "\n",
    "# Loop over the number of clusters\n",
    "for k in # YOUR CODE HERE:\n",
    "\n",
    "    # Fit the k-means model\n",
    "    # YOUR CODE HERE\n",
    "\n",
    "    # Compute some metric\n",
    "    # YOUR CODE HERE\n",
    "\n",
    "metric = np.array(metric)\n",
    "\n",
    "# Plot the metric\n",
    "fig, ax = plt.subplots(figsize=(6, 5), layout='constrained')\n",
    "ax.plot(# YOUR CODE HERE,\n",
    "     metric, 'k-o')\n",
    "ax.set_xlabel('no. of clusters')\n",
    "ax.set_ylabel('normalised mutual information')\n",
    "ax.set_title('Clustering Evaluation')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca5e6733",
   "metadata": {},
   "source": [
    "It seems that a low number of clusters is better than a high number of clusters.\n",
    "Let us try to use this information to predict the labels of the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 172,
   "id": "1bbd59b0-3c2e-4896-aff2-cd46d4fa3423",
   "metadata": {},
   "outputs": [],
   "source": [
    "kmeans = # YOUR CODE HERE\n",
    "y_train_pred, train_centres = # YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "627853b1",
   "metadata": {},
   "source": [
    "For visualisation purposes, let us apply the same PCA transformation to the centroids:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 173,
   "id": "0cce3f81-9f1e-4750-b066-44e59abe5c17",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_centres_vis = # YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c88fa346",
   "metadata": {},
   "source": [
    "Finally, let us visualise data (with their labels), and the cluster assignments.\n",
    "What can we say about the quality of the clustering?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8406301d-0938-4130-9a7b-92cb55d6da6a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create a figure\n",
    "fig, ax = plt.subplots(ncols=2, figsize=(12, 5), layout='constrained')\n",
    "fig.suptitle('Breast Cancer Wisconsin Dataset (training set)')\n",
    "\n",
    "# Plot the original data using the 2D PCA\n",
    "\n",
    "# YOUR CODE HERE\n",
    "\n",
    "ax[0].set_xlabel(\n",
    "    f'1st principal component ({pca.explained_variance_ratio_[0]:.1%})')\n",
    "ax[0].set_ylabel(\n",
    "    f'2nd principal component ({pca.explained_variance_ratio_[1]:.1%})')\n",
    "ax[0].set_title('Original Data (PCA 2D)')\n",
    "\n",
    "# Plot the cluster assignments using the 2D PCA\n",
    "\n",
    "# YOUR CODE HERE\n",
    "\n",
    "ax[1].scatter(train_centres_vis[..., 0], train_centres_vis[..., 1], marker='x')\n",
    "ax[1].set_xlabel(\n",
    "    f'1st principal component ({pca.explained_variance_ratio_[0]:.1%})')\n",
    "ax[1].set_ylabel(\n",
    "    f'2nd principal component ({pca.explained_variance_ratio_[1]:.1%})')\n",
    "ax[1].set_title(f'Clusters (PCA 2D) | NMI = {nmi[0]:.2f}')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "429437a0",
   "metadata": {},
   "source": [
    "Let us compute the **NMI** on the test set, to see if the result is consistent:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff6f68ef-6043-4b71-b275-68aafe53d793",
   "metadata": {},
   "outputs": [],
   "source": [
    "y_test_pred, test_centres = # YOUR CODE HERE\n",
    "nmi = # YOUR CODE HERE\n",
    "print(f'NMI = {nmi:.2%}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2f3120d",
   "metadata": {},
   "source": [
    "Finally, plot the test data and the cluster assignments:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc131dd0-620b-41a3-a70f-41f51d6e479c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute the PCA\n",
    "X_test_vis = # YOUR CODE HERE\n",
    "test_centres_vis = # YOUR CODE HERE\n",
    "\n",
    "# Build a figure\n",
    "fig, ax = plt.subplots(ncols=2, figsize=(12, 5), layout='constrained')\n",
    "fig.suptitle('Breast Cancer Wisconsin Dataset (test set)')\n",
    "\n",
    "# Plot the test data using the 2D PCA\n",
    "\n",
    "# YOUR CODE HERE\n",
    "\n",
    "ax[0].set_xlabel(\n",
    "    f'1st principal component ({pca.explained_variance_ratio_[0]:.1%})')\n",
    "ax[0].set_ylabel(\n",
    "    f'2nd principal component ({pca.explained_variance_ratio_[1]:.1%})')\n",
    "ax[0].set_title('Original Data (PCA 2D)')\n",
    "\n",
    "# Plot the cluster assignments using the 2D PCA\n",
    "\n",
    "# YOUR CODE HERE\n",
    "\n",
    "ax[1].scatter(test_centres_vis[..., 0], test_centres_vis[..., 1], marker='x')\n",
    "ax[1].set_xlabel(\n",
    "    f'1st principal component ({pca.explained_variance_ratio_[0]:.1%})')\n",
    "ax[1].set_ylabel(\n",
    "    f'2nd principal component ({pca.explained_variance_ratio_[1]:.1%})')\n",
    "ax[1].set_title(f'Clusters (PCA 2D) | NMI = {nmi:.2f}')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f59001db",
   "metadata": {},
   "source": [
    "### An Attempt at Classification\n",
    "\n",
    "Using the K-Means clustering, we try to classify the data using a [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression) classifier.\n",
    "In particular, we shall use the cluster assignments as **feature engineering** procedure to help the classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 177,
   "id": "d3ab93ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.metrics import (accuracy_score, precision_recall_fscore_support,\n",
    "                             ConfusionMatrixDisplay, RocCurveDisplay)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b132c38",
   "metadata": {},
   "source": [
    "Let us first compute a baseline score, by performing a classification over the untouched dataset.\n",
    "As we are using `scikit-learn` for the classifier, we can also acces its `predict_proba` method to compute the probability estimates for the classes, instead of simply the thresholded result.\n",
    "\n",
    "**QUESTION**: as this is a binary classification problem, do we need to know the probability estimates for both negative and positive classes?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7bacfeac",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Rescale the input data\n",
    "prep = StandardScaler()\n",
    "\n",
    "# YOUR CODE HERE\n",
    "\n",
    "# Perform the regression and predict the labels\n",
    "clf = LogisticRegression()\n",
    "\n",
    "# YOUR CODE HERE\n",
    "\n",
    "y_prob = clf.predict_proba(X_test)[:, 1]  # positive class is enough\n",
    "\n",
    "# Print the precision, recall and F1 score\n",
    "acc = accuracy_score(y_test, y_pred)\n",
    "precision, recall, f1, _ = precision_recall_fscore_support(y_test,\n",
    "                                                           y_pred,\n",
    "                                                           average='binary')\n",
    "\n",
    "print(f'Accuracy = {acc:.2%}')\n",
    "print(f'Precision = {precision:.2%}')\n",
    "print(f'Recall = {recall:.2%}')\n",
    "print(f'F1 = {f1:.2%}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07eb524c",
   "metadata": {},
   "source": [
    "Let us then visualise the confusion matrix and the [Receiver Operating Characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) (ROC):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d34c9381",
   "metadata": {},
   "outputs": [],
   "source": [
    "ConfusionMatrixDisplay.from_predictions(y_test,\n",
    "                                        y_pred,\n",
    "                                        normalize='all',\n",
    "                                        values_format='.1%',\n",
    "                                        display_labels=['benign', 'malignant'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64b5d65d",
   "metadata": {},
   "outputs": [],
   "source": [
    "RocCurveDisplay.from_predictions(y_test,\n",
    "                                 y_prob,\n",
    "                                 color='r',\n",
    "                                 name='plain classifier',\n",
    "                                 plot_chance_level=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9d9d06e",
   "metadata": {},
   "source": [
    "### Feature Engineering\n",
    "\n",
    "In this last part of the tutorial, we perform a basic **feature engineering** of the dataset.\n",
    "In particular, we would like to add what we know on the structure of the data (i.e. clusters) to the dataset, in order to improve the performance of the classifier.\n",
    "\n",
    "\n",
    "We then compute different clustering labels to concatenate to the original features.\n",
    "Will they help to improve the performance of the classifier?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f19bf123",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Store the clustering labels in separate values\n",
    "X_train_clust = []\n",
    "X_test_clust = []\n",
    "\n",
    "# Perform clustering for different values (play with the parameters!)\n",
    "for n in range(2, 20):\n",
    "\n",
    "    # Define a clustering model\n",
    "\n",
    "    # YOUR CODE HERE\n",
    "\n",
    "# Transform the values to a 2D array\n",
    "X_train_clust = np.array(X_train_clust).T\n",
    "X_test_clust = np.array(X_test_clust).T"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a029e279",
   "metadata": {},
   "source": [
    "Finally, try to refit a classification model with the new features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 182,
   "id": "a3dcf280",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Concatenate the clustering features\n",
    "X_train = np.hstack([X_train, X_train_clust])\n",
    "X_test = np.hstack([X_test, X_test_clust])\n",
    "\n",
    "# Define a preprocessing scaler\n",
    "prep = StandardScaler()\n",
    "\n",
    "# YOUR CODE HERE\n",
    "\n",
    "# Define and train a classifier (compute the predictions and probabilities)\n",
    "clf = LogisticRegression()\n",
    "\n",
    "# YOUR CODE HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3d16103b",
   "metadata": {},
   "source": [
    "We can finally compute the different metrics:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30674075",
   "metadata": {},
   "outputs": [],
   "source": [
    "# YOUR CODE HERE\n",
    "\n",
    "print(f'Accuracy = {acc:.2%}')\n",
    "print(f'Precision = {prec:.2%}')\n",
    "print(f'Recall = {rec:.2%}')\n",
    "print(f'F1 = {f1:.2%}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d0c95a1",
   "metadata": {},
   "source": [
    "Let us take some time to appreciate the importance of the result:\n",
    "\n",
    "- what metric has improved (if any)? What metric has lost something (if any)?\n",
    "- how do we interpret the result?\n",
    "- is it a good or a bad model? Is it better than the baseline?\n",
    "\n",
    "And finally, think about the newly introduced **hyperparameter**: the maximum number of clusters.\n",
    "For this tutorial, we are not interested in optimising it.\n",
    "However, how would we have to modify our **pipeline** in order to include an optimisation step of the hyperparameters?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a72b3fc",
   "metadata": {},
   "source": [
    "Finally, let us visualise the confusion matrix and the ROC curve:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6137b1df",
   "metadata": {},
   "outputs": [],
   "source": [
    "ConfusionMatrixDisplay.from_predictions(y_test,\n",
    "                                        y_pred,\n",
    "                                        normalize='all',\n",
    "                                        values_format='.1%',\n",
    "                                        display_labels=['benign', 'malignant'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7eaafde",
   "metadata": {},
   "outputs": [],
   "source": [
    "RocCurveDisplay.from_predictions(y_test,\n",
    "                                 y_prob,\n",
    "                                 color='r',\n",
    "                                 name='feat. eng.',\n",
    "                                 plot_chance_level=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}