{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/programminghistorian/jekyll/blob/gh-pages/assets/clustering-visualizing-word-embeddings/clustering-visualizing-word-embeddings.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tTWRVAGzlu8F"
      },
      "source": [
        "# Clustering and Visualising Documents using Word Embeddings\n",
        "\n",
        "This can be run in Google Colab (by clicking the _Open in Colab_ button above) or on your local machine. If you wish to run it locally, then you will need to ensure that you have the libraries listed in the [requirements.txt](requirements.txt) file installed *first*. THe most direct way to do this is `pip -r requirements.txt`, but I personally prefer to use Anaconda Python since it allows me to create 'virtual environments' (multiple 'versions' of Python that don't conflict with each other) as follows:\n",
        "\n",
        "```bash\n",
        "conda env create -n 'ph'\n",
        "conda env activate ph\n",
        "pip -r requirements.txt\n",
        "jupyter lab\n",
        "```\n",
        "\n",
        "This will install the required libraries into a new virtual environment called 'ph' (Programming Historian) by first creating the environment, then activating it, installing the libraries, then launching Jupyter Lab.\n",
        "\n",
        "<div class=\"alert alert-block alert-info\">\n",
        "    <p><b>Note</b> that we use a Parquet file for this work since it allows us to distribute a reasonably large and complex data set as a single highly-compressed file. If you would like to adapt this tutorial for use with a CSV or Excel file you have two choices: 1) simply replace <tt>df = pd.read_parquet(...)</tt> with <tt>df = pd.read_csv(...)</tt>; 2) convert your CSV/Excel file to Parquet first using <tt>pd.read_csv(...).to_parquet(...)</tt>.</p>\n",
        "    <p>The other big advantage of Parquet files is that they can contain lists and dictionaries, whereas as CSV has to 'serialise' these like this: <tt>\"['foo','bar',...,'baz']\"</tt>. To <em>deserialise</em> a literal value like this you need to use the built-in <tt>ast</tt> library: <tt>ast.literal_eval(&lt;string_that_should_be_list&gt;)</tt>. See <a href=\"https://stackoverflow.com/questions/1894269/how-to-convert-string-representation-of-list-to-a-list\">Stack Overflow for examples</a>.</p>\n",
        "</div>"
      ],
      "id": "tTWRVAGzlu8F"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "g3KWCpPXlu8J"
      },
      "source": [
        "## Required Libraries"
      ],
      "id": "g3KWCpPXlu8J"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "VUEzTZF-lu8K",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "a88922a1-6cfa-4cd7-d0c5-87db24ffca38"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "The autoreload extension is already loaded. To reload it, use:\n",
            "  %reload_ext autoreload\n"
          ]
        }
      ],
      "source": [
        "%load_ext autoreload\n",
        "%autoreload 1"
      ],
      "id": "VUEzTZF-lu8K"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_L4NqlpYlu8L"
      },
      "source": [
        "Generally useful libraries."
      ],
      "id": "_L4NqlpYlu8L"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "09ELuND-lu8M"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "import numpy as np\n",
        "import pickle\n",
        "import math\n",
        "import os\n",
        "import re\n",
        "\n",
        "%matplotlib inline\n",
        "import matplotlib.pyplot as plt\n",
        "from matplotlib import cm\n",
        "\n",
        "import seaborn as sns"
      ],
      "id": "09ELuND-lu8M"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JCjgunhtlu8M"
      },
      "source": [
        "Needed for the dimensionality reduction stage."
      ],
      "id": "JCjgunhtlu8M"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "C4wjXwhClu8M"
      },
      "outputs": [],
      "source": [
        "try:\n",
        "    import umap\n",
        "except ModuleNotFoundError:\n",
        "    print(\"Module not found, will try to install...\")\n",
        "    !pip install umap-learn\n",
        "    import umap"
      ],
      "id": "C4wjXwhClu8M"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EH3nH6SIlu8P"
      },
      "source": [
        "Needed for hierarchical clustering stage."
      ],
      "id": "EH3nH6SIlu8P"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "AWODYRzclu8Q"
      },
      "outputs": [],
      "source": [
        "try:\n",
        "    from kneed import KneeLocator\n",
        "except ModuleNotFoundError:\n",
        "    print(\"Module not found, will try to install...\")\n",
        "    !pip install kneed\n",
        "    from kneed import KneeLocator"
      ],
      "id": "AWODYRzclu8Q"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KeO12H6flu8S"
      },
      "outputs": [],
      "source": [
        "from scipy.cluster.hierarchy import dendrogram, linkage, fcluster, centroid\n",
        "from tabulate import tabulate"
      ],
      "id": "KeO12H6flu8S"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6lgZ4ifSlu8T"
      },
      "source": [
        "Needed for the validation and visualisation stage."
      ],
      "id": "6lgZ4ifSlu8T"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "cZrPUioFlu8U"
      },
      "outputs": [],
      "source": [
        "from sklearn.metrics import confusion_matrix, classification_report, silhouette_score\n",
        "from sklearn.feature_extraction.text import TfidfVectorizer\n",
        "from sklearn.feature_extraction.text import CountVectorizer\n",
        "\n",
        "from wordcloud import WordCloud"
      ],
      "id": "cZrPUioFlu8U"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "TnpN8bollu8V",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "6dbcd724-16e6-4c67-b699-5336c4d42aee"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Loaded Class-TF/IDF Vectorizer.\n"
          ]
        }
      ],
      "source": [
        "url      = 'https://github.com/MaartenGr/cTFIDF/archive/refs/tags/v0.1.1.tar.gz'\n",
        "version  = re.search(r'/v(.+?)\\.tar\\.gz', url).group(1)\n",
        "dir_name = f'cTFIDF-{version}'\n",
        "dir_path = os.path.join(os.getcwd(),dir_name)\n",
        "\n",
        "if not os.path.exists(dir_name):\n",
        "\n",
        "    print(\"Module not found, will try to download and prepare...\")\n",
        "\n",
        "    import requests, tarfile\n",
        "\n",
        "    r = requests.get(url, allow_redirects=True)\n",
        "    open(f'{dir_name}.tar.gz', 'wb').write(r.content)\n",
        "    print(\"  Downloaded\")\n",
        "\n",
        "    tarf = tarfile.open(f'{dir_name}.tar.gz', 'r')\n",
        "    for f in tarf.getnames():\n",
        "        if not (f.startswith('/') or f.startswith('.')):\n",
        "            tarf.extract(f)\n",
        "    tarf.close()\n",
        "    os.remove(f'{dir_name}.tar.gz')\n",
        "\n",
        "    print(f\"Downloaded and unpacked cTFIDF-{version}.\")\n",
        "\n",
        "import sys\n",
        "if sys.path[-1] != dir_path:\n",
        "    sys.path.append(dir_path)\n",
        "\n",
        "try:\n",
        "    from ctfidf import CTFIDFVectorizer\n",
        "    print(\"Loaded Class-TF/IDF Vectorizer.\")\n",
        "except ModuleNotFoundError:\n",
        "    print(\"Still can't load Class-TF/IDF Vectorizer.\")\n",
        "\n",
        "    print(\"=\"*25)\n",
        "    print(\"You should try restarting the kernel now.\\nFor some reason unpacking and loading\\nimmediately doesn't work.\")\n",
        "    print(\"=\"*25)\n"
      ],
      "id": "TnpN8bollu8V"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QnTGWu79lu8Y"
      },
      "source": [
        "<div class=\"alert alert-block alert-warning\">\n",
        "    <b>&#9888; Stop</b>: if you <i>still</i> have errors after running the <b>above code block</b> for the first time then you will probably have to Restart the Kernel at this point. This code is trying to download a new module for which no installer exists and <i>then</i> register it with Python, but the process doesn't seem bullet-proof in my testing. Sorry, but you <i>should</i> only need to restart the Kernel this first time.\n",
        "</div>"
      ],
      "id": "QnTGWu79lu8Y"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EMKgUcw1lu8Y"
      },
      "source": [
        "The code below tries to find a narrow sans-serif TTF font by path that is slightly nicer than the default for the WordCloud library. You would need to update this default for your own system. You can list available fonts using (/ht [imsc](https://stackoverflow.com/a/8755818/4041902)):\n",
        "```python\n",
        "import matplotlib.font_manager\n",
        "matplotlib.font_manager.findSystemFonts(fontpaths=None, fontext='ttf')\n",
        "```"
      ],
      "id": "EMKgUcw1lu8Y"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xgZgvvz0lu8Y",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "597838c0-2d9e-4d47-fa7a-9962a862ac5c"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Possible fonts: Humor-Sans.ttf, LiberationSans-Bold.ttf, LiberationSans-BoldItalic.ttf, LiberationSans-Italic.ttf, LiberationSans-Regular.ttf, LiberationSansNarrow-Bold.ttf, LiberationSansNarrow-BoldItalic.ttf, LiberationSansNarrow-Italic.ttf, LiberationSansNarrow-Regular.ttf, \n"
          ]
        }
      ],
      "source": [
        "import matplotlib.font_manager\n",
        "fonts = matplotlib.font_manager.findSystemFonts(fontpaths=None, fontext='ttf')\n",
        "print(\"Possible fonts: \", end=\"\")\n",
        "for f in sorted(fonts):\n",
        "    if 'Narrow' in f:\n",
        "        print(f.split(os.path.sep)[-1], end=\", \")\n",
        "    elif 'Sans' in f:\n",
        "        print(f.split(os.path.sep)[-1], end=\", \")\n",
        "print()"
      ],
      "id": "xgZgvvz0lu8Y"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "U-qow9YIlu8Z",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "f90948cd-42d5-47d3-8927-0aa82271e5d4"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Using font: LiberationSansNarrow-Regular\n",
            "  Guessing at font name: Liberation Sans Narrow\n"
          ]
        }
      ],
      "source": [
        "fp = fonts[0] # Ensure at least _something_ is set here\n",
        "for f in fonts:\n",
        "    if 'LiberationSansNarrow-Regular' in f:\n",
        "        fp = f.split(os.path.sep)[-1].split('.')[0]\n",
        "        break\n",
        "    elif 'Arial Narrow.ttf' in f:\n",
        "        fp = f.split(os.path.sep)[-1].split('.')[0]\n",
        "        break\n",
        "    elif 'Narrow' in f:\n",
        "        fp = f.split(os.path.sep)[-1].split('.')[0]\n",
        "print(f\"Using font: {fp}\")\n",
        "\n",
        "fname = ''.join([f' {x}' if x==x.upper() else x for x in fp.split('-')[0]]).strip().replace('  ','')\n",
        "print(f\"  Guessing at font name: {fname}\")\n",
        "\n",
        "# These are font dictionaries for the 's'uper-title, 't'itle,\n",
        "# 'a'xis, and 'l'abels.\n",
        "sfont = {'fontname':fname, 'fontsize':16}\n",
        "tfont = {'fontname':fname, 'fontsize':12}\n",
        "afont = {'fontname':fname, 'fontsize':10}\n",
        "lfont = {'fontname':fname, 'fontsize':8}"
      ],
      "id": "U-qow9YIlu8Z"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "j-Ru1Fj9lu8Z"
      },
      "source": [
        "## Configuration"
      ],
      "id": "j-Ru1Fj9lu8Z"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "i0wHzC7Flu8a"
      },
      "outputs": [],
      "source": [
        "# Random seed\n",
        "rs = 43\n",
        "\n",
        "# Which embeddings to use\n",
        "src_embeddings = 'doc_vec'"
      ],
      "id": "i0wHzC7Flu8a"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "WZ9sMyoSlu8a"
      },
      "source": [
        "## Load the Data\n",
        "\n",
        "In this tutorial I make use of the modern Parquet format: it's highly-compressed and columnar, so it works very well (and quickly) with large data sets. The file can also be read directly in DuckDB if you use it, but the general idea is to minimise the volume of data transfered. The columnar orientation means that you can quickly load only the columns that you need for an analysis, and don't have to read in the entire data set each time (as you would with, say, CSV or most other common data formats)."
      ],
      "id": "WZ9sMyoSlu8a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "BdKlC83wlu8a",
        "outputId": "e7af00dc-1d54-4519-a9a5-c786e47bc649",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Extracting clustering-visualizing-word-embeddings/\n",
            "Extracting clustering-visualizing-word-embeddings/ph-tutorial-data-cleaned.csv.gz\n",
            "Extracting clustering-visualizing-word-embeddings/ph-tutorial-data-cleaned.parquet\n"
          ]
        }
      ],
      "source": [
        "# Adapted from https://stackoverflow.com/a/72503304\n",
        "import os\n",
        "import pandas as pd\n",
        "\n",
        "dn = 'data'\n",
        "fn = 'ph-tutorial-data-cleaned.parquet'\n",
        "\n",
        "if not os.path.exists(os.path.join(dn,fn)):\n",
        "    print(f\"Couldn't find {os.path.join('data',fn)}, downloading...\")\n",
        "    from io import BytesIO\n",
        "    from urllib.request import urlopen\n",
        "    from zipfile import ZipFile\n",
        "\n",
        "    # Where is the Zipfile stored on Zenodo?\n",
        "    zipfile = 'clustering-visualizing-word-embeddings.zip'\n",
        "    zipurl  = f'https://zenodo.org/records/7948908/files/{zipfile}?download=1'\n",
        "\n",
        "    # Open the remote Zipfile and read it directly into Python\n",
        "    with urlopen(zipurl) as zipresp:\n",
        "        with ZipFile(BytesIO(zipresp.read())) as zf:\n",
        "            for zfile in zf.namelist():\n",
        "                if not zfile.startswith('__'): # Don't unpack hidden MacOSX junk\n",
        "                    print(f\"Extracting {zfile}\") # Update the user\n",
        "                    zf.extract(zfile,'.')\n",
        "    print(\"  Downloaded.\")\n",
        "    # And rename the unzipped directory to 'data' --\n",
        "    # IMPORTANT: Note that if 'data' already exists it will (probably) be silently overwritten.\n",
        "    os.rename('clustering-visualizing-word-embeddings',dn)\n",
        "\n",
        "print(f\"Loading {fn}\")\n",
        "df = pd.read_parquet(os.path.join(dn,fn))"
      ],
      "id": "BdKlC83wlu8a"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "DPYGe_mglu8b"
      },
      "outputs": [],
      "source": [
        "print(f\"Loading columns: {', '.join(df.columns.tolist())}\")"
      ],
      "id": "DPYGe_mglu8b"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UzItwy-Vlu8b"
      },
      "source": [
        "## Dimensionality Reduction"
      ],
      "id": "UzItwy-Vlu8b"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8sm-qGKFlu8b"
      },
      "source": [
        "While I'm confident about the output from UMAP in a _general_ sense, I'm much less certain about the default _distance_ measure (see [sklearn docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html)). There is a distinct *possibility* that some configurations of this tutorial would obtain better results using the `cosine` metric. [This article](https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa) appears to offer some help, but points on to a [longer discussion](https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions) where `manhattan` is argued to be a good representation. I've rationalised my choice in the tutorial, but you should feel free to experiment!"
      ],
      "id": "8sm-qGKFlu8b"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EchWHCn5lu8b"
      },
      "source": [
        "UMAP offers a very wide range of distance metrics:\n",
        "\n",
        "- Minkowski style metrics\n",
        "  - euclidean\n",
        "  - manhattan\n",
        "  - chebyshev\n",
        "  - minkowski\n",
        "- Miscellaneous spatial metrics\n",
        "  - canberra\n",
        "  - braycurtis\n",
        "  - haversine\n",
        "- Normalized spatial metrics\n",
        "  - mahalanobis\n",
        "  - wminkowski\n",
        "  - seuclidean\n",
        "- Angular and correlation metrics\n",
        "  - cosine\n",
        "  - correlation"
      ],
      "id": "EchWHCn5lu8b"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jJc08hD2lu8b"
      },
      "source": [
        "### Configuring the process"
      ],
      "id": "jJc08hD2lu8b"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "AR_btLHHlu8b"
      },
      "outputs": [],
      "source": [
        "dmeasure = 'euclidean'\n",
        "rdims    = 4 # r-dims == Reduced dimensionality\n",
        "print(f\"UMAP dimensionality reduction to {rdims} dimensions with '{dmeasure}' distance measure.\")"
      ],
      "id": "AR_btLHHlu8b"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SF-L9XXPlu8c"
      },
      "source": [
        "### Reducing dimensionality\n",
        "\n",
        "This is where we apply the UMAP dimensionality reduction step. Expect this to take **about 45 seconds** (or **1 minute on Google Collab**)."
      ],
      "id": "SF-L9XXPlu8c"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lX_Ah6Yolu8c"
      },
      "outputs": [],
      "source": [
        "# Assumes that there is a column that contains the\n",
        "# document embedding as an array/list that needs to be\n",
        "# extracted to a new data frame\n",
        "def x_from_df(df:pd.DataFrame, col:str='Embedding') -> pd.DataFrame:\n",
        "    cols = ['E'+str(x) for x in np.arange(0,len(df[col].iloc[0]))]\n",
        "    return pd.DataFrame(df[col].tolist(), columns=cols, index=df.index)"
      ],
      "id": "lX_Ah6Yolu8c"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "66TjWYavlu8c"
      },
      "outputs": [],
      "source": [
        "%%time\n",
        "X = x_from_df(df, col=src_embeddings)\n",
        "\n",
        "reducer = umap.UMAP(\n",
        "    n_neighbors=25,\n",
        "    min_dist=0.01,\n",
        "    n_components=rdims,\n",
        "    random_state=rs)\n",
        "\n",
        "# Basically reduces our feature vectors for each thesis, down to n dimensions\n",
        "X_embedded = reducer.fit_transform(X)"
      ],
      "id": "66TjWYavlu8c"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "m3F4pO-zlu8c"
      },
      "source": [
        "This next block turns the output Numpy array into a data frame with one column for each reduced dimension."
      ],
      "id": "m3F4pO-zlu8c"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3M7phFXxlu8d"
      },
      "outputs": [],
      "source": [
        "embedded_dict = {}\n",
        "for i in range(0,X_embedded.shape[1]):\n",
        "    embedded_dict[f\"Dim {i+1}\"] = X_embedded[:,i] # D{dimension_num} (Dim 1...Dim n)\n",
        "\n",
        "# dfe == df embedded\n",
        "dfe = pd.DataFrame(embedded_dict, index=df.index)\n",
        "del(embedded_dict)\n",
        "\n",
        "dfe.head(3)"
      ],
      "id": "3M7phFXxlu8d"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2CxpsbONlu8d"
      },
      "source": [
        "Merge the projection on to the main data frame so that we can easily explore the results."
      ],
      "id": "2CxpsbONlu8d"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fIyTwZtglu8d"
      },
      "outputs": [],
      "source": [
        "projected = df.join(dfe).sort_values(by=['ddc1','ddc2'])\n",
        "print(projected.columns.values)\n",
        "\n",
        "projected.head(3)"
      ],
      "id": "fIyTwZtglu8d"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8w2F3aMHlu8e"
      },
      "source": [
        "### Visualising the results"
      ],
      "id": "8w2F3aMHlu8e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8Fjk8yTglu8e"
      },
      "outputs": [],
      "source": [
        "# Figure for fine-tuning Matplotlib output so that\n",
        "# we get nicer outputs.\n",
        "def tune_figure(ax, title:str='Title'):\n",
        "    ax.axis('off')\n",
        "    ax.set_title(title, **tfont)\n",
        "    ax.get_legend().set_title(\"\")\n",
        "    ax.get_legend().prop.set_family(lfont['fontname'])\n",
        "    ax.get_legend().prop.set_size(lfont['fontsize'])\n",
        "    ax.get_legend().get_frame().set_linewidth(0.0)"
      ],
      "id": "8Fjk8yTglu8e"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_eNKKu1dlu8e"
      },
      "outputs": [],
      "source": [
        "f, axs = plt.subplots(1,2,figsize=(14,6))\n",
        "axs = axs.flatten()\n",
        "\n",
        "sns.scatterplot(data=projected, x='Dim 1', y='Dim 2', hue='ddc1', s=5, alpha=0.1, ax=axs[0]);\n",
        "tune_figure(axs[0], 'DDC1 Group')\n",
        "\n",
        "sns.scatterplot(data=projected, x='Dim 1', y='Dim 2', hue='ddc2', s=5, alpha=0.1, ax=axs[1]);\n",
        "tune_figure(axs[1], 'DDC2 Group')\n",
        "\n",
        "#plt.savefig('DDC_Plot.png', dpi=150)\n",
        "plt.show()"
      ],
      "id": "_eNKKu1dlu8e"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9n5W3NT8lu8f"
      },
      "source": [
        "## Hierarchical Clustering"
      ],
      "id": "9n5W3NT8lu8f"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nNM9V1emlu8k"
      },
      "source": [
        "Expect this next stage to take **about 4 seconds** (or **20 seconds on Google Collab**). If you consistently encounter Out-of-Memory errors in Google Colab while running this next block then you may need to sample the data instead:\n",
        "```python\n",
        "projected = projected.sample(frac=0.5)\n",
        "```\n",
        "When you perform the `join` later the unsampled records should fall out naturally though, obviously, your results will begin to differ substantially from the ones presented in the tutorial."
      ],
      "id": "nNM9V1emlu8k"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3R3mSv6Ilu8k"
      },
      "source": [
        "### Configuring the process\n",
        "\n",
        "Note the use of `x for x in...` code to dynamically select the columns that begin with `Dim ` from the 'projected' data set. If you were to specify a different number of dimensions when doing UMAP reduction above then this code does *not* need to change since it makes no assumptions about how many dimensions will match."
      ],
      "id": "3R3mSv6Ilu8k"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4ZFwF7Uilu8l"
      },
      "outputs": [],
      "source": [
        "%%time\n",
        "# Z is the full record of the clustering process\n",
        "# and is what underpins the dendrogram you'll see below.\n",
        "dmeasure = 'euclidean'\n",
        "Z = linkage(projected[[x for x in projected.columns if x.startswith('Dim ')]],\n",
        "            method='ward', metric='euclidean')"
      ],
      "id": "4ZFwF7Uilu8l"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cn24NBlClu8l"
      },
      "source": [
        "Save the clustering result. Can be useful for later exploration offline."
      ],
      "id": "cn24NBlClu8l"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QL5uPlh4lu8l"
      },
      "outputs": [],
      "source": [
        "pickle.dump(Z, open(os.path.join('data','Z.pickle'), 'wb'))"
      ],
      "id": "QL5uPlh4lu8l"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3wlc_TtVlu8l"
      },
      "source": [
        "### Visualising the results"
      ],
      "id": "3wlc_TtVlu8l"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vmMsTiq2lu8l"
      },
      "outputs": [],
      "source": [
        "last_cls = 100 # The number of last clusters to show in the dendogram\n",
        "\n",
        "plt.title(f'Hierarchical Clustering Dendrogram (truncated at {last_cls} clusters)', **tfont)\n",
        "plt.xlabel('Sample Index (includes count of records in cluster)', **afont)\n",
        "plt.ylabel('Distance', **afont)\n",
        "fig = plt.gcf()\n",
        "fig.set_size_inches(20, 7)\n",
        "fig.set_dpi(150)\n",
        "\n",
        "dendrogram(\n",
        "    Z,\n",
        "    truncate_mode='lastp', # truncate dendrogram to the last p merged clusters\n",
        "    p=last_cls,            # and set a value for last p merged clusters\n",
        "    show_leaf_counts=True, # if parentheses then this is a count of observations, otherwise an id\n",
        "    leaf_rotation=90.,\n",
        "    leaf_font_size=8.,\n",
        "    show_contracted=False, # to get a distribution impression in truncated branches\n",
        ")\n",
        "#plt.savefig(f'Dendogram-{last_cls}.png')\n",
        "plt.show()"
      ],
      "id": "vmMsTiq2lu8l"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HthZzD8Olu8m"
      },
      "source": [
        "`Z` is a $(n-1)$ by 4 matrix. At the $i$-th iteration, clusters with indices $Z[i, 0]$ and $Z[i, 1]$ are combined to form cluster $n+i$. A cluster with an index less than $n$ corresponds to one of the $n$ original observations. The distance between clusters $Z[i, 0]$ and $Z[i, 1]$ is given by $Z[i, 2]$. The fourth value $Z[i, 3]$ represents the number of original observations in the newly formed cluster."
      ],
      "id": "HthZzD8Olu8m"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "DwyA2yKnlu8m"
      },
      "outputs": [],
      "source": [
        "table = []\n",
        "\n",
        "# Take the 1st, the 25th, 50th and 75th 'percentiles', and the last\n",
        "for i in [0, math.ceil(Z.shape[0]*0.25), math.ceil(Z.shape[0]*0.5), math.ceil(Z.shape[0]*0.75), -1]:\n",
        "    r = list(Z[i])\n",
        "    r.insert(0,(i if i >= 0 else len(Z)+i))\n",
        "    table.append(r)\n",
        "    table[-1][1] = int(table[-1][1])\n",
        "    table[-1][2] = int(table[-1][2])\n",
        "    table[-1][4] = int(table[-1][4])\n",
        "\n",
        "display(\n",
        "    tabulate(table,\n",
        "             headers=[\"Iteration\",\"$c_{i}$\",\"$c_{j}$\",\"$d_{ij}$\",\"$\\sum{c_{i},c_{j}}$\"],\n",
        "             floatfmt='0.3f', tablefmt='html'))"
      ],
      "id": "DwyA2yKnlu8m"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6ojiaAK9lu8m"
      },
      "source": [
        "### Silhouette Scoring\n",
        "\n",
        "<div class=\"alert alert-block alert-warning\">\n",
        "    This section is <i>not</i> shown in the tutorial as it's a support to decision-making, not a core part of the tutorial.\n",
        "</div>\n",
        "\n",
        "This process is slow and computationally intensive: it is calculating a silhouette score for every clustering option between `start` and `end` to give you a sense of how the silhouette score evolves with the number of clusters. A falling silhouette score is normal since, the smaller the number of observations in the cluster, the more you're likely to see some badly-bitted observations within a cluster... at least up until the point where you start having very small clusters indeed. What we're going to be looking for is the 'knee' where this process levels out.\n",
        "\n",
        "You're looking at **_about_ 1 minute** to repeatedly cluster the data."
      ],
      "id": "6ojiaAK9lu8m"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_8uEsvTAlu8m"
      },
      "outputs": [],
      "source": [
        "%%time\n",
        "\n",
        "start_cl = 2\n",
        "end_cl   = 25\n",
        "\n",
        "sil_scores = []\n",
        "\n",
        "print(\"Scoring cluster levels: \", end=\"\")\n",
        "\n",
        "X_embedded = projected[[x for x in projected if x.startswith('Dim ')]]\n",
        "\n",
        "for i in range(start_cl,end_cl):\n",
        "    print(\".\", end=\"\")\n",
        "    clusterings = fcluster(Z, i, criterion='maxclust')\n",
        "\n",
        "    # Calculate silhouett average\n",
        "    sil_avg = silhouette_score(X_embedded, clusterings)\n",
        "\n",
        "    # Append silhouette scores\n",
        "    sil_scores.append(sil_avg)\n",
        "\n",
        "print(\"\\nDone.\")"
      ],
      "id": "_8uEsvTAlu8m"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_lOhAqWElu8n"
      },
      "source": [
        "Using the silhouette scores calculatee above we can now generate a scree plot."
      ],
      "id": "_lOhAqWElu8n"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Q6obX7cWlu8n"
      },
      "outputs": [],
      "source": [
        "fig,ax = plt.subplots(1,1,figsize=(10,5));\n",
        "sns.lineplot(x=np.arange(start_cl,start_cl+len(sil_scores)), y=sil_scores, ax=ax)\n",
        "ax.set_title(\"\")\n",
        "ax.set_xlabel(\"Number of Clusters\", **afont)\n",
        "ax.set_ylabel(\"Average Silhouette Score\", **afont)\n",
        "\n",
        "plt.box(False)\n",
        "plt.xticks(range(0,end_cl+1,5))\n",
        "ax.tick_params(axis='both', length=0)\n",
        "for xt in range(5,end_cl+1,5):\n",
        "    plt.vlines(xt, np.min(sil_scores), np.max(sil_scores), colors=(0.4,0.4,0.4,0.4), linestyles='dashed')\n",
        "\n",
        "plt.suptitle(f\"Scree Plot for Hierarchical Clustering\", fontsize=14);\n",
        "#plt.savefig(f'Scree-plot.png', dpi=150)\n",
        "plt.show();"
      ],
      "id": "Q6obX7cWlu8n"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hROZ8Mnflu8n"
      },
      "source": [
        "### Knee Locator\n",
        "\n",
        "<div class=\"alert alert-block alert-warning\">\n",
        "    This is also not shown in the tutorial as it's part of decision-making, not a core element of the tutorial.\n",
        "</div>\n",
        "\n",
        "We can eyeball the scree plot, but a good sanity-check is to use the [kneed](https://pypi.org/project/kneed/) utility to automate the process. Depending on how your clusters score this may or may not be helpful: _e.g._ sometimes a relatively small deviation triggers the 'knee' at what is obviously only a 'blip'. By experimenting with the value of `S` you can fine-tune the sensitivity to these 'blips'."
      ],
      "id": "hROZ8Mnflu8n"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "pjynd3v3lu8o"
      },
      "outputs": [],
      "source": [
        "kn = KneeLocator(np.arange(3,3+len(sil_scores)), sil_scores,\n",
        "                 curve=\"convex\", direction=\"decreasing\", S=3,\n",
        "                 interp_method=\"polynomial\", polynomial_degree=4)\n",
        "print(f'Suggest knee at {kn.knee} clusters.')"
      ],
      "id": "pjynd3v3lu8o"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "stBiQQ33lu8o"
      },
      "source": [
        "## Validation"
      ],
      "id": "stBiQQ33lu8o"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9GuQY4dElu8p"
      },
      "source": [
        "We're now going to investigate the clustering results in greater detail."
      ],
      "id": "9GuQY4dElu8p"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "KsA1lhRjlu8q"
      },
      "outputs": [],
      "source": [
        "# Assumes a data frame, a clustering result, and a DDC level (1, 2 or 3)\n",
        "# for mapping the clusters on to 'plain English' labels from the DDC\n",
        "def label_clusters(src_df:pd.DataFrame, clusterings:np.ndarray, ddc_level:int=1):\n",
        "\n",
        "    # How many clusters?\n",
        "    num_clusters = clusterings.max()\n",
        "\n",
        "    # Create a new data frame holding only the\n",
        "    # cluster results but indexed to the source\n",
        "    tmp = pd.DataFrame({f'Cluster_{num_clusters}':clusterings}, index=src_df.index)\n",
        "\n",
        "    # Now link them\n",
        "    joined_df = src_df.join(tmp, how='inner')\n",
        "\n",
        "    # Now get the dominant categories for each\n",
        "    labels = get_dominant_cat(joined_df, clusterings.max(), ddc_level)\n",
        "\n",
        "    # And map the labels for each cluster value\n",
        "    joined_df[f'Cluster_Name_{num_clusters}'] = joined_df[f'Cluster_{num_clusters}'].apply(lambda x: labels[x])\n",
        "\n",
        "    return joined_df\n",
        "\n",
        "# Find the dominan class for each cluster assuming a specified DDC level (1, 2 or 3)\n",
        "def get_dominant_cat(clustered_df:pd.DataFrame, num_clusters:int, ddc_level:int=1):\n",
        "    labels = {}\n",
        "    struct = {}\n",
        "\n",
        "    # First, work out the dominant group in each cluster\n",
        "    # and note that together with the cluster number --\n",
        "    # this gives us a dict with key==dominant group and\n",
        "    # then one or more cluster numbers from the output\n",
        "    # above.\n",
        "    for cl in range(1,num_clusters+1):\n",
        "\n",
        "        # Identify the dominant 'domain' (i.e. group by\n",
        "        # DDC description) using the value counts result.\n",
        "        dom     = clustered_df[clustered_df[f'Cluster_{num_clusters}']==cl][f'ddc{ddc_level}'].value_counts().index[0]\n",
        "        print(f\"Cluster {cl} dominated by {dom} theses.\")\n",
        "\n",
        "        if struct.get(dom) == None:\n",
        "            struct[dom] = []\n",
        "\n",
        "        struct[dom].append(cl)\n",
        "\n",
        "    # Next, flip this around so that we create useful\n",
        "    # cluster labels for each cluster. Since we can have\n",
        "    # more than one cluster dominated by the same group\n",
        "    # we have to increment them (e.g. History 1, History 2)\n",
        "    for g in struct.keys():\n",
        "        if len(struct[g])==1:\n",
        "            labels[struct[g][0]]=g\n",
        "            #print(f'{g} maps to Cluster {struct[g][0]}')\n",
        "        else:\n",
        "            for s in range(0,len(struct[g])):\n",
        "                labels[struct[g][s]]=f'{g} {s+1}'\n",
        "                #print(f'{g} {s+1} maps to Cluster {struct[g][s]}')\n",
        "    return labels"
      ],
      "id": "KsA1lhRjlu8q"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mDYgk35hlu8r"
      },
      "source": [
        "### 3 Clusters\n",
        "\n",
        "Note that we 'parameterise' the settings here so that you can easily change the number of clusters and DDC level and then re-run the code. You *could* move all of the code in the following code block to a function to streamline the code further (since we re-use the code for additional clusterings) but my feeling was that in this case it would be more confusing for the reader if I did so."
      ],
      "id": "mDYgk35hlu8r"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wxu6Bgeolu8r"
      },
      "outputs": [],
      "source": [
        "num_clusters = 3\n",
        "ddc_level = 1"
      ],
      "id": "wxu6Bgeolu8r"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GjvRb2Ailu8r"
      },
      "outputs": [],
      "source": [
        "# Extract clustering based on Z object\n",
        "clusterings  = fcluster(Z, num_clusters, criterion='maxclust')\n",
        "\n",
        "# Label clusters and add to df\n",
        "clustered_df = label_clusters(projected, clusterings, ddc_level=ddc_level)\n",
        "\n",
        "# Diagnostics\n",
        "print()\n",
        "\n",
        "# Classification report gives a (statistical) sense of power (TP/TN/FP/FN)\n",
        "print(classification_report(\n",
        "        clustered_df[f'ddc{ddc_level}'],\n",
        "        clustered_df[f'Cluster_Name_{num_clusters}'],\n",
        "        zero_division=0))\n",
        "\n",
        "# A confusion matrix is basically a cross-tab (without totals, which I think are nice to add)\n",
        "pd.crosstab(columns=clustered_df[f'Cluster_Name_{num_clusters}'],\n",
        "            index=clustered_df[f'ddc{ddc_level}'],\n",
        "            margins=True, margins_name='Total')\n"
      ],
      "id": "GjvRb2Ailu8r"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nVEQDo2Ulu8s"
      },
      "source": [
        "### 4 Clusters"
      ],
      "id": "nVEQDo2Ulu8s"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "hiIBCqkilu8s"
      },
      "outputs": [],
      "source": [
        "num_clusters = 4\n",
        "ddc_level = 2"
      ],
      "id": "hiIBCqkilu8s"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "WC8SgKRNlu8t"
      },
      "outputs": [],
      "source": [
        "# Extract clustering based on Z object\n",
        "clusterings  = fcluster(Z, num_clusters, criterion='maxclust')\n",
        "\n",
        "# Label clusters and add to df\n",
        "clustered_df = label_clusters(projected, clusterings, ddc_level=ddc_level)\n",
        "\n",
        "# Diagnostics\n",
        "print()\n",
        "\n",
        "# Classification report gives a (statistical) sense of power (TP/TN/FP/FN)\n",
        "print(classification_report(clustered_df[f'ddc{ddc_level}'], clustered_df[f'Cluster_Name_{num_clusters}']))\n",
        "\n",
        "# A confusion matrix is basically a cross-tab (without totals, which I think are nice to add)\n",
        "pd.crosstab(columns=clustered_df[f'Cluster_Name_{num_clusters}'],\n",
        "            index=clustered_df[f'ddc{ddc_level}'],\n",
        "            margins=True, margins_name='Total')\n"
      ],
      "id": "WC8SgKRNlu8t"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PdPedqh3lu8u"
      },
      "source": [
        "## Are the experts 'wrong'?\n",
        "\n",
        "Here we are trying to look in more detail at the PhDs that have (potentially!) been 'misclassified' by the experts--our clustering places them in a different group from the one specified by the DDC. Clearly, we'll have some false-positives in here as well, but the point is to examine the degree to which misclassification is both plausible and useful in terms of demonstrating the value of the NLP approach."
      ],
      "id": "PdPedqh3lu8u"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9B6ahdR-lu8u"
      },
      "outputs": [],
      "source": [
        "num_clusters = 4\n",
        "ddc_level  = 2"
      ],
      "id": "9B6ahdR-lu8u"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8dhd8A4tlu8u"
      },
      "outputs": [],
      "source": [
        "# Deal with long plot titles\n",
        "def break_title(title:str, target_len:int=40):\n",
        "    words = title.split(\" \")\n",
        "    fmt_title = ''\n",
        "    lines     = 1\n",
        "    for i in range(len(words)):\n",
        "        if (len(fmt_title) + len(words[i]))/target_len > lines:\n",
        "            fmt_title += \"\\n\"\n",
        "            lines += 1\n",
        "        fmt_title += words[i] + \" \"\n",
        "    return fmt_title"
      ],
      "id": "8dhd8A4tlu8u"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "FU1VLSuRlu8u"
      },
      "outputs": [],
      "source": [
        "projected = df.join(dfe).sort_values(by=['ddc1','ddc2'])\n",
        "\n",
        "# Extract clustering based on Z object\n",
        "clusterings  = fcluster(Z, num_clusters, criterion='maxclust')\n",
        "\n",
        "# Label clusters and add to df\n",
        "clustered_df = label_clusters(projected, clusterings, ddc_level=ddc_level)"
      ],
      "id": "FU1VLSuRlu8u"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tmsv08sclu8v"
      },
      "source": [
        "This approach to misclassification works well for level 1 and level 2 of the DDC, but it gets a lot more complex when you're looking at level 3 because there are _so_ many different groups and misclassifications (e.g. Economics vs Financial Economics) that the results become much harder to interpret meaningfully. The results also end up being highly unstable at that level and you'd probably want to think about validation very differently."
      ],
      "id": "tmsv08sclu8v"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "CppORx6Zlu8v"
      },
      "outputs": [],
      "source": [
        "fsize = (12,4)\n",
        "dpi   = 150\n",
        "nrows = 1\n",
        "\n",
        "# For each of the named clusters -- these will\n",
        "# have been assigned the name of the dominant\n",
        "# DDC in the cluster...\n",
        "for ddc_name in sorted(clustered_df[f'ddc{ddc_level}'].unique()):\n",
        "\n",
        "    print(f\"Processing {ddc_name} DDC...\")\n",
        "\n",
        "    # Here's the selected data sub-frame\n",
        "    sdf = clustered_df[clustered_df[f'ddc{ddc_level}']==ddc_name]\n",
        "\n",
        "    # Create one document per label (i.e. aggregate the individual documents and count)\n",
        "    docs = pd.DataFrame({'Document': sdf.tokens.apply(' '.join), 'Class': sdf[f'Cluster_Name_{num_clusters}']})\n",
        "    docs_per_class = docs.groupby(['Class'], as_index=False).agg({'Document': ' '.join})\n",
        "\n",
        "    cvec  = CountVectorizer().fit(docs_per_class.Document)\n",
        "    count = cvec.transform(docs_per_class.Document)\n",
        "    words = cvec.get_feature_names_out()\n",
        "\n",
        "    ctfidf = CTFIDFVectorizer().fit_transform(count, n_samples=len(docs))\n",
        "\n",
        "    ncols  = len(sdf[f'Cluster_Name_{num_clusters}'].unique())\n",
        "    nplots = nrows * ncols\n",
        "\n",
        "    axwidth  = math.floor((fsize[0]/ncols)*dpi)\n",
        "    axheight = math.floor(fsize[1]/nrows*dpi)\n",
        "\n",
        "    print(f\"Aiming for width x height of {axwidth} x {axheight}\")\n",
        "\n",
        "    # One image per DDC Category\n",
        "    f,axes = plt.subplots(nrows, ncols, figsize=fsize)\n",
        "\n",
        "    # Set up the word cloud\n",
        "    Cloud = WordCloud(background_color=None, mode='RGBA',\n",
        "                      max_words=50, relative_scaling=0.5, font_path=fp,\n",
        "                      height=axheight, width=axwidth)\n",
        "\n",
        "    for i, cl in enumerate(sorted(sdf[f'Cluster_Name_{num_clusters}'].unique())):\n",
        "        print(f\"Processing {cl} cluster ({i+1} of {nplots})\")\n",
        "\n",
        "        try:\n",
        "            ax = axes.flatten()[i]\n",
        "        except AttributeError:\n",
        "            ax = axes\n",
        "\n",
        "        tmp = pd.DataFrame({'words':words, 'weights':ctfidf.toarray()[i]}).set_index('words')\n",
        "\n",
        "        if ddc_name == cl:\n",
        "            ax.set_title(break_title(f\"{ddc_name} DDC Dominates Cluster ($n$={(sdf[f'Cluster_Name_{num_clusters}']==cl).sum():,})\"), **tfont)\n",
        "        else:\n",
        "            ax.set_title(break_title(f\"'Misclustered' into {cl} Cluster ($n$={(sdf[f'Cluster_Name_{num_clusters}']==cl).sum():,})\", 35), **tfont)\n",
        "        ax.imshow(Cloud.generate_from_frequencies({x:tmp.loc[x].weights for x in tmp.index.tolist()}))\n",
        "        ax.axis(\"off\")\n",
        "        del(tmp)\n",
        "\n",
        "    while i < len(axes.flatten())-1:\n",
        "        i += 1\n",
        "        axes.flatten()[i].axis('off')\n",
        "\n",
        "    # Set up a super-title and tweak the tight_layout\n",
        "    # in line with: https://stackoverflow.com/a/45161551/4041902\n",
        "    f.suptitle(f\"Clusters for {ddc_name} DDC\", **sfont)\n",
        "    f.tight_layout(rect=[0, 0.03, 1, 0.915])\n",
        "\n",
        "    #plt.savefig(f'c{num_clusters}-d{ddc_level}-class_tfidf-{ddc_name}.png', dpi=dpi)\n",
        "    print(\"Done.\")"
      ],
      "id": "CppORx6Zlu8v"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gz-HRQIflu8w"
      },
      "source": [
        "### 11 Clusters"
      ],
      "id": "gz-HRQIflu8w"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LxP5JqsOlu8x"
      },
      "outputs": [],
      "source": [
        "num_clusters = 11\n",
        "ddc_level = 3"
      ],
      "id": "LxP5JqsOlu8x"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "DkmQR0iMlu8y"
      },
      "outputs": [],
      "source": [
        "projected = df.join(dfe).sort_values(by=['ddc1','ddc2'])\n",
        "\n",
        "# Extract clustering based on Z object\n",
        "clusterings  = fcluster(Z, num_clusters, criterion='maxclust')\n",
        "\n",
        "# Label clusters and add to df\n",
        "clustered_df = label_clusters(projected, clusterings, ddc_level=ddc_level)"
      ],
      "id": "DkmQR0iMlu8y"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "3tivUAhclu8y"
      },
      "outputs": [],
      "source": [
        "# A confusion matrix is basically a cross-tab (without totals, which I think are nice to add)\n",
        "# Here we transform the matrix to have the DDCs on the top so that it's easy to scan down and\n",
        "# see how they were clustered. That said, with this number of clusters and DDCs it's very hard\n",
        "# to make out meaningful patterns. We've also not bothered producing the statistical tests since\n",
        "# they don't map well between cluster names and DDCs.\n",
        "pd.crosstab(\n",
        "    columns=clustered_df[f'Cluster_Name_{num_clusters}'],\n",
        "    index=clustered_df[f'ddc{ddc_level}'],\n",
        "    margins=True, margins_name='Total').T"
      ],
      "id": "3tivUAhclu8y"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "jK7b619llu8y"
      },
      "outputs": [],
      "source": [
        "nplots = len(clustered_df[f'Cluster_Name_{num_clusters}'].unique())\n",
        "ncols  = 3\n",
        "nrows  = math.ceil(nplots/ncols)\n",
        "print(f\"Expecting {nplots} plots on {nrows} x {ncols} layout.\")"
      ],
      "id": "jK7b619llu8y"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0W1cBr6clu8y"
      },
      "outputs": [],
      "source": [
        "dpi      = 150\n",
        "fsize    = (12,14)\n",
        "axwidth  = math.floor((fsize[0]/ncols)*dpi)\n",
        "axheight = math.floor(fsize[1]/nrows*dpi)\n",
        "\n",
        "print(f\"Aiming for width x height of {axwidth} x {axheight}\")"
      ],
      "id": "0W1cBr6clu8y"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rwK6kQ4Llu8z"
      },
      "outputs": [],
      "source": [
        "# Create one document per label (i.e. aggregate the individual documents and count)\n",
        "docs = pd.DataFrame({'Document': clustered_df.tokens.apply(' '.join), 'Class': clustered_df[f'Cluster_Name_{num_clusters}']})\n",
        "docs_per_class = docs.groupby(['Class'], as_index=False).agg({'Document': ' '.join})\n",
        "\n",
        "cvec  = CountVectorizer().fit(docs_per_class.Document)\n",
        "count = cvec.transform(docs_per_class.Document)\n",
        "words = cvec.get_feature_names_out()\n",
        "\n",
        "ctfidf = CTFIDFVectorizer().fit_transform(count, n_samples=len(docs))"
      ],
      "id": "rwK6kQ4Llu8z"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Me2M4hGilu8z"
      },
      "outputs": [],
      "source": [
        "# One image per DDC Category\n",
        "f,axes = plt.subplots(nrows, ncols, figsize=fsize)\n",
        "\n",
        "# Set up the word cloud\n",
        "Cloud = WordCloud(background_color=None, mode='RGBA',\n",
        "                  max_words=75, relative_scaling=0.5, font_path=fp,\n",
        "                  height=axheight, width=axwidth)\n",
        "\n",
        "for i, cl in enumerate(sorted(clustered_df[f'Cluster_Name_{num_clusters}'].unique())):\n",
        "\n",
        "    print(f\"Processing {cl} cluster ({i+1} of {nplots})\")\n",
        "\n",
        "    try:\n",
        "        ax = axes.flatten()[i]\n",
        "    except AttributeError:\n",
        "        ax = axes\n",
        "\n",
        "    tmp = pd.DataFrame({'words':words, 'weights':ctfidf.toarray()[i]}).set_index('words')\n",
        "\n",
        "    ax.set_title(break_title(f\"{cl} ($n$={(clustered_df[f'Cluster_Name_{num_clusters}']==cl).sum():,})\", 30), **tfont)\n",
        "    ax.imshow(Cloud.generate_from_frequencies({x:tmp.loc[x].weights for x in tmp.index.tolist()}))\n",
        "    ax.axis(\"off\")\n",
        "    del(tmp)\n",
        "\n",
        "while i < len(axes.flatten())-1:\n",
        "    i += 1\n",
        "    axes.flatten()[i].axis('off')\n",
        "\n",
        "# Set up a super-title and tweak the tight_layout\n",
        "# in line with: https://stackoverflow.com/a/45161551/4041902\n",
        "f.suptitle(f\"{num_clusters} Cluster Class-TF/IDF\", **sfont)\n",
        "f.tight_layout(rect=[0, 0.03, 1, 0.95])\n",
        "\n",
        "#plt.savefig(f'c{num_clusters}-d{ddc_level}-class_tfidf.png', dpi=dpi)\n",
        "\n",
        "print(\"Done.\")"
      ],
      "id": "Me2M4hGilu8z"
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iJDBM2rGlu8z"
      },
      "source": [
        "### Comparing clustering algorithms"
      ],
      "id": "iJDBM2rGlu8z"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Td4Ozdaelu8z"
      },
      "outputs": [],
      "source": [
        "#dfe.head()\n",
        "clustered_df.head(2)"
      ],
      "id": "Td4Ozdaelu8z"
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "J2FbuAe3lu8z"
      },
      "outputs": [],
      "source": [
        "from sklearn.cluster import KMeans, DBSCAN, SpectralClustering, AgglomerativeClustering\n",
        "\n",
        "clusterings = [\n",
        "    KMeans(n_clusters=5, n_init=20, random_state=42),\n",
        "    DBSCAN(eps=0.3, min_samples=20),\n",
        "    SpectralClustering(n_clusters=5, n_init=20),\n",
        "    AgglomerativeClustering(n_clusters=5, linkage='ward'),\n",
        "]\n",
        "ncols       = 2\n",
        "nrows       = math.ceil(len(clusterings)/ncols)\n",
        "\n",
        "dpi         = 150\n",
        "fsize       = (12,8)\n",
        "axwidth     = math.floor((fsize[0]/ncols)*dpi)\n",
        "axheight    = math.floor(fsize[1]/nrows*dpi)\n",
        "\n",
        "f, axs = plt.subplots(nrows,ncols,figsize=fsize)\n",
        "axes   = axs.flatten()\n",
        "\n",
        "for idx, algo in enumerate(clusterings):\n",
        "\n",
        "    clustered_df = clustered_df[clustered_df.columns[[x.startswith('Dim') for x in clustered_df.columns]].tolist()].copy()\n",
        "\n",
        "    rs = algo.fit(clustered_df)\n",
        "    s  = pd.Series(rs.labels_, index=clustered_df.index)\n",
        "    clustered_df['cluster'] = s\n",
        "\n",
        "    sns.scatterplot(data=clustered_df, x='Dim 1', y='Dim 2', hue='cluster', palette=sns.color_palette(\"husl\", len(s.unique())), s=3, alpha=0.2, ax=axes[idx]);\n",
        "    axes[idx].axis('off')\n",
        "    axes[idx].set_title(f'{algo.__class__.__name__.replace(\"Clustering\",\"\")} Clustering Result', **tfont)\n",
        "    axes[idx].get_legend().set_title(\"\")\n",
        "    axes[idx].get_legend().get_frame().set_linewidth(0.0)\n",
        "    print()\n",
        "\n",
        "#plt.savefig(os.path.join(c.outputs_dir,f'{c.which_embedding}-{c.embedding}-d{c.dimensions}-semantic_space-clustering_comparison.png'), dpi=150)\n",
        "plt.show()\n",
        "#print(os.path.join(c.outputs_dir,f'{c.which_embedding}-{c.embedding}-d{c.dimensions}-semantic_space-clustering_comparison.png'))"
      ],
      "id": "J2FbuAe3lu8z"
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.5"
    },
    "colab": {
      "provenance": [],
      "include_colab_link": true
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}