{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting started with Mapper\n", "\n", "In this notebook we explore a few of the core features included in ``giotto-tda``'s implementation of the [Mapper algorithm](https://research.math.osu.edu/tgda/mapperPBG.pdf). \n", "\n", "If you are looking at a static version of this notebook and would like to run its contents, head over to [GitHub](https://github.com/giotto-ai/giotto-tda/blob/master/examples/mapper_quickstart.ipynb) and download the source.\n", "\n", "## Useful references\n", "\n", "* [An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists](https://arxiv.org/abs/1710.04019)\n", "* [An Introduction to Topological Data Analysis for Physicists: From LGM to FRBs](https://arxiv.org/abs/1904.11044)\n", "\n", "**License: AGPLv3**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Data wrangling\n", "import numpy as np\n", "import pandas as pd # Not a requirement of giotto-tda, but is compatible with the gtda.mapper module\n", "\n", "# Data viz\n", "from gtda.plotting import plot_point_cloud\n", "\n", "# TDA magic\n", "from gtda.mapper import (\n", " CubicalCover,\n", " make_mapper_pipeline,\n", " Projection,\n", " plot_static_mapper_graph,\n", " plot_interactive_mapper_graph,\n", " MapperInteractivePlotter\n", ")\n", "\n", "# ML tools\n", "from sklearn import datasets\n", "from sklearn.cluster import DBSCAN\n", "from sklearn.decomposition import PCA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate and visualise data\n", "As a simple example, let's generate a two-dimensional point cloud of two concentric circles. The goal will be to examine how Mapper can be used to generate a topological graph that captures the salient features of the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data, _ = datasets.make_circles(n_samples=5000, noise=0.05, factor=0.3, random_state=42)\n", "\n", "plot_point_cloud(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure the Mapper pipeline\n", "Given a dataset ${\\cal D}$ of points $x \\in \\mathbb{R}^n$, the basic steps behind Mapper are as follows:\n", "\n", "1. Map ${\\cal D}$ to a lower-dimensional space using a **filter function** $f: \\mathbb{R}^n \\to \\mathbb{R}^m$. Common choices for the filter function include projection onto one or more axes via PCA or density-based methods. In ``giotto-tda``, you can import a variety of filter functions as follows:\n", "\n", "```python\n", "from gtda.mapper.filter import FilterFunctionName\n", "```\n", "\n", "2. Construct a cover of the filter values ${\\cal U} = (U_i)_{i\\in I}$, typically in the form of a set of overlapping intervals which have constant length. As with the filter, a choice of cover can be imported as follows:\n", "\n", "```python\n", "from gtda.mapper.cover import CoverName\n", "```\n", "\n", "3. For each interval $U_i \\in {\\cal U}$ cluster the points in the preimage $f^{-1}(U_i)$ into sets $C_{i,1}, \\ldots , C_{i,k_i}$. The choice of clustering algorithm can be any of ``scikit-learn``'s [clustering methods](https://scikit-learn.org/stable/modules/clustering.html) or an implementation of agglomerative clustering in ``giotto-tda``:\n", "\n", "```python\n", "# scikit-learn method\n", "from sklearn.cluster import ClusteringAlgorithm\n", "# giotto-tda method\n", "from gtda.mapper.cluster import FirstSimpleGap\n", "```\n", "\n", "4. Construct the topological graph whose vertices are the cluster sets $(C_{i,j})_{i\\in I, j \\in \\{1,\\ldots,k_i\\}}$ and an edge exists between two nodes if they share points in common: $C_{i,j} \\cap C_{k,l} \\neq \\emptyset$. This step is handled automatically by ``giotto-tda``.\n", "\n", "These four steps are implemented in the ``MapperPipeline`` object that mimics the ``Pipeline`` class from ``scikit-learn``. We provide a convenience function ``make_mapper_pipeline`` that allows you to pass the choice of filter function, cover, and clustering algorithm as arguments. For example, to project our data onto the $x$- and $y$-axes, we could setup the pipeline as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define filter function – can be any scikit-learn transformer\n", "filter_func = Projection(columns=[0, 1])\n", "# Define cover\n", "cover = CubicalCover(n_intervals=10, overlap_frac=0.3)\n", "# Choose clustering algorithm – default is DBSCAN\n", "clusterer = DBSCAN()\n", "\n", "# Configure parallelism of clustering step\n", "n_jobs = 1\n", "\n", "# Initialise pipeline\n", "pipe = make_mapper_pipeline(\n", " filter_func=filter_func,\n", " cover=cover,\n", " clusterer=clusterer,\n", " verbose=False,\n", " n_jobs=n_jobs,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualise the Mapper graph" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the Mapper pipeline at hand, it is now a simple matter to visualise it. To warm up, let's examine the graph in two-dimensions using the default arguments of ``giotto-tda``'s plotting function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plot_static_mapper_graph(pipe, data)\n", "fig.show(config={'scrollZoom': True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, nodes are coloured according to the average row index of the data points they represent.\n", "\n", "From the figure we can see that we have captured the salient topological features of our underlying data, namely two holes!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Configure the colouring of the Mapper graph\n", "\n", "In this example, it is more instructive to colour by the average values of the $x$- and $y$-coordinates. This can be achieved by passing the input data again as the keyword argument ``color_data``. In general, any ``numpy`` array or ``pandas`` dataframe explicitly passed as ``color_data`` will be used to calculate one colouring per column present. A dropdown menu is automatically created if ``color_data`` has more than one column, to easily switch between column-based colourings.\n", "\n", "At the same time, let's configure the choice of colorscale:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plotly_params = {\"node_trace\": {\"marker_colorscale\": \"Blues\"}}\n", "fig = plot_static_mapper_graph(\n", " pipe, data, color_data=data, plotly_params=plotly_params\n", ")\n", "fig.show(config={'scrollZoom': True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even finer control over the colouring can be achieved by making use of the additional keyword arguments ``color_features`` and ``node_color_statistic``. The former can be a ``scikit-learn`` transformer or a list of indices or column names to select from the data. For example, coloring by a PCA component can be neatly implemented as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialise estimator to color graph by\n", "pca = PCA(n_components=1)\n", "\n", "fig = plot_static_mapper_graph(\n", " pipe, data, color_data=data, color_features=pca\n", ")\n", "fig.show(config={'scrollZoom': True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "``node_color_statistic`` refers to the function used to extract single colour values for each node, starting from the values of ``color_features`` at each data point. The default, as we have seen, is ``np.mean``. But any other callable is acceptable which sends *column vectors* to scalars:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plot_static_mapper_graph(\n", " pipe, data, color_data=data, color_features=pca, node_color_statistic=lambda x: np.mean(x) / 2\n", ")\n", "fig.show(config={'scrollZoom': True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you prefer to just input custom node colours directly, you can do so by passing a ``numpy`` array or ``pandas`` dataframe of the correct length as ``node_color_statistic``. For example (see \"Run the Mapper pipeline\" below), the Mapper nodes for this particular Mapper pipeline would be as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "graph = pipe.fit_transform(data)\n", "node_elements = graph.vs[\"node_elements\"]\n", "print(f\"There are {len(node_elements)} nodes.\\nThe first node consists of row indices {node_elements[0]}.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us try this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plot_static_mapper_graph(\n", " pipe, data, node_color_statistic=np.arange(len(node_elements))\n", ")\n", "fig.show(config={'scrollZoom': True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pass a pandas DataFrame as input\n", "\n", "It is also possible to feed ``plot_static_mapper_graph`` a pandas DataFrame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(data, columns=[\"x\", \"y\"])\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before plotting we need to update the Mapper pipeline to know about the projection onto the column names. This can be achieved using the ``set_params`` method as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipe.set_params(filter_func=Projection(columns=[\"x\", \"y\"]));" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plot_static_mapper_graph(pipe, df, color_data=df)\n", "fig.show(config={'scrollZoom': True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: colour by proportions of categorical variables\n", "\n", "Often one has a dataset of observations, each belonging to a category (e.g. a country or region name). It can be very useful to visualize the distributions of each category in the nodes of the Mapper graph. As a trivial example, let us add a categorical column to our dataframe, with value equal to ``\"A\"`` for the outer circle, and ``\"B\"`` for the inner one: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[\"Circle\"] = df[\"x\"] ** 2 + df[\"y\"] ** 2 < 0.25\n", "df[\"Circle\"] = df[\"Circle\"].replace([False, True], [\"A\", \"B\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To visualize the proportions of data points in each Mapper node belonging to either circle, we can create a dataframe of one-hot encodings of the categorical variable ``\"Circle\"``, and pass it to ``plot_static_mapper_graph`` as ``color_data``:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "color_data = pd.get_dummies(df[\"Circle\"], prefix=\"Circle\")\n", "\n", "fig = plot_static_mapper_graph(pipe, df[[\"x\", \"y\"]], color_data=color_data)\n", "fig.show(config={'scrollZoom': True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dropdown menu allows us to quickly switch colourings according to each category, without needing to recompute the underlying graph." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Change the layout algorithm\n", "\n", "By default, ``plot_static_mapper_graph`` uses the Kamada–Kawai algorithm for the layout; however any of the layout algorithms defined in python-igraph are supported (see [here](https://igraph.org/python/doc/igraph.Graph-class.html) for a list of possible layouts). For example, we can switch to the Fruchterman–Reingold layout as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Reset back to numpy projection\n", "pipe.set_params(filter_func=Projection(columns=[0, 1]));" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plot_static_mapper_graph(\n", " pipe, data, layout=\"fruchterman_reingold\", color_data=data\n", ")\n", "fig.show(config={'scrollZoom': True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Change the layout dimension\n", "\n", "It is also possible to visualise the Mapper graph in 3 dimensions by configuring the ``layout_dim`` argument:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plot_static_mapper_graph(pipe, data, layout_dim=3, color_data=data)\n", "fig.show(config={'scrollZoom': True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Change the node size scale\n", "\n", "In general, node sizes are proportional to the number of dataset elements contained in the nodes. Sometimes, however, the default scale leads to graphs which are difficult to decipher, due to e.g. excessively small nodes. The ``node_scale`` parameter can be used to configure this scale." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "node_scale = 30\n", "fig = plot_static_mapper_graph(pipe, data, layout_dim=3, node_scale=node_scale)\n", "fig.show(config={'scrollZoom': True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run the Mapper pipeline\n", "\n", "Behind the scenes of ``plot_static_mapper_graph`` is a ``MapperPipeline`` object ``pipe`` that can be used like a typical ``scikit-learn`` estimator. For example, to extract the underlying graph data structure we can do the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "graph = pipe.fit_transform(data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting graph is a [python-igraph](https://igraph.org/python/) object which stores node metadata in the form of attributes. We can access this data as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "graph.vs.attributes()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here ``'pullback_set_label'`` and ``'partial_cluster_label'`` refer to the interval and cluster sets described above. ``'node_elements'`` refers to the indices of our original data that belong to each node. For example, to find which points belong to the first node of the graph we can access the desired data as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "node_id = 0\n", "node_elements = graph.vs[\"node_elements\"]\n", "\n", "print(f\"\"\"\n", "Node ID: {node_id}\n", "Node elements: {node_elements[node_id]}\n", "Data points: {data[node_elements[node_id]]}\n", "\"\"\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating custom filter functions\n", "\n", "In some cases, the list of filter functions provided in ``gtda.mapper.filter.py`` or ``scikit-learn`` may not be sufficient for the task at hand. In such cases, one can pass any callable to the pipeline that acts **row-wise** on the input data. For example, we can project by taking the sum of the $(x,y)$ coordinates as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filter_func = np.sum\n", "\n", "pipe = make_mapper_pipeline(\n", " filter_func=filter_func,\n", " cover=cover,\n", " clusterer=clusterer,\n", " verbose=True,\n", " n_jobs=n_jobs,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plot_static_mapper_graph(pipe, data)\n", "fig.show(config={'scrollZoom': True})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualise the Mapper graph interactively (Live Jupyter session needed)\n", "\n", "In general, building useful Mapper graphs requires some iteration through the various parameters in the cover and clustering algorithm. To simplify that process, ``giotto-tda`` provides an interactive figure that can be configured in real time by tweaking the pipeline hyperparameters. You can produce it in two ways, namely:\n", "\n", " 1. by using the ``plot_interactive_mapper_graph`` function in a similar same way as ``plot_static_mapper_graph``;" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipe = make_mapper_pipeline()\n", "\n", "# Generate interactive widget\n", "plot_interactive_mapper_graph(pipe, data, color_data=data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " 2. (**recommended**, new in ``giotto-tda`` 0.5.0) in an object-oriented way, by instantiating a ``MapperInteractivePlotter`` onject and then calling its ``plot`` method to create the widget." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create the plotter object\n", "MIP = MapperInteractivePlotter(pipe, data)\n", "\n", "# Generate interactive widget\n", "MIP.plot(color_data=data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The advantage of using the class API with ``MapperInteractivePlotter`` is that, once you are done tweaking the hyperparameters, you can inspect the latest state of the objects (graph, colours, pipeline, inner figure) which got changed during the interactive session." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Attributes created by `.plot` and updated during the interactive session:\\n\",\n", " [attr for attr in dir(MIP) if attr.endswith(\"_\") and attr[0] != \"_\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the widgets, if invalid parameters are selected, the _Show logs_ checkbox can be used to see what went wrong.\n", "\n", "To see the interactive outputs above, please **download** the notebook from [GitHub](https://github.com/giotto-ai/giotto-tda/blob/master/examples/mapper_quickstart.ipynb) and execute it locally." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }