{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Mapper Tutorial - the Santa Cloud\n", "\n", "_Created by:_ Davide Burba, December 2019\n", "\n", "In this notebook, we apply the [Mapper algorithm](https://research.math.osu.edu/tgda/mapperPBG.pdf)\n", "to point cloud data by using [giotto-tda](https://giotto.ai),\n", "an open-source Python library for topological data analysis." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Import libraries" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# data wrangling\n", "import pandas as pd\n", "import numpy as np\n", "\n", "# tda magic\n", "from gtda.mapper.filter import Projection\n", "from gtda.mapper.cover import (\n", " OneDimensionalCover, \n", " CubicalCover\n", ")\n", "\n", "\n", "from gtda.mapper.pipeline import make_mapper_pipeline\n", "from gtda.mapper.visualization import (\n", " plot_static_mapper_graph,\n", " plot_interactive_mapper_graph\n", ")\n", "\n", "# data viz\n", "import plotly.graph_objects as go\n", "from matplotlib.colors import to_rgba_array, to_rgba\n", "\n", "# ml tools\n", "from sklearn.cluster import DBSCAN\n", "\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load and visualise data\n", "\n", "The starting point for most Mapper analyses is a point cloud in an $n$-dimensional Euclidean space. For our purposes we'll use a dataset containing 20,000 points sampled from a Santa Claus shape in 3-dimensions.\n", "\n", "\n", "_**Note:**_ The dataset has been sampled from a mesh object available [here](https://free3d.com/3d-model/santa-clau-77751.html) through the usage of [CloudCompare](https://www.cloudcompare.org)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(\"data.csv\")\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, the data consists of 3-dimensional coordinates $ v_i = (x_i, y_i, z_i)$ for each point, along with a label for it's colour. Let's have a look at this 3-dimensional point cloud:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "fig = go.Figure(\n", " data=[\n", " go.Scatter3d(\n", " x=data.x,\n", " y=data.y,\n", " z=data.z,\n", " mode=\"markers\",\n", " marker=dict(size=1, color=data.color,),\n", " )\n", " ]\n", ")\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The Mapper algorithm\n", "\n", "Let's consider an $n$-dimensional point cloud, i.e. a dataset with $n$ numerical features. The Mapper algorithm follows three main steps:\n", "\n", "1. **_Filtering:_** map each data point $x \\in \\mathbb{R}^n$ to a low dimensional space $\\mathbb{R}^m$ through a filter function $f: \\mathbb{R}^n \\to \\mathbb{R}^m$\n", "2. **_Covering:_** cover the mapped values $y = f(x)$ with overlapping intervals.\n", "3. **_Clustering:_** for each interval, calculate the pre-image $x = f^{-1}(y)$ and apply a clustering algorithm on the set of points belonging to each pre-image.\n", "\n", "The topological graph is then composed of:\n", "- **nodes**: clusters\n", "- **edges**: non-empty intersections between clusters\n", "\n", "Even if its formulation could seem a bit intimidating, the best way to intuitively understand how the Mapper algorithm works is to use it and \"play\" with its parameters. The API of [giotto-tda](https://giotto.ai) is scikit-learn compatible and provides a very convenient way to fit the Mapper through a pipeline object.\n", "\n", "We will apply the Mapper algorithm to the Santa Claus point cloud (the so-called \"Santa Cloud\"). Our goal is to get a minimal graph which preserves the topological properties of the original figure. \n", "\n", "First, let's define an array of coordinates to act as the point cloud to analyse:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "coords = data[[\"x\",\"y\",\"z\"]].values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Identity as *filter*\n", "\n", "The easiest way to produce a meaningful graph from a dataset is to use the identity as *filter* function. \n", "\n", "Formally, we map each vector of coordinates $\\vec{v}_i$ to itself via $f: \\vec{v} \\to \\vec{v}$. Since the mapped values lie in $\\mathbb{R}^3$, we use a 3-dimensional cover." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_mapper_graph_from_pipeline(pipe):\n", " \"\"\" A helper function, that plots the graph\n", " resulting from applying the pipeline `pipe`\n", " on the data.\n", " Parameters\n", " -----------\n", " pipe : \n", " \"\"\"\n", " # generate topological graph from point cloud\n", " graph = pipeline.fit_transform(coords)\n", "\n", " # get cluster member indices\n", " node_elements = graph[\"node_metadata\"][\"node_elements\"]\n", "\n", " # configure choice of layout (x,z values)\n", " layout = np.array([np.mean(coords[el], axis=0)[[0,2]] for el in node_elements])\n", "\n", " # define node coloring\n", " node_colors = np.array([data.loc[el, \"color\"].value_counts().index[0] for el in node_elements])\n", "\n", " plotly_kwargs = {\n", " 'node_trace_marker_colorscale': None,\n", " 'node_trace_marker_showscale': False\n", " }\n", "\n", " # initialise and display figure\n", " fig = plot_static_mapper_graph(pipeline, coords,\n", " layout, plotly_kwargs=plotly_kwargs)\n", " fig.update_traces(patch={'hoverlabel_bgcolor': node_colors,\n", " 'marker_color': node_colors})\n", " fig.show(config={'scrollZoom':True})\n", " \n", " return fig" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# configure Mapper pipeline\n", "pipeline = make_mapper_pipeline(\n", " filter_func=None,\n", " cover=CubicalCover(),\n", " clusterer=DBSCAN(),\n", ")\n", "\n", "# plot the graph arising from applying the pipeline\n", "fig = plot_mapper_graph_from_pipeline(pipeline)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the Mapper output preserved the geometrical structure of Santa Claus. \n", "However, using the identity as *filter* function could be prohibitive in high dimensional spaces!\n", "\n", "Let's try to build a more realistic example by using a 2D filter projection.\n", "\n", "## 2D Projection\n", "\n", "We map each vector of coordinates $\\vec{v}_i = (x_i, y_i, z_i)$ to the $x$ and $z$ axis via $f: \\vec{v} \\to (x_i, z_i)$. Since the mapped values lie in $\\mathbb{R}^2$, we use a 2-dimensional cover.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# configure Mapper pipeline\n", "pipeline = make_mapper_pipeline(\n", " filter_func=Projection(columns=[0,2]),\n", " cover=CubicalCover(),\n", " clusterer=DBSCAN(),\n", ")\n", "\n", "# plot the graph arising from applying the pipeline\n", "fig = plot_mapper_graph_from_pipeline(pipeline)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that we lost information on the Santa Claus depth, but we can clearly distinguish the figure anyway. Let's try to get an even more minimal graph with a 1-dimensional projection.\n", "\n", "\n", "## Height as *filter* \n", "\n", "Let use the \"height\" of the Santa cloud as a filter function, i.e. we map each vector of coordinates $\\vec{v}_i = (x_i, y_i, z_i)$ to the $z$-axis via $f: \\vec{v} \\to z$. Since the mapped values lie in $\\mathbb{R}$, we use a 1-dimensional cover." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# configure Mapper pipeline\n", "pipeline = make_mapper_pipeline(\n", " filter_func=Projection(columns=2),\n", " cover=OneDimensionalCover(),\n", " clusterer=DBSCAN(),\n", ")\n", "\n", "# plot the graph arising from applying the pipeline\n", "fig = plot_mapper_graph_from_pipeline(pipeline)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The graph is not very satisfactory as we cannot distinguish any part of Santa Claus' body. Since the graph corresponds to a line of connected points, it means that the clustering algorithm always returned a unique cluster for each interval in the cover. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tweak clustering parameters" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to use more **restrictive clustering parameters**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# configure Mapper pipeline\n", "pipeline = make_mapper_pipeline(\n", " filter_func=Projection(columns=2),\n", " cover=OneDimensionalCover(),\n", " clusterer=DBSCAN(eps=0.06),\n", ")\n", "\n", "# plot the graph arising from applying the pipeline\n", "fig = plot_mapper_graph_from_pipeline(pipeline)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now distinguish the two legs, but if you zoom in you can see that Santa now has 4 arms! " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tweak overlap fraction of cover intervals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to **increase the overlapping fraction** of the intervals." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# configure Mapper pipeline\n", "pipeline = make_mapper_pipeline(\n", " filter_func=Projection(columns=2),\n", " cover=OneDimensionalCover(overlap_frac=0.35),\n", " clusterer=DBSCAN(eps=0.06),\n", ")\n", "\n", "# plot the graph arising from applying the pipeline\n", "fig = plot_mapper_graph_from_pipeline(pipeline)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can clearly distinguish the legs, the arms and the head." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tweak number of intervals in cover" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see what happens if we **increase the number of intervals** in the cover to 20 (default is 10)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# configure Mapper pipeline\n", "pipeline = make_mapper_pipeline(\n", " filter_func=Projection(columns=2),\n", " cover=OneDimensionalCover(n_intervals=20, overlap_frac=0.35),\n", " clusterer=DBSCAN(eps=0.06),\n", ")\n", "\n", "# plot the graph arising from applying the pipeline\n", "fig = plot_mapper_graph_from_pipeline(pipeline)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we increase the number of bins in the cover to 20, we see the appearance of a node representing the hat and a branch separating the chest from the beard" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Interactive mode\n", "\n", "As we saw above, it takes several steps of interation to obtain a topological Mapper graph that accurately captures the underlying structure of the point cloud. To help speedup the iterations, [giotto-tda](https://giotto.ai) provides the option to dynamically change the Mapper parameters and see how this affects the output!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# configure Mapper pipeline\n", "pipeline = make_mapper_pipeline(\n", " filter_func=Projection(columns=2),\n", " cover=OneDimensionalCover(),\n", " clusterer=DBSCAN(),\n", ")\n", "\n", "# generate figure\n", "plot_interactive_mapper_graph(pipeline, coords)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }