{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mapper Tutorial - the Santa Cloud\n",
    "\n",
    "_Created by:_ Davide Burba, December 2019\n",
    "\n",
    "In this notebook, we apply the [Mapper algorithm](https://research.math.osu.edu/tgda/mapperPBG.pdf)\n",
    "to point cloud data by using [giotto-tda](https://giotto.ai),\n",
    "an open-source Python library for topological data analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Import libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# data wrangling\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# tda magic\n",
    "from gtda.mapper.filter import Projection\n",
    "from gtda.mapper.cover import (\n",
    "    OneDimensionalCover, \n",
    "    CubicalCover\n",
    ")\n",
    "\n",
    "\n",
    "from gtda.mapper.pipeline import make_mapper_pipeline\n",
    "from gtda.mapper.visualization import (\n",
    "    plot_static_mapper_graph,\n",
    "    plot_interactive_mapper_graph\n",
    ")\n",
    "\n",
    "# data viz\n",
    "import plotly.graph_objects as go\n",
    "from matplotlib.colors import to_rgba_array, to_rgba\n",
    "\n",
    "# ml tools\n",
    "from sklearn.cluster import DBSCAN\n",
    "\n",
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load and visualise data\n",
    "\n",
    "The starting point for most Mapper analyses is a point cloud in an $n$-dimensional Euclidean space. For our purposes we'll use a dataset containing 20,000 points sampled from a Santa Claus shape in 3-dimensions.\n",
    "\n",
    "\n",
    "_**Note:**_ The dataset has been sampled from a mesh object available [here](https://free3d.com/3d-model/santa-clau-77751.html) through the usage of [CloudCompare](https://www.cloudcompare.org)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(\"data.csv\")\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the data consists of 3-dimensional coordinates $ v_i = (x_i, y_i, z_i)$ for each point, along with a label for it's colour. Let's have a look at this 3-dimensional point cloud:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "fig = go.Figure(\n",
    "    data=[\n",
    "        go.Scatter3d(\n",
    "            x=data.x,\n",
    "            y=data.y,\n",
    "            z=data.z,\n",
    "            mode=\"markers\",\n",
    "            marker=dict(size=1, color=data.color,),\n",
    "        )\n",
    "    ]\n",
    ")\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Mapper algorithm\n",
    "\n",
    "Let's consider an $n$-dimensional point cloud, i.e. a dataset with $n$ numerical features. The Mapper algorithm follows three main steps:\n",
    "\n",
    "1. **_Filtering:_** map each data point $x \\in \\mathbb{R}^n$ to a low dimensional space $\\mathbb{R}^m$ through a filter function $f: \\mathbb{R}^n \\to \\mathbb{R}^m$\n",
    "2. **_Covering:_** cover the mapped values $y = f(x)$ with overlapping intervals.\n",
    "3. **_Clustering:_** for each interval, calculate the pre-image $x = f^{-1}(y)$ and apply a clustering algorithm on the set of points belonging to each pre-image.\n",
    "\n",
    "The topological graph is then composed of:\n",
    "- **nodes**: clusters\n",
    "- **edges**: non-empty intersections between clusters\n",
    "\n",
    "Even if its formulation could seem a bit intimidating, the best way to intuitively understand how the Mapper algorithm works is to use it and \"play\" with its parameters. The API of [giotto-tda](https://giotto.ai) is scikit-learn compatible and provides a very convenient way to fit the Mapper through a pipeline object.\n",
    "\n",
    "We will apply the Mapper algorithm to the Santa Claus point cloud (the so-called \"Santa Cloud\"). Our goal is to get a minimal graph which preserves the topological properties of the original figure. \n",
    "\n",
    "First, let's define an array of coordinates to act as the point cloud to analyse:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "coords = data[[\"x\",\"y\",\"z\"]].values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Identity as *filter*\n",
    "\n",
    "The easiest way to produce a meaningful graph from a dataset is to use the identity as *filter* function. \n",
    "\n",
    "Formally, we map each vector of coordinates $\\vec{v}_i$ to itself via $f: \\vec{v} \\to \\vec{v}$. Since the mapped values lie in $\\mathbb{R}^3$, we use a 3-dimensional cover."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_mapper_graph_from_pipeline(pipe):\n",
    "    \"\"\" A helper function, that plots the graph\n",
    "    resulting from applying the pipeline `pipe`\n",
    "    on the data.\n",
    "    Parameters\n",
    "    -----------\n",
    "    pipe : \n",
    "    \"\"\"\n",
    "    # generate topological graph from point cloud\n",
    "    graph = pipeline.fit_transform(coords)\n",
    "\n",
    "    # get cluster member indices\n",
    "    node_elements = graph[\"node_metadata\"][\"node_elements\"]\n",
    "\n",
    "    # configure choice of layout (x,z values)\n",
    "    layout = np.array([np.mean(coords[el], axis=0)[[0,2]] for el in node_elements])\n",
    "\n",
    "    # define node coloring\n",
    "    node_colors = np.array([data.loc[el, \"color\"].value_counts().index[0] for el in node_elements])\n",
    "\n",
    "    plotly_kwargs = {\n",
    "        'node_trace_marker_colorscale': None,\n",
    "        'node_trace_marker_showscale': False\n",
    "    }\n",
    "\n",
    "    # initialise and display figure\n",
    "    fig = plot_static_mapper_graph(pipeline, coords,\n",
    "                                   layout, plotly_kwargs=plotly_kwargs)\n",
    "    fig.update_traces(patch={'hoverlabel_bgcolor': node_colors,\n",
    "                       'marker_color': node_colors})\n",
    "    fig.show(config={'scrollZoom':True})\n",
    "    \n",
    "    return fig"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# configure Mapper pipeline\n",
    "pipeline = make_mapper_pipeline(\n",
    "    filter_func=None,\n",
    "    cover=CubicalCover(),\n",
    "    clusterer=DBSCAN(),\n",
    ")\n",
    "\n",
    "# plot the graph arising from applying the pipeline\n",
    "fig = plot_mapper_graph_from_pipeline(pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that the Mapper output preserved the geometrical structure of Santa Claus. \n",
    "However, using the identity as *filter* function could be prohibitive in high dimensional spaces!\n",
    "\n",
    "Let's try to build a more realistic example by using a 2D filter projection.\n",
    "\n",
    "## 2D Projection\n",
    "\n",
    "We map each vector of coordinates $\\vec{v}_i = (x_i, y_i, z_i)$ to the $x$ and $z$ axis via $f: \\vec{v} \\to (x_i, z_i)$. Since the mapped values lie in $\\mathbb{R}^2$, we use a 2-dimensional cover.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# configure Mapper pipeline\n",
    "pipeline = make_mapper_pipeline(\n",
    "    filter_func=Projection(columns=[0,2]),\n",
    "    cover=CubicalCover(),\n",
    "    clusterer=DBSCAN(),\n",
    ")\n",
    "\n",
    "# plot the graph arising from applying the pipeline\n",
    "fig = plot_mapper_graph_from_pipeline(pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that we lost information on the Santa Claus depth, but we can clearly distinguish the figure anyway. Let's try to get an even more minimal graph with a 1-dimensional projection.\n",
    "\n",
    "\n",
    "## Height as *filter* \n",
    "\n",
    "Let use the \"height\" of the Santa cloud as a filter function, i.e. we map each vector of coordinates $\\vec{v}_i = (x_i, y_i, z_i)$ to the $z$-axis via $f: \\vec{v} \\to z$. Since the mapped values lie in $\\mathbb{R}$, we use a 1-dimensional cover."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# configure Mapper pipeline\n",
    "pipeline = make_mapper_pipeline(\n",
    "    filter_func=Projection(columns=2),\n",
    "    cover=OneDimensionalCover(),\n",
    "    clusterer=DBSCAN(),\n",
    ")\n",
    "\n",
    "# plot the graph arising from applying the pipeline\n",
    "fig = plot_mapper_graph_from_pipeline(pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The graph is not very satisfactory as we cannot distinguish any part of Santa Claus' body. Since the graph corresponds to a line of connected points, it means that the clustering algorithm always returned a unique cluster for each interval in the cover. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tweak clustering parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try to use more **restrictive clustering parameters**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# configure Mapper pipeline\n",
    "pipeline = make_mapper_pipeline(\n",
    "    filter_func=Projection(columns=2),\n",
    "    cover=OneDimensionalCover(),\n",
    "    clusterer=DBSCAN(eps=0.06),\n",
    ")\n",
    "\n",
    "# plot the graph arising from applying the pipeline\n",
    "fig = plot_mapper_graph_from_pipeline(pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now distinguish the two legs, but if you zoom in you can see that Santa now has 4 arms! "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tweak overlap fraction of cover intervals"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to **increase the overlapping fraction** of the intervals."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# configure Mapper pipeline\n",
    "pipeline = make_mapper_pipeline(\n",
    "    filter_func=Projection(columns=2),\n",
    "    cover=OneDimensionalCover(overlap_frac=0.35),\n",
    "    clusterer=DBSCAN(eps=0.06),\n",
    ")\n",
    "\n",
    "# plot the graph arising from applying the pipeline\n",
    "fig = plot_mapper_graph_from_pipeline(pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can clearly distinguish the legs, the arms and the head."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tweak number of intervals in cover"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see what happens if we **increase the number of intervals** in the cover to 20 (default is 10)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# configure Mapper pipeline\n",
    "pipeline = make_mapper_pipeline(\n",
    "    filter_func=Projection(columns=2),\n",
    "    cover=OneDimensionalCover(n_intervals=20, overlap_frac=0.35),\n",
    "    clusterer=DBSCAN(eps=0.06),\n",
    ")\n",
    "\n",
    "# plot the graph arising from applying the pipeline\n",
    "fig = plot_mapper_graph_from_pipeline(pipeline)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we increase the number of bins in the cover to 20, we see the appearance of a node representing the hat and a branch separating the chest from the beard"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Interactive mode\n",
    "\n",
    "As we saw above, it takes several steps of interation to obtain a topological Mapper graph that accurately captures the underlying structure of the point cloud. To help speedup the iterations, [giotto-tda](https://giotto.ai) provides the option to dynamically change the Mapper parameters and see how this affects the output!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# configure Mapper pipeline\n",
    "pipeline = make_mapper_pipeline(\n",
    "    filter_func=Projection(columns=2),\n",
    "    cover=OneDimensionalCover(),\n",
    "    clusterer=DBSCAN(),\n",
    ")\n",
    "\n",
    "# generate figure\n",
    "plot_interactive_mapper_graph(pipeline, coords)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}