{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Topological feature extraction using ``VietorisRipsPersistence`` and ``PersistenceEntropy``\n", "\n", "In this notebook, we showcase the ease of use of one of the core components of ``giotto-tda``: ``VietorisRipsPersistence``, along with vectorization methods. We first list steps in a typical, topological-feature extraction routine and then show to encapsulate them with a standard ``scikit-learn``–like pipeline.\n", "\n", "If you are looking at a static version of this notebook and would like to run its contents, head over to [GitHub](https://github.com/giotto-ai/giotto-tda/blob/master/examples/vietoris_rips_quickstart.ipynb) and download the source.\n", "\n", "**License: AGPLv3**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate data\n", "\n", "Let's begin by generating 3D point clouds of spheres and tori, along with a label of 0 (1) for each sphere (torus). We also add noise to each point cloud, whose effect is to displace the points sampling the surfaces by a random amount in a random direction. **Note**: You will need the auxiliary module [generate_datasets.py](https://github.com/giotto-ai/giotto-tda/blob/master/examples/data/generate_datasets.py) to run this cell. You can change the second argument of ``generate_point_clouds`` to obtain a finer or coarser sampling, or tune the level of noise via the third." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from data.generate_datasets import make_point_clouds\n", "n_samples_per_class = 10\n", "point_clouds, labels = make_point_clouds(n_samples_per_class, 10, 0.1)\n", "point_clouds.shape\n", "print(f\"There are {point_clouds.shape[0]} point clouds in {point_clouds.shape[2]} dimensions, \"\n", " f\"each with {point_clouds.shape[1]} points.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate persistent homology\n", "\n", "Instantiate a ``VietorisRipsPersistence`` transformer and calculate so-called **persistence diagrams** for this collection of point clouds." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gtda.homology import VietorisRipsPersistence\n", "\n", "VR = VietorisRipsPersistence(homology_dimensions=[0, 1, 2]) # Parameter explained in the text\n", "diagrams = VR.fit_transform(point_clouds)\n", "diagrams.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Important note**: ``VietorisRipsPersistence``, and all other \"persistent homology\" transformers in ``gtda.homology``, expect input in the form of a 3D array or, in some cases, a list of 2D arrays. For each entry in the input (here, for each point cloud in ``point_clouds``) they compute a topological summary which is also a 2D array, and then stack all these summaries into a single output 3D array. So, in our case, ``diagrams[i]`` represents the topology of ``point_clouds[i]``. ``diagrams[i]`` is interpreted as follows:\n", "- Each row is a triplet describing a single topological feature found in ``point_clouds[i]``.\n", "- The first and second entries (respectively) in the triplet denote the values of the \"filtration parameter\" at which the feature appears or disappears respectively. They are referred to as the \"birth\" and \"death\" values of the feature (respectively). The meaning of \"filtration parameter\" depends on the specific transformer, but in the case of ``VietorisRipsPersistence`` on point clouds it has the interpretation of a length scale.\n", "- A topological feature can be a connected component, 1D hole/loop, 2D cavity, or more generally $d$-dimensional \"void\" which exists in the data at scales between its birth and death values. The integer $d$ is the *homology dimension* (or degree) of the feature and is stored as the third entry in the triplet. In this example, the shapes should have 2D cavities so we explicitly tell ``VietorisRipsPersistence`` to look for these by using the ``homology_dimensions`` parameter!\n", "\n", "If we make one scatter plot per available homology dimension, and plot births and deaths as x- and y-coordinates of points in 2D, we end up with a 2D representation of ``diagrams[i]``, and the reason why it is called a persistence *diagram*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gtda.plotting import plot_diagram\n", "\n", "i = 0\n", "plot_diagram(diagrams[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The notebook [Plotting in giotto-tda](https://giotto-ai.github.io/gtda-docs/latest/notebooks/plotting_api.html) delves deeper into the plotting functions and class methods available in ``giotto-tda``." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extract features\n", "\n", "Instantiate a ``PersistenceEntropy`` transformer and extract scalar features from the persistence diagrams." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gtda.diagrams import PersistenceEntropy\n", "\n", "PE = PersistenceEntropy()\n", "features = PE.fit_transform(diagrams)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use the new features in a standard classifier\n", "\n", "Leverage the compatibility with ``scikit-learn`` to perform a train-test split and score the features." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_valid, y_train, y_valid = train_test_split(features, labels)\n", "model = RandomForestClassifier()\n", "model.fit(X_train, y_train)\n", "model.score(X_valid, y_valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Encapsulate the steps above in a pipeline\n", "\n", "1. Define an end-to-end pipeline by chaining transformers from ``giotto-tda`` with ``scikit-learn`` ones\n", "2. Train-test split the input point cloud data and labels.\n", "3. Fir the pipeline on the training data.\n", "4. Score the fitted pipeline on the test data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import make_pipeline\n", "\n", "steps = [VietorisRipsPersistence(homology_dimensions=[0, 1, 2]),\n", " PersistenceEntropy(),\n", " RandomForestClassifier()]\n", "pipeline = make_pipeline(*steps)\n", "\n", "pcs_train, pcs_valid, labels_train, labels_valid = train_test_split(point_clouds, labels)\n", "\n", "pipeline.fit(pcs_train, labels_train)\n", "\n", "pipeline.score(pcs_valid, labels_valid)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }