## Overview

This notebook demonstrates how to use Monet to cluster scRNA-Seq data with the [**Galapagos** (Wagner, 2019)](https://www.biorxiv.org/content/10.1101/770388v3). This method relies on performing density-based clustering (using DBSCAN) directly on the t-SNE result. It's a very straightforward approach based on the somewhat obvious notion that t-SNE plots provide a great starting point for defining cell populations. The approach is limited in its ability to resolve closely related cell types that don't separate well in t-SNE plots, but it's a very simple and transparent approach that also tends to avoid overclustering.

### Setting up the notebook

In [1]:
# change notebook width and font
from IPython.core.display import HTML, display
display(HTML("""<style>
    /* source: http://stackoverflow.com/a/24207353 */
    .container { width:95% !important; }
    div.prompt, div.CodeMirror pre, div.output_area pre { font-family:'Hack', monospace; font-size: 10.5pt; }
    </style>"""))

from monet import util
_LOGGER = util.configure_logger()

# the following is to allow embedding of plotly figures
from plotly.offline import init_notebook_mode
import plotly.graph_objs as go
init_notebook_mode(connected=True)

### Step 1: Perform t-SNE

Here, we perform t-SNE as in the previous tutorial.

In [2]:
import gc

from monet import ExpMatrix
from monet import visualize

expression_file = 'data/v3_human_pbmc_10k_expression.npz'

matrix = ExpMatrix.load_npz(expression_file)

fig, tsne_scores = visualize.tsne_plot(matrix, title='PBMC data')
# by default, tsne_plot() performs PCA with 30 principal components
# this can be changed, e.g. to 50, using tsne_plot(..., num_components=50)

fig.show()

# free up memory
del matrix; gc.collect()

[2020-06-17 14:30:28] (monet.core.exp_matrix) INFO: Loaded expression matrix with 10681 cells and 16319 genes -- .npz format, 36.7 MB (hash: f9d7fac20f4de6184ff55388c267699a).
[2020-06-17 14:30:28] (root) INFO: No Monet model provided, performing PCA to determine first 30principal components...
[2020-06-17 14:30:28] (monet.latent.pca_model) INFO: Converted matrix to float32 data type.
[2020-06-17 14:30:34] (monet.latent.pca_model) INFO: The PCA took 1.5 s.
[2020-06-17 14:30:34] (monet.latent.pca_model) INFO: The fraction of variance explained by the 30 selected PCs is 33.4 %.
[2020-06-17 14:30:34] (root) INFO: Performing t-SNE...
[2020-06-17 14:31:10] (root) INFO: t-SNE took 35.6 s.


3541

### Step 2: Clustering with *DBSCAN*

We'll now apply DBSCAN ([Ester et al., 1996](https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf)), a density-based clustering algorithm, to the t-SNE result. DBSCAN has two parameters, called *Eps* and *MinPts* (called `min_samples` in scikit-learn). *Eps* defines a radius for finding neighbors, and *MinPts* defines the minimum number of points (here: cells) that need to fall within that radius for a cluster to be formed (some cells won't be assigned to clusters and will be considered "outliers"). You can read more about the DBSCAN algorithm in the [scikit-learn User Manual](https://scikit-learn.org/stable/modules/clustering.html#dbscan), and visit Naftali Harris' website to [look at some nice demonstrations of DBSCAN on various datasets](https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/).

In Monet, you specify *Eps* as the fraction of the diameter of the t-SNE plot, using the `eps_frac` parameter. Here, I use the term "diameter" to refer to the distance from the top-left corner to the bottom-right corner of the t-SNE plot. For example, setting `eps_frac=0.03` (the default), means that *Eps* will be set to 3% of the diamater. Furthermore, you specify `MinPts` as a fraction of the total number of cells available, using the `min_cells_frac` parameter. So setting `min_cells_frac=0.01` (the default) means that `MinPts` will be set to 1% of the total number of cells (rounded up to the next integer).

In [3]:
from monet.visualize import plot_cells
from monet.cluster import cluster_cells_dbscan

eps_frac = 0.03
min_cells_frac = 0.01

cell_labels, clusters = cluster_cells_dbscan(
    tsne_scores, eps_frac=eps_frac, min_cells_frac=min_cells_frac)

cluster_colors = {
    'Outliers': 'lightgray',
}

fig = plot_cells(
    tsne_scores,
    cell_labels=cell_labels,
    cluster_order=clusters,
    cluster_colors=cluster_colors,
    width=850)

fig.show()

[2020-06-17 14:31:10] (monet.cluster.galapagos) INFO: Performing DBSCAN with minPts=107 and eps=6.57.
[2020-06-17 14:31:11] (monet.cluster.galapagos) INFO: Clustering with DBSCAN took 0.9 s.


These clusters seem a little bit too broad. By tweaking the DBSCAN parameters, we can increase the clustering resolution.

In [4]:
from monet.visualize import plot_cells
from monet.cluster import cluster_cells_dbscan

#eps_frac = 0.03
#min_cells_frac = 0.01

eps_frac = 0.023
min_cells_frac = 0.007

cell_labels, clusters = cluster_cells_dbscan(
    tsne_scores, eps_frac=eps_frac, min_cells_frac=min_cells_frac)

cluster_colors = {
    'Outliers': 'lightgray',
}

fig = plot_cells(
    tsne_scores,
    cell_labels=cell_labels,
    cluster_order=clusters,
    cluster_colors=cluster_colors,
    width=850)

fig.show()

[2020-06-17 14:31:11] (monet.cluster.galapagos) INFO: Performing DBSCAN with minPts=75 and eps=5.04.
[2020-06-17 14:31:12] (monet.cluster.galapagos) INFO: Clustering with DBSCAN took 1.0 s.


Now that we are happy with the clustering result, we can save it to disk.

In [5]:
from monet import util

util.save_cell_labels(cell_labels, 'output/v3_human_pbmc_10k_clustering.tsv')

[2020-06-17 14:31:13] (monet.util.files) INFO: Saved labels for 10681 cells to tab-delimited plain-text file.
