## Overview

This notebook demonstrates how to train a Monet model on an scRNA-Seq dataset. Monet models are encapsulated by `MonetModel` objects, and are described in the [Monet paper (Wagner, 2020)](https://www.biorxiv.org/content/10.1101/2020.06.08.140673v2). After training the model on one PBMC data, we will see how this model can serve as the basis for t-SNE analyses of arbitrary PBMC datasets. More generally, Monet models are useful for analyses that aim to integrate data from multiple scRNA-Seq datasets from the same tissue type, which will be demonstrate in the following tutorials.

### Setting up the notebook

In [1]:
# change notebook width and font
from IPython.core.display import HTML, display
display(HTML("""<style>
    /* source: http://stackoverflow.com/a/24207353 */
    .container { width:95% !important; }
    div.prompt, div.CodeMirror pre, div.output_area pre { font-family:'Hack', monospace; font-size: 10.5pt; }
    </style>"""))

from monet import util

_LOGGER = util.configure_logger()

# the following is to allow embedding of plotly figures
from plotly.offline import init_notebook_mode
import plotly.graph_objs as go
init_notebook_mode(connected=True)

## Train a Monet model

Here, we're training the model. The first step of training the model consists of performing [molecular cross-validation (MCV; Batson et al., 2019)](https://www.biorxiv.org/content/10.1101/786269v1) to infer the dimensionality of the data. Monet performs a grid search using 5-fold MCV, which is somewhat time-consuming. As you can see from the results below, for this dataset consisting of 10,681 cells, it took approx. ~20 minutes for this step to complete. The second step is a nearest-neighbor aggregation step, which is quite fast.

After training is complete, we're saving the trained model to disk using the `save_pickle()` method.

In [2]:
import gc

from monet import ExpMatrix
from monet import MonetModel

expression_file = 'data/v3_human_pbmc_10k_expression.npz'
monet_model_file = 'output/v3_human_pbmc_10k_monet_model.pickle'

matrix = ExpMatrix.load_npz(expression_file)

# initialize and train the model
monet_model = MonetModel()
monet_model.fit(matrix)

# save the model to disk
monet_model.save_pickle(monet_model_file)

# free up memory
del matrix; gc.collect()

[2020-06-17 10:35:53] (monet.core.exp_matrix) INFO: Loaded expression matrix with 10681 cells and 16319 genes -- .npz format, 36.7 MB (hash: f9d7fac20f4de6184ff55388c267699a).
[2020-06-17 10:35:53] (monet.latent.monet_model) INFO: Beginning of Phase I (Estimate dimensionality)...
[2020-06-17 10:35:53] (monet.latent.monet_model) INFO: Using molecular cross-validation to determine the number of PCs...
[2020-06-17 10:35:53] (monet.latent.monet_model) INFO: Testing coarse grid of num_component values...
[2020-06-17 10:35:53] (monet.latent.monet_model) INFO: Testing grid of 10 num_component values...
[2020-06-17 10:35:53] (monet.latent.monet_model) INFO: Now processing split 1/5...
[2020-06-17 10:35:53] (monet.latent.util) INFO: Data will be split into datasets containing 90.4% and 10.0% of transcripts, respectively.
[2020-06-17 10:36:00] (monet.latent.util) INFO: Done splitting data!
[2020-06-17 10:36:07] (monet.latent.pca_model) INFO: The PCA took 2.3 s.
[2020-06-17 10:36:08] (monet.laten

[2020-06-17 10:48:03] (monet.latent.monet_model) INFO: Now processing split 2/5...
[2020-06-17 10:48:03] (monet.latent.util) INFO: Data will be split into datasets containing 90.4% and 10.0% of transcripts, respectively.
[2020-06-17 10:48:11] (monet.latent.util) INFO: Done splitting data!
[2020-06-17 10:48:19] (monet.latent.pca_model) INFO: The PCA took 2.3 s.
[2020-06-17 10:48:19] (monet.latent.pca_model) INFO: The fraction of variance explained by the 100 selected PCs is 35.8 %.
[2020-06-17 10:48:25] (monet.latent.util) INFO: Testing value 1/6 (23 PCs)...
[2020-06-17 10:48:35] (monet.latent.util) INFO: Testing value 2/6 (26 PCs)...
[2020-06-17 10:48:46] (monet.latent.util) INFO: Testing value 3/6 (29 PCs)...
[2020-06-17 10:48:56] (monet.latent.util) INFO: Testing value 4/6 (31 PCs)...
[2020-06-17 10:49:07] (monet.latent.util) INFO: Testing value 5/6 (34 PCs)...
[2020-06-17 10:49:17] (monet.latent.util) INFO: Testing value 6/6 (37 PCs)...
[2020-06-17 10:49:28] (monet.latent.monet_mode

[2020-06-17 10:55:56] (monet.latent.monet_model) INFO: Fitting the Monet model took 1203.6 s (20.1 min).
[2020-06-17 10:55:56] (monet.latent.monet_model) INFO: Saved Monet model to pickle file "output/v3_human_pbmc_10k_monet_model.pickle".


0

We can take a look at the MCV results using the `plot_mcv_results()` function. We'll load the Monet model using the `MonetModel.load_pickle()` method.

In [8]:
from monet import MonetModel

monet_model_file = 'data/v3_human_pbmc_10k_monet_model.pickle'
#monet_model_file = 'output/v3_human_pbmc_10k_monet_model.pickle'

monet_model = MonetModel.load_pickle(monet_model_file)
fig = monet_model.plot_mcv_results()
fig.layout.title = 'MCV result'
fig.show()

[2020-06-17 10:57:54] (monet.latent.monet_model) INFO: Loaded Monet model from pickle file "data/v3_human_pbmc_10k_monet_model.pickle".
30


## Performing t-SNE for arbitrary* PBMC datasets based on the Monet model obtained

The idea behind Monet models is that they represent latent spaces for a particular tissue, and can form the basis for analyses of arbitrary other scRNA-Seq datasets. First, we will perform a t-SNE using the same dataset that the model was trained on. Then, we'll perform another t-SNE with a PBMC dataset generated using an earlier version of the 10x Genomics Chromium chemitry (v2).

**limited to datasets containing UMI counts. The theoretical framework underlying Monet does not extend to scRNA-Seq technologies that do not incorporate UMIs.*



In [9]:
import gc

from monet import ExpMatrix
from monet import MonetModel
from monet import visualize

expression_file = 'data/v3_human_pbmc_10k_expression.npz'
monet_model_file = 'data/v3_human_pbmc_10k_monet_model.pickle'
#monet_model_file = 'output/v3_human_pbmc_10k_monet_model.pickle'

matrix = ExpMatrix.load_npz(expression_file)

monet_model = MonetModel.load_pickle(monet_model_file)

fig, tsne_scores = visualize.tsne_plot(
    matrix, monet_model,
    title='Training data (PBMC v3)')
fig.show()

# free up memory
del matrix; gc.collect()

[2020-06-17 10:59:12] (monet.core.exp_matrix) INFO: Loaded expression matrix with 10681 cells and 16319 genes -- .npz format, 36.7 MB (hash: f9d7fac20f4de6184ff55388c267699a).
[2020-06-17 10:59:12] (monet.latent.monet_model) INFO: Loaded Monet model from pickle file "data/v3_human_pbmc_10k_monet_model.pickle".
[2020-06-17 10:59:12] (root) INFO: Using Monet model to project data onto a 30-dimensional latent space...
[2020-06-17 10:59:14] (monet.latent.pca_model) INFO: Expression profiles will be scaled 1.00x (on average).
[2020-06-17 10:59:18] (monet.latent.pca_model) INFO: Projection onto 30 PCs retained 32.1 % of the total variance in the scaled and FT-transformed data.
[2020-06-17 10:59:18] (root) INFO: Performing t-SNE...
[2020-06-17 10:59:40] (root) INFO: t-SNE took 21.8 s.


18569

In [10]:
import gc

from monet import ExpMatrix
from monet import MonetModel
from monet import visualize

expression_file = 'data/v2_human_pbmc_8k_expression.npz'
monet_model_file = 'data/v3_human_pbmc_10k_monet_model.pickle'
#monet_model_file = 'output/v3_human_pbmc_10k_monet_model.pickle'

matrix = ExpMatrix.load_npz(expression_file)

monet_model = MonetModel.load_pickle(monet_model_file)

fig, tsne_scores = visualize.tsne_plot(
    matrix, monet_model,
    title='New data (PBMC v2)')
fig.show()

# free up memory
del matrix; gc.collect()

[2020-06-17 10:59:48] (monet.core.exp_matrix) INFO: Loaded expression matrix with 8381 cells and 15510 genes -- .npz format, 19.9 MB (hash: c299645ab748c9dbe4030fc4cace369b).
[2020-06-17 10:59:48] (monet.latent.monet_model) INFO: Loaded Monet model from pickle file "data/v3_human_pbmc_10k_monet_model.pickle".
[2020-06-17 10:59:48] (root) INFO: Using Monet model to project data onto a 30-dimensional latent space...
[2020-06-17 10:59:49] (monet.latent.pca_model) INFO: Expression profiles will be scaled 1.57x (on average).
[2020-06-17 10:59:52] (monet.latent.pca_model) INFO: Projection onto 30 PCs retained 20.8 % of the total variance in the scaled and FT-transformed data.
[2020-06-17 10:59:52] (root) INFO: Performing t-SNE...
[2020-06-17 11:00:16] (root) INFO: t-SNE took 23.9 s.


18471