{ "cells": [ { "cell_type": "markdown", "id": "04fa53a2", "metadata": {}, "source": [ "# Synthetic expression data from asynchronous random walks on star network\n", "\n", "In this series of notebooks, we demonstrate how scBoolSeq can be employed to generate synthetic scRNA-Seq datasets from Boolean states of trajectories of mechanistic Boolean models.\n", "\n", "This notebook focuses on a toy model where a transcription factor progressively activates its target genes." ] }, { "cell_type": "code", "execution_count": 1, "id": "19cbcd2e", "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "This notebook has been executed using the docker image `bnediction/scboolseq:v0`" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import random\n", "from colomoto.minibn import * # for Boolean network manipulation\n", "from scboolseq import scBoolSeq\n", "\n", "# set seed for reproducible results\n", "_rng_seed = 19834650\n", "# use a Generator instead of numpy's singleton\n", "_rng = np.random.default_rng(_rng_seed)\n", "random.seed(_rng_seed)" ] }, { "cell_type": "markdown", "id": "a4b051c2", "metadata": {}, "source": [ "## Load Boolean network model" ] }, { "cell_type": "code", "execution_count": 2, "id": "4a9e3bae", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "gene1 <- tf\n", "gene10 <- tf\n", "gene2 <- tf\n", "gene3 <- tf\n", "gene4 <- tf\n", "gene5 <- tf\n", "gene6 <- tf\n", "gene7 <- tf\n", "gene8 <- tf\n", "gene9 <- tf\n", "tf <- 1" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bn = BooleanNetwork.load(\"models/star.bnet\")\n", "bn" ] }, { "cell_type": "code", "execution_count": 3, "id": "09653812", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# computing graph layout...\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "tf\n", "\n", "tf\n", "\n", "\n", "\n", "gene1\n", "\n", "gene1\n", "\n", "\n", "\n", "tf->gene1\n", "\n", "\n", "+\n", "\n", "\n", "\n", "gene2\n", "\n", "gene2\n", "\n", "\n", "\n", "tf->gene2\n", "\n", "\n", "+\n", "\n", "\n", "\n", "gene3\n", "\n", "gene3\n", "\n", "\n", "\n", "tf->gene3\n", "\n", "\n", "+\n", "\n", "\n", "\n", "gene4\n", "\n", "gene4\n", "\n", "\n", "\n", "tf->gene4\n", "\n", "\n", "+\n", "\n", "\n", "\n", "gene5\n", "\n", "gene5\n", "\n", "\n", "\n", "tf->gene5\n", "\n", "\n", "+\n", "\n", "\n", "\n", "gene6\n", "\n", "gene6\n", "\n", "\n", "\n", "tf->gene6\n", "\n", "\n", "+\n", "\n", "\n", "\n", "gene7\n", "\n", "gene7\n", "\n", "\n", "\n", "tf->gene7\n", "\n", "\n", "+\n", "\n", "\n", "\n", "gene8\n", "\n", "gene8\n", "\n", "\n", "\n", "tf->gene8\n", "\n", "\n", "+\n", "\n", "\n", "\n", "gene9\n", "\n", "gene9\n", "\n", "\n", "\n", "tf->gene9\n", "\n", "\n", "+\n", "\n", "\n", "\n", "gene10\n", "\n", "gene10\n", "\n", "\n", "\n", "tf->gene10\n", "\n", "\n", "+\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bn.influence_graph()" ] }, { "cell_type": "markdown", "id": "77abe8f3", "metadata": {}, "source": [ "## Simulation with random walk\n", "\n", "With the asynchronous update mode, the activation of the genes can be made in any order. Here, we randomly sample one trajectory of this model, which essentially boils down to selecting a random ordering of genes that get activated.\n", "\n", "Let us first specify the initial state of the network:" ] }, { "cell_type": "code", "execution_count": 4, "id": "ee4a6403", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'tf': 1,\n", " 'gene1': 0,\n", " 'gene2': 0,\n", " 'gene3': 0,\n", " 'gene4': 0,\n", " 'gene5': 0,\n", " 'gene6': 0,\n", " 'gene7': 0,\n", " 'gene8': 0,\n", " 'gene9': 0,\n", " 'gene10': 0}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "initial_state = bn.zero()\n", "initial_state[\"tf\"] = 1\n", "initial_state" ] }, { "cell_type": "markdown", "id": "60cedafe", "metadata": {}, "source": [ "Then, we use `minibn` to generate a random walk in the asynchronous dynamics of the Boolean network from the given initial state:" ] }, { "cell_type": "code", "execution_count": 5, "id": "05646298", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tfgene1gene2gene3gene4gene5gene6gene7gene8gene9gene10
010000000000
110000000100
211000000100
311010000100
411010010100
511010010110
611010010111
711011010111
811111010111
911111011111
1011111111111
\n", "
" ], "text/plain": [ " tf gene1 gene2 gene3 gene4 gene5 gene6 gene7 gene8 gene9 gene10\n", "0 1 0 0 0 0 0 0 0 0 0 0\n", "1 1 0 0 0 0 0 0 0 1 0 0\n", "2 1 1 0 0 0 0 0 0 1 0 0\n", "3 1 1 0 1 0 0 0 0 1 0 0\n", "4 1 1 0 1 0 0 1 0 1 0 0\n", "5 1 1 0 1 0 0 1 0 1 1 0\n", "6 1 1 0 1 0 0 1 0 1 1 1\n", "7 1 1 0 1 1 0 1 0 1 1 1\n", "8 1 1 1 1 1 0 1 0 1 1 1\n", "9 1 1 1 1 1 0 1 1 1 1 1\n", "10 1 1 1 1 1 1 1 1 1 1 1" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dynamics = FullyAsynchronousDynamics(bn)\n", "random_walk_df = pd.DataFrame(dynamics.random_walk(initial_state, steps=10))\n", "random_walk_df" ] }, { "cell_type": "markdown", "id": "a835af87", "metadata": {}, "source": [ "## Retrieve statistics of real expression datasets\n", "\n", "In order to generate synthetic RNA counts, scBoolSeq relies on statistical criteria learnt from real scRNA-Seq datasets. Then, the nodes of the Boolean model used to generate the Boolean states need to be associated with real gene names: scBoolSeq will then generate RNA counts from the corresponding distribution, biased by the Boolean state of the gene.\n", "\n", "In this example, we re-use the simulation criteria learnt from the Nestorowa dataset in the [1 - Binarization and synthetic data generation](1%20-%20Binarization%20and%20synthetic%20data%20generation.ipynb) notebook:" ] }, { "cell_type": "code", "execution_count": 6, "id": "7e1f185f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DipBIKurtosisDropOutRateMeanNZDenPeakAmplitudegaussian_prob1gaussian_prob2gaussian_mean1gaussian_mean2gaussian_variancemeanvarianceunimodal_margin_quantileunimodal_low_quantileunimodal_high_quantileIQRq50bim_thresh_downbim_thresh_upCategorydor_threshold
Clec1b9.948487e-011.6356986.1667110.8762081.520978-0.0072498.8521810.9861400.0138600.1112915.6664900.1576491.5209782.6667600.250.6672711.5552900.8880200.9687762.7857403.094168UnimodalNaN
Kdm3a0.000000e+002.407548-0.7840190.3260873.8479400.20923910.1266760.7145200.2854800.8726436.8994491.2782472.5931778.6925860.250.0000005.2589845.2589841.2680403.4322514.748643Bimodal0.95
Coro2b4.684039e-032.3200600.3270600.6582132.3838190.0045979.4755770.9195080.0804920.3355466.2890790.4873722.3838195.3705210.250.8277402.9129442.0852051.2906663.1835963.879537BimodalNaN
8430408G22Rik7.236739e-083.121069-0.9939790.8840582.9834720.0056639.0678570.9649620.0350380.0988987.1488080.1725062.9834728.1546470.250.8252986.4650745.6397761.4497793.6120614.175572BimodalNaN
Clec9a1.000000e+002.081717140.0892850.9655802.280293-0.0093619.6142330.9939610.0060390.0355997.1380990.0698700.0784880.3728780.250.0000000.0000000.0000000.0000003.1134104.607253Discarded0.95
\n", "
" ], "text/plain": [ " Dip BI Kurtosis DropOutRate MeanNZ \\\n", "Clec1b 9.948487e-01 1.635698 6.166711 0.876208 1.520978 \n", "Kdm3a 0.000000e+00 2.407548 -0.784019 0.326087 3.847940 \n", "Coro2b 4.684039e-03 2.320060 0.327060 0.658213 2.383819 \n", "8430408G22Rik 7.236739e-08 3.121069 -0.993979 0.884058 2.983472 \n", "Clec9a 1.000000e+00 2.081717 140.089285 0.965580 2.280293 \n", "\n", " DenPeak Amplitude gaussian_prob1 gaussian_prob2 \\\n", "Clec1b -0.007249 8.852181 0.986140 0.013860 \n", "Kdm3a 0.209239 10.126676 0.714520 0.285480 \n", "Coro2b 0.004597 9.475577 0.919508 0.080492 \n", "8430408G22Rik 0.005663 9.067857 0.964962 0.035038 \n", "Clec9a -0.009361 9.614233 0.993961 0.006039 \n", "\n", " gaussian_mean1 gaussian_mean2 gaussian_variance mean \\\n", "Clec1b 0.111291 5.666490 0.157649 1.520978 \n", "Kdm3a 0.872643 6.899449 1.278247 2.593177 \n", "Coro2b 0.335546 6.289079 0.487372 2.383819 \n", "8430408G22Rik 0.098898 7.148808 0.172506 2.983472 \n", "Clec9a 0.035599 7.138099 0.069870 0.078488 \n", "\n", " variance unimodal_margin_quantile unimodal_low_quantile \\\n", "Clec1b 2.666760 0.25 0.667271 \n", "Kdm3a 8.692586 0.25 0.000000 \n", "Coro2b 5.370521 0.25 0.827740 \n", "8430408G22Rik 8.154647 0.25 0.825298 \n", "Clec9a 0.372878 0.25 0.000000 \n", "\n", " unimodal_high_quantile IQR q50 bim_thresh_down \\\n", "Clec1b 1.555290 0.888020 0.968776 2.785740 \n", "Kdm3a 5.258984 5.258984 1.268040 3.432251 \n", "Coro2b 2.912944 2.085205 1.290666 3.183596 \n", "8430408G22Rik 6.465074 5.639776 1.449779 3.612061 \n", "Clec9a 0.000000 0.000000 0.000000 3.113410 \n", "\n", " bim_thresh_up Category dor_threshold \n", "Clec1b 3.094168 Unimodal NaN \n", "Kdm3a 4.748643 Bimodal 0.95 \n", "Coro2b 3.879537 Bimodal NaN \n", "8430408G22Rik 4.175572 Bimodal NaN \n", "Clec9a 4.607253 Discarded 0.95 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "criteria = pd.read_csv(\"cache_scBoolSeq_Nestorowa_simulation_criteria.csv\", index_col=0)\n", "criteria.head()" ] }, { "cell_type": "code", "execution_count": 7, "id": "c0d064c8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Bimodal 2987\n", "Unimodal 1580\n", "Discarded 201\n", "Name: Category, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "criteria.Category.value_counts()" ] }, { "cell_type": "markdown", "id": "8027f897", "metadata": {}, "source": [ "We randomly select bimodal genes for each of the nodes of the Boolean model to obtain simulation criteria:" ] }, { "cell_type": "code", "execution_count": 8, "id": "9e9b9663", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DipBIKurtosisDropOutRateMeanNZDenPeakAmplitudegaussian_prob1gaussian_prob2gaussian_mean1gaussian_mean2gaussian_variancemeanvarianceunimodal_margin_quantileunimodal_low_quantileunimodal_high_quantileIQRq50bim_thresh_downbim_thresh_upCategorydor_threshold
tf0.0002501.982430-1.1183900.0277784.4548482.23150910.4584650.6343210.3656792.4496887.5946801.5623654.3311027.7071620.252.0703916.9205024.8501123.3214804.2773516.081385Bimodal0.95
gene10.0000002.171544-1.1483140.0211356.2230478.24320411.1931840.3639060.6360942.5794398.1007601.4964396.0915218.5582190.253.1839708.4838655.2998957.2828844.3847965.990926Bimodal0.95
gene20.0000001.957530-1.2732490.0440825.6410878.02981810.7948510.4090650.5909352.2582467.5619911.7745185.3924168.5795170.252.5855107.9949015.4093916.1291233.8027625.780924Bimodal0.95
gene30.0000002.123630-1.3150390.0368365.7248708.07633410.8113160.4003380.5996622.2371267.7016441.5895705.5139908.7635080.252.5367128.1067695.5700576.4526763.9639045.713590Bimodal0.95
gene40.0000002.161170-1.3446870.0181166.5594009.54372812.2605580.4032310.5967692.7343728.9448111.9871336.44057011.2751560.253.1094839.4398446.3303617.4710174.7719866.660780Bimodal0.95
gene50.0000002.167416-1.3865050.0386475.6616398.15457910.4985230.4362310.5637692.3221337.8575541.6041125.4428329.1452570.252.4802698.1417195.6614516.2724764.1567335.870370Bimodal0.95
gene60.0000002.355317-1.5057610.0344205.7451178.72236011.4012880.4335510.5664492.0157948.2503791.7207485.54736811.2734380.252.0687128.5818356.5131236.5333174.2268845.879037Bimodal0.95
gene70.0000002.307954-0.8619030.0054356.8003398.65503510.9310650.3031850.6968152.8750438.4552041.2349936.7633817.8180940.254.1084458.9195684.8111237.8558444.8170056.135923Bimodal0.95
gene80.0000002.296122-1.4833510.0434785.8213948.96117711.4371980.4894540.5105462.4223838.5842361.7996145.56829011.2943180.252.4383928.7633616.3249685.8363084.6318916.354470Bimodal0.95
gene90.0010381.5938950.2463880.0356284.1160753.38209610.3217750.8139530.1860473.1185727.6919191.2467233.9694284.4166920.252.6836854.5604551.8767703.5063174.9939696.611602Bimodal0.95
gene100.0000002.252029-1.2165980.0060396.2814438.38387610.8912270.3654520.6345482.9637478.1324131.2215296.2435117.4211620.253.4646648.5087205.0440567.2986774.7059176.140867Bimodal0.95
\n", "
" ], "text/plain": [ " Dip BI Kurtosis DropOutRate MeanNZ DenPeak \\\n", "tf 0.000250 1.982430 -1.118390 0.027778 4.454848 2.231509 \n", "gene1 0.000000 2.171544 -1.148314 0.021135 6.223047 8.243204 \n", "gene2 0.000000 1.957530 -1.273249 0.044082 5.641087 8.029818 \n", "gene3 0.000000 2.123630 -1.315039 0.036836 5.724870 8.076334 \n", "gene4 0.000000 2.161170 -1.344687 0.018116 6.559400 9.543728 \n", "gene5 0.000000 2.167416 -1.386505 0.038647 5.661639 8.154579 \n", "gene6 0.000000 2.355317 -1.505761 0.034420 5.745117 8.722360 \n", "gene7 0.000000 2.307954 -0.861903 0.005435 6.800339 8.655035 \n", "gene8 0.000000 2.296122 -1.483351 0.043478 5.821394 8.961177 \n", "gene9 0.001038 1.593895 0.246388 0.035628 4.116075 3.382096 \n", "gene10 0.000000 2.252029 -1.216598 0.006039 6.281443 8.383876 \n", "\n", " Amplitude gaussian_prob1 gaussian_prob2 gaussian_mean1 \\\n", "tf 10.458465 0.634321 0.365679 2.449688 \n", "gene1 11.193184 0.363906 0.636094 2.579439 \n", "gene2 10.794851 0.409065 0.590935 2.258246 \n", "gene3 10.811316 0.400338 0.599662 2.237126 \n", "gene4 12.260558 0.403231 0.596769 2.734372 \n", "gene5 10.498523 0.436231 0.563769 2.322133 \n", "gene6 11.401288 0.433551 0.566449 2.015794 \n", "gene7 10.931065 0.303185 0.696815 2.875043 \n", "gene8 11.437198 0.489454 0.510546 2.422383 \n", "gene9 10.321775 0.813953 0.186047 3.118572 \n", "gene10 10.891227 0.365452 0.634548 2.963747 \n", "\n", " gaussian_mean2 gaussian_variance mean variance \\\n", "tf 7.594680 1.562365 4.331102 7.707162 \n", "gene1 8.100760 1.496439 6.091521 8.558219 \n", "gene2 7.561991 1.774518 5.392416 8.579517 \n", "gene3 7.701644 1.589570 5.513990 8.763508 \n", "gene4 8.944811 1.987133 6.440570 11.275156 \n", "gene5 7.857554 1.604112 5.442832 9.145257 \n", "gene6 8.250379 1.720748 5.547368 11.273438 \n", "gene7 8.455204 1.234993 6.763381 7.818094 \n", "gene8 8.584236 1.799614 5.568290 11.294318 \n", "gene9 7.691919 1.246723 3.969428 4.416692 \n", "gene10 8.132413 1.221529 6.243511 7.421162 \n", "\n", " unimodal_margin_quantile unimodal_low_quantile \\\n", "tf 0.25 2.070391 \n", "gene1 0.25 3.183970 \n", "gene2 0.25 2.585510 \n", "gene3 0.25 2.536712 \n", "gene4 0.25 3.109483 \n", "gene5 0.25 2.480269 \n", "gene6 0.25 2.068712 \n", "gene7 0.25 4.108445 \n", "gene8 0.25 2.438392 \n", "gene9 0.25 2.683685 \n", "gene10 0.25 3.464664 \n", "\n", " unimodal_high_quantile IQR q50 bim_thresh_down \\\n", "tf 6.920502 4.850112 3.321480 4.277351 \n", "gene1 8.483865 5.299895 7.282884 4.384796 \n", "gene2 7.994901 5.409391 6.129123 3.802762 \n", "gene3 8.106769 5.570057 6.452676 3.963904 \n", "gene4 9.439844 6.330361 7.471017 4.771986 \n", "gene5 8.141719 5.661451 6.272476 4.156733 \n", "gene6 8.581835 6.513123 6.533317 4.226884 \n", "gene7 8.919568 4.811123 7.855844 4.817005 \n", "gene8 8.763361 6.324968 5.836308 4.631891 \n", "gene9 4.560455 1.876770 3.506317 4.993969 \n", "gene10 8.508720 5.044056 7.298677 4.705917 \n", "\n", " bim_thresh_up Category dor_threshold \n", "tf 6.081385 Bimodal 0.95 \n", "gene1 5.990926 Bimodal 0.95 \n", "gene2 5.780924 Bimodal 0.95 \n", "gene3 5.713590 Bimodal 0.95 \n", "gene4 6.660780 Bimodal 0.95 \n", "gene5 5.870370 Bimodal 0.95 \n", "gene6 5.879037 Bimodal 0.95 \n", "gene7 6.135923 Bimodal 0.95 \n", "gene8 6.354470 Bimodal 0.95 \n", "gene9 6.611602 Bimodal 0.95 \n", "gene10 6.140867 Bimodal 0.95 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_criteria = criteria[\n", " (criteria.Category == \"Bimodal\") &\n", " (criteria.DropOutRate < 0.05)\n", "].sample(11, random_state=_rng_seed)\n", "random_criteria.set_index(random_walk_df.columns, inplace=True)\n", "random_criteria" ] }, { "cell_type": "markdown", "id": "4ac21fef", "metadata": {}, "source": [ "## Generate synthetic RNA-Seq data\n", "\n", "We instantiate scBoolSeq with the simulation criteria having the name matching with the column of the generated Boolean matrix." ] }, { "cell_type": "code", "execution_count": 9, "id": "1239447f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "scBoolSeq(has_data=False, can_binarize=False, can_simulate=True)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scbool = scBoolSeq(simulation_criteria=random_criteria)\n", "scbool" ] }, { "cell_type": "markdown", "id": "18e159e5", "metadata": {}, "source": [ "Then, we generate 300 samples per Boolean states using the `.simulate` method:" ] }, { "cell_type": "code", "execution_count": 10, "id": "f627ea46", "metadata": {}, "outputs": [], "source": [ "n_samples = 300 # number of samples per row" ] }, { "cell_type": "code", "execution_count": 11, "id": "163cdddb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tfgene1gene2gene3gene4gene5gene6gene7gene8gene9gene10
07.3439974.4237481.4557642.6662663.6224632.3391833.1455572.3555204.2268172.4507004.954449
16.5034594.8695160.9979022.4908263.8319142.9047711.3553202.1217289.0141414.1126132.857409
27.1945688.0607261.8033313.3958022.9642982.0390392.8562743.07840210.7932652.7103451.965232
35.2653557.6462481.6370696.7125793.3821252.3342640.0000001.1652219.3657782.6654266.474557
47.5194438.7548642.4716897.5917013.1097032.2089827.9968781.13710810.4565072.6030114.516670
\n", "
" ], "text/plain": [ " tf gene1 gene2 gene3 gene4 gene5 gene6 \\\n", "0 7.343997 4.423748 1.455764 2.666266 3.622463 2.339183 3.145557 \n", "1 6.503459 4.869516 0.997902 2.490826 3.831914 2.904771 1.355320 \n", "2 7.194568 8.060726 1.803331 3.395802 2.964298 2.039039 2.856274 \n", "3 5.265355 7.646248 1.637069 6.712579 3.382125 2.334264 0.000000 \n", "4 7.519443 8.754864 2.471689 7.591701 3.109703 2.208982 7.996878 \n", "\n", " gene7 gene8 gene9 gene10 \n", "0 2.355520 4.226817 2.450700 4.954449 \n", "1 2.121728 9.014141 4.112613 2.857409 \n", "2 3.078402 10.793265 2.710345 1.965232 \n", "3 1.165221 9.365778 2.665426 6.474557 \n", "4 1.137108 10.456507 2.603011 4.516670 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts = scbool.simulate(random_walk_df, n_samples=n_samples)\n", "counts.head()" ] }, { "cell_type": "markdown", "id": "a95920eb", "metadata": {}, "source": [ "To ease post-analysis with STREAM, we generate unique identifiers for each simulated row (cell):" ] }, { "cell_type": "code", "execution_count": 12, "id": "9ace2312", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tfgene1gene2gene3gene4gene5gene6gene7gene8gene9gene10
cellID
step0_07.3439974.4237481.4557642.6662663.6224632.3391833.1455572.3555204.2268172.4507004.954449
step1_06.5034594.8695160.9979022.4908263.8319142.9047711.3553202.1217289.0141414.1126132.857409
step2_07.1945688.0607261.8033313.3958022.9642982.0390392.8562743.07840210.7932652.7103451.965232
step3_05.2653557.6462481.6370696.7125793.3821252.3342640.0000001.1652219.3657782.6654266.474557
step4_07.5194438.7548642.4716897.5917013.1097032.2089827.9968781.13710810.4565072.6030114.516670
....................................
step6_2997.01949811.8330264.1916777.6368804.5695412.1781498.8122442.6815198.1035038.3502288.577708
step7_2996.8693628.4949980.9875286.1915698.5651681.7421427.8308210.22138010.3016374.2301317.762939
step8_2999.3185268.2633075.6603656.1020518.7170694.5437806.4951202.9182228.3886956.1719414.913572
step9_2997.7684778.4152408.0410825.9576559.1192772.50709110.0579158.0739528.2945285.6352968.548555
step10_2999.3631837.6107906.5056886.6940606.8726597.6331217.1011267.1066778.1663037.8565778.372333
\n", "

3300 rows × 11 columns

\n", "
" ], "text/plain": [ " tf gene1 gene2 gene3 gene4 gene5 \\\n", "cellID \n", "step0_0 7.343997 4.423748 1.455764 2.666266 3.622463 2.339183 \n", "step1_0 6.503459 4.869516 0.997902 2.490826 3.831914 2.904771 \n", "step2_0 7.194568 8.060726 1.803331 3.395802 2.964298 2.039039 \n", "step3_0 5.265355 7.646248 1.637069 6.712579 3.382125 2.334264 \n", "step4_0 7.519443 8.754864 2.471689 7.591701 3.109703 2.208982 \n", "... ... ... ... ... ... ... \n", "step6_299 7.019498 11.833026 4.191677 7.636880 4.569541 2.178149 \n", "step7_299 6.869362 8.494998 0.987528 6.191569 8.565168 1.742142 \n", "step8_299 9.318526 8.263307 5.660365 6.102051 8.717069 4.543780 \n", "step9_299 7.768477 8.415240 8.041082 5.957655 9.119277 2.507091 \n", "step10_299 9.363183 7.610790 6.505688 6.694060 6.872659 7.633121 \n", "\n", " gene6 gene7 gene8 gene9 gene10 \n", "cellID \n", "step0_0 3.145557 2.355520 4.226817 2.450700 4.954449 \n", "step1_0 1.355320 2.121728 9.014141 4.112613 2.857409 \n", "step2_0 2.856274 3.078402 10.793265 2.710345 1.965232 \n", "step3_0 0.000000 1.165221 9.365778 2.665426 6.474557 \n", "step4_0 7.996878 1.137108 10.456507 2.603011 4.516670 \n", "... ... ... ... ... ... \n", "step6_299 8.812244 2.681519 8.103503 8.350228 8.577708 \n", "step7_299 7.830821 0.221380 10.301637 4.230131 7.762939 \n", "step8_299 6.495120 2.918222 8.388695 6.171941 4.913572 \n", "step9_299 10.057915 8.073952 8.294528 5.635296 8.548555 \n", "step10_299 7.101126 7.106677 8.166303 7.856577 8.372333 \n", "\n", "[3300 rows x 11 columns]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ids = [f\"step{x}_{y}\" for y in range(n_samples) for x in random_walk_df.index]\n", "counts.index = ids\n", "counts.index.name = \"cellID\"\n", "counts" ] }, { "cell_type": "markdown", "id": "56f449c4", "metadata": {}, "source": [ "We write the result as a TSV file:" ] }, { "cell_type": "code", "execution_count": 13, "id": "edb10f46", "metadata": {}, "outputs": [], "source": [ "counts.T.to_csv(\"synthetic_data_star_counts.tsv\", sep=\"\\t\")" ] }, { "cell_type": "markdown", "id": "962ad85f", "metadata": {}, "source": [ "The, we generate metadata to validate the trajectory reconstruction with STREAM:" ] }, { "cell_type": "code", "execution_count": 14, "id": "5ec64c76", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{0: '#A2F37E',\n", " 1: '#36873F',\n", " 2: '#81D278',\n", " 3: '#8CA2D4',\n", " 4: '#D0327B',\n", " 5: '#CDEF47',\n", " 6: '#CB6896',\n", " 7: '#590605',\n", " 8: '#3C27AB',\n", " 9: '#4A7BBC',\n", " 10: '#F0F94B'}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nb_active_genes = [cfg.sum()-1 for _, cfg in random_walk_df.iterrows()]\n", "_RGB_values = list(\"0123456789ABCDEF\")\n", "color_map = {nb: \"#\"+''.join([_rng.choice(_RGB_values) for _ in range(6)]) for nb in set(nb_active_genes)}\n", "color_map" ] }, { "cell_type": "code", "execution_count": 15, "id": "e5827936", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labellabel_color
cellID
step0_00#A2F37E
step1_01#36873F
step2_02#81D278
step3_03#8CA2D4
step4_04#D0327B
.........
step6_2996#CB6896
step7_2997#590605
step8_2998#3C27AB
step9_2999#4A7BBC
step10_29910#F0F94B
\n", "

3300 rows × 2 columns

\n", "
" ], "text/plain": [ " label label_color\n", "cellID \n", "step0_0 0 #A2F37E\n", "step1_0 1 #36873F\n", "step2_0 2 #81D278\n", "step3_0 3 #8CA2D4\n", "step4_0 4 #D0327B\n", "... ... ...\n", "step6_299 6 #CB6896\n", "step7_299 7 #590605\n", "step8_299 8 #3C27AB\n", "step9_299 9 #4A7BBC\n", "step10_299 10 #F0F94B\n", "\n", "[3300 rows x 2 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "metadata = [[nb, color_map[nb]] for nb in nb_active_genes]*n_samples\n", "metadata = pd.DataFrame(metadata, columns=[\"label\", \"label_color\"])\n", "metadata.index = counts.index\n", "metadata" ] }, { "cell_type": "code", "execution_count": 16, "id": "8887d91d", "metadata": {}, "outputs": [], "source": [ "metadata.to_csv(\"synthetic_data_star_metadata.tsv\", sep=\"\\t\")" ] }, { "cell_type": "markdown", "id": "76d0f0c6", "metadata": {}, "source": [ "STREAM analysis is performed in a separate notebook: [3.1 - STREAM - Trajectory reconstruction for star network synthetic scRNA data](3.1%20-%20STREAM%20-%20Trajectory%20reconstruction%20for%20star%20network%20synthetic%20scRNA%20data.ipynb). Note that its execution should be performed in the adequate software environment (e.g., STREAM Docker image)" ] }, { "cell_type": "code", "execution_count": null, "id": "f81a6be2", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 5 }