--- name: "muon-multiomics-singlecell" description: "Multi-modal single-cell analysis with muon/MuData. Joint RNA+ATAC (10x Multiome), CITE-seq (RNA+protein), other multi-omics. MuData holds per-modality AnnData with shared obs. WNN joint embedding, per-modality preprocessing, MOFA factor analysis. Use scanpy-scrna-seq for single-modality RNA; use muon when combining 2+ omics from the same cells." license: "BSD-3-Clause" --- # muon — Multi-Modal Single-Cell Analysis ## Overview muon is a Python framework for multi-modal single-cell data analysis that extends the AnnData ecosystem. Its core data structure, `MuData`, holds multiple `AnnData` objects (one per modality: RNA, ATAC, protein, etc.) with shared observation and variable axes, enabling coordinated operations across all modalities. muon provides modality-specific preprocessing routines (TF-IDF and LSI for ATAC, CLR normalization for surface proteins), Weighted Nearest Neighbor (WNN) graph construction for joint dimensionality reduction, and cross-modal analysis tools. It integrates directly with scanpy, scvi-tools, and MOFA+ for a complete multi-omics single-cell workflow. ## When to Use - Analyzing 10x Genomics Multiome data (simultaneous RNA + ATAC from the same nuclei) - Processing CITE-seq experiments (RNA + surface protein from the same cells) - Building joint UMAP embeddings that integrate signals from two or more modalities via WNN - Preprocessing ATAC-seq modalities (TF-IDF normalization, LSI dimensionality reduction) - Normalizing surface protein data with centered log-ratio (CLR) normalization - Performing cross-modal feature linkage (associating ATAC peaks with nearby gene expression) - Applying MOFA+ factor analysis across multiple omics layers within a unified container - Use **scanpy-scrna-seq** instead when analyzing a single RNA-seq modality without any co-measured omics - Use **scvi-tools (MultiVI / totalVI)** when you need probabilistic deep generative batch correction across modalities ## Prerequisites - **Python packages**: `muon>=0.1.6`, `scanpy>=1.10`, `anndata>=0.10`, `numpy`, `scipy`, `pandas`, `matplotlib`, `leidenalg` - **Data requirements**: 10x Multiome h5 or h5mu files, or per-modality AnnData objects (cells x features) sharing the same `obs_names` - **Environment**: Python 3.9+; 16 GB+ RAM for datasets >50k cells; optional `mofapy2` for MOFA factor analysis ```bash pip install "muon[all]" "scanpy[leiden]" anndata # Optional: for MOFA+ integration pip install mofapy2 ``` ## Quick Start ```python import muon as mu import scanpy as sc import numpy as np import anndata as ad import pandas as pd # Simulate a small RNA + ATAC MuData object (200 cells) np.random.seed(42) n_cells = 200 rna = ad.AnnData(np.abs(np.random.negative_binomial(5, 0.3, (n_cells, 2000))).astype(float), obs=pd.DataFrame(index=[f"cell_{i}" for i in range(n_cells)]), var=pd.DataFrame(index=[f"gene_{j}" for j in range(2000)])) atac = ad.AnnData(np.abs(np.random.negative_binomial(2, 0.5, (n_cells, 5000))).astype(float), obs=rna.obs.copy(), var=pd.DataFrame(index=[f"peak_{j}" for j in range(5000)])) mdata = mu.MuData({"rna": rna, "atac": atac}) print(mdata) # MuData object with n_obs × n_vars = 200 × 7000 # 2 modalities # rna: 200 × 2000 # atac: 200 × 5000 # Minimal preprocessing sc.pp.normalize_total(mdata["rna"], target_sum=1e4) sc.pp.log1p(mdata["rna"]) sc.pp.highly_variable_genes(mdata["rna"], n_top_genes=500) sc.pp.pca(mdata["rna"]) mu.atac.pp.tfidf(mdata["atac"]) mu.atac.tl.lsi(mdata["atac"]) # WNN joint embedding mu.pp.neighbors(mdata, key_added="wnn", use_rep={"rna": "X_pca", "atac": "X_lsi"}) sc.tl.umap(mdata, neighbors_key="wnn") sc.tl.leiden(mdata, neighbors_key="wnn", key_added="leiden_wnn") print(f"Clusters: {mdata.obs['leiden_wnn'].nunique()}") ``` ## Core API ### Module 1: MuData Creation and I/O `MuData` is the central container: a dictionary of `AnnData` modalities plus shared `obs` and `var` slots. The `mdata.obs` DataFrame merges per-modality observation metadata with a modality prefix for ambiguous columns. ```python import muon as mu import anndata as ad import numpy as np import pandas as pd # --- Build from per-modality AnnData objects --- n_cells = 300 rna = ad.AnnData(np.abs(np.random.negative_binomial(5, 0.3, (n_cells, 3000))).astype(float), obs=pd.DataFrame(index=[f"cell_{i}" for i in range(n_cells)]), var=pd.DataFrame(index=[f"gene_{j}" for j in range(3000)])) prot = ad.AnnData(np.abs(np.random.randn(n_cells, 30)) + 3, obs=rna.obs.copy(), var=pd.DataFrame(index=[f"protein_{j}" for j in range(30)])) mdata = mu.MuData({"rna": rna, "protein": prot}) print(mdata) # Access individual modalities print(mdata.mod["rna"]) # AnnData: 300 × 3000 print(mdata["rna"]) # same shorthand print(mdata.obs.head()) # shared observation metadata print(mdata.obs_names[:5]) # cell barcodes # Save and load mdata.write("multiome_data.h5mu") mdata2 = mu.read("multiome_data.h5mu") print(f"Loaded: {mdata2.n_obs} cells, {mdata2.n_mod} modalities") ``` ```python # --- Load from 10x Multiome h5 file --- # mdata = mu.read_10x_h5("filtered_feature_bc_matrix.h5") # Produces MuData with mdata["rna"] and mdata["atac"] modalities # --- Subsetting by cells or features --- # Select high-quality cells (e.g., after QC) mask = np.ones(mdata.n_obs, dtype=bool) # replace with actual QC mask mdata_filtered = mdata[mask].copy() print(f"After filtering: {mdata_filtered.n_obs} cells") # Propagate obs mask to each modality mu.pp.intersect_obs(mdata) # ensures obs consistency across modalities ``` ### Module 2: RNA Modality Preprocessing Standard scRNA-seq preprocessing applied to `mdata["rna"]` using scanpy functions. The RNA modality is preprocessed identically to a standalone scanpy workflow but operates on the slice of the MuData container. ```python import scanpy as sc import numpy as np # Assume mdata["rna"] has raw integer counts rna = mdata["rna"] # QC metrics sc.pp.calculate_qc_metrics(rna, percent_top=None, log1p=False, inplace=True) rna.obs["pct_counts_mt"] = rna[:, rna.var_names.str.startswith("MT-")].X.sum(axis=1).A1 / rna.obs["total_counts"] * 100 # Filter cells and genes min_genes, max_genes, max_mt = 200, 5000, 20 sc.pp.filter_cells(rna, min_genes=min_genes) sc.pp.filter_genes(rna, min_cells=5) rna = rna[(rna.obs["n_genes_by_counts"] < max_genes) & (rna.obs["pct_counts_mt"] < max_mt)].copy() print(f"RNA after QC: {rna.n_obs} cells × {rna.n_vars} genes") # Normalize and log-transform sc.pp.normalize_total(rna, target_sum=1e4) sc.pp.log1p(rna) # Highly variable genes sc.pp.highly_variable_genes(rna, n_top_genes=3000, flavor="seurat_v3", batch_key=None) print(f"HVGs: {rna.var['highly_variable'].sum()}") # PCA on HVGs sc.pp.pca(rna, n_comps=50, use_highly_variable=True) print(f"PCA embedding shape: {rna.obsm['X_pca'].shape}") # Update slice in MuData mdata.mod["rna"] = rna ``` ### Module 3: ATAC Modality Preprocessing ATAC-seq modalities require a different normalization strategy. TF-IDF (term frequency–inverse document frequency) normalizes peak accessibility across cells and peaks; LSI (latent semantic indexing, equivalent to truncated SVD after TF-IDF) produces a low-dimensional embedding. The first LSI component typically captures sequencing depth rather than biology and is excluded. ```python # Assume mdata["atac"] contains raw binary or integer peak accessibility counts atac = mdata["atac"] # Basic QC: filter low-coverage cells and low-frequency peaks sc.pp.calculate_qc_metrics(atac, percent_top=None, log1p=False, inplace=True) sc.pp.filter_cells(atac, min_genes=200) # min peaks detected sc.pp.filter_genes(atac, min_cells=10) # min cells a peak appears in print(f"ATAC after QC: {atac.n_obs} cells × {atac.n_vars} peaks") # TF-IDF normalization (log(TF) * log(IDF) scaling) mu.atac.pp.tfidf(atac, scale_factor=1e4) print("TF-IDF normalization complete") print(f"ATAC data range: [{atac.X.min():.2f}, {atac.X.max():.2f}]") # LSI dimensionality reduction (truncated SVD on TF-IDF matrix) mu.atac.tl.lsi(atac, n_comps=50, use_highly_variable=False) # LSI component 1 correlates with sequencing depth — exclude it # mu.atac.tl.lsi sets X_lsi starting from component 2 by default print(f"LSI embedding shape: {atac.obsm['X_lsi'].shape}") # Update modality in MuData mdata.mod["atac"] = atac ``` ### Module 4: WNN Graph and Joint Embedding Weighted Nearest Neighbor (WNN) integrates multiple modality embeddings by learning per-cell, per-modality weights. Cells with high-quality RNA signal receive higher RNA weight; cells with cleaner ATAC signal receive higher ATAC weight. The resulting WNN graph is used for UMAP layout and Leiden clustering. ```python # Compute per-modality neighbor graphs first (optional but enables modality-specific UMAPs) mu.pp.neighbors(mdata["rna"], use_rep="X_pca", n_neighbors=30, key_added="neighbors") mu.pp.neighbors(mdata["atac"], use_rep="X_lsi", n_neighbors=30, key_added="neighbors") # WNN joint neighbor graph across modalities mu.pp.neighbors( mdata, key_added="wnn", use_rep={"rna": "X_pca", "atac": "X_lsi"}, n_neighbors=30, random_state=42, ) print("WNN graph built. Keys:", list(mdata.obsp.keys())) # Expected: ['wnn_connectivities', 'wnn_distances'] # UMAP from WNN graph sc.tl.umap(mdata, neighbors_key="wnn", random_state=42) print(f"UMAP embedding shape: {mdata.obsm['X_umap'].shape}") # Leiden clustering from WNN graph sc.tl.leiden(mdata, neighbors_key="wnn", resolution=0.5, key_added="leiden_wnn") n_clusters = mdata.obs["leiden_wnn"].nunique() print(f"Leiden WNN clustering: {n_clusters} clusters at resolution 0.5") ``` ### Module 5: Visualization muon extends scanpy's plotting interface with modality-aware functions. `mu.pl.embedding()` colors joint UMAP embeddings by features from any modality; `sc.pl.umap()` with `color` pointing to modality-prefixed feature names (e.g., `"rna:CD3E"`) is also supported. ```python import matplotlib.pyplot as plt # Joint UMAP colored by cluster assignment sc.pl.umap(mdata, color="leiden_wnn", title="WNN Leiden clusters", legend_loc="on data", show=False) plt.savefig("wnn_umap_clusters.png", dpi=150, bbox_inches="tight") plt.close() print("Saved wnn_umap_clusters.png") # Color by RNA gene expression on joint UMAP sc.pl.umap(mdata, color=["rna:CD3E", "rna:CD19", "rna:CD14"], use_raw=False, vmax="p99", show=False, ncols=3) plt.savefig("wnn_umap_markers.png", dpi=150, bbox_inches="tight") plt.close() print("Saved wnn_umap_markers.png") ``` ```python # Per-modality scatter / embedding plots mu.pl.embedding(mdata, basis="X_umap", color="leiden_wnn", show=False) plt.savefig("embedding_clusters.png", dpi=150, bbox_inches="tight") plt.close() # Violin plot: RNA QC metrics per cluster sc.pl.violin(mdata["rna"], keys=["n_genes_by_counts", "total_counts"], groupby=mdata.obs.loc[mdata["rna"].obs_names, "leiden_wnn"], rotation=90, show=False) plt.savefig("rna_qc_by_cluster.png", dpi=150, bbox_inches="tight") plt.close() print("Saved rna_qc_by_cluster.png") ``` ### Module 6: Cross-Modal Analysis Cross-modal analysis links features across modalities — for example, associating ATAC peak accessibility near a gene's promoter with its RNA expression, or applying MOFA+ to find shared latent factors. ```python # --- Peak-to-gene distance annotation --- # Annotate ATAC peaks with nearest gene (requires genomic coordinates in var) # mdata["atac"].var should contain columns: chrom, chromStart, chromEnd # mu.atac.tl.rank_peaks_groups(mdata, groupby="leiden_wnn") # differential peaks # --- Differentially accessible peaks per cluster --- atac = mdata["atac"] # Transfer cluster labels from MuData obs to ATAC obs atac.obs["cluster"] = mdata.obs.loc[atac.obs_names, "leiden_wnn"].values sc.tl.rank_genes_groups(atac, groupby="cluster", method="wilcoxon", use_raw=False) top_peaks = sc.get.rank_genes_groups_df(atac, group="0").head(5) print("Top differential peaks in cluster 0:") print(top_peaks[["names", "logfoldchanges", "pvals_adj"]]) ``` ```python # --- MOFA+ factor analysis on MuData --- # Requires: pip install mofapy2 try: mu.tl.mofa(mdata, n_factors=10, seed=42, use_obs="all", outfile="mofa_model.hdf5") # Factor scores stored in mdata.obsm["X_mofa"] print(f"MOFA factors: {mdata.obsm['X_mofa'].shape}") # Variance explained per factor per modality # Inspect with mdata.uns["mofa"] except ImportError: print("Install mofapy2 to enable MOFA factor analysis") ``` ## Common Workflows ### Workflow 1: Full 10x Multiome (RNA + ATAC) Joint WNN Clustering **Goal**: Process paired RNA and ATAC data from 10x Genomics Multiome, build a WNN joint embedding, cluster cells, and identify RNA marker genes per cluster. ```python import muon as mu import scanpy as sc import numpy as np import anndata as ad import pandas as pd import matplotlib.pyplot as plt # --- 1. Simulate 10x Multiome-like data (replace with mu.read_10x_h5()) --- np.random.seed(0) n_cells = 400 cell_ids = [f"AAACGAATC{i:05d}-1" for i in range(n_cells)] # Simulate two cell types with distinct expression/accessibility labels = np.array(["typeA"] * 200 + ["typeB"] * 200) rna_counts = np.abs(np.random.negative_binomial(6, 0.35, (n_cells, 2000))).astype(float) rna_counts[:200, :100] += 10 # typeA marker genes rna_counts[200:, 100:200] += 10 # typeB marker genes atac_counts = np.abs(np.random.binomial(1, 0.05, (n_cells, 50000))).astype(float) atac_counts[:200, :5000] += 1 # typeA accessible peaks atac_counts[200:, 5000:10000] += 1 # typeB accessible peaks rna = ad.AnnData(rna_counts, obs=pd.DataFrame({"cell_type": labels}, index=cell_ids), var=pd.DataFrame(index=[f"gene_{j}" for j in range(2000)])) atac = ad.AnnData(atac_counts, obs=pd.DataFrame({"cell_type": labels}, index=cell_ids), var=pd.DataFrame(index=[f"peak_{j}" for j in range(50000)])) mdata = mu.MuData({"rna": rna, "atac": atac}) print(f"Loaded: {mdata}") # --- 2. RNA preprocessing --- sc.pp.filter_cells(mdata["rna"], min_genes=50) sc.pp.filter_genes(mdata["rna"], min_cells=5) sc.pp.normalize_total(mdata["rna"], target_sum=1e4) sc.pp.log1p(mdata["rna"]) sc.pp.highly_variable_genes(mdata["rna"], n_top_genes=500) sc.pp.pca(mdata["rna"], n_comps=30, use_highly_variable=True) print(f"RNA PCA: {mdata['rna'].obsm['X_pca'].shape}") # --- 3. ATAC preprocessing: TF-IDF + LSI --- sc.pp.filter_cells(mdata["atac"], min_genes=100) sc.pp.filter_genes(mdata["atac"], min_cells=5) mu.atac.pp.tfidf(mdata["atac"]) mu.atac.tl.lsi(mdata["atac"], n_comps=30) print(f"ATAC LSI: {mdata['atac'].obsm['X_lsi'].shape}") # --- 4. Per-modality neighbors (optional for modality-specific plots) --- mu.pp.neighbors(mdata["rna"], use_rep="X_pca", n_neighbors=15, key_added="neighbors") mu.pp.neighbors(mdata["atac"], use_rep="X_lsi", n_neighbors=15, key_added="neighbors") # --- 5. WNN joint neighbor graph --- mu.pp.intersect_obs(mdata) # align obs after per-modality filtering mu.pp.neighbors(mdata, key_added="wnn", use_rep={"rna": "X_pca", "atac": "X_lsi"}, n_neighbors=15, random_state=42) # --- 6. UMAP + Leiden clustering --- sc.tl.umap(mdata, neighbors_key="wnn", random_state=42) sc.tl.leiden(mdata, neighbors_key="wnn", resolution=0.5, key_added="leiden_wnn") print(f"Clusters: {mdata.obs['leiden_wnn'].value_counts().to_dict()}") # --- 7. Marker genes per cluster --- sc.tl.rank_genes_groups(mdata["rna"], groupby=mdata.obs.loc[mdata["rna"].obs_names, "leiden_wnn"], method="wilcoxon", use_raw=False) top_markers = sc.get.rank_genes_groups_df(mdata["rna"], group="0").head(5) print("Top RNA markers for cluster 0:") print(top_markers[["names", "logfoldchanges", "pvals_adj"]]) # --- 8. Visualization --- fig, axes = plt.subplots(1, 2, figsize=(12, 5)) sc.pl.umap(mdata, color="leiden_wnn", ax=axes[0], show=False, title="WNN clusters") sc.pl.umap(mdata, color="rna:gene_0", ax=axes[1], show=False, title="gene_0 expression", use_raw=False, vmax="p99") plt.tight_layout() plt.savefig("multiome_wnn_pipeline.png", dpi=150, bbox_inches="tight") plt.close() print("Saved multiome_wnn_pipeline.png") ``` ### Workflow 2: CITE-seq (RNA + Surface Protein) Analysis **Goal**: Process CITE-seq data with paired RNA and antibody-derived tag (ADT/protein) counts. Normalize proteins with CLR, annotate cell types from protein markers, and build a joint UMAP. ```python import muon as mu import scanpy as sc import numpy as np import anndata as ad import pandas as pd import matplotlib.pyplot as plt from scipy.sparse import csr_matrix np.random.seed(1) n_cells = 350 cell_ids = [f"CITESEQ_{i:04d}" for i in range(n_cells)] # Simulate RNA + protein modalities # Three cell types: T-cells (CD3+CD4+), B-cells (CD19+CD20+), Monocytes (CD14+CD16+) n_t, n_b, n_mono = 120, 130, 100 labels = ["Tcell"] * n_t + ["Bcell"] * n_b + ["Monocyte"] * n_mono rna_mat = np.abs(np.random.negative_binomial(4, 0.4, (n_cells, 1500))).astype(float) rna_mat[:n_t, :50] += 15 # T-cell RNA markers rna_mat[n_t:n_t+n_b, 50:100] += 15 # B-cell RNA markers rna_mat[n_t+n_b:, 100:150] += 15 # Monocyte RNA markers # 20 surface proteins: first 5 T-cell, next 5 B-cell, next 5 Monocyte, rest baseline prot_mat = np.abs(np.random.normal(2, 0.5, (n_cells, 20))) prot_mat[:n_t, :5] += 8 # CD3, CD4, CD5, CD7, CD8 for T-cells prot_mat[n_t:n_t+n_b, 5:10] += 8 # CD19, CD20, CD22, CD24, CD79 for B-cells prot_mat[n_t+n_b:, 10:15] += 8 # CD14, CD16, CD64, CD11b, HLA-DR for Monocytes protein_names = ["CD3", "CD4", "CD5", "CD7", "CD8", "CD19", "CD20", "CD22", "CD24", "CD79a", "CD14", "CD16", "CD64", "CD11b", "HLA-DR", "CD25", "CD56", "CD45RA", "CD45RO", "IgG-ctrl"] rna = ad.AnnData(csr_matrix(rna_mat), obs=pd.DataFrame({"cell_type": labels}, index=cell_ids), var=pd.DataFrame(index=[f"gene_{j}" for j in range(1500)])) prot = ad.AnnData(prot_mat, obs=pd.DataFrame({"cell_type": labels}, index=cell_ids), var=pd.DataFrame(index=protein_names)) mdata = mu.MuData({"rna": rna, "protein": prot}) print(mdata) # --- RNA preprocessing --- sc.pp.normalize_total(mdata["rna"], target_sum=1e4) sc.pp.log1p(mdata["rna"]) sc.pp.highly_variable_genes(mdata["rna"], n_top_genes=500) sc.pp.pca(mdata["rna"], n_comps=30, use_highly_variable=True) # --- Protein CLR normalization (Centered Log-Ratio) --- # CLR normalizes each protein across cells: log(x / geometric_mean(x)) mu.prot.pp.clr(mdata["protein"]) print("Protein CLR normalization applied") print(f"Protein data range: [{mdata['protein'].X.min():.2f}, {mdata['protein'].X.max():.2f}]") # PCA on protein modality sc.pp.pca(mdata["protein"], n_comps=min(15, prot.n_vars - 1)) # --- WNN joint embedding --- mu.pp.neighbors(mdata, key_added="wnn", use_rep={"rna": "X_pca", "protein": "X_pca"}, n_neighbors=20, random_state=0) sc.tl.umap(mdata, neighbors_key="wnn", random_state=0) sc.tl.leiden(mdata, neighbors_key="wnn", resolution=0.4, key_added="leiden_wnn") # --- Protein-based cell type annotation --- # Compute mean CLR protein per cluster prot_df = pd.DataFrame( mdata["protein"].X, index=mdata["protein"].obs_names, columns=protein_names ) prot_df["cluster"] = mdata.obs.loc[mdata["protein"].obs_names, "leiden_wnn"].values cluster_means = prot_df.groupby("cluster").mean() print("\nMean CLR protein per cluster:") print(cluster_means[["CD3", "CD4", "CD19", "CD14"]].round(2)) # --- Visualization --- fig, axes = plt.subplots(1, 3, figsize=(15, 4)) sc.pl.umap(mdata, color="leiden_wnn", ax=axes[0], show=False, title="WNN Leiden") sc.pl.umap(mdata, color="protein:CD3", ax=axes[1], show=False, title="CD3 (CLR)", use_raw=False, vmax="p99") sc.pl.umap(mdata, color="protein:CD19", ax=axes[2], show=False, title="CD19 (CLR)", use_raw=False, vmax="p99") plt.tight_layout() plt.savefig("citeseq_joint_umap.png", dpi=150, bbox_inches="tight") plt.close() print("Saved citeseq_joint_umap.png") # --- Dot plot: protein markers per cluster --- sc.pl.dotplot(mdata["protein"], var_names=["CD3", "CD4", "CD19", "CD20", "CD14", "CD16"], groupby=mdata.obs.loc[mdata["protein"].obs_names, "leiden_wnn"], show=False) plt.savefig("citeseq_protein_dotplot.png", dpi=150, bbox_inches="tight") plt.close() print("Saved citeseq_protein_dotplot.png") ``` ## Key Parameters | Parameter | Module / Function | Default | Range / Options | Effect | |-----------|-------------------|---------|-----------------|--------| | `n_neighbors` | `mu.pp.neighbors()` | `30` | `10`–`100` | Neighborhood size for graph; lower = finer local structure | | `use_rep` | `mu.pp.neighbors()` | `None` (auto) | dict of `{modality: embedding_key}` | Which embedding per modality to use for WNN | | `key_added` | `mu.pp.neighbors()` | `"neighbors"` | any string | Key under which graph is stored; use `"wnn"` for joint graphs | | `n_comps` | `sc.pp.pca()`, `mu.atac.tl.lsi()` | `50` | `10`–`100` | Number of reduced dimensions; 30–50 typical | | `resolution` | `sc.tl.leiden()` | `1.0` | `0.1`–`2.0` | Clustering granularity; lower = fewer, larger clusters | | `scale_factor` | `mu.atac.pp.tfidf()` | `1e4` | `1e3`–`1e5` | TF scaling constant before log transform | | `n_factors` | `mu.tl.mofa()` | `10` | `5`–`50` | Number of MOFA latent factors; use elbow on variance explained | | `target_sum` | `sc.pp.normalize_total()` | `1e4` | `1e3`–`1e6` | Library size normalization target per cell (RNA) | | `n_top_genes` | `sc.pp.highly_variable_genes()` | varies | `1000`–`5000` | Number of highly variable genes to retain for PCA | ## Best Practices 1. **Align obs before WNN**: After per-modality QC filtering, cell counts across modalities may differ. Always call `mu.pp.intersect_obs(mdata)` before WNN to ensure every modality has the same cell set. ```python mu.pp.intersect_obs(mdata) print(f"Shared cells after intersect: {mdata.n_obs}") ``` 2. **Drop LSI component 1 from ATAC**: The first LSI component captures sequencing depth (total counts per cell) rather than biological signal. `mu.atac.tl.lsi()` stores all components in `X_lsi`, so verify that component 1 correlates with `log_total_counts` before using `X_lsi` in WNN; if so, restrict to `X_lsi[:, 1:]`. ```python import numpy as np, pandas as pd atac = mdata["atac"] corr = np.corrcoef(atac.obsm["X_lsi"][:, 0], np.log1p(atac.obs["total_counts"]))[0, 1] print(f"LSI1 vs log_counts correlation: {corr:.3f}") # If |corr| > 0.9, exclude component 1: # atac.obsm["X_lsi"] = atac.obsm["X_lsi"][:, 1:] ``` 3. **Use `key_added="wnn"` consistently**: Always name the joint graph `"wnn"` and pass `neighbors_key="wnn"` to all downstream `sc.tl.umap()`, `sc.tl.leiden()`, and `sc.tl.paga()` calls. Mixing `neighbors_key` values silently uses the wrong graph. 4. **CLR normalization for proteins, not log-normalization**: Antibody-derived tag (ADT) counts from CITE-seq have a different noise model than RNA. Use `mu.prot.pp.clr()` for per-protein CLR normalization. Do NOT apply `sc.pp.normalize_total()` + `sc.pp.log1p()` to the protein modality. 5. **Store raw RNA counts before normalization**: Before any normalization, store the raw RNA counts in `mdata["rna"].layers["counts"]` and set `mdata["rna"].raw = mdata["rna"]`. This is required for downstream differential expression tests and scvi-tools integration. ```python import scipy.sparse as sp mdata["rna"].layers["counts"] = mdata["rna"].X.copy() # Store pre-normalization snapshot mdata["rna"].raw = mdata["rna"] ``` 6. **Match WNN embedding dimensions**: The RNA PCA and ATAC LSI embeddings passed to WNN should have comparable dimensionality (e.g., both 30 or 50 components). Mismatched dimensions do not cause errors but can down-weight the smaller modality. ## Common Recipes ### Recipe: Compute Per-Cluster Modality Weights from WNN When to use: Inspect which cells rely more on RNA vs ATAC signal in the WNN graph. High RNA weight in a cluster suggests cleaner RNA data; high ATAC weight suggests stronger chromatin accessibility signal. ```python # WNN stores per-cell modality weights in mdata.obsm after mu.pp.neighbors() # Key name: "{key_added}_weights" — check available keys print([k for k in mdata.obsm.keys() if "wnn" in k.lower()]) # If weights are stored (depends on muon version): if "wnn_weights" in mdata.obsm: weights_df = pd.DataFrame( mdata.obsm["wnn_weights"], index=mdata.obs_names, columns=["rna_weight", "atac_weight"] ) weights_df["cluster"] = mdata.obs["leiden_wnn"].values print(weights_df.groupby("cluster").mean().round(3)) else: # Proxy: compare RNA vs ATAC PCA explained variance per cell rna_var = np.var(mdata["rna"].obsm["X_pca"], axis=1) atac_var = np.var(mdata["atac"].obsm["X_lsi"], axis=1) mdata.obs["rna_signal_proxy"] = rna_var / (rna_var + atac_var) mdata.obs["atac_signal_proxy"] = atac_var / (rna_var + atac_var) print(mdata.obs[["rna_signal_proxy", "atac_signal_proxy", "leiden_wnn"]].groupby("leiden_wnn").mean().round(3)) ``` ### Recipe: Export Cluster Labels Back to Per-Modality AnnData When to use: After joint WNN clustering, copy shared cluster labels back into each modality's `.obs` for modality-specific downstream analyses (e.g., differential peaks within clusters). ```python # Copy joint cluster labels into each modality's obs for mod_name in mdata.mod: mod_adata = mdata[mod_name] shared_cells = mod_adata.obs_names mod_adata.obs["leiden_wnn"] = mdata.obs.loc[shared_cells, "leiden_wnn"].values print(f"{mod_name}: added leiden_wnn, {mod_adata.obs['leiden_wnn'].nunique()} clusters") # Now run modality-specific differential analysis per cluster sc.tl.rank_genes_groups(mdata["rna"], groupby="leiden_wnn", method="wilcoxon", use_raw=False) sc.tl.rank_genes_groups(mdata["atac"], groupby="leiden_wnn", method="wilcoxon", use_raw=False) print("Differential features computed for both modalities") ``` ### Recipe: Convert MuData to Concatenated AnnData for scvi-tools When to use: MultiVI and totalVI in scvi-tools require a concatenated AnnData with modality identity tracked in `var`. This recipe prepares the input for these models. ```python # Concatenate RNA and ATAC into a single AnnData with modality column in var rna_adata = mdata["rna"].copy() atac_adata = mdata["atac"].copy() rna_adata.var["modality"] = "Gene Expression" atac_adata.var["modality"] = "Peaks" # Restore raw counts for scvi-tools (requires integer counts) # Ensure mdata["rna"].layers["counts"] and mdata["atac"].X are integer import anndata as ad combined = ad.concat([rna_adata, atac_adata], axis=1, merge="unique") combined.obs = mdata.obs.loc[combined.obs_names].copy() print(f"Combined AnnData: {combined.n_obs} cells × {combined.n_vars} features") print(combined.var["modality"].value_counts()) # Use combined with scvi.model.MULTIVI.setup_anndata(combined, ...) ``` ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `KeyError: 'rna'` when building MuData | Modality key not set or loaded from file without expected names | Check `mdata.mod.keys()`. When loading 10x h5 files, use `mu.read_10x_h5()` which auto-names modalities `"rna"` and `"atac"` | | `ValueError: obs_names mismatch` during WNN | Cells filtered differently per modality leaving mismatched obs | Call `mu.pp.intersect_obs(mdata)` after per-modality QC to harmonize cell sets | | UMAP/Leiden uses wrong graph | Multiple neighbor graphs exist; function uses default key `"neighbors"` | Always pass `neighbors_key="wnn"` explicitly to `sc.tl.umap()` and `sc.tl.leiden()` | | LSI component 1 dominates ATAC embedding | First component captures sequencing depth, not biology | Compute correlation of `X_lsi[:, 0]` with `log_total_counts`; if |r| > 0.9, use `X_lsi[:, 1:]` | | `mu.atac.pp.tfidf` raises sparse matrix error | Input matrix is dense or wrong dtype | Convert first: `mdata["atac"].X = scipy.sparse.csr_matrix(mdata["atac"].X.astype(float))` | | CLR normalization produces NaN for proteins | Zero counts in a cell for all proteins | Filter cells with `sc.pp.filter_cells(mdata["protein"], min_genes=1)` before CLR | | `mu.tl.mofa()` fails with missing mofapy2 | mofapy2 not installed | `pip install mofapy2`; ensure `muon` version ≥ 0.1.5 for MOFA integration | | Memory error for large ATAC peak matrices | ATAC matrices (50k+ peaks × 10k+ cells) exceed RAM | Use `sc.pp.highly_variable_genes()` to select top 50k peaks first, or load in chunks | ## Related Skills - **scanpy-scrna-seq** — single-modality RNA analysis; use as the foundation for the RNA preprocessing steps within muon - **scvi-tools-single-cell** — deep generative multi-modal integration (MultiVI for RNA+ATAC, totalVI for CITE-seq); use when probabilistic batch correction across samples is needed - **mofaplus-multi-omics** — multi-omics factor analysis; `mu.tl.mofa()` calls mofapy2 internally and stores factors in MuData - **anndata-data-structure** — AnnData fundamentals; each MuData modality is a standard AnnData object - **deeptools-ngs-analysis** — upstream ATAC-seq BAM → bigWig normalization before peak calling - **macs3-peak-calling** — produces the peak BED files used to generate the ATAC AnnData count matrix ## References - [muon documentation](https://muon.readthedocs.io/) — official docs, tutorials, API reference - [muon GitHub (scverse)](https://github.com/scverse/muon) — source code, issues, examples - [Bredikhin et al. Genome Biology 2022](https://doi.org/10.1186/s13059-021-02577-8) — muon/MuData publication - [10x Genomics Multiome analysis guide](https://muon-tutorials.readthedocs.io/en/latest/single-cell-rna-atac/pbmc10k/1-Gene-Expression-Processing.html) — step-by-step Multiome tutorial in muon - [WNN method (Hao et al. Cell 2021)](https://doi.org/10.1016/j.cell.2021.04.048) — original Weighted Nearest Neighbor paper from Seurat v4