# Single-Cell Atlas Construction

Full single-cell RNA-seq pipeline from raw counts to biological interpretation. Covers QC, normalization, batch integration, clustering, annotation, pseudobulk DE, trajectory inference, cell-cell communication, and TF activity. Dual-language: Seurat v5 (R) and scanpy (Python).

## When to Use This Skill

Activate when the user requests:
- Single-cell RNA-seq quality control and preprocessing
- 10x Chromium data processing (Cell Ranger output)
- Doublet detection and removal
- Normalization of UMI count matrices
- Feature selection for dimensionality reduction
- Processing scRNA-seq data before integration, clustering, or annotation
- Batch integration across samples or experiments
- Clustering and resolution selection
- Automated or marker-based cell type annotation
- UMAP visualization
- Differential expression between conditions (pseudobulk)
- Trajectory inference and RNA velocity
- Cell-cell communication analysis
- Transcription factor activity and gene regulatory networks

## Inputs

| Data Type | Format | Source |
|-----------|--------|--------|
| Count matrix | 10x HDF5 (`.h5`), MEX (matrix.mtx + barcodes + features) | Cell Ranger, STARsolo |
| Count matrix | AnnData (`.h5ad`) for scanpy, Seurat object (`.rds`) for R | Processed datasets |
| Reference | PBMC 3k, Tabula Muris, HCA datasets | 10x Genomics, HCA |

---

## Loading Data

### Seurat v5 (R)

```r
library(Seurat)  # v5.4+

# From Cell Ranger output (filtered_feature_bc_matrix/)
obj <- Read10X("path/to/filtered_feature_bc_matrix/") |>
  CreateSeuratObject(project = "sample1", min.cells = 3, min.features = 200)

# From HDF5
obj <- Read10X_h5("path/to/filtered_feature_bc_matrix.h5") |>
  CreateSeuratObject(project = "sample1", min.cells = 3, min.features = 200)

# Seurat v5 uses Assay5 (layers: counts, data, scale.data)
# Access counts: obj[["RNA"]]$counts or LayerData(obj, layer = "counts")
```

### scanpy (Python)

```python
import scanpy as sc  # v1.12+

# From Cell Ranger output
adata = sc.read_10x_mtx("path/to/filtered_feature_bc_matrix/", var_names="gene_symbols")

# From HDF5
adata = sc.read_10x_h5("path/to/filtered_feature_bc_matrix.h5")

# Store raw counts for later
adata.raw = adata.copy()
```

---

## Quality Control

### Computing QC Metrics

#### Seurat

```r
obj[["percent.mt"]] <- PercentageFeatureSet(obj, pattern = "^MT-")
# Mouse: pattern = "^mt-"
# Ribosomal: pattern = "^RP[SL]" (optional, rarely used for filtering)

# QC stats are in obj@meta.data:
#   nCount_RNA = total UMI per cell
#   nFeature_RNA = genes detected per cell
#   percent.mt = mitochondrial fraction
```

#### scanpy

```python
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], percent_top=None, log1p=True, inplace=True)

# Adds to adata.obs:
#   total_counts, n_genes_by_counts, pct_counts_mt
#   log1p_total_counts, log1p_n_genes_by_counts
```

### MAD-Based Filtering

```
Why MAD over fixed thresholds:
  Fixed thresholds (>200 genes, <5% MT) are arbitrary and dataset-specific.
  A threshold that works for PBMCs fails for tumor tissues (higher MT baseline)
  or deep-sequenced datasets (more genes detected per cell).

  MAD (median absolute deviation) adapts to the actual data distribution.
  Flag cells that deviate > N MADs from the median in the problematic direction.
  Default: 3 MADs (~99% of values under normality).
  Permissive: 5 MADs (for heterogeneous tissues).
```

#### Seurat (manual MAD implementation)

```r
# Compute MAD thresholds per metric
mad_outlier <- function(x, nmads = 3, direction = "both") {
  med <- median(x, na.rm = TRUE)
  mad_val <- mad(x, na.rm = TRUE)
  if (direction == "lower") return(x < med - nmads * mad_val)
  if (direction == "higher") return(x > med + nmads * mad_val)
  (x < med - nmads * mad_val) | (x > med + nmads * mad_val)
}

obj$outlier_mt <- mad_outlier(obj$percent.mt, nmads = 3, direction = "higher")
obj$outlier_ncount <- mad_outlier(log1p(obj$nCount_RNA), nmads = 3, direction = "both")
obj$outlier_nfeat <- mad_outlier(log1p(obj$nFeature_RNA), nmads = 3, direction = "lower")

obj$discard <- obj$outlier_mt | obj$outlier_ncount | obj$outlier_nfeat

cat(sprintf("Cells flagged: %d / %d (%.1f%%)\n",
  sum(obj$discard), ncol(obj), 100 * mean(obj$discard)))

obj <- subset(obj, subset = discard == FALSE)
```

#### scanpy (MAD filtering)

```python
import numpy as np

def mad_outlier(adata, metric, nmads=3, direction="higher"):
    """Flag outliers beyond N MADs from median."""
    vals = adata.obs[metric].values
    med = np.median(vals)
    mad = np.median(np.abs(vals - med)) * 1.4826  # scale to match SD
    if direction == "higher":
        return vals > med + nmads * mad
    elif direction == "lower":
        return vals < med - nmads * mad
    else:
        return (vals < med - nmads * mad) | (vals > med + nmads * mad)

adata.obs["outlier_mt"] = mad_outlier(adata, "pct_counts_mt", nmads=3, direction="higher")
adata.obs["outlier_counts"] = mad_outlier(adata, "log1p_total_counts", nmads=3, direction="both")
adata.obs["outlier_genes"] = mad_outlier(adata, "log1p_n_genes_by_counts", nmads=3, direction="lower")

adata.obs["discard"] = adata.obs["outlier_mt"] | adata.obs["outlier_counts"] | adata.obs["outlier_genes"]

print(f"Cells flagged: {adata.obs['discard'].sum()} / {adata.n_obs} "
      f"({100 * adata.obs['discard'].mean():.1f}%)")

adata = adata[~adata.obs["discard"]].copy()
```

#### Bioconductor (scater — dedicated function)

```r
library(scater)  # Bioconductor
sce <- as.SingleCellExperiment(obj)
sce <- addPerCellQCMetrics(sce, subsets = list(mt = grep("^MT-", rownames(sce))))

# isOutlier handles log-transformation and batch-aware thresholds
discard <- isOutlier(sce$subsets_mt_percent, nmads = 3, type = "higher") |
  isOutlier(sce$detected, nmads = 3, type = "lower", log = TRUE) |
  isOutlier(sce$sum, nmads = 3, type = "both", log = TRUE)

sce <- sce[, !discard]
```

### Gene Filtering

```r
# Remove genes detected in very few cells
# min.cells = 3 in CreateSeuratObject handles this for Seurat

# For scanpy:
# sc.pp.filter_genes(adata, min_cells=3)

# Also consider removing:
#   Mitochondrial genes (after using them for QC)
#   Ribosomal genes (optional, sometimes informative)
#   Sex-linked genes (if confounding — XIST, DDX3Y, etc.)
```

---

## Doublet Detection

### scDblFinder (R — top performer in benchmarks)

```r
library(scDblFinder)  # Bioconductor

sce <- as.SingleCellExperiment(obj)
sce <- scDblFinder(sce)

# Results in colData(sce)$scDblFinder.class: "singlet" or "doublet"
# Also: scDblFinder.score (continuous), scDblFinder.cxds_score
table(sce$scDblFinder.class)

# Expected doublet rate for 10x Chromium: ~0.8% per 1000 cells captured
# 5000 cells -> ~4% doublets, 10000 cells -> ~8% doublets

# Transfer back to Seurat
obj$doublet_class <- sce$scDblFinder.class
obj$doublet_score <- sce$scDblFinder.score
obj <- subset(obj, subset = doublet_class == "singlet")
```

### Scrublet (Python — integrated in scanpy)

```python
# Scrublet is now in the core scanpy API (not external)
sc.pp.scrublet(adata, expected_doublet_rate=0.06)

# Adds to adata.obs:
#   predicted_doublet (bool), doublet_score (float)
print(f"Detected doublets: {adata.obs['predicted_doublet'].sum()}")

adata = adata[~adata.obs["predicted_doublet"]].copy()
```

```
Doublet detection order:
  Run BEFORE normalization and feature selection.
  Doublets with mixed transcriptomes distort HVG selection
  and normalization size factors.

  For multi-sample experiments:
    Run doublet detection PER SAMPLE, not on merged data.
    Cross-sample doublets don't exist in droplet-based protocols.
```

---

## Normalization

### Decision Tree

```
Which normalization method?

  Seurat v5 workflow, plan to integrate multiple samples?
    -> SCTransform(obj) — v2 is default in Seurat v5
    -> Replaces NormalizeData + FindVariableFeatures + ScaleData in one call
    -> Returns 3000 variable features by default
    -> Requires glmGamPoi package for speed

  Seurat v5, single sample or exploratory analysis?
    -> NormalizeData(obj) + FindVariableFeatures(obj) + ScaleData(obj)
    -> Log-normalization: counts / total * 10000 -> log1p
    -> Fast, simple, adequate for most single-sample analyses

  scanpy workflow?
    -> sc.pp.normalize_total(adata, target_sum=1e4) + sc.pp.log1p(adata)
    -> Standard log-normalization, analogous to Seurat's LogNormalize

  Need principled size factor estimation (composition bias)?
    -> scran::computeSumFactors() on a SingleCellExperiment
    -> Requires pre-clustering with scran::quickCluster()
    -> Slowest method but handles composition bias best

  Dealing with very large datasets (>500k cells)?
    -> SCTransform with BPCells on-disk storage (Seurat v5)
    -> Or scanpy with backed mode: sc.read_h5ad("data.h5ad", backed="r")
```

### SCTransform v2 (Seurat)

```r
# SCTransform v2 is the default in Seurat v5
# Uses glmGamPoi for fast regularized negative binomial regression
# install.packages("glmGamPoi")  # or BiocManager::install("glmGamPoi")

obj <- SCTransform(obj, verbose = FALSE)
# Replaces NormalizeData + FindVariableFeatures + ScaleData
# Sets default assay to "SCT"
# Variable features stored in VariableFeatures(obj)

# For multiple samples, run SCTransform on each before integration
# Then use IntegrateLayers() or SelectIntegrationFeatures()
```

### Log-Normalization (Seurat)

```r
obj <- NormalizeData(obj)  # LogNormalize, scale.factor = 10000
obj <- FindVariableFeatures(obj, selection.method = "vst", nfeatures = 2000)
obj <- ScaleData(obj)      # centers and scales, optionally regress out variables
# ScaleData(obj, vars.to.regress = "percent.mt")  # regress out MT% if needed
```

### scanpy Normalization

```python
# Store raw counts first
adata.layers["counts"] = adata.X.copy()

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
# adata.X now contains log-normalized data
# adata.layers["counts"] retains raw counts for DE analysis later
```

### scran Deconvolution (Bioconductor)

```r
library(scran)
sce <- as.SingleCellExperiment(obj)

# Pre-cluster to avoid violating the non-DE majority assumption
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, clusters = clusters)
# Size factors in sizeFactors(sce), centered around 1

sce <- logNormCounts(sce)
# logcounts(sce) now contains normalized, log-transformed values
```

---

## Feature Selection

### Seurat (VST)

```r
# After NormalizeData:
obj <- FindVariableFeatures(obj, selection.method = "vst", nfeatures = 2000)
# After SCTransform: already done (default 3000 features)

# Inspect top variable genes
head(VariableFeatures(obj), 20)

# Visualize variance-mean relationship
VariableFeaturePlot(obj)
```

### scanpy

```python
# Flavor depends on input: seurat_v3 needs raw counts, seurat needs log-normalized
sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=2000)
# Adds adata.var["highly_variable"], adata.var["means"], adata.var["variances"]

# Subset to HVGs for PCA (keep all genes in adata.raw)
adata = adata[:, adata.var["highly_variable"]].copy()
```

### Deviance-Based (scry — more principled alternative)

```r
library(scry)  # Bioconductor
sce <- as.SingleCellExperiment(obj)

# Compute binomial deviance per gene on raw counts
sce <- devianceFeatureSelection(sce, assay = "counts", sorted = TRUE)
# Top genes by deviance are most informative
# Select top 2000-4000 genes
hvgs <- head(rownames(sce), 2000)

# Pairs with GLM-PCA instead of standard PCA
# glmpca <- GLMPCA(counts(sce[hvgs, ]), L = 30, fam = "poi")
```

```
Feature selection guidelines:
  2000-3000 HVGs works for most datasets.
  More HVGs (4000-5000) can help for heterogeneous tissues or rare cell types.
  No single method consistently outperforms others (2025 Nature Methods benchmark).
  Deviance-based selection (scry) is more principled than variance-based
  but less widely adopted.
```

---

---

## Dimensionality Reduction

### PCA

#### Seurat

```r
obj <- RunPCA(obj, npcs = 50, verbose = FALSE)
# Determine how many PCs to use downstream
ElbowPlot(obj, ndims = 50)
# Pick the elbow — typically 15-30 PCs for most datasets
# Alternatively: use JackStraw (slow) or molecular cross-validation
```

#### scanpy

```python
sc.tl.pca(adata, n_comps=50, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, n_pcs=50, log=True)
# Pick PCs where variance ratio flattens — typically 15-30
```

### UMAP

```
UMAP is for visualization only. Never cluster on UMAP coordinates.
Distances between clusters in UMAP do NOT reflect biological similarity.
Cluster size does not reflect heterogeneity.

Parameters that matter:
  n_neighbors: 10-30 (default 15). Lower = more local detail, higher = broader topology.
  min_dist: 0.1-0.5 (default 0.1). Lower = tighter clusters, higher = more spread.
  For cell type identification: min_dist = 0.1-0.3
  For trajectory visualization: min_dist = 0.3-0.5
```

#### Seurat

```r
obj <- FindNeighbors(obj, dims = 1:30)
obj <- RunUMAP(obj, dims = 1:30, min.dist = 0.3, n.neighbors = 30)
DimPlot(obj, reduction = "umap", group.by = "cell_type")
```

#### scanpy

```python
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)
sc.tl.umap(adata, min_dist=0.3)
sc.pl.umap(adata, color=["cell_type", "batch"])
```

---

## Batch Integration

### Method Selection

```
Which integration method?

  Few batches (2-5), similar cell compositions?
    -> Harmony: fast, works in PCA space, no GPU needed.
    -> RPCA (Seurat): conservative, preserves biology well.

  Many batches (>5), different cell compositions across batches?
    -> scVI: deep generative model, handles complex batch effects.
    -> CCA (Seurat): finds shared correlation structure across batches.

  Have partial cell type labels?
    -> scANVI: semi-supervised, uses labels for better integration.
    -> Best overall performer when labels are available (scIB benchmark).

  Very large dataset (>500k cells)?
    -> Harmony: scales linearly, runs in seconds on PCA.
    -> Seurat v5 sketch-based: subsamples, integrates sketch, projects back.

  Simple/fast first pass?
    -> Harmony. Works well for most cases and takes < 1 minute.

  Evaluate integration quality?
    -> scib-metrics: batch ASW, graph iLISI (mixing), cell type ASW, NMI (conservation).
    -> Scoring: 40% batch removal + 60% biological conservation.
```

### Harmony (R)

```r
library(harmony)  # v1.2+

# Run on PCA embeddings — fast, no GPU needed
obj <- RunHarmony(obj, group.by.vars = "batch", dims.use = 1:30)
# theta = 2 (default): diversity penalty. Increase to 3-4 for stubborn batch effects.
# lambda = 1 (default): ridge regression penalty. Decrease for stronger correction.

# Use harmony reduction for downstream steps
obj <- FindNeighbors(obj, reduction = "harmony", dims = 1:30)
obj <- RunUMAP(obj, reduction = "harmony", dims = 1:30)
obj <- FindClusters(obj, resolution = 0.5)
```

### Harmony (scanpy)

```python
import scanpy.external as sce

sce.pp.harmony_integrate(adata, key="batch", basis="X_pca", adjusted_basis="X_pca_harmony")
# Use the corrected embedding for neighbors/UMAP
sc.pp.neighbors(adata, use_rep="X_pca_harmony", n_pcs=30)
sc.tl.umap(adata)
```

### Seurat v5 IntegrateLayers

```r
# Split layers by batch (if not already split after merge)
obj[["RNA"]] <- split(obj[["RNA"]], f = obj$batch)

# Standard workflow: normalize, HVGs, scale, PCA, then integrate
obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj)
obj <- ScaleData(obj)
obj <- RunPCA(obj)

# Integration — pick one method:
obj <- IntegrateLayers(obj, method = HarmonyIntegration,
  orig.reduction = "pca", new.reduction = "harmony")

# Or CCA/RPCA:
# obj <- IntegrateLayers(obj, method = CCAIntegration,
#   orig.reduction = "pca", new.reduction = "integrated.cca")

# Rejoin layers before DE analysis
obj <- JoinLayers(obj)
```

### scVI (Python)

```python
import scvi  # v1.4+

# Setup — requires raw counts in a specific layer
scvi.model.SCVI.setup_anndata(adata, batch_key="batch", layer="counts")

model = scvi.model.SCVI(adata, n_latent=30, n_layers=2, gene_likelihood="nb")
model.train(max_epochs=400, accelerator="auto")
# GPU speeds training ~10x but CPU works for < 100k cells

# Get latent representation
adata.obsm["X_scVI"] = model.get_latent_representation()

# Use for neighbors/UMAP
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)
```

### scANVI (semi-supervised, when partial labels available)

```python
# Start from a trained scVI model
scvi.model.SCANVI.setup_anndata(adata, batch_key="batch",
  labels_key="cell_type", unlabeled_category="Unknown")

scanvi_model = scvi.model.SCANVI.from_scvi_model(model,
  labels_key="cell_type", unlabeled_category="Unknown")
scanvi_model.train(max_epochs=20)

adata.obsm["X_scANVI"] = scanvi_model.get_latent_representation()
adata.obs["predicted_label"] = scanvi_model.predict()
```

---

## Clustering

### Leiden (preferred over Louvain)

```
Why Leiden over Louvain:
  Louvain can produce badly connected communities (internal disconnections).
  Leiden guarantees well-connected communities.
  Louvain is no longer maintained. Leiden is the default in Seurat and scanpy.
```

#### Seurat

```r
obj <- FindNeighbors(obj, reduction = "harmony", dims = 1:30)
obj <- FindClusters(obj, algorithm = 4, resolution = 0.5)
# algorithm = 4 is Leiden. algorithm = 1 is Louvain (legacy).
# resolution: 0.3-0.5 = coarse, 0.8-1.0 = fine, 1.2+ = very fine
```

#### scanpy

```python
sc.tl.leiden(adata, resolution=0.5)
# Requires leidenalg package
# sc.tl.louvain() is deprecated as of scanpy 1.12
```

### Resolution Selection

```
Picking resolution is one of the hardest choices in scRNA-seq analysis.
Too low: biologically distinct cell types merged.
Too high: single cell types split into meaningless subclusters.

Approaches:
  1. Clustree (R): visualize cluster stability across resolutions 0.1-1.5
     library(clustree)
     clustree(obj, prefix = "RNA_snn_res.")
     Pick the resolution where clusters stabilize (no excessive splitting).

  2. Known biology: if you expect ~9 cell types (PBMCs), resolution 0.4-0.6
     gives ~8-12 clusters. If you expect ~20 (complex tissue), go higher.

  3. Marker gene validation: run FindAllMarkers at each resolution.
     Clusters that lack distinct markers at high resolution were over-split.

  4. Silhouette scores: compute per-cluster silhouette on PCA embedding.
     Low silhouette = poorly separated cluster = likely over-split.
```

```r
# Clustree: test multiple resolutions
library(clustree)
obj <- FindClusters(obj, algorithm = 4, resolution = seq(0.1, 1.5, 0.1))
clustree(obj, prefix = "RNA_snn_res.")
# Pick the highest resolution with stable cluster assignments
```

---

## Cell Type Annotation

### Decision Tree

```
Which annotation method?

  Have a reference atlas for the same tissue?
    -> CellTypist (Python): pre-trained models for immune, lung, gut, brain, etc.
    -> scANVI: transfer labels from reference during integration.

  Working in Seurat (R) with known tissue type?
    -> scType: marker-gene database, scores clusters directly.
    -> Fast, customizable, works on Seurat scale.data.

  No reference, no database matches your tissue?
    -> Manual annotation with FindAllMarkers + known canonical markers.
    -> Still the gold standard for novel tissues or rare cell types.

  Want a quick first pass?
    -> CellTypist with majority_voting=True. Then validate with markers.
```

### CellTypist (Python)

```python
import celltypist
from celltypist import models

# Download models (run once)
models.download_models(force_update=True)

# Input must be log-normalized to 10,000 counts per cell
# If adata.X is already log-normalized, proceed directly
predictions = celltypist.annotate(
  adata,
  model="Immune_All_Low.pkl",   # 90 immune subtypes across 20 tissues
  majority_voting=True           # refines via local subcluster consensus
)

adata = predictions.to_adata()
# Adds: adata.obs["predicted_labels"], adata.obs["over_clustering"],
#   adata.obs["majority_voting"], adata.obs["conf_score"]
```

### scType (R)

```r
# Source scType functions from GitHub
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R")
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R")

# Prepare marker gene database
db <- "https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/ScTypeDB_full.xlsx"
gs_list <- gene_sets_prepare(db, "Immune system")

# Score each cell against all cell types
es <- sctype_score(
  scRNAseqData = obj[["RNA"]]$scale.data,  # Seurat v5 layer access
  gs = gs_list$gs_positive,
  gs2 = gs_list$gs_negative
)

# Assign cell types to clusters
cl_results <- do.call("rbind", lapply(unique(obj$seurat_clusters), function(cl) {
  es_cl <- sort(rowSums(es[, rownames(obj@meta.data[obj$seurat_clusters == cl, ])]),
    decreasing = TRUE)
  data.frame(cluster = cl, type = names(es_cl[1]), score = es_cl[1])
}))

obj$sctype_annotation <- cl_results$type[match(obj$seurat_clusters, cl_results$cluster)]
```

### Manual Annotation with Markers

```r
# Find markers for each cluster
markers <- FindAllMarkers(obj, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.5)

# Top 5 markers per cluster
top5 <- markers |> group_by(cluster) |> top_n(5, wt = avg_log2FC)

# Known PBMC markers for reference:
#   CD4+ T: IL7R, CCR7 (naive), S100A4 (memory)
#   CD8+ T: CD8A, CD8B
#   B cells: MS4A1 (CD20), CD79A
#   NK: GNLY, NKG7, KLRD1
#   CD14 Mono: CD14, LYZ
#   FCGR3A Mono: FCGR3A, MS4A7
#   DC: FCER1A, CST3
#   Platelets: PPBP

# Dot plot of markers across clusters
DotPlot(obj, features = c("IL7R", "CD8A", "MS4A1", "GNLY", "CD14",
  "FCGR3A", "FCER1A", "PPBP"), group.by = "seurat_clusters") +
  RotatedAxis()
```

---

## Differential Expression (Pseudobulk)

```
Why pseudobulk and not Wilcoxon/t-test on single cells:
  Cells from the same donor are not independent.
  Cell-level tests treat 5000 cells from 3 donors as n=5000, not n=3.
  This inflates significance: thousands of false positives.
  Squair et al. 2021, Zimmerman et al. 2021 both showed pseudobulk methods
  outperform cell-level tests for cross-condition comparisons.

  When cell-level tests are OK:
    Marker gene discovery within a single sample (no cross-sample comparison).
    FindAllMarkers for cluster markers is fine — it's within-sample.

  When pseudobulk is required:
    Any comparison across conditions, treatments, or donors.
```

### Pseudobulk with DESeq2 (R)

```r
library(DESeq2)

# Aggregate raw counts: sum per gene, grouped by sample + cell_type
# Seurat v5: JoinLayers first, then extract counts
obj <- JoinLayers(obj)
counts <- obj[["RNA"]]$counts

pb <- aggregate(t(as.matrix(counts)),
  by = list(sample = obj$sample_id, cell_type = obj$cell_type),
  FUN = sum)
# Reshape to genes x pseudobulk_samples matrix
# Or use scuttle::aggregateAcrossCells (cleaner)

# With scuttle (recommended)
library(scuttle)
sce <- as.SingleCellExperiment(obj)
pb_sce <- aggregateAcrossCells(sce,
  ids = DataFrame(sample = sce$sample_id, cell_type = sce$cell_type))

# Run DESeq2 on one cell type
ct <- "CD8+ T"
pb_ct <- pb_sce[, pb_sce$cell_type == ct]

dds <- DESeqDataSetFromMatrix(
  countData = counts(pb_ct),
  colData = colData(pb_ct),
  design = ~ condition  # e.g., treatment vs control
)
dds <- DESeq(dds)
res <- results(dds, alpha = 0.05)
```

### Pseudobulk with decoupleR (Python)

```python
import decoupler as dc

# Aggregate counts by sample + cell_type
pdata = dc.get_pseudobulk(
    adata, sample_col="sample_id", groups_col="cell_type",
    layer="counts", min_cells=10, min_counts=1000
)

# Run DESeq2 via decoupler's wrapper
dc.deseq2(pdata, design="~condition")
# Results in pdata.uns with log2FC, padj per gene per cell type
```

---

## Trajectory Inference

### Method Selection

```
Which trajectory method?

  Want cluster-level connectivity (which cell types connect)?
    -> PAGA (scanpy, fast, no assumptions about topology)
    -> Run first to understand the coarse structure.

  Want continuous pseudotime along a lineage?
    -> Diffusion pseudotime (scanpy sc.tl.dpt) — needs a root cell
    -> Monocle3 order_cells() — learns a graph, picks root interactively

  Want RNA velocity (transcriptional dynamics)?
    -> scVelo dynamical mode — requires spliced/unspliced counts from velocyto/STARsolo
    -> Caveat: assumes active transcription. Fails for stable/mature populations.
    -> Validate with CellRank 2 for fate probabilities.

  Branching trajectories?
    -> Monocle3: learns principal graph with automatic branch point detection.
    -> PAGA + DPT: use PAGA for topology, DPT for pseudotime per branch.
```

### PAGA + Diffusion Pseudotime (scanpy)

```python
# PAGA: cluster-level trajectory graph
sc.tl.paga(adata, groups="leiden")
sc.pl.paga(adata, threshold=0.03, show=True)
# Edges = statistically supported connections between clusters
# threshold: remove weak edges (default 0.03)

# Initialize UMAP with PAGA for better trajectory layout
sc.tl.umap(adata, init_pos="paga")

# Diffusion pseudotime (requires a root cell)
adata.uns["iroot"] = int(np.where(adata.obs["cell_type"] == "HSC")[0][0])
sc.tl.diffmap(adata)
sc.tl.dpt(adata)
sc.pl.umap(adata, color=["dpt_pseudotime", "cell_type"])
```

### Monocle3 (R)

```r
library(monocle3)

# Convert from Seurat
cds <- as.cell_data_set(obj)  # requires SeuratWrappers
cds <- preprocess_cds(cds, num_dim = 30)
cds <- reduce_dimension(cds)
cds <- cluster_cells(cds)

# Learn trajectory graph
cds <- learn_graph(cds)

# Order cells (select root interactively or programmatically)
cds <- order_cells(cds, root_cells = root_cell_ids)
# root_cell_ids: barcodes of cells at the trajectory start (e.g., stem cells)

plot_cells(cds, color_cells_by = "pseudotime", cell_size = 0.5)

# Find genes that change along the trajectory
trajectory_genes <- graph_test(cds, neighbor_graph = "principal_graph")
sig_genes <- trajectory_genes[trajectory_genes$q_value < 0.05, ]
```

### RNA Velocity (scVelo)

```python
import scvelo as scv  # v0.3+

# Requires spliced/unspliced counts (from velocyto or STARsolo)
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

# Dynamical mode — fits full kinetic ODE per gene
scv.tl.recover_dynamics(adata)
scv.tl.velocity(adata, mode="dynamical")
scv.tl.velocity_graph(adata)

scv.pl.velocity_embedding_stream(adata, basis="umap", color="cell_type")

# Velocity confidence: low confidence = unreliable direction
scv.tl.velocity_confidence(adata)
# adata.obs["velocity_confidence"] — filter genes/cells with low values
```

```
RNA velocity caveats:
  Requires active transcription dynamics (splicing timescale ~ hours).
  Fails for: PBMCs (mature, stable), slow diseases, steady-state populations.
  Contradictory results reported on some datasets (opposite directions).
  Always validate with known biology before interpreting velocity arrows.
  Consider CellRank 2 for fate analysis — it works with or without velocity.
```

---

## Cell-Cell Communication

### CellChat v2 (R)

```r
library(CellChat)  # v2.1+, from jinworks/CellChat

# Create CellChat object from Seurat
cellchat <- createCellChat(obj, group.by = "cell_type")

# Set ligand-receptor database
CellChatDB <- CellChatDB.human  # or CellChatDB.mouse
cellchat@DB <- CellChatDB

# Inference pipeline
cellchat <- subsetData(cellchat)
cellchat <- identifyOverExpressedGenes(cellchat)
cellchat <- identifyOverExpressedInteractions(cellchat)
cellchat <- computeCommunProb(cellchat)
cellchat <- computeCommunProbPathway(cellchat)
cellchat <- aggregateNet(cellchat)

# Visualization
netVisual_aggregate(cellchat, signaling = "CXCL")
netVisual_bubble(cellchat, sources.use = c("CD8+ T"), targets.use = c("Macrophage"))
netAnalysis_computeCentrality(cellchat, slot.name = "netP")
```

### LIANA+ (Python — consensus across methods)

```python
import liana as li

# LIANA wraps 7 methods and computes consensus rankings
li.mt.rank_aggregate(
    adata,
    groupby="cell_type",
    resource_name="consensus",  # OmniPath consensus resource
    verbose=True
)

# Results in adata.uns["liana_res"]
liana_res = adata.uns["liana_res"]
# Columns: source, target, ligand_complex, receptor_complex,
#   magnitude_rank, specificity_rank (consensus scores)

# Filter top interactions
top = liana_res[(liana_res["magnitude_rank"] < 0.01)]

# Visualization
li.pl.dotplot(adata, colour="magnitude_rank", size="specificity_rank",
  source_labels=["CD8+ T"], target_labels=["Macrophage"])
```

```
CellChat vs LIANA:
  CellChat: single method, rich curated database (1000+ interactions), R only,
    built-in spatial support (v2), best visualization.
  LIANA: consensus across 7 methods, flexible databases (OmniPath, CellPhoneDB, etc.),
    Python + R, higher confidence (cross-method agreement).
  For publication: run both and report concordant interactions.
```

---

## Transcription Factor Activity

```
For TF activity scoring, decoupleR + CollecTRI is now preferred over pySCENIC
for most use cases. It is faster, less prone to overfitting (uses pre-defined
regulons rather than inferring from the same data), and benchmarks favorably.

Use pySCENIC only when you need to discover novel TF-target relationships
specific to your dataset.
```

### decoupleR (Python — recommended for TF activity)

```python
import decoupler as dc

# Get CollecTRI regulons (curated TF-target interactions)
net = dc.get_collectri(organism="human", split_complexes=False)

# Score TF activities per cell
dc.run_ulm(adata, net=net, source="source", target="target", weight="weight")
# Results in adata.obsm["ulm_estimate"] (activity scores)
# and adata.obsm["ulm_pvals"]

# Visualize TF activities on UMAP
acts = dc.get_acts(adata, obsm_key="ulm_estimate")
sc.pl.umap(acts, color=["PAX5", "SPI1", "GATA1"], vcenter=0, cmap="RdBu_r")
# PAX5 high in B cells, SPI1 in myeloid, GATA1 in erythroid
```

### decoupleR (R)

```r
library(decoupleR)

net <- get_collectri(organism = "human", split_complexes = FALSE)
# Run ULM (univariate linear model) on the expression matrix
acts <- run_ulm(mat = obj[["RNA"]]$data, net = net, .source = "source",
  .target = "target", .mor = "mor", minsize = 5)
```

### pySCENIC (for novel GRN discovery)

```python
# Only use when discovering dataset-specific regulatory networks
# Requires: pyscenic, arboreto, ctxcore

from pyscenic.utils import modules_from_adjacencies
from pyscenic.prune import prune2df, df2regulons
from pyscenic.aucell import aucell
from arboreto.algo import grnboost2

# Step 1: infer co-expression modules (slow — hours for large datasets)
adjacencies = grnboost2(expression_data, tf_names=tf_list, seed=42)

# Step 2: prune with cis-regulatory motif analysis
modules = modules_from_adjacencies(adjacencies, expression_data)
# Requires ranking databases (hg38 feather files from cisTarget)
df = prune2df(dbs, modules, "motifs-v10-nr.hgnc-m0.001-o0.0.tbl")
regulons = df2regulons(df)

# Step 3: score regulon activity per cell
auc_mtx = aucell(expression_data, regulons)
```

## Output Specification

| Output | Format | Description |
|--------|--------|-------------|
| `qc_metrics.csv` | CSV | Per-cell QC: total counts, genes detected, MT%, outlier flags |
| `doublet_scores.csv` | CSV | Doublet probability per cell, singlet/doublet classification |
| `filtered_object.rds` | RDS | Seurat object after QC, doublet removal, normalization |
| `filtered_adata.h5ad` | H5AD | AnnData after QC, doublet removal, normalization |
| `hvg_list.csv` | CSV | Selected highly variable genes |
| `integrated_object.rds` | RDS | Seurat object after batch integration |
| `integrated_adata.h5ad` | H5AD | AnnData after integration with latent embedding |
| `cluster_markers.csv` | CSV | Top markers per cluster from FindAllMarkers |
| `cell_annotations.csv` | CSV | Per-cell type assignments (automated + manual) |
| `umap_plot.pdf` | PDF | UMAP colored by cluster, batch, cell type |
| `pseudobulk_de.csv` | CSV | DESeq2 results per cell type (log2FC, padj) |
| `trajectory_pseudotime.csv` | CSV | Per-cell pseudotime values |
| `velocity_embedding.pdf` | PDF | RNA velocity stream plot on UMAP |
| `cellchat_network.rds` | RDS | CellChat object with communication probabilities |
| `liana_results.csv` | CSV | Consensus L-R interaction rankings |
| `tf_activities.csv` | CSV | decoupleR TF activity scores per cell/cluster |
| `qc_violin.pdf` | PDF | Violin plots of nCount, nFeature, percent.mt |

## Validation Checks

```
After preprocessing PBMC 3k (10x reference dataset), verify:

Cell counts:
  Starting cells: ~2,700
  After QC (MAD 3): ~2,500-2,650 (lose ~2-7%)
  After doublet removal: ~2,400-2,600 (lose ~2-5%)

QC distributions:
  Median genes/cell: 800-1200
  Median UMI/cell: 2000-4000
  Median MT%: 2-5%
  Cells with > 10% MT: < 5% of total (PBMCs have low MT content)

Normalization:
  After log-normalization, expression values should be 0-8 range (log scale)
  SCTransform residuals centered near 0 with SD ~1

Feature selection:
  2000 HVGs should include known markers: IL7R, CD14, MS4A1, CD8A, GNLY, NKG7
  Housekeeping genes (ACTB, GAPDH) should NOT be in top HVGs

Doublet detection:
  Expected doublets for 3k cells: 20-60 (~1-2%)
  If > 10% flagged: check threshold or consider the data has quality issues

Integration (multi-sample):
  After Harmony/scVI, UMAP should show batches interleaved, not separated
  Cell types should remain distinct (batch correction ≠ biological erasure)
  If all cell types collapse into one blob: over-correction (reduce theta or lambda)

Clustering (PBMC 3k at resolution 0.5):
  Expect 8-12 clusters
  CD4 T, CD8 T, B cells, NK, CD14 Mono, FCGR3A Mono, DC, Platelets should separate
  If only 3-4 clusters: resolution too low
  If >20 clusters: resolution too high, check with clustree

Annotation:
  CellTypist with Immune_All_Low on PBMCs should identify all major types
  scType with "Immune system" database should score CD14 Mono, B cells, NK correctly
  Manual markers: IL7R for CD4 T, CD8A for CD8 T, MS4A1 for B, CD14 for Mono, GNLY for NK
```

## Common Pitfalls

### QC
1. **Fixed MT% threshold on all datasets**: A 5% cutoff works for PBMCs but removes real cardiomyocytes (30-40% MT) or hepatocytes (15-20% MT). Use MAD-based filtering or at least check the distribution before setting a threshold.
2. **Not filtering genes**: Removing cells but keeping genes detected in 0-2 cells. These contribute noise to normalization and feature selection. Filter with `min.cells = 3` or `sc.pp.filter_genes(adata, min_cells=3)`.
3. **QC on merged data without batch awareness**: MAD thresholds should ideally be computed per sample/batch. A sample with lower quality shifts the median and hides outliers in better samples. Use `scater::isOutlier(..., batch = sample_id)` or compute MAD per sample.

### Doublets
4. **Doublet detection on merged multi-sample data**: Droplet-based doublets only form within a sample. Running scDblFinder or Scrublet on merged data creates false doublets between cell types that were never in the same droplet.
5. **Trusting doublet scores blindly for rare populations**: Rare cell types with intermediate profiles can be flagged as doublets. Cross-check flagged cells against known markers before removal.

### Normalization
6. **Using TPM/FPKM normalization for scRNA-seq**: Single-cell UMI counts are not length-biased. Standard bulk normalization (TPM, FPKM) is wrong for UMI data. Use library-size normalization or SCTransform.
7. **Running SCTransform without glmGamPoi**: SCTransform v2 requires glmGamPoi for the fast GLM fitting. Without it, the function falls back to a much slower implementation. Install it from Bioconductor.
8. **ScaleData on all genes**: `ScaleData()` only needs to run on variable features for PCA. Scaling all genes wastes memory and time. `ScaleData(obj, features = VariableFeatures(obj))`.

### Feature Selection
9. **Running HVG selection on normalized data with seurat_v3 flavor**: The `seurat_v3` flavor in scanpy expects raw counts, not log-normalized data. Using the wrong flavor silently produces bad HVG rankings.

### Integration
10. **Integrating before QC and doublet removal**: Batch integration amplifies artifacts from low-quality cells and doublets. Always QC and remove doublets per sample first, then merge and integrate.
11. **Over-correction erasing biology**: If cell types that should be distinct merge after integration, the method is too aggressive. Reduce Harmony `theta`, increase scVI `n_latent`, or try a more conservative method (RPCA).
12. **Not joining layers before DE in Seurat v5**: After `IntegrateLayers()`, each sample's counts are still in separate layers. Call `JoinLayers()` before `FindAllMarkers()` or DE analysis will fail or use only one sample's data.

### Clustering and Annotation
13. **Clustering on UMAP coordinates**: UMAP distorts distances and densities. Always cluster on PCA or integrated embeddings (Harmony, scVI latent space), never on UMAP dimensions.
14. **Single resolution without validation**: Picking resolution = 0.5 without checking is arbitrary. Use clustree, silhouette scores, or marker gene validation to support the choice.
15. **Automated annotation without marker validation**: CellTypist and scType provide a starting point, not ground truth. Always check top markers per cluster against known biology for the tissue. Automated tools can mislabel rare or novel cell types.

### Differential Expression
16. **Wilcoxon/t-test for cross-condition DE**: Treating cells as independent replicates inflates significance massively. Use pseudobulk (aggregate by sample + cell type, then DESeq2). Cell-level tests are only appropriate for marker discovery within a single sample.
17. **Pseudobulk on normalized data**: Aggregate raw counts, not normalized values. DESeq2 needs integers and handles normalization internally.

### Trajectory and Communication
18. **RNA velocity on stable populations**: Velocity requires active transcription dynamics. PBMCs, terminally differentiated cells, and chronic disease states produce meaningless velocity arrows. Check velocity confidence scores and validate against known biology.
19. **Over-interpreting cell communication rankings**: Both CellChat and LIANA rank L-R pairs by statistical enrichment, not by biological activity. High-ranked interactions are hypotheses for validation, not confirmed communication events.
20. **pySCENIC for TF activity scoring**: pySCENIC infers GRNs from the same data it scores, creating overfitting risk. For TF activity scoring alone (not GRN discovery), decoupleR with CollecTRI regulons is faster and more reproducible.

## Related Skills

- `cancer-multiomics`: Bulk RNA-seq analysis, TCGA data retrieval
- `immune-deconvolution`: Bulk-level immune estimation (compare with single-cell ground truth)

## Public Datasets for Testing

| Dataset | Cells | Use Case |
|---------|-------|----------|
| PBMC 3k (10x) | 2,700 | Standard tutorial dataset, 9 cell types, fast processing |
| PBMC 10k (10x) | 10,000 | Larger PBMC, tests scalability |
| Tabula Muris | 100k+ | Multi-organ mouse atlas, tests batch integration |
| Human Cell Atlas (HCA) | varies | Reference atlases for annotation |