--- name: "pydeseq2-differential-expression" description: "Bulk RNA-seq DE with PyDESeq2: load counts, normalize, fit negative binomial models, Wald test (BH-FDR), LFC shrinkage, volcano/MA plots. Use for two-group comparisons, multi-factor designs with batch correction, multiple contrasts." license: "CC-BY-4.0" --- # PyDESeq2 Differential Expression Analysis ## Overview PyDESeq2 is a Python reimplementation of the R DESeq2 package for differential gene expression analysis from bulk RNA-seq count data. It fits negative binomial generalized linear models per gene, estimates dispersion with empirical Bayes shrinkage, and performs Wald tests with Benjamini-Hochberg FDR correction. This skill covers the full pipeline from raw counts to publication-ready result tables and visualizations. ## When to Use - Identifying differentially expressed genes between two or more experimental conditions from bulk RNA-seq - Performing two-group comparisons (e.g., treated vs control) with proper statistical testing - Running multi-factor designs that account for batch effects or covariates (e.g., `~batch + condition`) - Applying log2 fold change shrinkage (apeGLM) for ranking and visualization - Generating volcano plots, MA plots, and heatmaps from differential expression results - Converting R-based DESeq2 workflows to a pure Python environment - Integrating DE analysis into larger Python bioinformatics pipelines (e.g., with scanpy, pandas) - Use **DESeq2** (R/Bioconductor) or **edgeR** instead for the reference R implementations with the broadest method support and community validation ## Prerequisites - **Python packages**: `pydeseq2>=0.4`, `pandas>=1.4`, `numpy>=1.23`, `scipy>=1.11`, `scikit-learn>=1.1`, `anndata>=0.8` - **Data requirements**: Raw (unnormalized) integer count matrix (samples x genes) + sample metadata DataFrame - **Environment**: Python 3.10+; optional `matplotlib`, `seaborn` for visualization ```bash pip install pydeseq2 matplotlib seaborn ``` ## Workflow ### Step 1: Data Loading and Validation Load the count matrix and metadata. PyDESeq2 expects counts as a samples x genes DataFrame with non-negative integers, and metadata as a samples x variables DataFrame with matching indices. ```python import pandas as pd # Load data — typical CSV has genes as rows, samples as columns counts_raw = pd.read_csv("counts.csv", index_col=0) metadata = pd.read_csv("metadata.csv", index_col=0) # Transpose if needed: PyDESeq2 requires samples x genes if counts_raw.shape[0] > counts_raw.shape[1]: counts_df = counts_raw.T # genes x samples → samples x genes else: counts_df = counts_raw # Validate alignment common_samples = counts_df.index.intersection(metadata.index) counts_df = counts_df.loc[common_samples] metadata = metadata.loc[common_samples] print(f"Samples: {counts_df.shape[0]}, Genes: {counts_df.shape[1]}") print(f"Metadata columns: {list(metadata.columns)}") print(f"Condition counts:\n{metadata['condition'].value_counts()}") ``` ### Step 2: Gene Filtering Remove lowly expressed genes to improve statistical power and reduce multiple testing burden. ```python # Filter genes with total counts below threshold min_total_counts = 10 gene_counts = counts_df.sum(axis=0) genes_to_keep = gene_counts[gene_counts >= min_total_counts].index counts_df = counts_df[genes_to_keep] # Optional: require minimum counts in a minimum number of samples min_count_per_sample = 5 min_samples = 3 genes_expressed = (counts_df >= min_count_per_sample).sum(axis=0) >= min_samples counts_df = counts_df.loc[:, genes_expressed] print(f"Genes after filtering: {counts_df.shape[1]}") ``` ### Step 3: DeseqDataSet Initialization and Fitting Create the DESeq dataset object, specify the design formula, and run the full pipeline (size factor estimation, dispersion estimation, model fitting). ```python from pydeseq2.dds import DeseqDataSet dds = DeseqDataSet( counts=counts_df, metadata=metadata, design="~condition", # Wilkinson-style formula refit_cooks=True, # Refit after Cook's outlier removal n_cpus=4 # Parallel threads ) # Run: size factors → dispersions → trend → MAP shrinkage → LFC fitting dds.deseq2() # Inspect normalization print(f"Size factors (first 5): {dds.obsm['size_factors'][:5]}") print(f"Size factor range: {dds.obsm['size_factors'].min():.2f} - {dds.obsm['size_factors'].max():.2f}") ``` ### Step 4: Statistical Testing (Wald Test) Perform Wald tests to identify differentially expressed genes. Specify the contrast as `[variable, test_level, reference_level]`. ```python from pydeseq2.ds import DeseqStats ds = DeseqStats( dds, contrast=["condition", "treated", "control"], alpha=0.05, # FDR threshold cooks_filter=True, # Filter Cook's outliers independent_filter=True # Independent filtering for power ) ds.summary() # Access full results results = ds.results_df print(f"Total genes tested: {len(results)}") print(f"Significant (padj < 0.05): {(results.padj < 0.05).sum()}") ``` ### Step 5: LFC Shrinkage (Optional) Apply apeGLM shrinkage to reduce noise in log2 fold change estimates. Use shrunk values for visualization and ranking, not for significance calls. ```python # Apply shrinkage — modifies results_df.log2FoldChange in place ds.lfc_shrink() # Compare pre/post shrinkage effect print(f"Max |LFC| after shrinkage: {results.log2FoldChange.abs().max():.2f}") print(f"Genes with |LFC| > 2: {(results.log2FoldChange.abs() > 2).sum()}") ``` ### Step 6: Result Filtering and Export Filter significant genes and export results for downstream analysis. ```python import numpy as np # Significance + effect size filter significant = results[ (results.padj < 0.05) & (results.log2FoldChange.abs() > 1.0) ].copy() # Separate up/down-regulated up = significant[significant.log2FoldChange > 0].sort_values("padj") down = significant[significant.log2FoldChange < 0].sort_values("padj") print(f"Upregulated: {len(up)}, Downregulated: {len(down)}") # Export results.to_csv("deseq2_all_results.csv") significant.to_csv("deseq2_significant.csv") up.to_csv("deseq2_upregulated.csv") down.to_csv("deseq2_downregulated.csv") print("Results exported to CSV files") ``` ### Step 7: Visualization — Volcano Plot ```python import matplotlib.pyplot as plt import numpy as np fig, ax = plt.subplots(figsize=(10, 7)) res = results.dropna(subset=["padj"]).copy() res["-log10_padj"] = -np.log10(res.padj) # Color categories is_sig = (res.padj < 0.05) & (res.log2FoldChange.abs() > 1.0) is_up = is_sig & (res.log2FoldChange > 0) is_down = is_sig & (res.log2FoldChange < 0) ax.scatter(res.loc[~is_sig, "log2FoldChange"], res.loc[~is_sig, "-log10_padj"], c="grey", alpha=0.3, s=8, label="Not significant") ax.scatter(res.loc[is_up, "log2FoldChange"], res.loc[is_up, "-log10_padj"], c="firebrick", alpha=0.6, s=12, label=f"Up ({is_up.sum()})") ax.scatter(res.loc[is_down, "log2FoldChange"], res.loc[is_down, "-log10_padj"], c="steelblue", alpha=0.6, s=12, label=f"Down ({is_down.sum()})") ax.axhline(-np.log10(0.05), ls="--", c="black", alpha=0.4) ax.axvline(-1, ls="--", c="black", alpha=0.4) ax.axvline(1, ls="--", c="black", alpha=0.4) ax.set_xlabel("Log2 Fold Change") ax.set_ylabel("-Log10(Adjusted P-value)") ax.set_title("Volcano Plot — Treated vs Control") ax.legend() plt.tight_layout() plt.savefig("volcano_plot.png", dpi=300) print("Saved volcano_plot.png") ``` ### Step 8: Visualization — MA Plot ```python fig, ax = plt.subplots(figsize=(10, 7)) ax.scatter(np.log10(res.loc[~is_sig, "baseMean"] + 1), res.loc[~is_sig, "log2FoldChange"], c="grey", alpha=0.3, s=8) ax.scatter(np.log10(res.loc[is_sig, "baseMean"] + 1), res.loc[is_sig, "log2FoldChange"], c="firebrick", alpha=0.6, s=12) ax.axhline(0, ls="--", c="black", alpha=0.5) ax.set_xlabel("Log10(Mean Normalized Count + 1)") ax.set_ylabel("Log2 Fold Change") ax.set_title("MA Plot") plt.tight_layout() plt.savefig("ma_plot.png", dpi=300) print("Saved ma_plot.png") ``` ## Key Parameters | Parameter | Default | Range / Options | Effect | |-----------|---------|-----------------|--------| | `design` | (required) | Wilkinson formula | Model formula; put covariates before variable of interest | | `contrast` | `None` | `[var, test, ref]` | Which comparison to test; `None` uses last coefficient | | `alpha` | `0.05` | `0.01`–`0.10` | FDR threshold for significance calling | | `refit_cooks` | `True` | `True`/`False` | Refit model after removing Cook's distance outliers | | `cooks_filter` | `True` | `True`/`False` | Apply Cook's distance filtering during testing | | `independent_filter` | `True` | `True`/`False` | Independent filtering to optimize detection power | | `n_cpus` | `1` | `1`–`N` | Number of parallel threads for dispersion fitting | | `min_total_counts` (user) | `10` | `5`–`50` | Gene filtering: minimum total reads across all samples | | `lfc_shrink()` | off | call after `summary()` | apeGLM shrinkage; reduces noisy LFC estimates | ## Common Recipes ### Recipe: Multiple Contrasts from One Model When comparing multiple treatment groups against a shared control. ```python dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition") dds.deseq2() contrasts = { "A_vs_ctrl": ["condition", "treatment_A", "control"], "B_vs_ctrl": ["condition", "treatment_B", "control"], "C_vs_ctrl": ["condition", "treatment_C", "control"], } for name, contrast in contrasts.items(): ds = DeseqStats(dds, contrast=contrast, alpha=0.05) ds.summary() n_sig = (ds.results_df.padj < 0.05).sum() ds.results_df.to_csv(f"results_{name}.csv") print(f"{name}: {n_sig} significant genes") ``` ### Recipe: Batch-Corrected Analysis When samples come from multiple batches or sequencing runs. ```python # Verify batch is not confounded with condition print(pd.crosstab(metadata["batch"], metadata["condition"])) dds = DeseqDataSet( counts=counts_df, metadata=metadata, design="~batch + condition" # batch first, then variable of interest ) dds.deseq2() ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary() print(f"Significant genes (batch-corrected): {(ds.results_df.padj < 0.05).sum()}") ``` ### Recipe: P-value Distribution QC Diagnostic check — a healthy analysis shows a flat histogram with a spike near 0. ```python import matplotlib.pyplot as plt fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # P-value histogram axes[0].hist(ds.results_df.pvalue.dropna(), bins=50, edgecolor="black") axes[0].set_xlabel("P-value") axes[0].set_ylabel("Frequency") axes[0].set_title("P-value Distribution") # Dispersion plot axes[1].scatter( np.log10(dds.varm["_normed_means"] + 1), np.log10(dds.varm["dispersions"]), alpha=0.3, s=5 ) axes[1].set_xlabel("Log10(Mean Expression + 1)") axes[1].set_ylabel("Log10(Dispersion)") axes[1].set_title("Dispersion vs Mean") plt.tight_layout() plt.savefig("qc_diagnostics.png", dpi=300) print("Saved qc_diagnostics.png") ``` ### Recipe: Save and Reload DESeq Dataset For resuming analysis without re-running the expensive fitting step. ```python import pickle # Save with open("dds_fitted.pkl", "wb") as f: pickle.dump(dds.to_picklable_anndata(), f) print("Saved fitted DESeqDataSet") # Reload with open("dds_fitted.pkl", "rb") as f: adata = pickle.load(f) # Note: reload requires re-constructing DeseqDataSet from the AnnData ``` ## Expected Outputs - `deseq2_all_results.csv` — Full results table (baseMean, log2FoldChange, lfcSE, stat, pvalue, padj) for all tested genes - `deseq2_significant.csv` — Filtered results (padj < 0.05 and |LFC| > 1) - `deseq2_upregulated.csv` — Significant upregulated genes sorted by padj - `deseq2_downregulated.csv` — Significant downregulated genes sorted by padj - `volcano_plot.png` — Volcano plot with significance and fold change thresholds - `ma_plot.png` — MA plot showing fold change vs mean expression - `qc_diagnostics.png` — P-value distribution and dispersion plot ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `ValueError: Index mismatch` | Sample names differ between counts and metadata | Use `counts_df.index.intersection(metadata.index)` to align | | All genes have `padj = NaN` | Genes have zero variance or all zero counts | Apply stricter gene filtering (increase `min_total_counts`) | | `Design matrix is not full rank` | Confounded variables (e.g., all treated in one batch) | Check with `pd.crosstab()`; simplify design or remove confounded variable | | No significant genes found | Small effect size, high variability, or low sample size | Check p-value distribution; relax alpha or |LFC| threshold; inspect QC plots | | `MemoryError` during fitting | Too many genes or very large dataset | Pre-filter more aggressively; reduce `n_cpus`; use machine with more RAM | | Very large size factors (>5) | Extreme library size differences | Verify raw counts are unnormalized; check for contamination or failed libraries | | Shrinkage produces unexpected LFCs | Calling `lfc_shrink()` before `summary()` | Always call `ds.summary()` first, then `ds.lfc_shrink()` | ## References - [PyDESeq2 Documentation](https://pydeseq2.readthedocs.io/) — Official API reference and tutorials - [Muzellec et al. (2023)](https://doi.org/10.1093/bioinformatics/btad547) — PyDESeq2: a Python package for bulk differential expression analysis with DESeq2. Bioinformatics. - [Love et al. (2014)](https://doi.org/10.1186/s13059-014-0550-8) — Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. - [DESeq2 Vignette (Bioconductor)](https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html) — Comprehensive R DESeq2 guide with design formula examples