--- name: bulk-rna-seq-differential-expression-with-omicverse title: Bulk RNA-seq differential expression with omicverse description: "Bulk RNA-seq DEG pipeline: gene ID mapping, DESeq2 normalization, statistical testing, volcano plots, and pathway enrichment in OmicVerse." --- # Bulk RNA-seq differential expression with omicverse ## Overview Follow this skill to run the end-to-end differential expression (DEG) workflow showcased in [`t_deg.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_deg.ipynb). It assumes the user provides a raw gene-level count matrix (e.g., from featureCounts) and wants to analyse bulk RNA-seq cohorts inside omicverse. ## Instructions 1. **Set up the session** - Import `omicverse as ov`, `scanpy as sc`, and `matplotlib.pyplot as plt`. - Call `ov.plot_set()` so downstream plots adopt omicverse styling. 2. **Prepare ID mapping assets** - When gene IDs must be converted to gene symbols, instruct the user to download mapping pairs via `ov.utils.download_geneid_annotation_pair()` and store them under `genesets/`. - Mention the available prebuilt genomes (T2T-CHM13, GRCh38, GRCh37, GRCm39, danRer7, danRer11) and that users can generate their own mapping from GTF files if needed. 3. **Load the raw counts** - Read tab-delimited featureCounts output with `ov.pd.read_csv(..., sep='\t', header=1, index_col=0)`. - Strip trailing `.bam` segments from column names using list comprehension so sample IDs are clean. 4. **Map gene identifiers** - Run `ov.bulk.Matrix_ID_mapping(counts_df, 'genesets/pair_.tsv')` to replace `gene_id` entries with gene symbols. 5. **Initialise the DEG object** - Create `dds = ov.bulk.pyDEG(mapped_counts)`. - Handle duplicate gene symbols with `dds.drop_duplicates_index()` to keep the highest expressed version. 6. **Normalise and estimate size factors** - Execute `dds.normalize()` to calculate DESeq2 size factors, correcting for library size and batch differences. 7. **Run differential testing** - Collect treatment and control replicate labels into lists. - Call `dds.deg_analysis(treatment_groups, control_groups, method='ttest')` for the default Welch t-test. - Offer optional alternatives: `method='edgepy'` for edgeR-like tests and `method='limma'` for limma-style modelling. 8. **Filter and threshold results** - Note that lowly expressed genes are retained by default; filter using `dds.result.loc[dds.result['log2(BaseMean)'] > 1]` when needed. - Set dynamic fold-change and significance cutoffs via `dds.foldchange_set(fc_threshold=-1, pval_threshold=0.05, logp_max=6)` (`fc_threshold=-1` auto-selects based on log2FC distribution). 9. **Visualise differential expression** - Produce volcano plots with `dds.plot_volcano(title=..., figsize=..., plot_genes=... or plot_genes_num=...)` to highlight key genes. - Generate per-gene boxplots using `dds.plot_boxplot(genes=[...], treatment_groups=..., control_groups=..., figsize=..., legend_bbox=...)`; adjust y-axis tick labels if required. 10. **Perform pathway enrichment (optional)** - Download curated pathway libraries through `ov.utils.download_pathway_database()`. - Load genesets with `ov.utils.geneset_prepare(, organism='Mouse'|'Human'|...)`. - Build the DEG gene list from `dds.result.loc[dds.result['sig'] != 'normal'].index`. - Run enrichment with `ov.bulk.geneset_enrichment(gene_list=deg_genes, pathways_dict=..., pvalue_type='auto', organism=...)`. Encourage users without internet access to provide a `background` gene list. - Visualise single-library results via `ov.bulk.geneset_plot(...)` and combine multiple ontologies using `ov.bulk.geneset_plot_multi(enr_dict, colors_dict, num=...)`. 11. **Document outputs** - Suggest exporting `dds.result` and enrichment tables to CSV for downstream reporting. - Encourage users to save figures generated by matplotlib (`plt.savefig(...)`) when running outside notebooks. 12. **Defensive validation** ```python # Before DEG: verify treatment/control groups exist as column names all_cols = set(dds.result.columns) if hasattr(dds, 'result') else set(counts_df.columns) for g in treatment_groups + control_groups: assert g in all_cols, f"Sample '{g}' not found in count matrix columns" # Verify groups don't overlap assert not set(treatment_groups) & set(control_groups), "Treatment and control groups must not overlap" ``` 13. **Troubleshooting tips** - Ensure sample labels in `treatment_groups`/`control_groups` exactly match column names post-cleanup. - Verify required packages (`omicverse`, `pyComplexHeatmap`, `gseapy`) are installed for enrichment visualisations. - Remind users that internet access is required the first time they download gene mappings or pathway databases. ## Examples - "I have a featureCounts matrix for mouse tumour samples—normalize it with DESeq2, run t-test DEG, and highlight the top 8 genes in a volcano plot." - "Use omicverse to compute edgeR-style differential expression between treated and control replicates, then run GO enrichment on significant genes." - "Guide me through converting Ensembl IDs to symbols, performing limma DEG, and plotting boxplots for Krtap9-5 and Lef1." ## References - Detailed walkthrough notebook: [`t_deg.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_deg.ipynb) - Sample count matrix for testing: [`sample/counts.txt`](../../sample/counts.txt) - Quick copy/paste commands: [`reference.md`](reference.md)