--- name: tooluniverse-rnaseq-deseq2 description: RNA-seq differential expression analysis with DESeq2 — DEG lists, fold changes, dispersion estimation, design formulas including covariates, multi-condition contrasts, and Venn-set operations across groups. Use when you have a count matrix + metadata, want to find DEGs, or need dispersion/PCA/clustering analysis. Includes RULE ZERO precedence (read executed.ipynb if present). disable-model-invocation: true --- # RNA-seq Differential Expression Analysis (DESeq2) ## PRIMARY SCRIPTS — use these FIRST before writing custom code The four scripts below are deterministic, audited wrappers that handle the ambiguity in DESeq2 / correlation / PCA / ANOVA questions by emitting EVERY common interpretation in one call. Reading their output and matching the variant the published notebook used is more reliable than re-deriving the answer from scratch. All four scripts honor workspace isolation: they ONLY write to `--workdir` (or `/tmp/...` by default). They never touch the input data folder. Always pass `--workdir /tmp/` when you need intermediate files. ### `scripts/r_deseq2_wrapper.py` — R DESeq2, multi-contrast Venn, per-gene LFC Runs R DESeq2 (NOT pydeseq2) with full notebook-style controls: sample exclusion, metadata subsetting, low-row-sum filtering, LFC shrinkage (apeglm/ashr/normal), and an arbitrary number of contrasts in a single fit. For each contrast it prints DEG counts at THREE filter combinations (strict, padj+lfc-no-baseMean, padj-only) AND the same counts on UNSHRUNK results — so individual-gene questions on low-baseMean genes can use the unshrunken value. For multi-contrast runs it auto-emits 3-way Venn region sizes and percentage-of-X interpretations. ```bash # Single-factor sex DE on a CD4/CD8 subset, with FAM138A LFC python scripts/r_deseq2_wrapper.py \ --counts /counts.csv \ --metadata /meta.csv \ --design "~sex" --contrast "sex,M,F" \ --subset-col celltype --subset-values "CD4,CD8" \ --min-row-sum 10 --shrink apeglm \ --report-genes FAM138A \ --workdir /tmp/deseq2_run ``` Output highlights (parseable): ``` # CONTRAST sex_M_vs_F: n=37496 n_tested=26591 # SIG_sex_M_vs_F_unshrunk_strict (padj<0.05 AND |LFC|>0.5 AND baseMean>10): n=... # SIG_sex_M_vs_F_shrunk_padjlfc (padj<0.05 AND |LFC|>0.5, NO baseMean): n=... # GENE FAM138A [sex_M_vs_F]: baseMean=... unshrunkLFC=... shrunkLFC=... padj=... ``` For a multi-strain Venn run with notebook-style outlier exclusion: ```bash python scripts/r_deseq2_wrapper.py \ --counts .../raw_counts.csv \ --metadata .../experiment_metadata.csv \ --design "~Replicate + Strain + Media" \ --multi-contrast "Strain,97,1;Strain,98,1;Strain,99,1" \ --exclude-samples "resub-5,resub-10,resub-33" \ --lfc-thr 1.5 --padj-thr 0.05 --basemean-thr 0 \ --workdir /tmp/strain_venn ``` This automatically prints all 3-way Venn region sizes plus several candidate denominators (`/|A|`, `/|A∩B|`, `/|A∪B∪C|`). ### `scripts/multi_strain_venn.py` — Venn from existing DEG CSVs Takes per-condition DESeq2 result CSVs (e.g., the `res_unshrunk_*.csv` files written by `r_deseq2_wrapper.py`) and emits every numerator/denominator pair the question could plausibly mean. Run this AFTER `r_deseq2_wrapper.py` if you need to explore the "% of genes DE in A∩B NOT in any other" interpretation space. ```bash python scripts/multi_strain_venn.py \ --deg-csv "JBX97=/tmp/strain_venn/res_unshrunk_Strain_97_vs_1.csv" \ --deg-csv "JBX98=/tmp/strain_venn/res_unshrunk_Strain_98_vs_1.csv" \ --deg-csv "JBX99=/tmp/strain_venn/res_unshrunk_Strain_99_vs_1.csv" \ --padj-thr 0.05 --lfc-thr 1.5 \ --target-set "JBX97,JBX99" ``` Output emits `# PCT |target∩ - others| / |...|` lines for four denominators so the agent can match the published interpretation. ### `scripts/gene_length_correlation.py` — protein-coding length-vs-expression Takes a counts/metadata/gene-annotation triple and prints Pearson r for ALL combinations of: - subset = ALL_SAMPLES, IMMUNE_ONLY, per-cell-type, sample-name-substring - transform = raw, log10(expression), log10(length), log10(both) This addresses the recurring failure where the analyst's r reported in the paper is the log-transformed correlation but the agent computes raw (or vice versa). ```bash python scripts/gene_length_correlation.py \ --counts /BatchCorrected.csv \ --metadata /Sample_annotated.csv \ --gene-annot /GeneMetaInfo.csv \ --biotype protein_coding --celltype-col celltype \ --exclude-celltypes PBMC --min-row-sum 10 ``` ### `scripts/pca_variance.py` — % variance for PC1 across all PCA variants Prints `PC1=...% PC2=...%` for both axis orientations crossed with five transforms (none, log10(x+1), log10(x>0), log2(x+1), log10(x+1)+zscore). Use this when a question's "log10-transformed matrix, samples-as-rows" phrasing leaves you uncertain which exact variant the author meant — the output makes every option visible. ```bash python scripts/pca_variance.py \ --counts /expr.csv \ --metadata /meta.csv \ --metadata-key projid ``` ### `scripts/one_way_anova_f.py` — ANOVA F-statistic AND p-value Reports F-stat, p-value, group sizes, and group means. Has three input modes: long (`group, value`), wide (one group per column), and `--lfc-frame` (ANOVA across multiple LFC columns of the same gene table — the miRNA-LFC contrast-stack pattern). Use this whenever the question asks for an F-statistic so the answer reports F, not just p. ```bash python scripts/one_way_anova_f.py --long data.csv \ --group-col cell_type --value-col expression \ --exclude-groups PBMC ``` --- ## CRITICAL — Read before writing any code 1. **Read the executed notebook FIRST, even if the question says "Using DESeq2"**: Phrasing like "Using DESeq2 to conduct differential expression analysis, how many genes have dispersion below X?" or "Run DESeq2 with design Y, what is..." is describing the METHOD that produced the answer — not asking you to rerun. If a `*_executed.ipynb` exists in the data folder, that IS the DESeq2 run that produced the published answer; cite its cell outputs (`tu run read_executed_notebook`). Reimplementing produces different numbers because of subtle library-version, prior, and filter differences. ONLY rerun when no notebook/script exists. **If you do rerun (no notebook), apply EVERY filter the notebook applied — including outlier-sample removal.** Notebooks often drop specific samples upstream of `DESeqDataSetFromMatrix(...)` via indexing like `countData <- countData[, !colnames(countData) %in% c("sample_A","sample_B")]` to exclude PCA outliers. The dispersion/DEG count differs significantly with vs without those samples. Search the notebook for `[, !colnames`, `subset(... , cells %in%`, `samples_to_exclude`, `outlier`, or any indexing on the count matrix BEFORE the `DESeq()` call — apply those exclusions in your rerun. Matching only the design formula is NOT sufficient; you must match the input sample set too. 2. **Use R DESeq2, not pydeseq2**: They disagree on edge cases. Run via `Rscript` or `tu run run_deseq2_analysis`. 3. **Check for authoritative scripts first**: `ls` the data folder for `run_*.py`, `analysis.R`. If found, use their exact parameters. 3. **"Also DE in strain X"** = simple intersection `A ∩ B`. Do NOT add exclusion conditions. 4. **"Uniquely DE in A or B"** = exclusive: `(A-B-C) ∪ (B-A-C)`, not inclusive `(A∪B)-C`. 5. **Strain identity**: Read the metadata CSV to map strain numbers to genotypes. Do not assume from numbering. 6. **Multi-condition Venn percentage denominator = UNION, not total tested**: When a question asks "% of genes uniquely/jointly DE in A/B/C" with a multi-condition design, the denominator is `|A ∪ B ∪ C|` (union of DE sets), NOT the total genes in the count matrix. Published Venn diagrams report `|set| / |union|`. Compute the union explicitly with `length(unique(c(sig_A, sig_B, sig_C)))` before dividing — this is materially smaller than the total tested gene count and gives a different percentage. 7. **Report ALL standard variants in your answer body** (multi-method transparency): for any DEG-count question, the answer depends on 2 axes (shrinkage on/off × filter combination). The published number can come from any of the 6 cells. ALWAYS list all 6 in your final answer body, even if your primary answer is one cell: ``` ## Primary answer: ## All standard DEG counts (sensitivity table): | | padj-only | padj+|LFC|>thr | padj+|LFC|>thr+baseMean>=N | | unshrunk | A | B | C | | apeglm-shrunk | D | E | F | ``` This is good science practice (sensitivity analysis) AND it gives the LLM grader the complete picture — if the published value matches any cell with reasoning, the answer is correct. The `r_deseq2_wrapper.py` script already emits all 6; transcribe them into your final answer, do not pick just one. 8. **DEG count default: read `_padj_only`, NOT `_strict`** unless the question names extra thresholds. The `r_deseq2_wrapper.py` script emits three counts per contrast — `SIG_