--- name: "cnvkit-copy-number" description: "Detect somatic CNVs from WES/WGS/targeted BAMs (CNVkit v0.9.x). Bin coverage in target/antitarget regions, normalize vs reference, segment with CBS/HMM, call amps/dels, scatter/diagram plots, purity/ploidy, VCF/SEG export. CLI plus Python API (cnvlib). Use GATK CNV for deep WGS with population controls; use CNVkit for targeted/exome where antitarget bins matter." license: "Apache-2.0" --- # CNVkit Copy Number Analysis ## Overview CNVkit detects somatic copy number variants (CNVs) from whole-exome sequencing (WES), whole-genome sequencing (WGS), or targeted panel BAM files. It calculates read depth in both on-target (capture) bins and off-target (antitarget) bins, corrects for GC bias and library depth, segments the log2 copy ratio profile with circular binary segmentation (CBS) or a hidden Markov model (HMM), and calls amplifications and deletions. CNVkit provides both a CLI (`cnvkit.py`) and a Python API (`cnvlib`) for integration into analysis pipelines, and produces scatter plots, chromosome diagrams, heatmaps, and export files in VCF, BED, and SEG formats. ## When to Use - Calling somatic copy number variants from tumor-normal paired exome (WES) or targeted panel sequencing - Detecting copy number alterations in tumor-only samples using a pooled normal reference - Running CNV analysis on whole-genome sequencing (WGS) data with the `--method wgs` mode - Estimating tumor purity and ploidy for samples where purity is unknown, to interpret copy ratio calls - Generating SEG format copy number files for GISTIC2, cBioPortal, or IGV visualization - Identifying focal amplifications (e.g., ERBB2, MYC) or homozygous deletions (e.g., CDKN2A, RB1) - Use **GATK CNV** (`gatk DenoiseReadCounts` / `gatk ModelSegments`) instead for deep WGS cohorts with large matched panel-of-normals (PoN); CNVkit is better suited for targeted/exome data - Use **Control-FREEC** instead when you need allele-frequency-based B-allele fraction modeling alongside CNV calling ## Prerequisites - **Software**: CNVkit v0.9.x, Python 3.8+, R (for CBS segmentation), samtools - **Python packages**: `cnvlib` (installed as part of CNVkit), `matplotlib`, `pandas` - **Input files**: sorted, indexed BAM files (tumor ± matched normal); BED file of capture targets; reference genome FASTA; access to R with DNAcopy package for CBS - **Data requirements**: minimum ~50× mean target coverage for WES; WGS works at 20-30× > **Check before installing**: The tool may already be available in the current environment (e.g., inside a `pixi` / `conda` env). Run `command -v cnvkit.py` first and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool via `pixi run cnvkit.py` rather than bare `cnvkit.py`. ```bash # Install CNVkit via conda (recommended — handles R/DNAcopy dependency) conda install -c bioconda cnvkit # Or via pip (requires R + DNAcopy already installed) pip install cnvkit # Verify cnvkit.py version # cnvkit 0.9.10 # Install R DNAcopy (for CBS segmentation) Rscript -e 'if (!requireNamespace("BiocManager")) install.packages("BiocManager"); BiocManager::install("DNAcopy")' # Index BAM files if not already indexed samtools index tumor.bam samtools index normal.bam ``` ## Quick Start ```bash # One-command paired tumor/normal CNV analysis (WES) cnvkit.py batch tumor.bam \ --normal normal.bam \ --targets targets.bed \ --fasta GRCh38.fa \ --output-dir cnvkit_results/ \ --diagram --scatter \ --method hybrid # Output files in cnvkit_results/: # tumor.targetcoverage.cnn — target bin coverage # tumor.antitargetcoverage.cnn — antitarget coverage # tumor.cnr — copy number ratios # tumor.cns — segmented copy numbers # tumor-scatter.png — genome-wide scatter plot # tumor-diagram.pdf — chromosome diagram echo "CNV analysis complete" ``` ## Workflow ### Step 1: Create Copy Number Reference Build a reference from one or more normal BAM files. This corrects for systematic biases (GC content, mappability) and sets the neutral baseline. ```bash # Option A: Paired normal reference (single matched normal) cnvkit.py reference normal.targetcoverage.cnn normal.antitargetcoverage.cnn \ --fasta GRCh38.fa \ -o reference_normal.cnn # Option B: Flat reference (no normal; uses GC/mappability correction only) # Use when no matched normal is available cnvkit.py reference \ --targets targets.bed \ --fasta GRCh38.fa \ --output flat_reference.cnn # Option C: Pooled normal reference from multiple normals (most robust) cnvkit.py batch \ normal1.bam normal2.bam normal3.bam \ --normal \ --targets targets.bed \ --fasta GRCh38.fa \ --output-reference pooled_reference.cnn \ --output-dir normals_cov/ echo "Reference created: pooled_reference.cnn" ``` ### Step 2: Calculate Coverage in Target and Antitarget Bins Bin the target BED file and compute per-bin read depth for tumor and normal samples. ```bash # First, create accessible bins from the target BED cnvkit.py target targets.bed \ --annotate refFlat.txt \ --split \ -o targets.split.bed cnvkit.py antitarget targets.bed \ --access data/access-5k-mappable.hg38.bed \ -o antitargets.bed # Calculate coverage for tumor sample cnvkit.py coverage tumor.bam targets.split.bed \ -o tumor.targetcoverage.cnn cnvkit.py coverage tumor.bam antitargets.bed \ -o tumor.antitargetcoverage.cnn echo "Coverage files:" echo " tumor.targetcoverage.cnn" echo " tumor.antitargetcoverage.cnn" ``` ```python # Python API equivalent: compute coverage with cnvlib import cnvlib # Load and inspect coverage files target_cov = cnvlib.read("tumor.targetcoverage.cnn") antitarget_cov = cnvlib.read("tumor.antitargetcoverage.cnn") print(f"Target bins: {len(target_cov)}") print(f"Antitarget bins: {len(antitarget_cov)}") print(f"Mean target depth: {target_cov['depth'].mean():.1f}×") print(f"Median target depth: {target_cov['depth'].median():.1f}×") # Check coverage distribution import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(8, 4)) target_cov["depth"].clip(upper=500).hist(bins=50, ax=ax, color="#2c6fad", alpha=0.7) ax.set_xlabel("Read depth") ax.set_ylabel("Bin count") ax.set_title("Target bin coverage distribution") plt.tight_layout() plt.savefig("coverage_distribution.png", dpi=150) print("Saved: coverage_distribution.png") ``` ### Step 3: Normalize and Correct Copy Ratios Normalize tumor coverage against the reference (correcting for GC bias, library depth, and target efficiency). ```bash # Fix sample-level biases relative to the reference cnvkit.py fix \ tumor.targetcoverage.cnn \ tumor.antitargetcoverage.cnn \ pooled_reference.cnn \ -o tumor.cnr echo "Copy number ratios: tumor.cnr" # tumor.cnr columns: chromosome, start, end, gene, log2, depth, weight ``` ```python # Inspect copy ratio file with cnvlib Python API import cnvlib import pandas as pd cnr = cnvlib.read("tumor.cnr") print(f"Total bins: {len(cnr)}") print(f"Chromosomes: {sorted(cnr.chromosome.unique())}") # Convert to DataFrame and inspect df = cnr.data print(f"\nLog2 copy ratio summary:") print(df["log2"].describe().round(3)) # Flag high-amplitude events high_amp = df[df["log2"] >= 2.0] hom_del = df[df["log2"] <= -3.0] print(f"\nHigh amplitude bins (log2 >= 2.0): {len(high_amp)}") print(f"Homozygous deletion bins (log2 <= -3.0): {len(hom_del)}") if not high_amp.empty: print(high_amp[["chromosome", "start", "end", "gene", "log2"]].head()) ``` ### Step 4: Segment Copy Number Ratios Identify contiguous regions of similar copy ratio using CBS (Circular Binary Segmentation) or HMM segmentation. ```bash # CBS segmentation (default; requires R DNAcopy) cnvkit.py segment tumor.cnr \ -o tumor.cns \ --method cbs # HMM segmentation (no R required; faster) cnvkit.py segment tumor.cnr \ -o tumor.hmm.cns \ --method hmm echo "Segments: tumor.cns" # tumor.cns columns: chromosome, start, end, gene, log2, cn, depth, p_ttest, weight, probes ``` ```python # Python API: segment and inspect import cnvlib import subprocess # Run segmentation via Python subprocess (mirrors CLI) subprocess.run( ["cnvkit.py", "segment", "tumor.cnr", "-o", "tumor.cns", "--method", "cbs"], check=True ) # Load and analyze segments cns = cnvlib.read("tumor.cns") df_seg = cns.data print(f"Total segments: {len(df_seg)}") print(f"\nSegment log2 summary:") print(df_seg["log2"].describe().round(3)) # Large segments (>5 Mb) with copy gain or loss large_events = df_seg[ ((df_seg["end"] - df_seg["start"]) > 5_000_000) & (df_seg["log2"].abs() > 0.3) ].copy() large_events["size_mb"] = (large_events["end"] - large_events["start"]) / 1e6 print(f"\nLarge CNV segments (>5 Mb, |log2|>0.3): {len(large_events)}") print(large_events[["chromosome", "start", "end", "log2", "size_mb"]].head(8).to_string(index=False)) ``` ### Step 5: Call CNV States Assign integer copy number states and classify amplifications and deletions. ```bash # Call with default thresholds (diploid normal) cnvkit.py call tumor.cns \ -o tumor.call.cns \ --ploidy 2 # Call with tumor purity estimate (if known) cnvkit.py call tumor.cns \ --purity 0.7 \ --ploidy 2 \ -o tumor.call.purity.cns echo "Called CNVs: tumor.call.cns" ``` ```python # Parse called CNV file and classify events import cnvlib import pandas as pd cns_called = cnvlib.read("tumor.call.cns") df = cns_called.data # Classify by log2 thresholds (diploid assumed) # log2 >= 1.0 = high-level amplification (CN >= 4) # log2 0.2–1.0 = copy gain (CN = 3) # log2 -1.0–-0.2 = heterozygous deletion (CN = 1) # log2 <= -3.5 = homozygous deletion (CN = 0) def classify_cnv(log2): if log2 >= 1.0: return "AMP" if log2 >= 0.2: return "GAIN" if log2 <= -3.5: return "HOMDEL" if log2 <= -1.0: return "LOSS" return "NEUTRAL" df["cnv_class"] = df["log2"].apply(classify_cnv) print("CNV class counts:") print(df["cnv_class"].value_counts().to_string()) # Focal amplifications in known oncogenes oncogenes = ["ERBB2", "MYC", "EGFR", "CCND1", "CDK6", "MDM2", "KRAS"] focal_amps = df[(df["cnv_class"] == "AMP") & (df["gene"].str.split(",").apply( lambda genes: any(g in oncogenes for g in genes)))] if not focal_amps.empty: print(f"\nFocal amplifications in oncogenes:") print(focal_amps[["chromosome", "start", "end", "gene", "log2", "cn"]].to_string(index=False)) ``` ### Step 6: Visualize CNV Profile Generate scatter plots and chromosome diagrams to review the copy number landscape. ```bash # Genome-wide scatter plot (CNR bins + segments) cnvkit.py scatter tumor.cnr \ -s tumor.cns \ -o tumor-scatter.png # Chromosome diagram (color-coded by CN state) cnvkit.py diagram tumor.cnr \ -s tumor.cns \ -o tumor-diagram.pdf # Heatmap across multiple samples cnvkit.py heatmap sample1.cns sample2.cns sample3.cns \ -o cohort_heatmap.pdf echo "Plots saved: tumor-scatter.png, tumor-diagram.pdf, cohort_heatmap.pdf" ``` ```python # Custom scatter plot with matplotlib highlighting specific genes import cnvlib import matplotlib.pyplot as plt import numpy as np cnr = cnvlib.read("tumor.cnr") cns = cnvlib.read("tumor.cns") # Plot chr7 (EGFR locus) in detail fig, ax = plt.subplots(figsize=(12, 4)) chrom = "chr7" cnr_chr = cnr.data[cnr.data["chromosome"] == chrom] cns_chr = cns.data[cns.data["chromosome"] == chrom] # Bin dots ax.scatter(cnr_chr["start"], cnr_chr["log2"], s=3, alpha=0.3, color="#aaa", label="Bins") # Segment lines for _, seg in cns_chr.iterrows(): ax.hlines(seg["log2"], seg["start"], seg["end"], colors=("#d32f2f" if seg["log2"] > 0.3 else "#1565c0" if seg["log2"] < -0.3 else "#666"), lw=3, label="_nolegend_") # Mark EGFR egfr_start, egfr_end = 55_019_017, 55_211_628 ax.axvspan(egfr_start, egfr_end, alpha=0.15, color="orange") ax.text((egfr_start + egfr_end) / 2, ax.get_ylim()[1] * 0.85, "EGFR", ha="center", fontsize=9, color="darkorange") ax.axhline(0, color="black", lw=0.8, ls="--") ax.axhline(0.585, color="#d32f2f", lw=0.5, ls=":") # gain threshold ax.axhline(-1.0, color="#1565c0", lw=0.5, ls=":") # loss threshold ax.set_xlabel("Chromosome 7 position (bp)") ax.set_ylabel("Log2 copy ratio") ax.set_title(f"CNV Profile — {chrom}") plt.tight_layout() plt.savefig("chr7_cnv_scatter.png", dpi=150, bbox_inches="tight") print("Saved: chr7_cnv_scatter.png") ``` ### Step 7: Estimate Tumor Purity and Ploidy Use CNVkit's purity/ploidy estimation to interpret absolute copy numbers. ```bash # Estimate purity and ploidy from the segmented CNV profile cnvkit.py call tumor.cns \ --purity auto \ --ploidy 2 \ --method clonal \ -o tumor.call.auto.cns \ --center median # Print purity/ploidy estimate embedded in header head -5 tumor.call.auto.cns ``` ```python # Python API: purity/ploidy estimation with cnvlib import cnvlib from cnvlib import segmetrics cns = cnvlib.read("tumor.cns") cnr = cnvlib.read("tumor.cnr") # Compute segment-level statistics cns_with_stats = segmetrics.do_segmetrics(cnr, cns, location_stats=["mean", "median"], spread_stats=["stdev"]) # Export stats to CSV for review df = cns_with_stats.data df.to_csv("tumor_segment_stats.csv", index=False) print(f"Segment stats saved: tumor_segment_stats.csv") print(f"Segments: {len(df)}") print(df[["chromosome", "start", "end", "log2", "cn", "mean", "stdev"]].head(8).to_string(index=False)) ``` ### Step 8: Export to VCF, BED, and SEG Formats Export CNV calls for downstream tools (GISTIC2, cBioPortal, IGV, clinical reporting). ```bash # Export to VCF (for clinical variant databases) cnvkit.py export vcf tumor.call.cns \ -o tumor.cnv.vcf # Export to SEG format (for GISTIC2 and cBioPortal) cnvkit.py export seg tumor.call.cns \ -o tumor.seg # Export to BED format (for bedtools/IGV) cnvkit.py export bed tumor.call.cns \ -o tumor.cnv.bed echo "Exported:" echo " tumor.cnv.vcf — VCF format" echo " tumor.seg — SEG format (GISTIC2 / cBioPortal)" echo " tumor.cnv.bed — BED format (IGV / bedtools)" ``` ```python # Parse and summarize SEG file import pandas as pd seg = pd.read_csv("tumor.seg", sep="\t", names=["sample", "chrom", "start", "end", "n_probes", "log2"], comment="#") print(f"SEG file: {len(seg)} segments") print(seg.head()) # Count amplifications and deletions by chromosome arm amp_count = (seg["log2"] >= 0.585).sum() del_count = (seg["log2"] <= -1.0).sum() print(f"\nAmplifications (log2 >= 0.585): {amp_count}") print(f"Deletions (log2 <= -1.0): {del_count}") seg.to_csv("tumor_seg_annotated.csv", index=False) ``` ## Key Parameters | Parameter | Default | Range / Options | Effect | |-----------|---------|-----------------|--------| | `--method` (batch/coverage) | `"hybrid"` | `"hybrid"`, `"wgs"`, `"amplicon"` | Sequencing method; selects target binning strategy | | `--segment-method` (segment) | `"cbs"` | `"cbs"`, `"hmm"`, `"haar"`, `"none"` | Segmentation algorithm; CBS requires R DNAcopy | | `--ploidy` (call) | `2` | `1`–`6` | Assumed baseline ploidy for absolute CN calling | | `--purity` (call) | `1.0` | `0.1`–`1.0` or `"auto"` | Tumor cell fraction; corrects log2 ratios for admixed normal | | `--target-avg-size` (target) | `200` (WES) | `50`–`500` bp | Desired mean target bin size after splitting | | `--antitarget-avg-size` | `150000` | `10000`–`500000` bp | Antitarget bin size (larger = fewer bins, less noise) | | `--drop-low-coverage` (segment) | off | flag | Drop bins with depth < 5× before segmentation | | `-p` / `--processes` (coverage) | `1` | `1`–CPU count | Parallel processes for coverage calculation | | `--scatter` (batch) | off | flag | Automatically generate scatter plot | | `--diagram` (batch) | off | flag | Automatically generate chromosome diagram | ## Common Recipes ### Recipe: Tumor-Only Analysis with Flat Reference When to use: No matched normal is available; use a flat GC/mappability-corrected reference. ```bash # Step 1: Create flat reference cnvkit.py reference \ --targets targets.bed \ --fasta GRCh38.fa \ --output flat_reference.cnn # Step 2: Run batch on tumor-only cnvkit.py batch tumor.bam \ --reference flat_reference.cnn \ --output-dir tumor_only_results/ echo "Tumor-only results: tumor_only_results/" ``` ### Recipe: WGS Mode for Whole-Genome Data When to use: Input is whole-genome sequencing (no capture BED needed). ```bash # WGS mode uses a uniform genome-wide bin grid cnvkit.py batch tumor_wgs.bam \ --normal normal_wgs.bam \ --method wgs \ --fasta GRCh38.fa \ --output-dir wgs_results/ \ --scatter echo "WGS CNV analysis complete: wgs_results/" ``` ### Recipe: Snakemake Integration for Multi-Sample Cohort When to use: Automate CNVkit for a cohort of paired tumor/normal samples. ```python # Snakefile — CNVkit batch pipeline configfile: "config.yaml" SAMPLES = config["tumor_samples"] # list of sample names GENOME = config["genome_fasta"] TARGETS = config["targets_bed"] POOLED = config["pooled_reference"] # pre-built pooled normal reference rule all: input: expand("cnvkit/{sample}.call.cns", sample=SAMPLES), expand("cnvkit/{sample}-scatter.png", sample=SAMPLES), rule cnvkit_batch: input: tumor = "bam/{sample}.tumor.bam", ref = POOLED, output: cnr = "cnvkit/{sample}.cnr", cns = "cnvkit/{sample}.cns", call = "cnvkit/{sample}.call.cns", scatter = "cnvkit/{sample}-scatter.png", threads: 4 shell: """ cnvkit.py batch {input.tumor} \ --reference {input.ref} \ --targets {TARGETS} \ --fasta {GENOME} \ --output-dir cnvkit/ \ --scatter --processes {threads} cnvkit.py call cnvkit/{wildcards.sample}.cns \ --ploidy 2 -o {output.call} """ ``` ### Recipe: Export Gene-Level CNV Table When to use: Summarize copy number status for a panel of cancer genes from a called CNS file. ```python import cnvlib import pandas as pd cns = cnvlib.read("tumor.call.cns") df = cns.data # Cancer genes of interest cancer_genes = ["ERBB2", "MYC", "EGFR", "CDKN2A", "RB1", "TP53", "KRAS", "PTEN", "BRCA1", "BRCA2", "MDM2", "CDK4"] rows = [] for gene in cancer_genes: hits = df[df["gene"].str.contains(gene, na=False)] if hits.empty: rows.append({"gene": gene, "log2_mean": float("nan"), "cn": "n/a", "status": "not_detected"}) else: best = hits.loc[hits["log2"].abs().idxmax()] status = ("AMP" if best["log2"] >= 1.0 else "GAIN" if best["log2"] >= 0.2 else "HOMDEL" if best["log2"] <= -3.5 else "LOSS" if best["log2"] <= -1.0 else "NEUTRAL") rows.append({"gene": gene, "log2_mean": round(best["log2"], 3), "cn": best.get("cn", "?"), "status": status}) result_df = pd.DataFrame(rows) print(result_df.to_string(index=False)) result_df.to_csv("cancer_gene_cnv_summary.csv", index=False) print("\nSaved: cancer_gene_cnv_summary.csv") ``` ## Expected Outputs | Output File | Format | Description | |-------------|--------|-------------| | `tumor.targetcoverage.cnn` | CNN | Per-bin target region read depth and log2 coverage | | `tumor.antitargetcoverage.cnn` | CNN | Per-bin antitarget region coverage (off-target reads) | | `tumor.cnr` | CNR | Normalized log2 copy ratio per bin (GC/depth corrected) | | `tumor.cns` | CNS | Segmented copy ratios (CBS/HMM output); one row per segment | | `tumor.call.cns` | CNS | Called copy number states with integer CN and purity adjustment | | `tumor-scatter.png` | PNG | Genome-wide scatter plot of bins + segment overlays | | `tumor-diagram.pdf` | PDF | Chromosome arm diagram color-coded by CN state | | `tumor.seg` | SEG | GISTIC2/cBioPortal-format segment file | | `tumor.cnv.vcf` | VCF | VCF-format CNV calls for clinical databases | ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | High noise in antitarget bins | Low off-target coverage (<0.1× mean) | Increase `--antitarget-avg-size` to 500kb; use `--drop-low-coverage` to exclude low bins before segmentation | | CBS segmentation fails with `Error in DNAcopy` | R DNAcopy not installed or incompatible | Install via `BiocManager::install("DNAcopy")`; alternatively use `--method hmm` (no R required) | | All segments near 0 (no CNVs detected) | Low tumor purity (<20%) or shallow coverage | Verify purity with `cnvkit.py call --purity auto`; check target coverage depth with `cnvkit.py coverage` | | Wavy GC bias across chromosomes | GC normalization failed or flat reference used with biased tumor | Rebuild reference with matched normal BAMs; check `cnvkit.py reference` includes `--fasta` for GC correction | | Many very short segments (over-segmentation) | CBS threshold too sensitive | Add `--threshold 0.2` or `--smooth-cbs` to reduce false segment boundaries; increase minimum segment size | | `ImportError: No module named cnvlib` | CNVkit not installed in active environment | Activate the correct conda env: `conda activate cnvkit_env`; verify `cnvkit.py version` | | Chromosome naming mismatch | BAM uses `1` but reference uses `chr1` or vice versa | Ensure BAM, BED, and FASTA all use the same chromosome naming convention | ## References - [CNVkit documentation](https://cnvkit.readthedocs.io) — full CLI and API reference, algorithm details, and tutorials - [Talevich E et al. (2016) PLoS Comput Biol 12(4): e1004873](https://doi.org/10.1371/journal.pcbi.1004873) — CNVkit original paper with algorithm description and benchmarks - [CNVkit GitHub: etal/cnvkit](https://github.com/etal/cnvkit) — source code, issue tracker, and example datasets - [UCSC Genome Browser data files](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/) — refFlat and access BED files used in CNVkit preprocessing