--- name: pysam-genomic-files description: "Read/write SAM/BAM/CRAM, VCF/BCF, FASTA/FASTQ. Region queries, pileup, variant filtering, read groups. Python htslib wrapper exposing samtools/bcftools CLI. Use STAR/BWA for alignment; GATK/DeepVariant for variant calling." license: MIT --- # Pysam — Genomic File Toolkit ## Overview Pysam provides a Pythonic interface to htslib for reading, manipulating, and writing genomic data files. It handles SAM/BAM/CRAM alignments, VCF/BCF variants, and FASTA/FASTQ sequences with efficient region-based random access. Also exposes samtools and bcftools as callable Python functions. ## When to Use - Reading and querying BAM/CRAM alignment files (region extraction, read filtering) - Analyzing VCF/BCF variant files (genotype access, variant filtering, annotation) - Extracting reference sequences from indexed FASTA files - Calculating per-base coverage and pileup statistics - Building custom bioinformatics pipelines that combine alignment + variant + sequence data - Quality control of NGS data (mapping quality, flag filtering, coverage) - For **alignment from FASTQ** (read mapping), use STAR, BWA, or minimap2 instead - For **variant calling from BAM**, use GATK or DeepVariant instead ## Prerequisites ```bash pip install pysam ``` **Note**: Requires htslib C library (bundled with pip install on most platforms). On some Linux systems, may need `libhts-dev` or equivalent. Index files (`.bai`, `.tbi`, `.fai`) required for random access — create with `pysam.index()`, `pysam.tabix_index()`, or `pysam.faidx()`. ## Quick Start ```python import pysam # Read BAM file, fetch reads in a region with pysam.AlignmentFile("sample.bam", "rb") as bam: for read in bam.fetch("chr1", 1000, 2000): print(f"{read.query_name}: pos={read.reference_start}, mapq={read.mapping_quality}") print(f"Total reads in region: {bam.count('chr1', 1000, 2000)}") ``` ## Core API ### 1. Alignment Files (SAM/BAM/CRAM) Read, query, and write aligned sequencing reads. ```python import pysam # Open BAM (binary) or SAM (text) file bam = pysam.AlignmentFile("sample.bam", "rb") # rb=read BAM, r=read SAM, rc=read CRAM # Fetch reads overlapping a region (requires .bai index) for read in bam.fetch("chr1", 10000, 20000): print(f"Name: {read.query_name}") print(f" Position: {read.reference_start}-{read.reference_end}") print(f" MAPQ: {read.mapping_quality}") print(f" CIGAR: {read.cigarstring}") print(f" Sequence: {read.query_sequence[:30]}...") break # Count reads in region (fast, no iteration needed) n_reads = bam.count("chr1", 10000, 20000) print(f"Reads in region: {n_reads}") # Filter reads by quality and flags for read in bam.fetch("chr1", 10000, 20000): if read.mapping_quality >= 30 and not read.is_unmapped and not read.is_duplicate: pass # Process high-quality, mapped, non-duplicate reads bam.close() ``` ```python # Write filtered reads to a new BAM file with pysam.AlignmentFile("input.bam", "rb") as inbam: with pysam.AlignmentFile("filtered.bam", "wb", header=inbam.header) as outbam: for read in inbam.fetch("chr1", 10000, 20000): if read.mapping_quality >= 30: outbam.write(read) # Index the output pysam.index("filtered.bam") print("Created filtered.bam + filtered.bam.bai") ``` ### 2. Coverage and Pileup Analysis Calculate per-base coverage statistics. ```python import pysam import numpy as np bam = pysam.AlignmentFile("sample.bam", "rb") # Pileup: per-base coverage with read-level detail for pileup_col in bam.pileup("chr1", 10000, 10100, min_mapping_quality=30): bases = [p.alignment.query_sequence[p.query_position] for p in pileup_col.pileups if not p.is_del and p.query_position is not None] print(f"Pos {pileup_col.reference_pos}: depth={pileup_col.nsegments}, bases={''.join(bases[:5])}") # Quick coverage count per region (faster than pileup) coverage = bam.count_coverage("chr1", 10000, 10100, quality_threshold=20) # Returns tuple of 4 arrays (A, C, G, T counts per position) total_cov = np.array(coverage).sum(axis=0) print(f"Mean coverage: {total_cov.mean():.1f}x") bam.close() ``` ### 3. Variant Files (VCF/BCF) Read, query, and filter genetic variants. ```python import pysam # Open VCF/BCF file vcf = pysam.VariantFile("variants.vcf.gz") # Iterate all variants for record in vcf.fetch("chr1", 10000, 50000): print(f"{record.chrom}:{record.pos} {record.ref}>{','.join(record.alts or [])}") print(f" QUAL={record.qual}, FILTER={list(record.filter)}") print(f" INFO: {dict(record.info)}") # Access genotypes per sample for sample in record.samples: gt = record.samples[sample]["GT"] print(f" {sample}: GT={gt}") break vcf.close() ``` ```python # Filter variants and write to new VCF with pysam.VariantFile("variants.vcf.gz") as vcf_in: with pysam.VariantFile("filtered.vcf.gz", "wz", header=vcf_in.header) as vcf_out: for record in vcf_in: if record.qual and record.qual >= 30 and "PASS" in record.filter: vcf_out.write(record) pysam.tabix_index("filtered.vcf.gz", preset="vcf") print("Created filtered.vcf.gz + filtered.vcf.gz.tbi") ``` ### 4. Sequence Files (FASTA/FASTQ) Random access to reference sequences and sequential reading of raw reads. ```python import pysam # FASTA: random access (requires .fai index) fasta = pysam.FastaFile("reference.fasta") seq = fasta.fetch("chr1", 10000, 10050) print(f"Sequence ({len(seq)} bp): {seq}") print(f"Available contigs: {fasta.references[:5]}") print(f"Contig lengths: {dict(zip(fasta.references[:3], fasta.lengths[:3]))}") fasta.close() # Create FASTA index if needed # pysam.faidx("reference.fasta") ``` ```python # FASTQ: sequential reading with pysam.FastxFile("reads.fastq.gz") as fq: for i, entry in enumerate(fq): print(f"Read {entry.name}: {len(entry.sequence)} bp, mean_qual={sum(entry.get_quality_array())/len(entry.sequence):.1f}") if i >= 2: break ``` ### 5. Read Groups and Sample Information Extract and filter reads by read group (essential for multi-sample BAM files). ```python import pysam bam = pysam.AlignmentFile("multisample.bam", "rb") # Access read group information from BAM header print("Read groups in file:") for rg_dict in bam.header.get("RG", []): print(f" ID: {rg_dict['ID']}, Sample: {rg_dict.get('SM', 'N/A')}, Library: {rg_dict.get('LB', 'N/A')}, Platform: {rg_dict.get('PL', 'N/A')}") # Get all samples in the BAM (from RG headers) samples = set() for rg_dict in bam.header.get("RG", []): if "SM" in rg_dict: samples.add(rg_dict["SM"]) print(f"Samples in BAM: {sorted(samples)}") bam.close() ``` ```python # Filter reads by read group ID def extract_reads_by_rg(bam_path, rg_id, output_path): """Extract all reads from a specific read group. WARNING: Uses fetch(until_eof=True), which scans the entire BAM sequentially. Multi-sample BAMs can be tens to hundreds of GB — this may be slow. For large files, prefer region-based filtering: for read in bam.fetch("chr1", start, end): ... Or use the samtools CLI equivalent (faster for one-off extractions): samtools view -b -r input.bam -o output.bam """ with pysam.AlignmentFile(bam_path, "rb") as bam_in: with pysam.AlignmentFile(output_path, "wb", header=bam_in.header) as bam_out: for read in bam_in.fetch(until_eof=True): if read.has_tag("RG") and read.get_tag("RG") == rg_id: bam_out.write(read) pysam.index(output_path) print(f"Extracted reads from RG:{rg_id} → {output_path}") extract_reads_by_rg("multisample.bam", "SAMPLE_001_LaneA", "sample001_laneA.bam") ``` ```python from collections import defaultdict import pysam # Count reads per sample def reads_per_sample(bam_path): """Count reads per sample from read group information. Two distinct "unknown" cases are tracked separately: - "no_sm_field": RG header entry exists but is missing the SM (sample name) field. - "undefined_rg": A read carries an RG tag not declared in the BAM header. """ counts = defaultdict(int) rg_to_sample = {} with pysam.AlignmentFile(bam_path, "rb") as bam: # Build RG → sample mapping from header for rg_dict in bam.header.get("RG", []): rg_id = rg_dict["ID"] # (a) RG header entry lacks SM field rg_to_sample[rg_id] = rg_dict.get("SM", "no_sm_field") # Count reads per resolved sample name for read in bam.fetch(until_eof=True): if read.has_tag("RG"): rg_id = read.get_tag("RG") # (b) Read's RG tag is not declared in the header sample = rg_to_sample.get(rg_id, "undefined_rg") counts[sample] += 1 return dict(counts) sample_counts = reads_per_sample("multisample.bam") for sample, count in sorted(sample_counts.items()): print(f" {sample}: {count:,} reads") ``` ### 6. Samtools/Bcftools CLI Access Call samtools and bcftools commands from Python. ```python import pysam # Sort BAM file pysam.sort("-o", "sorted.bam", "input.bam") # Index BAM pysam.index("sorted.bam") # View region as BAM pysam.view("-b", "-o", "region.bam", "sorted.bam", "chr1:1000-2000") # BCFtools: compress and index VCF pysam.bcftools.view("-O", "z", "-o", "output.vcf.gz", "input.vcf") pysam.tabix_index("output.vcf.gz", preset="vcf") # Error handling try: pysam.sort("-o", "output.bam", "nonexistent.bam") except pysam.SamtoolsError as e: print(f"samtools error: {e}") ``` **CLI equivalents** (for reference — use Python API in automated pipelines): ```bash # These are equivalent to the Python calls above: samtools sort -o sorted.bam input.bam samtools index sorted.bam samtools view -b -o region.bam sorted.bam chr1:1000-2000 bcftools view -O z -o output.vcf.gz input.vcf ``` ## Key Concepts ### Coordinate Systems **Critical**: pysam uses **0-based, half-open** coordinates (Python convention): | System | Start | End | Example: "bases 1000-2000" | |--------|-------|-----|---------------------------| | pysam Python API | 0-based | exclusive | `fetch("chr1", 999, 2000)` | | samtools region string | 1-based | inclusive | `fetch("chr1:1000-2000")` | | VCF file format | 1-based | — | `record.pos` = 1-based, `record.start` = 0-based | | BED format | 0-based | exclusive | `chr1\t999\t2000` | ### Index File Requirements | File Type | Index Extension | Create With | |-----------|----------------|-------------| | BAM | `.bai` | `pysam.index("file.bam")` | | CRAM | `.crai` | `pysam.index("file.cram")` | | FASTA | `.fai` | `pysam.faidx("file.fasta")` | | VCF.gz | `.tbi` | `pysam.tabix_index("file.vcf.gz", preset="vcf")` | | BCF | `.csi` | `pysam.tabix_index("file.bcf", preset="bcf")` | Without an index, use `fetch(until_eof=True)` for sequential reading. ### File Mode Strings | Mode | Format | Direction | |------|--------|-----------| | `"rb"` | BAM (binary) | Read | | `"r"` | SAM (text) | Read | | `"rc"` | CRAM | Read | | `"wb"` | BAM | Write | | `"w"` | SAM | Write | | `"wz"` | VCF.gz (compressed) | Write | ## Common Workflows ### Workflow 1: Coverage Analysis for Target Regions **Goal**: Calculate coverage statistics for a set of target regions (e.g., exome capture targets). ```python import pysam import numpy as np def coverage_for_regions(bam_path, regions, min_mapq=30): """Calculate coverage stats for a list of (chrom, start, end) regions.""" results = [] with pysam.AlignmentFile(bam_path, "rb") as bam: for chrom, start, end in regions: cov = np.array(bam.count_coverage(chrom, start, end, quality_threshold=min_mapq)) total = cov.sum(axis=0) results.append({ "region": f"{chrom}:{start}-{end}", "mean_cov": total.mean(), "min_cov": total.min(), "pct_above_20x": (total >= 20).mean() * 100, }) return results regions = [("chr1", 10000, 10500), ("chr1", 20000, 20500), ("chr2", 5000, 5500)] stats = coverage_for_regions("sample.bam", regions) for s in stats: print(f"{s['region']}: mean={s['mean_cov']:.1f}x, min={s['min_cov']}x, ≥20x={s['pct_above_20x']:.1f}%") ``` ### Workflow 2: Variant Annotation with Read Support **Goal**: For each variant in a VCF, count supporting reads from the BAM. ```python import pysam def annotate_variants_with_reads(vcf_path, bam_path, output_path): """Add read support counts to each variant.""" with pysam.VariantFile(vcf_path) as vcf_in: # Add INFO field to header vcf_in.header.add_line( '##INFO=' ) with pysam.VariantFile(output_path, "w", header=vcf_in.header) as vcf_out: with pysam.AlignmentFile(bam_path, "rb") as bam: for record in vcf_in: alt_count = 0 for col in bam.pileup(record.chrom, record.start, record.stop, min_mapping_quality=30, truncate=True): if col.reference_pos == record.start: for p in col.pileups: if (not p.is_del and p.query_position is not None and p.alignment.query_sequence[p.query_position] in (record.alts or [])): alt_count += 1 record.info["READ_SUPPORT"] = alt_count vcf_out.write(record) annotate_variants_with_reads("variants.vcf", "sample.bam", "annotated.vcf") print("Created annotated.vcf with READ_SUPPORT field") ``` ## Key Parameters | Parameter | Module | Default | Range / Options | Effect | |-----------|--------|---------|-----------------|--------| | mode string | `AlignmentFile`, `VariantFile` | — | `"rb"`, `"r"`, `"rc"`, `"wb"`, `"w"`, `"wz"` | File format and read/write direction | | `min_mapping_quality` | `pileup()` | 0 | 0–60 | Filter reads below this MAPQ | | `quality_threshold` | `count_coverage()` | 15 | 0–40 | Minimum base quality to count | | `truncate` | `pileup()` | False | True/False | Truncate pileup to exact region (True) vs include overlapping reads (False) | | `until_eof` | `fetch()` | False | True/False | Read all records sequentially without index | | `multiple_iterators` | `fetch()` | False | True/False | Allow multiple simultaneous iterators (slight overhead) | | `preset` | `tabix_index()` | — | `"vcf"`, `"bed"`, `"gff"`, `"sam"` | File format for tabix indexing | ## Best Practices 1. **Always use context managers** (`with` statement) for automatic file cleanup. Unclosed files can leak file descriptors. 2. **Create and verify index files first**: Most random-access operations fail silently or raise cryptic errors without indexes. Check for `.bai`/`.tbi`/`.fai` files before queries. 3. **Use `count()` instead of iterating to count reads**: `bam.count("chr1", 1000, 2000)` is much faster than `sum(1 for _ in bam.fetch(...))`. 4. **Use `count_coverage()` for coverage, `pileup()` for base-level detail**: `count_coverage()` is faster when you only need depth numbers. Use `pileup()` only when you need per-read, per-base information. 5. **Anti-pattern — mixing 0-based and 1-based coordinates**: Always double-check coordinate systems when combining pysam (0-based) with VCF files (1-based POS), BED files (0-based), or region strings (1-based). See Key Concepts table. 6. **Anti-pattern — forgetting `truncate=True` in pileup**: Without `truncate=True`, `pileup()` extends to the full extent of overlapping reads, which can be much larger than the requested region. ## Common Recipes ### Recipe: Extract Gene Sequences from Reference ```python import pysam def get_gene_sequence(fasta_path, chrom, start, end, strand="+"): """Extract gene sequence, reverse-complement if on minus strand.""" with pysam.FastaFile(fasta_path) as fasta: seq = fasta.fetch(chrom, start, end) if strand == "-": complement = str.maketrans("ACGTacgt", "TGCAtgca") seq = seq.translate(complement)[::-1] return seq seq = get_gene_sequence("reference.fasta", "chr1", 10000, 11000, strand="-") print(f"Gene sequence ({len(seq)} bp): {seq[:50]}...") ``` ### Recipe: BAM Statistics Summary ```python import pysam def bam_summary(bam_path): """Quick summary statistics for a BAM file.""" with pysam.AlignmentFile(bam_path, "rb") as bam: stats = {"total": 0, "mapped": 0, "unmapped": 0, "duplicates": 0, "mapq_ge30": 0} for read in bam.fetch(until_eof=True): stats["total"] += 1 if read.is_unmapped: stats["unmapped"] += 1 else: stats["mapped"] += 1 if read.is_duplicate: stats["duplicates"] += 1 if read.mapping_quality >= 30: stats["mapq_ge30"] += 1 return stats summary = bam_summary("sample.bam") for k, v in summary.items(): print(f" {k}: {v:,}") ``` ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `ValueError: could not open alignment file` | Missing file or wrong mode string | Check file path; use `"rb"` for BAM, `"r"` for SAM | | `ValueError: fetch called on bamfile without index` | No `.bai` index file | Run `pysam.index("file.bam")` first | | Region returns unexpected reads | Reads overlapping boundaries are included | Use `truncate=True` in `pileup()` or filter by `read.reference_start >= start` | | Coordinate off-by-one errors | Mixing 0-based (pysam) with 1-based (VCF, samtools) | See Key Concepts coordinate table; `record.pos` is 1-based, `record.start` is 0-based | | `PileupProxy accessed after iterator finished` | Pileup iterator went out of scope | Store needed data from pileup columns immediately, don't save PileupProxy references | | `SamtoolsError` from CLI calls | Invalid arguments or missing input | Wrap in `try/except pysam.SamtoolsError`; check samtools docs for argument syntax | | Very slow iteration | Iterating all reads without region query | Use `fetch("chr1", start, end)` for targeted queries; use indexed files | | Read group filter returns 0 reads | RG tag missing or wrong ID specified | Verify RG tag exists: `read.has_tag("RG")`; list available RGs from `bam.header.get("RG", [])` | ## Related Skills - **biopython-molecular-biology** — sequence I/O and alignment; complementary for non-BAM sequence formats - **pydeseq2-differential-expression** — downstream analysis of read counts from BAM coverage data - **scanpy-scrna-seq** — single-cell analysis; pysam handles the upstream BAM processing ## References - [Pysam documentation](https://pysam.readthedocs.io/) — official API reference - [htslib](https://github.com/samtools/htslib) — underlying C library for genomic file formats - [SAM format specification](https://samtools.github.io/hts-specs/SAMv1.pdf) — SAM/BAM format details - Li et al. (2009) "The Sequence Alignment/Map format and SAMtools" — [Bioinformatics](https://doi.org/10.1093/bioinformatics/btp352)