--- name: "gtars" description: "Rust-backed Python library for fast genomic token arithmetic and BED processing. High-performance BED I/O, interval set ops (intersect, merge, complement, subtract), region tokenization against a universe, universe construction. Use for preprocessing large BED collections and ML token vocabularies." license: "MIT" --- # GTARS: Fast Genomic Token Arithmetic and BED File Processing ## Overview GTARS is a Python library with a Rust-backed core for high-performance genomic interval operations. It provides BED file I/O, set-theoretic interval operations (intersection, union, merge, complement, subtract), genomic region tokenization against a reference universe, and utilities for building consensus universe BED files. GTARS is designed for workflows that process hundreds to thousands of BED files efficiently, serving as a preprocessing engine for ML pipelines (including geniml) and general bioinformatics pipelines. ## When to Use - Read and write large BED files efficiently, leveraging Rust-backed parsing for speed over pure Python alternatives - Compute genomic interval intersections, merges, complements, or subtracts between BED file pairs or sets - Tokenize a collection of genomic regions against a fixed universe vocabulary for ML input preparation - Build consensus universe BED files from a collection of sample BED files - Count overlap statistics between two BED files without launching bedtools processes - Preprocess ATAC-seq, ChIP-seq, or ENCODE peak files before feeding into geniml or other ML tools - For full BED/BAM/SAM reading with CIGAR-level detail, use `pysam-genomic-files` instead ## Prerequisites - **Python packages**: `gtars`, `numpy` (optional, for array conversion) - **Data requirements**: BED format files (3+ columns: chr, start, end); optionally a universe BED file for tokenization - **Environment**: Python 3.8+; Rust toolchain not required (pre-built wheels available on PyPI) ```bash pip install gtars ``` ## Quick Start ```python from gtars import GenomicIntervalSet # Load a BED file and inspect gis = GenomicIntervalSet("peaks.bed") print(f"Loaded {len(gis)} intervals") print(f"First interval: {gis[0]}") # First interval: chr1:1000-2000 # Intersect two BED files gis2 = GenomicIntervalSet("other_peaks.bed") overlap = gis.intersect(gis2) print(f"Overlapping intervals: {len(overlap)}") ``` ## Core API ### Module 1: GenomicIntervalSet — BED File I/O Primary data structure for loading, indexing, and writing genomic intervals. ```python from gtars import GenomicIntervalSet # Load from file gis = GenomicIntervalSet("peaks.bed") print(f"Intervals loaded: {len(gis)}") print(f"Chromosomes present: {gis.chromosomes}") # Access by index interval = gis[0] print(f"chr={interval.chr}, start={interval.start}, end={interval.end}") # Iterate all intervals for iv in gis: width = iv.end - iv.start if width > 5000: print(f"Wide interval: {iv.chr}:{iv.start}-{iv.end} ({width} bp)") break ``` ```python from gtars import GenomicIntervalSet, GenomicInterval # Create from a list of GenomicInterval objects intervals = [ GenomicInterval("chr1", 100, 500), GenomicInterval("chr1", 600, 1200), GenomicInterval("chr2", 300, 800), ] gis = GenomicIntervalSet(intervals) print(f"Created GenomicIntervalSet with {len(gis)} intervals") # Write to BED file gis.to_bed("output_intervals.bed") print("Saved output_intervals.bed") ``` ### Module 2: Interval Set Operations Compute intersections, unions, merges, subtracts, and complements between BED sets. ```python from gtars import GenomicIntervalSet peaks_a = GenomicIntervalSet("condition_A.bed") peaks_b = GenomicIntervalSet("condition_B.bed") # Intersection: intervals present in both sets shared = peaks_a.intersect(peaks_b) print(f"Shared intervals: {len(shared)}") # Subtraction: intervals in A not overlapping B a_only = peaks_a.subtract(peaks_b) print(f"A-specific intervals: {len(a_only)}") # Union (merge both sets, then merge overlapping) combined = peaks_a.union(peaks_b) print(f"Union intervals: {len(combined)}") ``` ```python from gtars import GenomicIntervalSet # Merge overlapping/adjacent intervals within a single set gis = GenomicIntervalSet("fragmented_peaks.bed") print(f"Before merge: {len(gis)} intervals") merged = gis.merge() print(f"After merge: {len(merged)} intervals") # Complement: genome-wide intervals NOT covered by peaks # Requires chromosome sizes chrom_sizes = {"chr1": 248956422, "chr2": 242193529, "chrX": 156040895} complement = gis.complement(chrom_sizes) print(f"Complement (uncovered) intervals: {len(complement)}") ``` ### Module 3: Tokenization Convert genomic intervals to integer token IDs against a reference universe vocabulary. ```python from gtars import Tokenizer # Initialize tokenizer with a universe BED file tokenizer = Tokenizer("universe.bed") print(f"Universe vocabulary size: {len(tokenizer)}") # Tokenize a single BED file → list of token IDs from gtars import GenomicIntervalSet gis = GenomicIntervalSet("sample_peaks.bed") tokens = tokenizer.tokenize(gis) print(f"Token IDs: {tokens[:10]} ...") print(f"Total tokens: {len(tokens)}") ``` ```python from gtars import Tokenizer, GenomicIntervalSet tokenizer = Tokenizer("universe.bed") # Tokenize and convert to numpy array for ML import numpy as np gis = GenomicIntervalSet("sample_peaks.bed") tokens = tokenizer.tokenize(gis) token_array = np.array(tokens) print(f"Token array shape: {token_array.shape}, dtype: {token_array.dtype}") # Build a binary presence/absence vector over the full universe vocab_size = len(tokenizer) presence_vector = np.zeros(vocab_size, dtype=np.float32) presence_vector[token_array] = 1.0 print(f"Presence vector shape: {presence_vector.shape}") print(f"Fraction of universe covered: {presence_vector.mean():.4f}") ``` ### Module 4: Universe Building Construct a consensus non-overlapping universe from a collection of BED files. ```python from gtars import UniverseBuilder # Build universe from multiple BED files bed_files = ["sample_1.bed", "sample_2.bed", "sample_3.bed", "sample_4.bed"] builder = UniverseBuilder() universe = builder.build(bed_files) print(f"Universe regions: {len(universe)}") universe.to_bed("consensus_universe.bed") print("Saved consensus_universe.bed") ``` ```python from gtars import UniverseBuilder # Build with coverage threshold (region must appear in >= N% of samples) bed_files = [f"chip_{i}.bed" for i in range(1, 21)] builder = UniverseBuilder(fraction=0.25) # present in >= 25% of samples universe = builder.build(bed_files) print(f"Universe (fraction>=0.25): {len(universe)} regions") # Compare thresholds for frac in [0.1, 0.25, 0.5]: b = UniverseBuilder(fraction=frac) u = b.build(bed_files) print(f" fraction={frac}: {len(u)} regions") ``` ### Module 5: Interval Statistics and Utilities Compute coverage, overlap counts, and basic statistics over BED files. ```python from gtars import GenomicIntervalSet gis = GenomicIntervalSet("peaks.bed") # Basic statistics widths = [iv.end - iv.start for iv in gis] import numpy as np print(f"Interval count: {len(gis)}") print(f"Mean width (bp): {np.mean(widths):.1f}") print(f"Median width (bp): {np.median(widths):.1f}") print(f"Total coverage (bp): {sum(widths):,}") # Per-chromosome counts from collections import Counter chrom_counts = Counter(iv.chr for iv in gis) for chrom, count in sorted(chrom_counts.items())[:5]: print(f" {chrom}: {count} intervals") ``` ```python from gtars import GenomicIntervalSet # Overlap count matrix between two sets peaks_a = GenomicIntervalSet("condition_A.bed") peaks_b = GenomicIntervalSet("condition_B.bed") # Count how many intervals in A overlap at least one interval in B overlapping_a = peaks_a.intersect(peaks_b) pct_overlap = len(overlapping_a) / len(peaks_a) * 100 print(f"A intervals overlapping B: {len(overlapping_a)}/{len(peaks_a)} ({pct_overlap:.1f}%)") # Filter peaks by minimum width min_width = 200 filtered = GenomicIntervalSet([iv for iv in peaks_a if (iv.end - iv.start) >= min_width]) print(f"Peaks >= {min_width} bp: {len(filtered)}/{len(peaks_a)}") filtered.to_bed("filtered_peaks.bed") ``` ## Key Concepts ### Tokenization and Universe Vocabulary GTARS tokenization maps genomic regions to integer indices by finding the universe region that best overlaps each query interval. If a query interval does not overlap any universe region, it receives a special out-of-vocabulary (OOV) token. This vocabulary approach is analogous to NLP tokenization and enables treating genomic intervals as discrete tokens for transformer- and embedding-based models. ```python from gtars import Tokenizer, GenomicIntervalSet tokenizer = Tokenizer("universe.bed") # OOV token index is typically 0 or len(tokenizer) print(f"OOV token ID: {tokenizer.unknown_token}") gis = GenomicIntervalSet("test.bed") tokens = tokenizer.tokenize(gis) oov_count = sum(1 for t in tokens if t == tokenizer.unknown_token) print(f"OOV rate: {oov_count}/{len(tokens)} ({oov_count/len(tokens)*100:.1f}%)") ``` ## Common Workflows ### Workflow 1: BED File Preprocessing Pipeline **Goal**: Load raw ATAC-seq peaks, filter by width, merge overlaps, and tokenize for ML. ```python from gtars import GenomicIntervalSet, GenomicInterval, Tokenizer import numpy as np # Step 1: Load peaks peaks = GenomicIntervalSet("atac_peaks_raw.bed") print(f"Raw peaks: {len(peaks)}") # Step 2: Filter by minimum width (remove very short noise peaks) min_width = 150 filtered_intervals = [iv for iv in peaks if (iv.end - iv.start) >= min_width] peaks_filtered = GenomicIntervalSet(filtered_intervals) print(f"After width filter (>={min_width} bp): {len(peaks_filtered)}") # Step 3: Merge overlapping peaks peaks_merged = peaks_filtered.merge() print(f"After merge: {len(peaks_merged)}") peaks_merged.to_bed("atac_peaks_processed.bed") # Step 4: Tokenize against universe tokenizer = Tokenizer("universe.bed") tokens = tokenizer.tokenize(peaks_merged) token_array = np.array(tokens) print(f"Token array: {token_array.shape}, OOV rate: {(token_array == tokenizer.unknown_token).mean():.3f}") # Step 5: Build presence vector for ML input vocab_size = len(tokenizer) presence = np.zeros(vocab_size, dtype=np.float32) valid_tokens = token_array[token_array != tokenizer.unknown_token] presence[valid_tokens] = 1.0 np.save("sample_presence_vector.npy", presence) print(f"Saved presence vector: {presence.shape}, coverage={presence.mean():.4f}") ``` ### Workflow 2: Batch BED File Statistics **Goal**: Compute overlap and coverage statistics across a collection of BED files. ```python from gtars import GenomicIntervalSet from pathlib import Path import pandas as pd import numpy as np bed_dir = Path("chip_peaks/") bed_files = sorted(bed_dir.glob("*.bed")) reference = GenomicIntervalSet("reference_regions.bed") records = [] for bed_file in bed_files: gis = GenomicIntervalSet(str(bed_file)) widths = [iv.end - iv.start for iv in gis] overlap = gis.intersect(reference) records.append({ "sample": bed_file.stem, "n_peaks": len(gis), "mean_width_bp": np.mean(widths), "total_coverage_bp": sum(widths), "overlap_with_ref": len(overlap), "pct_overlap": len(overlap) / len(gis) * 100 if len(gis) > 0 else 0, }) df = pd.DataFrame(records) print(df.to_string(index=False)) df.to_csv("bed_file_statistics.csv", index=False) print(f"\nSaved bed_file_statistics.csv ({len(df)} samples)") ``` ### Workflow 3: Build Universe and Tokenize Corpus **Goal**: From scratch — build a universe from a BED corpus and tokenize all files. ```python from gtars import UniverseBuilder, Tokenizer, GenomicIntervalSet from pathlib import Path import numpy as np bed_dir = Path("atac_peaks/") bed_files = [str(f) for f in sorted(bed_dir.glob("*.bed"))] print(f"Building universe from {len(bed_files)} BED files...") # Build universe builder = UniverseBuilder(fraction=0.2) universe = builder.build(bed_files) universe.to_bed("corpus_universe.bed") print(f"Universe: {len(universe)} regions → corpus_universe.bed") # Tokenize all BED files tokenizer = Tokenizer("corpus_universe.bed") vocab_size = len(tokenizer) print(f"Tokenizer vocab size: {vocab_size}") presence_matrix = [] for f in bed_files: gis = GenomicIntervalSet(f) tokens = np.array(tokenizer.tokenize(gis)) presence = np.zeros(vocab_size, dtype=np.float32) valid = tokens[tokens != tokenizer.unknown_token] presence[valid] = 1.0 presence_matrix.append(presence) X = np.stack(presence_matrix) print(f"Presence matrix: {X.shape} (samples × universe_regions)") np.save("corpus_presence_matrix.npy", X) print("Saved corpus_presence_matrix.npy") ``` ## Key Parameters | Parameter | Module | Default | Range / Options | Effect | |-----------|--------|---------|-----------------|--------| | `fraction` | UniverseBuilder | `0.5` | `0.0`–`1.0` | Minimum fraction of samples a region must appear in to enter universe | | `merge_dist` | merge() | `0` | `≥0` bp | Merge intervals within this distance; `0` merges only overlapping/touching | | `min_overlap` | intersect() | `1` bp | `1`–region width | Minimum overlap length required to count as an intersection | | OOV token | Tokenizer | `tokenizer.unknown_token` | — | Token ID for regions not overlapping any universe region | | Chromosome sizes | complement() | required dict | `{chr: length}` | Genome-wide sizes used to compute uncovered complement intervals | ## Common Recipes ### Recipe: Filter BED File by Chromosome List When to use: Remove non-canonical chromosomes (patches, alternative contigs) before analysis. ```python from gtars import GenomicIntervalSet, GenomicInterval gis = GenomicIntervalSet("raw_peaks.bed") canonical_chroms = {f"chr{i}" for i in list(range(1, 23)) + ["X", "Y", "M"]} filtered = GenomicIntervalSet([iv for iv in gis if iv.chr in canonical_chroms]) filtered.to_bed("canonical_peaks.bed") print(f"Canonical peaks: {len(filtered)}/{len(gis)}") ``` ### Recipe: Compute Pairwise Jaccard Similarity Between BED Sets When to use: Measure similarity between two peak sets without external tools. ```python from gtars import GenomicIntervalSet def jaccard(a: GenomicIntervalSet, b: GenomicIntervalSet) -> float: intersection = len(a.intersect(b)) union = len(a.union(b)) return intersection / union if union > 0 else 0.0 peaks_a = GenomicIntervalSet("sample_a.bed") peaks_b = GenomicIntervalSet("sample_b.bed") j = jaccard(peaks_a, peaks_b) print(f"Jaccard similarity: {j:.4f}") ``` ## Expected Outputs - **GenomicIntervalSet**: iterable of `GenomicInterval` objects with `.chr`, `.start`, `.end` - **BED files**: written via `.to_bed()` — 3-column tab-separated (chr, start, end) - **Token arrays**: `list[int]` from `tokenizer.tokenize()` — convert with `np.array(tokens)` - **Presence vectors**: `np.ndarray` of shape `(vocab_size,)` built from token IDs - **Universe BED**: non-overlapping consensus regions written via `universe.to_bed()` ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `ImportError: gtars` | Package not installed | `pip install gtars`; check Python version ≥ 3.8 | | All intervals report OOV token | BED coordinate system mismatch | Verify both BED and universe use same genome assembly and `chr` prefix convention | | `intersect` returns empty set | Non-overlapping chromosomes or shifted coordinates | Check chromosome names match exactly (e.g., `chr1` vs `1`) | | Very large universe (>5M regions) | Low `fraction` threshold with many short samples | Increase `fraction` to 0.3–0.5; pre-filter BED files to remove noise peaks | | `to_bed()` produces unsorted output | Intervals added in arbitrary order | Sort output: `sorted_gis = GenomicIntervalSet(sorted(gis, key=lambda r: (r.chr, r.start)))` | | Memory error on large BED collections | Holding all intervals in memory | Process files in batches; use streaming/chunked loading if available | ## Related Skills - **geniml** — uses GTARS-built universes and tokenizers for region2vec training and BEDSpace indexing - **pysam-genomic-files** — for BAM/SAM/VCF-level access with CIGAR and read-level detail ## References - [GTARS GitHub repository](https://github.com/databio/gtars) — source, issue tracker, and changelog - [PyPI: gtars](https://pypi.org/project/gtars/) — installation instructions and version history - [Geniml documentation](https://geniml.databio.org/) — ecosystem context; GTARS is the preprocessing engine for geniml