--- name: bioinformatics-fundamentals description: Core bioinformatics concepts including SAM/BAM format, AGP genome assembly format, sequencing technologies (Hi-C, HiFi, Illumina), quality metrics, and common data processing patterns. Essential for debugging alignment, filtering, pairing issues, and AGP coordinate validation. version: 1.1.0 --- # Bioinformatics Fundamentals Foundation knowledge for genomics and bioinformatics workflows. Provides essential understanding of file formats, sequencing technologies, and common data processing patterns. ## When to Use This Skill - Working with sequencing data (PacBio HiFi, Hi-C, Illumina) - Debugging SAM/BAM alignment or filtering issues - Processing AGP files for genome assembly curation - Validating AGP coordinate systems and unloc assignments - Understanding paired-end vs single-end data - Interpreting quality metrics (MAPQ, PHRED scores) - Troubleshooting empty outputs or broken read pairs - General bioinformatics data analysis ## SAM/BAM Format Essentials ### SAM Flags (Bitwise) Flags are **additive** - a read can have multiple flags set simultaneously. **Common Flags:** - `0x0001` (1): Read is paired in sequencing - `0x0002` (2): **Each segment properly aligned** (proper pair) - `0x0004` (4): Read unmapped - `0x0008` (8): Mate unmapped - `0x0010` (16): Read mapped to reverse strand - `0x0020` (32): Mate mapped to reverse strand - `0x0040` (64): First in pair (R1/forward) - `0x0080` (128): Second in pair (R2/reverse) - `0x0100` (256): Secondary alignment - `0x0400` (1024): PCR or optical duplicate - `0x0800` (2048): Supplementary alignment **Flag Combinations:** - Properly paired R1: `99` (0x63 = 1 + 2 + 32 + 64) - Properly paired R2: `147` (0x93 = 1 + 2 + 16 + 128) - Unmapped read: `4` - Mate unmapped: `8` ### Proper Pair Flag (0x0002) **What "proper pair" means:** - Both R1 and R2 are mapped - Mapping orientations are correct (typically R1 forward, R2 reverse) - Insert size is reasonable for the library - Pair conforms to aligner's expectations **Important:** Different aligners have different criteria for proper pairs! ### MAPQ (Mapping Quality) **Formula:** `MAPQ = -10 * log10(P(mapping is wrong))` **Common Thresholds:** - `MAPQ >= 60`: High confidence (error probability < 0.0001%) - `MAPQ >= 30`: Good quality (error probability < 0.1%) - `MAPQ >= 20`: Acceptable (error probability < 1%) - `MAPQ >= 10`: Low confidence (error probability < 10%) - `MAPQ = 0`: Multi-mapper or unmapped **Note:** MAPQ=0 can mean either unmapped OR equally good multiple mappings. ### CIGAR String Represents alignment between read and reference: - `M`: Match or mismatch (alignment match) - `I`: Insertion in read vs reference - `D`: Deletion in read vs reference - `S`: Soft clipping (bases in read not aligned) - `H`: Hard clipping (bases not in read sequence) - `N`: Skipped region (for RNA-seq splicing) **Example:** `100M` = perfect 100bp match **Example:** `50M5I45M` = 50bp match, 5bp insertion, 45bp match ## Sequencing Technologies ### PacBio HiFi (High Fidelity) **Characteristics:** - Long reads: 10-25 kb typical - High accuracy: >99.9% (Q20+) - Circular Consensus Sequencing (CCS) - Single-end data (though from circular molecules) - Excellent for de novo assembly **Best Mappers:** - minimap2 presets: `map-pb`, `map-hifi` - BWA-MEM2 can work but optimized for short reads **Typical Use Cases:** - De novo genome assembly - Structural variant detection - Isoform sequencing (Iso-Seq) - Haplotype phasing ### Hi-C (Chromatin Conformation Capture) **Characteristics:** - Paired-end short reads (typically 100-150 bp) - Read pairs capture chromatin interactions - **R1 and R2 often map to different scaffolds/chromosomes** - Requires careful proper pair handling - Used for scaffolding and 3D genome structure **Best Mappers:** - BWA-MEM2 (paired-end mode) - BWA-MEM (paired-end mode) **Critical Concept:** Hi-C read pairs **intentionally** map to distant loci. Region filtering can easily break pairs! **Typical Use Cases:** - Genome scaffolding (connecting contigs) - 3D chromatin structure analysis - Haplotype phasing - Assembly quality assessment ### Illumina Short Reads **Characteristics:** - Short reads: 50-300 bp - Paired-end or single-end - High throughput - Well-established quality scores **Best Mappers:** - BWA-MEM2, BWA-MEM (general purpose) - Bowtie2 (fast, local alignment) - STAR (RNA-seq spliced alignment) ## Common Tools and Their Behaviors ### samtools view **Purpose:** Filter, convert, and view SAM/BAM files **Key Flags:** - `-b`: Output BAM format - `-h`: Include header - `-f INT`: Require flags (keep reads WITH these flags) - `-F INT`: Filter flags (remove reads WITH these flags) - `-q INT`: Minimum MAPQ threshold - `-L FILE`: Keep reads overlapping regions in BED file **Important Behavior:** - `-L` (region filtering) checks **each read individually**, not pairs - Can break read pairs if mates map to different regions - Flag filters (`-f`, `-F`) are applied **before** region filters (`-L`) **Example - Filter for proper pairs:** ```bash samtools view -b -f 2 input.bam > proper_pairs.bam ``` **Example - Filter by region (may break pairs):** ```bash samtools view -b -L regions.bed input.bam > filtered.bam ``` **Example - Proper pairs in regions (correct order):** ```bash samtools view -b -f 2 -L regions.bed input.bam > proper_pairs_in_regions.bam ``` ### bamtools filter **Purpose:** Advanced filtering with complex criteria **Key Features:** - Can filter on multiple properties simultaneously - More strict about pair validation than samtools - Supports JSON filter rules **Common Filters:** - `isPaired: true` - Read is from paired-end sequencing - `isProperPair: true` - Read is part of proper pair - `isMapped: true` - Read is mapped - `mapQuality: >=30` - Mapping quality threshold **Important Difference from samtools:** - `isProperPair` is more strict than samtools `-f 2` - Checks pair validity more thoroughly - Better for ensuring R1/R2 match correctly ### samtools fastx **Purpose:** Convert SAM/BAM to FASTQ/FASTA **Output Modes:** - `outputs: ["r1", "r2"]` - Separate forward and reverse for paired-end - `outputs: ["other"]` - Single output for single-end data - `outputs: ["r0"]` - All reads (mixed paired/unpaired) **Filtering Options:** - `inclusive_filter: ["2"]` - Require proper pair flag - `exclusive_filter: ["4", "8"]` - Exclude unmapped or mate unmapped - `exclusive_filter_all: ["8"]` - Exclude if mate unmapped **Critical:** Use appropriate filters to ensure R1/R2 files match! ## Common Patterns and Best Practices ### Pattern 1: Filtering Paired-End Data by Regions **WRONG WAY (breaks pairs):** ```bash # Region filter first → breaks pairs when mates are in different regions samtools view -b -L regions.bed input.bam | bamtools filter -isPaired -isProperPair # Result: Empty output (all pairs broken) ``` **RIGHT WAY (preserves pairs):** ```bash # Proper pair filter FIRST, then region filter samtools view -b -f 2 -L regions.bed input.bam > output.bam # Result: Pairs where both mates are in regions (or one mate in region, other anywhere) ``` **BEST WAY (both mates in regions):** ```bash # Filter for proper pairs, then use paired-aware region filtering samtools view -b -f 2 input.bam | \ # Custom script to keep pairs where both mates in regions ``` ### Pattern 2: Extracting FASTQ from Filtered BAM **For Paired-End:** ```bash # Ensure proper pairs before extraction samtools fastx -1 R1.fq.gz -2 R2.fq.gz \ --i1-flags 2 \ # Require proper pair --i2-flags 64,128 \ # Separate R1/R2 input.bam ``` **For Single-End:** ```bash # Simple extraction samtools fastx -0 output.fq.gz input.bam ``` ### Pattern 3: Quality Filtering **Conservative (high quality):** ```bash samtools view -b -q 30 -f 2 -F 256 -F 2048 input.bam # MAPQ >= 30, proper pairs, no secondary/supplementary ``` **Permissive (for low-coverage data):** ```bash samtools view -b -q 10 -F 4 input.bam # MAPQ >= 10, mapped reads ``` ## Common Issues and Solutions ### Issue 1: Empty Output After Region Filtering (Hi-C Data) **Symptom:** - BAM file non-empty before filtering - Empty after region filtering + proper pair filtering - Happens with paired-end data (especially Hi-C) **Cause:** - Region filter (`samtools view -L`) breaks read pairs - One mate in region, other mate outside region - Proper pair flag (0x2) is lost - Subsequent `isProperPair` filter removes all reads **Solution:** ```bash # Apply proper pair filter BEFORE region filtering samtools view -b -f 2 -L regions.bed input.bam > output.bam ``` **See Also:** `common-issues.md` for detailed troubleshooting ### Issue 2: R1 and R2 Files Have Different Read Counts **Symptom:** - Forward and reverse FASTQ files have different numbers of reads - Downstream tools fail expecting matched pairs **Cause:** - Improper filtering broke some pairs - One mate filtered out, other kept - Extraction didn't require proper pairing **Solution:** ```bash # Require proper pairs during extraction samtools fastx -1 R1.fq -2 R2.fq --i1-flags 2 input.bam ``` ### Issue 3: Low Mapping Rate for Hi-C Data **Symptom:** - Many Hi-C reads unmapped or low MAPQ - Expected for Hi-C due to chimeric reads **Not Actually a Problem:** - Hi-C involves ligation of distant DNA fragments - Creates chimeric molecules - Mappers may mark these as low quality or unmapped - This is **normal** for Hi-C data **Solution:** - Use Hi-C-specific pipelines (e.g., HiC-Pro, Juicer) - Don't filter too aggressively on MAPQ - Accept lower mapping rates than DNA-seq ### Issue 4: Proper Pairs Lost After Mapping **Symptom:** - Few reads marked as proper pairs (flag 0x2) - Expected paired-end data **Possible Causes:** 1. Insert size distribution wrong (check aligner parameters) 2. Reference mismatch (reads from different assembly) 3. Poor library quality 4. Incorrect orientation flags passed to aligner **Solution:** ```bash # Check insert size distribution samtools stats input.bam | grep "insert size" # Check pairing flags samtools flagstat input.bam ``` ## Quality Metrics ### N50 and Related Metrics **N50:** Length of the shortest contig at which 50% of total assembly is contained in contigs of that length or longer **How to interpret:** - Higher N50 = better contiguity - Compare to expected chromosome/scaffold sizes - Use with caution - can be misleading for fragmented assemblies **Related Metrics:** - **L50:** Number of contigs needed to reach N50 - **N90:** More stringent than N50 (90% coverage) - **NG50:** N50 relative to genome size (better for comparisons) ### Coverage and Depth **Coverage:** Percentage of reference bases covered by at least one read **Depth:** Average number of reads covering each base **Recommended Depths:** - Genome assembly (HiFi): 30-50x - Variant calling: 30x minimum - RNA-seq: 20-40 million reads - Hi-C scaffolding: 50-100x genomic coverage ## File Format Quick Reference ### FASTA ``` >sequence_id description ATCGATCGATCG ATCGATCG ``` - Header line starts with `>` - Can span multiple lines - No quality scores ### FASTQ ``` @read_id ATCGATCGATCG + IIIIIIIIIIII ``` - Four lines per read - Quality scores (Phred+33 encoding typical) - Can be gzipped (.fastq.gz) ### BED ``` chr1 1000 2000 feature_name score + ``` - 0-based coordinates - Used for regions, features, intervals - Minimum 3 columns (chrom, start, end) ### AGP ``` chr1 1 5000 1 W contig_1 1 5000 + chr1 5001 5100 2 U 100 scaffold yes proximity_ligation ``` - Tab-delimited genome assembly format - 1-based closed coordinates [start, end] - Describes construction of objects from components - Object and component lengths must match - See AGP Format section for complete specification ## Best Practices ### General 1. **Always check data type:** Paired-end vs single-end determines filtering strategy 2. **Understand your sequencing technology:** Hi-C behaves differently than HiFi 3. **Filter in the right order:** Proper pairs BEFORE region filtering 4. **Validate outputs:** Check file sizes, read counts, flagstat 5. **Use appropriate MAPQ thresholds:** Too stringent = lost data, too permissive = noise ### For Hi-C Data 1. **Expect distant read pairs:** Don't be surprised by different scaffolds 2. **Preserve proper pairs:** Critical for downstream scaffolding 3. **Use paired-aware tools:** Standard filters may break pairs 4. **Don't over-filter on MAPQ:** Hi-C often has lower MAPQ than DNA-seq ### For HiFi Data 1. **Single-end processing:** No pair concerns 2. **High quality expected:** Can use strict filters 3. **Use appropriate presets:** minimap2 `map-hifi` or `map-pb` 4. **Consider read length distribution:** HiFi reads vary in length ### For Tool Testing 1. **Create self-contained datasets:** Both mates in selected region 2. **Maintain proper pairs:** Essential for realistic testing 3. **Use representative data:** Subsample proportionally, not randomly 4. **Verify file sizes:** Too small = overly filtered ## Genome Ark Data Retrieval ### Overview Genome Ark (s3://genomeark/) is a public AWS S3 bucket containing VGP assemblies and QC data. Access requires no credentials using `--no-sign-request`. ### Directory Structure ``` s3://genomeark/species/{SPECIES_NAME}/{TOLID}/ ├── assembly_vgp_HiC_2.0/ # Standard HiC assembly ├── assembly_vgp_standard_2.1/ # Standard assembly ├── assembly_vgp_trio_1.0/ # Trio assembly ├── assembly_curated/ # Manually curated ├── assembly_{METHOD}_{VERSION}/ # Method-specific (e.g., assembly_verkko_1.4) └── genomic_data/ # Raw sequencing data ``` ### GenomeScope Data Locations GenomeScope summaries can be in multiple locations - search comprehensively: **Common paths** (priority order): 1. `{assembly}/evaluation/genomescope/{TOLID}_genomescope__Summary.txt` 2. `{assembly}/evaluation/genomescope2/{TOLID}_genomescope__Summary.txt` 3. `{assembly}/qc/genomescope/{TOLID}_genomescope__Summary.txt` 4. `{assembly}/intermediates/genomescope/{TOLID}_genomescope__Summary.txt` **Search strategy**: 1. Dynamically discover assembly folders (not all species use same structure) 2. Search multiple subfolder locations per assembly 3. Try standard assemblies first (HiC_2.0, standard_2.1) 4. Fall back to method-specific assemblies ### Python Implementation Pattern ```python def search_genomescope_recursive(species_name, tolid): """Search for GenomeScope data across all assembly folders.""" species_s3 = species_name.replace(' ', '_') # 1. Discover assembly folders result = subprocess.run( ['aws', 's3', 'ls', f's3://genomeark/species/{species_s3}/{tolid}/', '--no-sign-request'], capture_output=True, text=True ) assembly_folders = [] for line in result.stdout.strip().split('\n'): if 'PRE' in line: folder = line.split('PRE')[1].strip().rstrip('/') if folder.startswith('assembly'): assembly_folders.append(folder) # 2. Search patterns patterns = [ 'evaluation/genomescope/{tolid}_genomescope__Summary.txt', 'evaluation/genomescope2/{tolid}_genomescope__Summary.txt', 'qc/genomescope/{tolid}_genomescope__Summary.txt', 'intermediates/genomescope/{tolid}_genomescope__Summary.txt', ] # 3. Try each combination for assembly in assembly_folders: for pattern in patterns: s3_path = f's3://genomeark/species/{species_s3}/{tolid}/{assembly}/{pattern.format(tolid=tolid)}' result = subprocess.run( ['aws', 's3', 'cp', s3_path, '-', '--no-sign-request'], capture_output=True, text=True, timeout=10 ) if result.returncode == 0 and result.stdout: return result.stdout return None ``` ### GenomeScope Summary Parsing ```python def parse_genomescope_summary(content): """Extract genome characteristics from GenomeScope2 summary.""" data = {} # Genome Haploid Length (max value - second column) match = re.search(r'Genome Haploid Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content) if match: data['genome_size_genomescope'] = int(match.group(1).replace(',', '')) # Heterozygosity percentage (max value) match = re.search(r'Heterozygous \(ab\)\s+[\d.]+%\s+([\d.]+)%', content) if match: data['heterozygosity_percent'] = float(match.group(1)) # Genome Unique Length (max value) match = re.search(r'Genome Unique Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content) if match: data['unique_length'] = int(match.group(1).replace(',', '')) # Genome Repeat Length (max value) match = re.search(r'Genome Repeat Length\s+[\d,]+\s*bp\s+([\d,]+)\s*bp', content) if match: data['repeat_length'] = int(match.group(1).replace(',', '')) # Calculate repeat content percentage if 'repeat_length' in data and 'unique_length' in data: total = data['repeat_length'] + data['unique_length'] if total > 0: data['repeat_content_percent'] = (data['repeat_length'] / total) * 100 return data ``` ### Best Practices 1. **Rate limiting**: Add 0.2s delay between successful fetches to be respectful to AWS 2. **Caching**: Check if file exists locally before fetching 3. **Timeout**: Use 10-15s timeout per request 4. **Assembly discovery**: Always discover assemblies dynamically - don't assume structure 5. **Multiple locations**: GenomeScope data may be in evaluation/, qc/, or intermediates/ 6. **Expected coverage**: ~15-20% of VGP assemblies have GenomeScope data available ### Common Issues **Species not found**: Some species use different naming (check exact spacing/underscores) ```bash # Search for species aws s3 ls s3://genomeark/species/ --no-sign-request | grep -i "species_name" ``` **Assembly folder variations**: Not all use standard names ```bash # List all assemblies for a species aws s3 ls s3://genomeark/species/{SPECIES}/{TOLID}/ --no-sign-request ``` ### AWS CLI Setup ```bash # Install AWS CLI (no credentials needed for Genome Ark) conda install -c conda-forge awscli # Test access aws s3 ls s3://genomeark/species/ --no-sign-request | head ``` ## Karyotype Data Curation and Literature Search ### Overview Karyotype data (diploid 2n and haploid n chromosome numbers) is critical for genome assembly validation but rarely available via APIs. Manual literature curation is required. ### Search Strategy #### Effective Search Terms ``` "{species_name} karyotype chromosome 2n" "{species_name} diploid number karyotype" "{genus} karyotype evolution" "cytogenetic analysis {family_name}" "{species_name} chromosome number diploid" ``` #### Best Reference Sources 1. **PubMed/PMC**: Primary cytogenetic studies 2. **ResearchGate**: Karyotype descriptions and figures 3. **Specialized databases**: - Bird Chromosome Database: https://sites.unipampa.edu.br/birdchromosomedatabase/ - Animal Genome Size Database: http://www.genomesize.com/ 4. **Genome assembly papers**: Often mention expected karyotype 5. **Comparative cytogenetic studies**: Family-level analyses #### Search Time Estimates - **Model organisms, domestic species**: 2-3 minutes - **Well-studied taxonomic groups**: 5-10 minutes - **Rare/uncommon species**: 10-20 minutes or not found ### Taxonomic Conservation Patterns #### Mammals - **Cetaceans**: Highly conserved 2n = 44, n = 22 (exceptions: pygmy sperm whale, right whale, beaked whales = 2n = 42) - **Felidae**: Conserved 2n = 38, n = 19 - **Canidae**: Conserved 2n = 78, n = 39 - **Primates**: Variable (great apes 2n = 48, macaques 2n = 42, marmosets 2n = 46) #### Birds - **Anatidae (waterfowl)**: Highly conserved 2n = 80, n = 40 across ducks, geese, swans - **Galliformes (game birds)**: Typically 2n = 78, n = 39 (chicken, quail, grouse) - **Passerines**: Variable 2n = 78-82, most common 2n = 80 - **Ancestral avian karyotype**: Putative 2n = 80 - **General pattern**: 50.7% of birds have 2n = 78-82; 21.7% have exactly 2n = 80 #### Reptiles - **Lacertidae (wall lizards)**: Often 2n = 38, n = 19 ### Genome Assembly Interpretation ⚠️ **Warning**: Chromosome-level assemblies often report fewer chromosomes than actual diploid number **Why**: Assemblies typically capture only: - Macrochromosomes (large chromosomes) - Larger microchromosomes - Small microchromosomes remain unassembled **Example**: Waterfowl with 2n = 80 often have genome assemblies with 34-42 "chromosomes" - True karyotype: 10 macro pairs + 30 micro pairs = 80 - Assembly: ~34-42 scaffolds (only macro + larger micros) ### Using Conservation for Inference When specific karyotype data is unavailable but genus/family patterns are strong: 1. **High confidence inference** (acceptable for publication): - Multiple congeneric species confirmed - Family-level conservation documented - No known exceptions in genus 2. **Document inference clearly**: ```csv accession,taxid,species,2n,n,notes,reference GCA_XXX,123,Species name,80,40,Inferred from Anatidae conservation,https://family-level-study.url ``` 3. **Priority for direct confirmation**: - Species with conservation exceptions - Type specimens or reference species - Phylogenetically divergent lineages ### VGP-Specific: Sex Chromosome Adjustment When both sex chromosomes are in main haplotype (common in VGP assemblies): - **Expected scaffolds = n + 1** (not n) - **Reason**: X+Y or Z+W = two distinct chromosomes - **Check**: VGP metadata column "Sex chromosomes main haplotype" - **Patterns**: "Has X and Y", "Has Z and W", "Has X1, X2, and Y" ### Data Recording Format **CSV Structure**: ```csv accession,taxid,species_name,diploid_2n,haploid_n,notes,reference GCA_XXXXXX,12345,Species name,80,40,Brief description,https://doi.org/... ``` **Notes field examples**: - "Standard {family} karyotype" - "Conserved {genus} karyotype" - "Inferred from {family} conservation" - "Unusual karyotype for family" - "Geographic variation reported" ### Prioritization for Literature Searches **TIER 1** (>90% success rate): - Model organisms (zebrafish, mouse, medaka) - Domestic species (chicken, goat, sheep) - Game animals (waterfowl, deer) - Laboratory species (fruit fly, nematode) **TIER 2** (70-90% success rate): - Well-studied taxonomic groups (Podarcis lizards, corvids) - Conservation focus species (raptors, large mammals) - Commercial species (salmonids, oysters) **TIER 3** (50-70% success rate): - Common but not economically important - Widespread distribution - Recent phylogenetic interest **Low priority** (<50% success rate): - Deep-sea species - Rare/endangered without conservation genetics - Recently described species - Cryptic species complexes ## AGP Format (A Golden Path) ### Overview AGP (A Golden Path) format describes how assembled sequences (chromosomes, scaffolds) are constructed from component sequences (contigs, scaffolds) and gaps. Critical for genome assembly curation and submission to NCBI/EBI. ### When to Use This Knowledge - Processing genome assemblies for submission to databases - Curating chromosome-level assemblies - Splitting haplotype assemblies - Assigning unlocalized scaffolds (unlocs) - Debugging AGP validation errors - Converting between assembly representations ### AGP Format Structure **Tab-delimited format with 9 columns for sequence lines (type W) or 8+ columns for gap lines (type U/N)** **Sequence Lines (Column 5 = 'W'):** ``` object obj_beg obj_end part_num W component_id comp_beg comp_end orientation ``` **Gap Lines (Column 5 = 'U' or 'N'):** ``` object obj_beg obj_end part_num gap_type gap_length gap_type linkage linkage_evidence ``` ### Critical Coordinate Rules **Rule 1: Object and Component Lengths MUST Match** For sequence lines, the span in the object MUST equal the span in the component: ``` (obj_end - obj_beg + 1) == (comp_end - comp_beg + 1) ``` **Example - CORRECT:** ``` Scaffold_47_unloc_1 1 54360 1 W scaffold_23.hap1 19274039 19328398 - # Object length: 54360 - 1 + 1 = 54,360 bp # Component length: 19328398 - 19274039 + 1 = 54,360 bp ✓ ``` **Example - INCORRECT:** ``` Scaffold_47_unloc_1 1 19328398 1 W scaffold_23.hap1 19274039 19328398 - # Object length: 19328398 - 1 + 1 = 19,328,398 bp # Component length: 19328398 - 19274039 + 1 = 54,360 bp ✗ # ERROR: Lengths don't match! ``` **Rule 2: Component Numbering Restarts for New Objects** Each object (column 1) has its own component numbering (column 4) starting at 1: ``` Scaffold_10 1 30578279 1 W scaffold_4.hap2 1 30578279 - Scaffold_10_unloc_1 1 65764 1 W scaffold_74.hap2 1 65764 + # ← Starts at 1, not 3! ``` **Rule 3: Sequential Component Numbering Within Objects** Component numbers increment sequentially (gaps and sequences both count): ``` Scaffold_2 1 1731008 1 W scaffold_25.hap1 1 1731008 - Scaffold_2 1731009 1731108 2 U 100 scaffold yes proximity_ligation Scaffold_2 1731109 1956041 3 W scaffold_70.hap1 1 224933 - ``` ### Common AGP Processing Issues #### Issue 1: Incorrect Object Coordinates When Creating Unlocs **Symptom:** ``` ERROR: object coordinates (1, 19328398) and component coordinates (19274039, 19328398) do not have the same length ``` **Cause:** When converting a region of a scaffold into an unlocalized sequence (unloc), the object coordinates must represent the **length** of the extracted region, not the original component end coordinate. **Wrong Approach:** ```python # Setting object end to component end coordinate agp_df.loc[index, 'chr_end'] = agp_df.loc[index, 'scaff_end'] # ✗ WRONG ``` **Correct Approach:** ```python # Calculate actual length from component coordinates agp_df.loc[index, 'chr_end'] = int(agp_df.loc[index, 'scaff_end']) - int(agp_df.loc[index, 'scaff_start']) + 1 # ✓ CORRECT ``` #### Issue 2: Component Numbering Not Reset for New Objects **Symptom:** Unloc scaffolds have component numbers > 1 when they should start at 1. **Cause:** When creating a new object (unloc scaffold), component numbering wasn't reset. **Solution:** ```python # When creating unlocs, reset component number agp_df.loc[index, '#_scaffs'] = 1 # Column 4: component number ``` #### Issue 3: AGPcorrect Accumulating Coordinates **Symptom:** Unloc sequences inherit cumulative coordinates from parent scaffolds. **Cause:** AGPcorrect adjusts coordinates based on sequence length corrections. When scaffolds are later split into unlocs, the accumulated corrections need to be recalculated based on actual component spans. **Solution:** Always recalculate object coordinates from component spans when creating new objects (unlocs). ### AGP Processing Best Practices #### 1. Coordinate System Understanding - **Object coordinates (columns 2-3):** Position within the assembled object (1-based, inclusive) - **Component coordinates (columns 7-8):** Position within the source sequence (1-based, inclusive) - Both use **1-based closed intervals** [start, end] - Length calculation: `end - start + 1` #### 2. Creating Unlocalized Sequences (Unlocs) ```python # When extracting a region to create an unloc: # 1. Calculate the actual length of the region length = int(comp_end) - int(comp_start) + 1 # 2. Set object coordinates for the new unloc obj_start = 1 # Always starts at 1 obj_end = length # Equals the length # 3. Reset component number component_num = 1 # New object, new numbering # 4. Rename the object new_object_name = f"{parent_scaffold}_unloc_{unloc_number}" ``` #### 3. Validating AGP Files **Use NCBI's AGP validator:** ```bash agp_validate assembly.agp ``` **Common validation checks:** - Object/component length match - Sequential component numbering - No coordinate overlaps - Gap specifications valid - Orientation values (+, -, ?, 0, na) #### 4. Handling Haplotype-Split Assemblies When splitting diploid assemblies into haplotypes: 1. Identify haplotype markers in sequence names (H1/hap1, H2/hap2) 2. Maintain proper pairing information 3. Process unlocs separately per haplotype 4. Remove haplotig duplications 5. Track gaps appropriately (especially proximity ligation gaps) ### AGP Coordinate Debugging Pattern When encountering coordinate errors: ```python # For each AGP line, verify: obj_length = int(obj_end) - int(obj_beg) + 1 comp_length = int(comp_end) - int(comp_beg) + 1 assert obj_length == comp_length, f"Length mismatch: obj={obj_length}, comp={comp_length}" # For sequential component numbers: assert comp_num == expected_num, f"Component number gap: got {comp_num}, expected {expected_num}" ``` ### AGP File Structure by Assembly Stage **1. Raw Assembly AGP:** - Direct representation from assembler - May have incorrect sequence lengths - Needs coordinate correction (AGPcorrect) **2. Corrected AGP:** - Sequence lengths match actual FASTA - Coordinates adjusted for length discrepancies - Ready for haplotype splitting **3. Haplotype-Split AGP:** - Separate files per haplotype - Unlocs identified but not separated - Haplotigs marked but not removed **4. Final Curated AGP:** - Unlocs separated into individual objects - Haplotigs removed to separate file - Proximity ligation gaps cleaned - Ready for database submission ## BED File Processing and Telomere Analysis ### Pattern: Classifying Scaffolds by Telomere Types When analyzing telomere data from BED files to classify scaffolds: **File Structure**: - Terminal telomeres BED: columns include scaffold, start, end, orientation (p/q), accession - Interstitial telomeres BED: similar structure with position markers (p/q/u for internal) **Best Practice - Use Python CSV Module**: ```python import csv from collections import defaultdict # Use defaultdict for automatic initialization telomere_counts = defaultdict(lambda: {'terminal': 0, 'interstitial': 0}) # Process with csv.reader (more portable than pandas) with open('telomeres.bed', 'r') as f: reader = csv.reader(f, delimiter='\t') for row in reader: scaffold = row[0] accession = row[10] # GCA accession key = (accession, scaffold) telomere_counts[key]['terminal'] += 1 ``` **Why CSV over pandas**: - No external dependencies (pandas may not be installed) - Faster for simple tabular operations - Lower memory footprint for large files - Better portability across environments **Classification Categories**: 1. Category 1: 2 terminal telomeres, 0 interstitial (complete chromosomes) 2. Category 2: 1 terminal telomere, 0 interstitial (partial) 3. Category 3: Has interstitial telomeres (likely assembly issues) ## NCBI Data Integration Strategies ### Check Existing Data Sources Before API Calls **Problem**: Need chromosome counts for 400+ assemblies from NCBI. **Anti-pattern**: Query NCBI datasets API for each accession ```python # DON'T: Query 400+ times for accession in missing_data: result = subprocess.run(['datasets', 'summary', 'genome', 'accession', accession]) # Takes 10+ minutes, hits API rate limits ``` **Better Pattern**: Check if data already exists in compiled tables ```python # DO: Look for existing compiled data first # VGP table has multiple chromosome count columns: # - num_chromosomes (column 54) # - total_number_of_chromosomes (column 106) # - num_chromosomes_haploid (column 122) # Read from existing comprehensive table with open('VGP-table.csv') as f: reader = csv.reader(f) header = next(reader) for row in reader: num_chr = row[53] if row[53] else row[105] # Fallback strategy ``` **Results**: Filled 392/417 missing values instantly vs 10+ minutes of API calls. **Fallback Strategy for Multiple Columns**: ```python # Try multiple sources in order of preference num_chromosomes = row[53] if (len(row) > 53 and row[53]) else '' if not num_chromosomes and len(row) > 105: num_chromosomes = row[105] # Alternative column ``` **When to use NCBI API**: - Data not in existing tables - Need real-time/latest data - Fetching assembly reports or sequence data - Small number of queries (<20) **API Best Practices** (when necessary): - Use full path to datasets command (may be aliased) - Add delays between calls (`time.sleep(0.5)`) - Set reasonable timeouts - Handle errors gracefully ## Related Skills - **vgp-pipeline** - VGP workflows process Hi-C and HiFi data - **galaxy-tool-wrapping** - Galaxy tools work with SAM/BAM and sequencing data formats - **galaxy-workflow-development** - Workflows process sequencing data ## Supporting Documentation - **reference.md:** Detailed format specifications and tool documentation - **common-issues.md:** Comprehensive troubleshooting guide with examples ## Version History - **v1.1.1:** Added BED file processing patterns for telomere analysis and NCBI data integration strategies - **v1.1.0:** Added comprehensive AGP format documentation including coordinate validation, unloc processing, and common error patterns - **v1.0.0:** Initial release with SAM/BAM, Hi-C, HiFi, common filtering patterns