--- name: liftover-coordinates description: Convert genomic coordinates between assembly versions (GRCh37/hg19 to GRCh38/hg38, mm9 to mm10). Guides UCSC liftOver for BED files, CrossMap for VCF/bigWig, and handles unmapped regions with provenance logging. --- # Convert Genomic Coordinates Between Assembly Versions ## When to Use - User needs to convert genomic coordinates between assemblies (hg19↔hg38, mm9↔mm10) - User asks about "liftover", "coordinate conversion", "assembly mismatch", or "CrossMap" - User has data in hg19/GRCh37 that needs conversion to GRCh38 (or vice versa) before integration - User wants to use UCSC liftOver or CrossMap for BED, VCF, bigWig, or BAM files - Example queries: "convert my hg19 peaks to hg38", "liftover coordinates for integration with ENCODE", "my data is in mm9, how do I convert to mm10?" Guide coordinate liftover between genome assemblies using UCSC liftOver, CrossMap, Ensembl REST API, and rtracklayer. Assembly conversion is one of the most common pitfalls in genomics — this skill provides the definitive workflow for safe, reproducible liftover with full provenance tracking. ## Scientific Rationale **The question**: "How do I safely convert my genomic coordinates from one assembly to another without losing data or introducing errors?" Assembly conversion is referenced as a critical step in 10+ other ENCODE Toolkit skills because ENCODE spans multiple data releases: some experiments were processed against hg19/GRCh37, while most current data uses GRCh38/hg38. Combining data across assemblies without proper liftover is one of the most common and most dangerous errors in computational genomics — coordinates that look valid in both assemblies may refer to completely different genomic locations. ### The Core Problem Genome assemblies are updated to fix errors, fill gaps, add alternative haplotypes, and improve centromeric/telomeric sequence. Between hg19 and hg38, approximately 1,000 sequence gaps were closed, 8% of the genome was modified, and several regions were rearranged. A coordinate like chr17:41,197,694 in hg19 (BRCA1) maps to chr17:43,044,295 in GRCh38 — a shift of nearly 2 Mb. Using the wrong assembly silently produces incorrect results. ### When to Liftover Common scenarios requiring coordinate conversion: - **Combining ENCODE data from different releases**: Some hg19, some GRCh38 — must unify before intersection - **Integrating GWAS Catalog results**: Many GWAS hits are still reported in hg19/GRCh37 coordinates - **Using gnomAD**: gnomAD v4 uses GRCh38; older v2 datasets use GRCh37 - **Cross-species comparison**: Mouse data across mm9/mm10/GRCm39 - **Legacy datasets**: Published supplementary files often use older assemblies - **ClinVar integration**: Some ClinVar entries reference GRCh37 positions - **GTEx cross-reference**: GTEx v8 uses GRCh38, earlier versions used GRCh37 ### Literature Support - **Kent et al. 2002** (Genome Research, ~5,000 citations): UCSC Genome Browser and the liftOver tool. The original chain/net alignment framework for coordinate conversion between genome assemblies. [DOI](https://doi.org/10.1101/gr.229102) - **Zhao et al. 2014** (Bioinformatics, ~800 citations): CrossMap — a versatile tool for coordinate conversion between genome assemblies. Handles VCF, BAM, bigWig, GFF, and Wiggle formats that UCSC liftOver cannot process natively. [DOI](https://doi.org/10.1093/bioinformatics/btt730) - **Hinrichs et al. 2006** (Nucleic Acids Research, ~1,200 citations): UCSC genome browser chain/net alignment methodology. Defines the reciprocal-best chain alignment that underpins coordinate conversion. [DOI](https://doi.org/10.1093/nar/gkj144) - **Kuhn et al. 2013** (Nucleic Acids Research, ~600 citations): Assembly updates and the implications for re-annotation. Documents the biological impact of assembly changes on gene models and regulatory element coordinates. [DOI](https://doi.org/10.1093/nar/gks1195) - **Schneider et al. 2017** (Genome Research, ~400 citations): GRCh38 improvements over GRCh37 — gap closures, centromere models, alternative haplotypes. Quantifies what changed and why liftover is necessary. [DOI](https://doi.org/10.1101/gr.213611.116) - **Amemiya et al. 2019** (Scientific Reports, ~1,372 citations): ENCODE Blacklist regions — some blacklisted regions are assembly-specific. Liftover of blacklist files must use the correct version. [DOI](https://doi.org/10.1038/s41598-019-45839-z) ## Assembly Version Mapping | Common Name | UCSC Name | NCBI/GRC Name | Species | Release Year | |------------|-----------|---------------|---------|-------------| | hg19 | hg19 | GRCh37 | Human | 2009 | | hg38 | hg38 | GRCh38 | Human | 2013 | | mm9 | mm9 | MGSCv37 | Mouse | 2007 | | mm10 | mm10 | GRCm38 | Mouse | 2012 | | mm39 | mm39 | GRCm39 | Mouse | 2020 | ### Naming Convention Alert The same assembly has different names depending on the source: - **UCSC convention**: `hg19`, `hg38`, `mm10` — used in filenames, chromosome prefixes (`chr1`) - **NCBI/GRC convention**: `GRCh37`, `GRCh38`, `GRCm38` — used in publications, Ensembl - **Ensembl convention**: Chromosomes without `chr` prefix (`1` instead of `chr1`) Always verify which naming convention your data uses. Mixing `chr1` (UCSC) with `1` (Ensembl) causes silent failures in bedtools intersection and peak overlap analysis. ## Chain Files Chain files encode the alignment between assemblies and are the essential input for liftover. ### Source: UCSC (Recommended) ``` https://hgdownload.soe.ucsc.edu/goldenPath/{from}/liftOver/{from}To{To}.over.chain.gz ``` Common chain files: | Conversion | Chain File | URL | |-----------|-----------|-----| | hg19 to hg38 | hg19ToHg38.over.chain.gz | `https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz` | | hg38 to hg19 | hg38ToHg19.over.chain.gz | `https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz` | | mm9 to mm10 | mm9ToMm10.over.chain.gz | `https://hgdownload.soe.ucsc.edu/goldenPath/mm9/liftOver/mm9ToMm10.over.chain.gz` | | mm10 to mm39 | mm10ToMm39.over.chain.gz | `https://hgdownload.soe.ucsc.edu/goldenPath/mm10/liftOver/mm10ToMm39.over.chain.gz` | | mm10 to hg38 | mm10ToHg38.over.chain.gz | `https://hgdownload.soe.ucsc.edu/goldenPath/mm10/liftOver/mm10ToHg38.over.chain.gz` | ### Source: Ensembl ``` ftp://ftp.ensembl.org/pub/assembly_mapping/ ``` Ensembl provides chain files for their coordinate system (without `chr` prefix). Useful when working with Ensembl VEP output or Ensembl gene annotations. ### Source: NCBI Remap NCBI Genome Remapping Service: `https://www.ncbi.nlm.nih.gov/genome/tools/remap` - Web-based and API access - Handles complex remapping with alignment-based and annotation-based methods - Useful for non-standard assemblies or patch-level conversions ### Chain File Verification Always verify chain file integrity after download: ```bash wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz md5sum hg19ToHg38.over.chain.gz # Verify against UCSC md5sum.txt in the same directory gunzip -t hg19ToHg38.over.chain.gz # Test archive integrity ``` ## Tool Guide: UCSC liftOver (BED Files) The standard tool for BED-format coordinate conversion. ### Basic Usage ```bash liftOver input.bed hg19ToHg38.over.chain.gz output.bed unmapped.bed ``` ### Parameters | Parameter | Default | Description | |----------|---------|-------------| | `-minMatch` | 0.95 | Minimum ratio of bases that must remap (0.0–1.0) | | `-minBlocks` | 1 | Minimum number of alignment blocks | | `-fudgeThick` | off | If thickStart/thickEnd not mapped, use mapped region | | `-multiple` | off | Allow mapping to multiple output regions | | `-minChainT` | 0 | Minimum chain target coverage | | `-minChainQ` | 0 | Minimum chain query coverage | ### Recommended Settings by Data Type | Data Type | `-minMatch` | Notes | |----------|-------------|-------| | SNP positions (1bp) | 0.95 (default) | Point coordinates almost always map cleanly | | Narrow peaks (100–500bp) | 0.95 (default) | Short regions map well | | Broad peaks (1–50kb) | 0.50–0.80 | Large regions may partially overlap rearrangements | | Regulatory elements | 0.90 | Balance between completeness and accuracy | | TAD boundaries (5–50kb) | 0.50 | Large-scale organization is approximate anyway | ### Handling narrowPeak Format UCSC liftOver expects standard BED (3–12 columns). For narrowPeak files (BED6+4): ```bash # Step 1: Extract BED6 columns + preserve extra columns as name awk 'BEGIN{OFS="\t"} {print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10}' input.narrowPeak > input_full.bed # Step 2: Liftover (liftOver handles extra columns) liftOver input_full.bed hg19ToHg38.over.chain.gz output.bed unmapped.bed # Step 3: Verify column count is preserved awk '{print NF}' output.bed | sort -u ``` **Peak summit recalculation**: After liftover, the summit position (column 10 in narrowPeak = offset from start) may no longer accurately represent the signal maximum. For critical analyses, re-calculate summits from signal data in the new assembly rather than relying on lifted summit positions. ### Checking Unmapped Regions ```bash # Count unmapped regions wc -l unmapped.bed # Note: comment lines start with # # Calculate loss rate total=$(wc -l < input.bed) unmapped=$(grep -v '^#' unmapped.bed | wc -l) loss_pct=$(echo "scale=2; $unmapped * 100 / $total" | bc) echo "Lost $unmapped of $total regions ($loss_pct%)" # Investigate reasons for unmapping grep '^#' unmapped.bed | sort | uniq -c | sort -rn # Common reasons: # "Partially deleted in new" — region spans a deletion # "Deleted in new" — region fully removed # "Split in new" — region maps to multiple locations ``` ## Tool Guide: CrossMap (VCF, bigWig, BAM, GFF) CrossMap (Zhao et al. 2014) handles file formats that UCSC liftOver cannot process natively. ### VCF Conversion ```bash CrossMap vcf hg19ToHg38.over.chain.gz input.vcf hg38.fa output.vcf ``` **Critical VCF considerations**: - CrossMap updates coordinates AND checks REF alleles against the new reference - Variants where the REF allele changes between assemblies are flagged - Always re-validate variant calls after liftover using `bcftools norm` - Multi-allelic variants may need special handling ```bash # Post-liftover VCF validation bcftools norm -f hg38.fa -c ws output.vcf -o output.normalized.vcf 2> norm_warnings.log # -c ws: warn about and set incorrect REF alleles ``` ### bigWig Conversion ```bash CrossMap bigwig hg19ToHg38.over.chain.gz input.bw output.bw ``` **Signal track caveats**: - Resolution is reduced during conversion (interpolation at boundaries) - Regions that split during liftover lose signal accuracy - For quantitative analysis, re-generate signal tracks from re-aligned reads when possible ### BAM Conversion ```bash CrossMap bam hg19ToHg38.over.chain.gz input.bam output.bam ``` **BAM liftover is generally NOT recommended**: - Read mapping quality is meaningless after coordinate shifting - Paired-end relationships may break - Duplicate marking becomes invalid - **Best practice**: Re-align from FASTQ to the new reference genome ### GFF/GTF Conversion ```bash CrossMap gff hg19ToHg38.over.chain.gz input.gff output.gff ``` Useful for lifting gene annotations, but prefer downloading the native annotation for the target assembly from GENCODE or Ensembl. ## Tool Guide: Ensembl REST API (Single Coordinates) For programmatic conversion of individual coordinates without installing local tools. ### API Endpoint ``` GET https://rest.ensembl.org/map/human/GRCh37/{region}/GRCh38?content-type=application/json ``` ### Example ```python import requests def liftover_ensembl(chrom, start, end, source="GRCh37", target="GRCh38", species="human"): """Convert coordinates using Ensembl REST API.""" region = f"{chrom}:{start}..{end}:1" url = f"https://rest.ensembl.org/map/{species}/{source}/{region}/{target}" headers = {"Content-Type": "application/json"} response = requests.get(url, headers=headers) if response.status_code == 200: mappings = response.json()["mappings"] return mappings return None # Example: BRCA1 region mappings = liftover_ensembl("17", 41197694, 41276113) for m in mappings: mapped = m["mapped"] print(f" {mapped['seq_region_name']}:{mapped['start']}-{mapped['end']}") ``` ### Rate Limits - 15 requests per second (without registered email) - 50 requests per second (with registered email in User-Agent header) - For batch conversion, use local liftOver or CrossMap instead ### Ensembl Chromosome Naming Ensembl uses chromosomes WITHOUT `chr` prefix: - Ensembl: `17:41197694-41276113` - UCSC: `chr17:41197694-41276113` Convert between conventions: ```bash # Add 'chr' prefix (Ensembl to UCSC) sed 's/^/chr/' input.bed > input_ucsc.bed # Remove 'chr' prefix (UCSC to Ensembl) sed 's/^chr//' input.bed > input_ensembl.bed ``` ## Tool Guide: R (rtracklayer) For R-based workflows, rtracklayer provides native liftover support. ```r library(rtracklayer) library(GenomicRanges) # Import chain file chain <- import.chain("hg19ToHg38.over.chain") # Create GRanges object from your coordinates gr <- GRanges( seqnames = c("chr17", "chr7", "chr1"), ranges = IRanges( start = c(41197694, 55086725, 11873), end = c(41276113, 55275031, 14409) ), name = c("BRCA1", "EGFR", "DDX11L1") ) # Perform liftover lifted <- liftOver(gr, chain) # liftOver returns a GRangesList (1:many mapping possible) # Convert to GRanges (keeping only 1:1 mappings) lifted_1to1 <- unlist(lifted[elementNROWS(lifted) == 1]) # Check for unmapped n_unmapped <- sum(elementNROWS(lifted) == 0) n_multimapped <- sum(elementNROWS(lifted) > 1) cat(sprintf("Mapped: %d, Unmapped: %d, Multi-mapped: %d\n", length(lifted_1to1), n_unmapped, n_multimapped)) ``` ### Bioconductor Packages for Liftover | Package | Purpose | |---------|---------| | `rtracklayer` | Core liftover functionality | | `liftOver` (AnnotationHub) | Pre-packaged chain files | | `GenomicRanges` | GRanges manipulation pre/post liftover | | `VariantAnnotation` | VCF-aware liftover | ## Expected Loss Rates | Conversion | Typical Loss | High-Loss Regions | Notes | |-----------|-------------|------------------|-------| | hg19 to hg38 | 1–3% | Centromeric, telomeric, segmental duplications | Most reliable conversion | | hg38 to hg19 | 2–5% | New alt haplotypes, gap-filled regions in hg38 | Higher loss due to new hg38 sequences | | mm9 to mm10 | 3–5% | Significant rearrangements on multiple chromosomes | Document chromosome-level changes | | mm10 to mm39 | 1–2% | Minor scaffold updates | Relatively clean conversion | | mm10 to hg38 | N/A | Cross-species: use synteny, not liftover | Requires different approach (e.g., UCSC synteny maps) | ### When Loss Rates Are Concerning - **<2% loss**: Normal, proceed with analysis - **2–5% loss**: Acceptable for most analyses, document in methods - **5–10% loss**: Investigate — may indicate problematic input regions (many centromeric/repeat-rich regions) - **>10% loss**: Something is wrong — check assembly mismatch, chromosome naming, or chain file version ## Pitfalls & Edge Cases - **Unmapped regions are expected**: 1-5% of coordinates typically fail to lift over. Regions near centromeres, telomeres, and assembly gaps are most affected. Always check the unmapped file and report the loss rate. - **Many-to-one mapping**: Some hg19 regions map to multiple hg38 locations due to assembly improvements. UCSC liftOver reports only one mapping by default — use `-multiple` flag to detect split mappings. - **Peak coordinates may shift asymmetrically**: Peak summits can shift by different amounts than peak boundaries after liftover. Re-center peaks on summits after conversion rather than trusting the lifted boundaries. - **Chain file source matters**: Only use chain files from UCSC or Ensembl. Third-party chain files may have different coordinate conventions or incomplete mappings. Verify chain file checksums. - **VCF liftover requires reference allele check**: After lifting VCF coordinates, the reference allele may no longer match the new assembly. CrossMap handles this with `--refgenome` but UCSC liftOver does not — always validate. - **Assembly detection is unreliable from filenames**: File names like "peaks.bed" give no assembly hint. Check the actual coordinate ranges against known chromosome sizes. chrM length differs between hg19 (16571) and hg38 (16569). ## Provenance Integration Log every liftover operation with `encode_log_derived_file` for full reproducibility: ``` encode_log_derived_file( file_path="/path/to/lifted_peaks_hg38.bed", source_accessions=["ENCSR...", "ENCFF..."], description="Lifted [N] narrowPeak regions from hg19 to GRCh38. [X] unmapped ([Y]% loss). Original: [source description]", file_type="lifted_coordinates", tool_used="UCSC liftOver v377", parameters="minMatch=0.95, chain=hg19ToHg38.over.chain.gz (MD5: abc123...), unmapped=[X]/[N] ([Y]%)" ) ``` ### Provenance Checklist Every liftover log entry should include: - Source file path and assembly - Chain file used with MD5 checksum - Tool name and version (e.g., liftOver v377, CrossMap v0.6.4) - `-minMatch` or equivalent parameter - Total input regions/variants - Successfully mapped count - Unmapped count and percentage - Multi-mapped count (if applicable) - Output file path and assembly ## Workflow Summary ``` 1. Identify assemblies: Check input assembly → check target assembly 2. Get chain file: Download from UCSC → verify MD5 3. Select tool: BED → liftOver | VCF → CrossMap | single → Ensembl API | R → rtracklayer 4. Convert: Run liftover with appropriate parameters 5. Check loss: Count unmapped, flag if >5% 6. Validate: Verify output assembly, check chromosome names 7. Post-process: Re-center peaks, normalize VCF, re-annotate genes 8. Log provenance: Record all parameters, tools, loss rates ``` ## Walkthrough: Converting ENCODE Peak Coordinates Between Genome Assemblies **Goal**: Convert ENCODE BED peak files from hg19 to GRCh38 (or vice versa) for cross-study integration when experiments use different genome builds. **Context**: Older ENCODE experiments may be aligned to hg19, while newer ones use GRCh38. LiftOver enables coordinate conversion for combined analysis. ### Step 1: Identify experiments needing liftover ``` encode_search_experiments(assay_title="Histone ChIP-seq", organ="liver", target="H3K27ac", organism="Homo sapiens") ``` Expected output: ```json { "total": 8, "results": [ {"accession": "ENCSR100OLD", "assay_title": "Histone ChIP-seq", "assembly": "hg19"}, {"accession": "ENCSR200NEW", "assay_title": "Histone ChIP-seq", "assembly": "GRCh38"} ] } ``` **Interpretation**: ENCSR100OLD uses hg19 — needs liftover before merging with ENCSR200NEW (GRCh38). ### Step 2: Download the hg19 peak file ``` encode_list_files(accession="ENCSR100OLD", file_format="bed", assembly="hg19") ``` ### Step 3: Run UCSC liftOver ```bash liftOver ENCSR100OLD_peaks.bed hg19ToHg38.over.chain.gz peaks_GRCh38.bed unmapped.bed ``` ### Step 4: Check conversion results Count converted vs. unmapped: - If >95% convert successfully → proceed with analysis - If >5% unmapped → investigate (regions may be in assembly-specific contigs) ### Step 5: Log the conversion provenance ``` encode_log_derived_file( source_accessions=["ENCFF100OLD"], derived_file="/data/peaks_GRCh38.bed", description="Lifted from hg19 to GRCh38 using UCSC liftOver", tool="liftOver (UCSC, chain: hg19ToHg38.over.chain.gz)" ) ``` ### Integration with downstream skills - Lifted peaks feed into → **histone-aggregation** for cross-assembly union merge - Conversion provenance logged by → **data-provenance** - UCSC chain files accessed via → **ucsc-browser** REST API - Lifted coordinates used by → **variant-annotation** for position-dependent annotation ## Code Examples ### 1. Check file assembly before liftover ``` encode_get_file_info(accession="ENCFF100OLD") ``` Expected output: ```json { "accession": "ENCFF100OLD", "assembly": "hg19", "file_format": "bed narrowPeak" } ``` ### 2. Find GRCh38 version of same experiment ``` encode_list_files(accession="ENCSR100OLD", file_format="bed", assembly="GRCh38") ``` Expected output: ```json { "files": [] } ``` **Interpretation**: No GRCh38 files available — liftover is required. ### 3. Log liftover provenance ``` encode_log_derived_file( source_accessions=["ENCFF100OLD"], derived_file="/data/peaks_GRCh38.bed", description="hg19→GRCh38 liftOver", tool="UCSC liftOver" ) ``` Expected output: ```json {"status": "logged", "derived_file": "/data/peaks_GRCh38.bed", "source_count": 1} ``` ## Related Skills - `variant-annotation` — Variants often need liftover before annotation with ENCODE data (hg19 GWAS variants to GRCh38) - `gwas-catalog` — GWAS Catalog coordinates may be in GRCh37; liftover needed for ENCODE GRCh38 integration - `ensembl-annotation` — Ensembl REST API provides coordinate mapping; Ensembl uses non-chr chromosome naming - `ucsc-browser` — UCSC provides chain files and the liftOver tool; retrieve assembly-specific tracks - `gnomad-variants` — gnomAD v4 uses GRCh38; v2 uses GRCh37; liftover needed for cross-version analysis - `histone-aggregation` — Aggregating peaks across samples requires all peaks in the same assembly - `accessibility-aggregation` — ATAC-seq/DNase-seq peak union requires assembly-consistent coordinates - `data-provenance` — Every liftover operation must be logged with chain file, tool version, and loss rate - `publication-trust` — Verify literature claims backing analytical decisions ## Presenting Results When reporting liftover results, always present: - **Input summary**: Number of regions/variants, source assembly - **Output summary**: Number successfully mapped, target assembly - **Loss report**: Unmapped count and percentage, with breakdown by reason if available - **Multi-mapping report**: Number of regions mapping to multiple locations and how they were handled - **Assembly confirmation**: Explicit statement of output assembly (e.g., "All coordinates are now in GRCh38/hg38") - **Flag if loss >5%**: Warn the user and investigate the cause (centromeric regions, assembly-specific sequences, or input errors) - **Chain file version**: Which chain file was used and its source Example output summary: ``` Liftover: hg19 -> GRCh38 Input: 45,231 narrowPeak regions Mapped: 44,012 (97.3%) Unmapped: 1,219 (2.7%) — 847 partially deleted, 312 split, 60 fully deleted Chain: hg19ToHg38.over.chain.gz (UCSC, MD5: 7a42e...) Tool: UCSC liftOver v377, minMatch=0.95 ``` ## For the request: "$ARGUMENTS"