TIMESTAMP: 2024-10-01 /usr/local/lib/python3.9/site-packages/hailtop/aiocloud/aiogoogle/user_config.py:28: UserWarning: You have specified the GCS requester pays configuration in both your spark-defaults.conf (/usr/local/lib/python3.9/site-packages/pyspark/conf/spark-defaults.conf) and either an explicit argument or through `hailctl config`. For GCS requester pays configuration, Hail first checks explicit arguments, then `hailctl config`, then spark-defaults.conf. warnings.warn( gsutil -m cp gs://tandem-repeat-catalog/v1.0/vcs_v1.0.bed.gz . Unlike pure gsutil, this shim won't run composite uploads and sliced downloads in parallel by default. Use the -m flag to enable parallelism (i.e. "gsutil -m cp ..."). Copying gs://tandem-repeat-catalog/v1.0/vcs_v1.0.bed.gz to file://./vcs_v1.0.bed.gz ............. Average throughput: 31.1MiB/s gsutil -m cp gs://tandem-repeat-catalog/v1.0/HPRC_100_LongestPureSegmentQuantiles.txt.gz . Unlike pure gsutil, this shim won't run composite uploads and sliced downloads in parallel by default. Use the -m flag to enable parallelism (i.e. "gsutil -m cp ..."). Copying gs://tandem-repeat-catalog/v1.0/HPRC_100_LongestPureSegmentQuantiles.txt.gz to file://./HPRC_100_LongestPureSegmentQuantiles.txt.gz ........................... Average throughput: 29.4MiB/s mkdir -p /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01 cd /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01 mkdir -p /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/release_draft_2024-10-01 wget -O variant_catalog_without_offtargets.GRCh38.json.tmp -qnc https://raw.githubusercontent.com/broadinstitute/str-analysis/main/str_analysis/variant_catalogs/variant_catalog_without_offtargets.GRCh38.json && mv variant_catalog_without_offtargets.GRCh38.json.tmp variant_catalog_without_offtargets.GRCh38.json wget -O illumina_variant_catalog.sorted.bed.gz.tmp -qnc https://storage.googleapis.com/str-truth-set/hg38/ref/other/illumina_variant_catalog.sorted.bed.gz && mv illumina_variant_catalog.sorted.bed.gz.tmp illumina_variant_catalog.sorted.bed.gz wget -O hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.bed.gz.tmp -qnc https://storage.googleapis.com/str-truth-set/hg38/ref/other/colab-repeat-finder/hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp/hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.bed.gz && mv hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.bed.gz.tmp hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.bed.gz wget -O merged_expansion_hunter_catalog.78_samples.json.gz.tmp -qnc https://storage.googleapis.com/str-truth-set-v2/filter_vcf/all_repeats_including_homopolymers_keeping_loci_that_have_overlapping_variants/combined/merged_expansion_hunter_catalog.78_samples.json.gz && mv merged_expansion_hunter_catalog.78_samples.json.gz.tmp merged_expansion_hunter_catalog.78_samples.json.gz STEP #0: python3 -u -m str_analysis.split_adjacent_loci_in_expansion_hunter_catalog /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.json Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.json Split 9 loci into 19 output records Wrote 83 total records to variant_catalog_without_offtargets.GRCh38.split.json STEP #0: sed -i 's/AARRG/AAAAG/g' /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.json STEP #0: gzip -f /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.json STEP #1: python3 -u -m str_analysis.annotate_and_filter_str_catalog --verbose --reference-fasta /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa --min-interval-size-bp 1 --skip-gene-annotations --skip-mappability-annotations --skip-disease-loci-annotations --discard-loci-with-non-ACGT-bases-in-reference --discard-loci-with-non-ACGTN-bases-in-motif --output-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz Args: reference_fasta = /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa output_path = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz variant_catalog_json_or_bed = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz min_interval_size_bp = 1 output_tsv = False output_bed = False output_stats = False show_progress_bar = False set_locus_id = False add_gene_region_to_locus_id = False add_canonical_motif_to_locus_id = False only_known_disease_associated_loci = False exclude_known_disease_associated_loci = False only_known_disease_associated_motifs = False discard_overlapping_intervals_with_similar_motifs = False discard_loci_with_non_ACGT_bases_in_motif = False discard_loci_with_non_ACGTN_bases_in_reference = False dont_simplify_motifs = False trim_loci = False verbose = True skip_gene_annotations = True skip_disease_loci_annotations = True skip_mappability_annotations = True discard_loci_with_non_ACGTN_bases_in_motif = True discard_loci_with_non_ACGT_bases_in_reference = True mappability_track_bigwig = gs://tgg-viewer/ref/GRCh38/mappability/GRCh38_no_alt_analysis_set_GCA_000001405.15-k36_m2.bw Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz Wrote 63 records to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz Filter stats: 63 total input rows 63 out of 63 (100%) passed all filters STEP #2: python3 -m str_analysis.compute_catalog_stats --verbose /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz -------------------------------------------------- Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz Stats for variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz: 63 total loci 3,204 base pairs spanned by all loci (0.000% of the genome) 0 out of 63 ( 0.0%) loci define adjacent repeats 63 total repeat intervals 62 out of 63 ( 98.4%) repeat interval size is an integer multiple of the motif size (aka. trimmed) 0 out of 63 ( 0.0%) repeat intervals are homopolymers 0 out of 63 ( 0.0%) repeat intervals overlap each other by at least two motif lengths 11 out of 63 ( 17.5%) repeat intervals have non-ACGT motifs Ranges: Motif size range: 3-24bp Locus size range: 15-150bp Num repeats range: 2-50x repeats Max locus size = 150bp @ chr13:102161574-102161724 (AAG) Min reference repeat purity = 0.60 @ chr8:118366812-118366918 (AAAAT) Base-level purity median: 1.000, mean: 0.927 chrX: 7 out of 63 ( 11.1%) repeat intervals chrY: 0 out of 63 ( 0.0%) repeat intervals chrM: 0 out of 63 ( 0.0%) repeat intervals alt contigs: 0 out of 63 ( 0.0%) repeat intervals Motif size distribution: 1bp: 0 out of 63 ( 0.0%) repeat intervals 2bp: 0 out of 63 ( 0.0%) repeat intervals 3bp: 46 out of 63 ( 73.0%) repeat intervals 4bp: 1 out of 63 ( 1.6%) repeat intervals 5bp: 10 out of 63 ( 15.9%) repeat intervals 6bp: 2 out of 63 ( 3.2%) repeat intervals 7-24bp: 4 out of 63 ( 6.3%) repeat intervals 25+bp: 0 out of 63 ( 0.0%) repeat intervals Num repeats in reference: 1x: 0 out of 63 ( 0.0%) repeat intervals 2x: 1 out of 63 ( 1.6%) repeat intervals 3x: 2 out of 63 ( 3.2%) repeat intervals 4x: 3 out of 63 ( 4.8%) repeat intervals 5x: 2 out of 63 ( 3.2%) repeat intervals 6x: 2 out of 63 ( 3.2%) repeat intervals 7x: 4 out of 63 ( 6.3%) repeat intervals 8x: 1 out of 63 ( 1.6%) repeat intervals 9x: 1 out of 63 ( 1.6%) repeat intervals 10-15x: 26 out of 63 ( 41.3%) repeat intervals 16-25x: 17 out of 63 ( 27.0%) repeat intervals 26-35x: 2 out of 63 ( 3.2%) repeat intervals 36-50x: 2 out of 63 ( 3.2%) repeat intervals 51+x: 0 out of 63 ( 0.0%) repeat intervals Reference repeat purity distribution: 0.0: 0 out of 63 ( 0.0%) repeat intervals 0.1: 0 out of 63 ( 0.0%) repeat intervals 0.2: 0 out of 63 ( 0.0%) repeat intervals 0.3: 0 out of 63 ( 0.0%) repeat intervals 0.4: 0 out of 63 ( 0.0%) repeat intervals 0.5: 0 out of 63 ( 0.0%) repeat intervals 0.6: 12 out of 63 ( 19.0%) repeat intervals 0.7: 0 out of 63 ( 0.0%) repeat intervals 0.8: 0 out of 63 ( 0.0%) repeat intervals 0.9: 15 out of 63 ( 23.8%) repeat intervals 1.0: 36 out of 63 ( 57.1%) repeat intervals Locus sizes at each motif size: 3bp motifs: locus size range: 15 bp to 150 bp (median: 42 bp) based on 46 loci. Mean base purity: 0.91. 4bp motifs: locus size range: 80 bp to 80 bp (median: 80 bp) based on 1 loci. Mean base purity: 0.96. 5bp motifs: locus size range: 35 bp to 106 bp (median: 67 bp) based on 10 loci. Mean base purity: 0.96. 6bp motifs: locus size range: 18 bp to 24 bp (median: 21 bp) based on 2 loci. Mean base purity: 1.00. 10bp motifs: locus size range: 20 bp to 20 bp (median: 20 bp) based on 1 loci. Mean base purity: 1.00. 12bp motifs: locus size range: 36 bp to 36 bp (median: 36 bp) based on 1 loci. Mean base purity: 1.00. 20bp motifs: locus size range: 80 bp to 80 bp (median: 80 bp) based on 1 loci. Mean base purity: 1.00. 24bp motifs: locus size range: 96 bp to 96 bp (median: 96 bp) based on 1 loci. Mean base purity: 0.96. Wrote 1 rows to variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.catalog_stats.tsv ======================================================================================================================================================================================================== cd /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01 mkdir -p 1_to_1000bp_motifs cd 1_to_1000bp_motifs STEP #3: python3 -u -m str_analysis.annotate_and_filter_str_catalog --verbose --known-disease-associated-loci /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz --reference-fasta /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa --min-motif-size 1 --max-motif-size 1000 --min-interval-size-bp 1 --skip-gene-annotations --skip-mappability-annotations --skip-disease-loci-annotations --set-locus-id --discard-loci-with-non-ACGT-bases-in-reference --discard-loci-with-non-ACGTN-bases-in-motif --output-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/variant_catalog_without_offtargets.GRCh38.split.filtered.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.json.gz Args: reference_fasta = /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa output_path = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/variant_catalog_without_offtargets.GRCh38.split.filtered.json.gz variant_catalog_json_or_bed = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.json.gz known_disease_associated_loci = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz min_motif_size = 1 min_interval_size_bp = 1 max_motif_size = 1000 output_tsv = False output_bed = False output_stats = False show_progress_bar = False add_gene_region_to_locus_id = False add_canonical_motif_to_locus_id = False only_known_disease_associated_loci = False exclude_known_disease_associated_loci = False only_known_disease_associated_motifs = False discard_overlapping_intervals_with_similar_motifs = False discard_loci_with_non_ACGT_bases_in_motif = False discard_loci_with_non_ACGTN_bases_in_reference = False dont_simplify_motifs = False trim_loci = False verbose = True skip_gene_annotations = True skip_disease_loci_annotations = True skip_mappability_annotations = True set_locus_id = True discard_loci_with_non_ACGTN_bases_in_motif = True discard_loci_with_non_ACGT_bases_in_reference = True mappability_track_bigwig = gs://tgg-viewer/ref/GRCh38/mappability/GRCh38_no_alt_analysis_set_GCA_000001405.15-k36_m2.bw Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.json.gz Wrote 83 records to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/variant_catalog_without_offtargets.GRCh38.split.filtered.json.gz Filter stats: 83 total input rows 83 out of 83 (100%) passed all filters Stats for /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.json.gz STEP #3: python3 -m str_analysis.compute_catalog_stats --verbose /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/variant_catalog_without_offtargets.GRCh38.split.filtered.json.gz -------------------------------------------------- Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/variant_catalog_without_offtargets.GRCh38.split.filtered.json.gz Stats for variant_catalog_without_offtargets.GRCh38.split.filtered.json.gz: 83 total loci 3,770 base pairs spanned by all loci (0.000% of the genome) 0 out of 83 ( 0.0%) loci define adjacent repeats 83 total repeat intervals 82 out of 83 ( 98.8%) repeat interval size is an integer multiple of the motif size (aka. trimmed) 1 out of 83 ( 1.2%) repeat intervals are homopolymers 0 out of 83 ( 0.0%) repeat intervals overlap each other by at least two motif lengths 11 out of 83 ( 13.3%) repeat intervals have non-ACGT motifs Ranges: Motif size range: 1-27bp Locus size range: 6-150bp Num repeats range: 1-50x repeats Max locus size = 150bp @ chr13:102161574-102161724 (AAG) Min reference repeat purity = 0.60 @ chr8:118366812-118366918 (AAAAT) Base-level purity median: 1.000, mean: 0.942 chrX: 11 out of 83 ( 13.3%) repeat intervals chrY: 0 out of 83 ( 0.0%) repeat intervals chrM: 0 out of 83 ( 0.0%) repeat intervals alt contigs: 0 out of 83 ( 0.0%) repeat intervals Motif size distribution: 1bp: 1 out of 83 ( 1.2%) repeat intervals 2bp: 1 out of 83 ( 1.2%) repeat intervals 3bp: 60 out of 83 ( 72.3%) repeat intervals 4bp: 2 out of 83 ( 2.4%) repeat intervals 5bp: 10 out of 83 ( 12.0%) repeat intervals 6bp: 4 out of 83 ( 4.8%) repeat intervals 7-24bp: 4 out of 83 ( 4.8%) repeat intervals 25+bp: 1 out of 83 ( 1.2%) repeat intervals Num repeats in reference: 1x: 1 out of 83 ( 1.2%) repeat intervals 2x: 3 out of 83 ( 3.6%) repeat intervals 3x: 3 out of 83 ( 3.6%) repeat intervals 4x: 4 out of 83 ( 4.8%) repeat intervals 5x: 2 out of 83 ( 2.4%) repeat intervals 6x: 2 out of 83 ( 2.4%) repeat intervals 7x: 5 out of 83 ( 6.0%) repeat intervals 8x: 5 out of 83 ( 6.0%) repeat intervals 9x: 2 out of 83 ( 2.4%) repeat intervals 10-15x: 31 out of 83 ( 37.3%) repeat intervals 16-25x: 21 out of 83 ( 25.3%) repeat intervals 26-35x: 2 out of 83 ( 2.4%) repeat intervals 36-50x: 2 out of 83 ( 2.4%) repeat intervals 51+x: 0 out of 83 ( 0.0%) repeat intervals Reference repeat purity distribution: 0.0: 0 out of 83 ( 0.0%) repeat intervals 0.1: 0 out of 83 ( 0.0%) repeat intervals 0.2: 0 out of 83 ( 0.0%) repeat intervals 0.3: 0 out of 83 ( 0.0%) repeat intervals 0.4: 0 out of 83 ( 0.0%) repeat intervals 0.5: 0 out of 83 ( 0.0%) repeat intervals 0.6: 12 out of 83 ( 14.5%) repeat intervals 0.7: 0 out of 83 ( 0.0%) repeat intervals 0.8: 0 out of 83 ( 0.0%) repeat intervals 0.9: 19 out of 83 ( 22.9%) repeat intervals 1.0: 52 out of 83 ( 62.7%) repeat intervals Locus sizes at each motif size: 1bp motifs: locus size range: 25 bp to 25 bp (median: 25 bp) based on 1 loci. Mean base purity: 0.92. 2bp motifs: locus size range: 36 bp to 36 bp (median: 36 bp) based on 1 loci. Mean base purity: 1.00. 3bp motifs: locus size range: 6 bp to 150 bp (median: 39 bp) based on 60 loci. Mean base purity: 0.93. 4bp motifs: locus size range: 40 bp to 80 bp (median: 60 bp) based on 2 loci. Mean base purity: 0.98. 5bp motifs: locus size range: 35 bp to 106 bp (median: 67 bp) based on 10 loci. Mean base purity: 0.96. 6bp motifs: locus size range: 12 bp to 24 bp (median: 18 bp) based on 4 loci. Mean base purity: 1.00. 10bp motifs: locus size range: 20 bp to 20 bp (median: 20 bp) based on 1 loci. Mean base purity: 1.00. 12bp motifs: locus size range: 36 bp to 36 bp (median: 36 bp) based on 1 loci. Mean base purity: 1.00. 20bp motifs: locus size range: 80 bp to 80 bp (median: 80 bp) based on 1 loci. Mean base purity: 1.00. 24bp motifs: locus size range: 96 bp to 96 bp (median: 96 bp) based on 1 loci. Mean base purity: 0.96. Wrote 1 rows to variant_catalog_without_offtargets.GRCh38.split.filtered.catalog_stats.tsv STEP #3: python3 -u -m str_analysis.annotate_and_filter_str_catalog --verbose --known-disease-associated-loci /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz --reference-fasta /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa --min-motif-size 1 --max-motif-size 1000 --min-interval-size-bp 1 --skip-gene-annotations --skip-mappability-annotations --skip-disease-loci-annotations --set-locus-id --discard-loci-with-non-ACGT-bases-in-reference --discard-loci-with-non-ACGTN-bases-in-motif --output-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/illumina_variant_catalog.sorted.filtered.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/illumina_variant_catalog.sorted.bed.gz Args: reference_fasta = /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa output_path = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/illumina_variant_catalog.sorted.filtered.json.gz variant_catalog_json_or_bed = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/illumina_variant_catalog.sorted.bed.gz known_disease_associated_loci = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz min_motif_size = 1 min_interval_size_bp = 1 max_motif_size = 1000 output_tsv = False output_bed = False output_stats = False show_progress_bar = False add_gene_region_to_locus_id = False add_canonical_motif_to_locus_id = False only_known_disease_associated_loci = False exclude_known_disease_associated_loci = False only_known_disease_associated_motifs = False discard_overlapping_intervals_with_similar_motifs = False discard_loci_with_non_ACGT_bases_in_motif = False discard_loci_with_non_ACGTN_bases_in_reference = False dont_simplify_motifs = False trim_loci = False verbose = True skip_gene_annotations = True skip_disease_loci_annotations = True skip_mappability_annotations = True set_locus_id = True discard_loci_with_non_ACGTN_bases_in_motif = True discard_loci_with_non_ACGT_bases_in_reference = True mappability_track_bigwig = gs://tgg-viewer/ref/GRCh38/mappability/GRCh38_no_alt_analysis_set_GCA_000001405.15-k36_m2.bw Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/illumina_variant_catalog.sorted.bed.gz WARNING: skipping line with invalid motif: chr3 63912684 63912714 (GCA)*(GCC)+ . WARNING: skipping line with invalid motif: chr3 63912714 63912726 (GCA)*(GCC)+ . WARNING: skipping line with invalid motif: chr3 129172576 129172656 (CAGG)*(CAGA)*(CA)* . WARNING: skipping line with invalid motif: chr3 129172656 129172696 (CAGG)*(CAGA)*(CA)* . WARNING: skipping line with invalid motif: chr3 129172696 129172732 (CAGG)*(CAGA)*(CA)* . WARNING: skipping line with invalid motif: chr4 3074876 3074933 (CAG)*CAACAG(CCG)* . WARNING: skipping line with invalid motif: chr4 3074939 3074966 (CAG)*CAACAG(CCG)* . WARNING: skipping line with invalid motif: chr4 39348424 39348479 (AARRG)* . WARNING: skipping line with invalid motif: chr9 69037261 69037286 (A)*(GAA)* . WARNING: skipping line with invalid motif: chr9 69037286 69037304 (A)*(GAA)* . WARNING: skipping line with invalid motif: chr13 70139353 70139383 (CTA)*(CTG)* . WARNING: skipping line with invalid motif: chr13 70139383 70139428 (CTA)*(CTG)* . WARNING: skipping line with invalid motif: chr20 2652733 2652757 (GGCCTG)*(CGCCTG)* . WARNING: skipping line with invalid motif: chr20 2652757 2652775 (GGCCTG)*(CGCCTG)* . Wrote 174,286 records to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/illumina_variant_catalog.sorted.filtered.json.gz Filter stats: 174,286 total input rows 174,286 out of 174,286 (100%) passed all filters Modification stats: 349 out of 174,286 ( 0%) replaced 2bp motif with a simplified 1bp motif 8 out of 174,286 ( 0%) replaced 3bp motif with a simplified 1bp motif 6 out of 174,286 ( 0%) replaced 4bp motif with a simplified 1bp motif 4 out of 174,286 ( 0%) replaced 4bp motif with a simplified 2bp motif 2 out of 174,286 ( 0%) replaced 5bp motif with a simplified 1bp motif Stats for /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/illumina_variant_catalog.sorted.bed.gz STEP #3: python3 -m str_analysis.compute_catalog_stats --verbose /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/illumina_variant_catalog.sorted.filtered.json.gz -------------------------------------------------- Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/illumina_variant_catalog.sorted.filtered.json.gz Stats for illumina_variant_catalog.sorted.filtered.json.gz: 174,286 total loci 4,623,031 base pairs spanned by all loci (0.150% of the genome) 0 out of 174,286 ( 0.0%) loci define adjacent repeats 174,286 total repeat intervals 174,262 out of 174,286 (100.0%) repeat interval size is an integer multiple of the motif size (aka. trimmed) 365 out of 174,286 ( 0.2%) repeat intervals are homopolymers 0 out of 174,286 ( 0.0%) repeat intervals overlap each other by at least two motif lengths 1 out of 174,286 ( 0.0%) repeat intervals have non-ACGT motifs Ranges: Motif size range: 1-24bp Locus size range: 4-532bp Num repeats range: 2-133x repeats Max locus size = 532bp @ chr7:138621782-138622314 (TATC) Min reference repeat purity = 0.43 @ chr3:112804380-112804514 (TCT) Base-level purity median: 1.000, mean: 0.998 chrX: 8,561 out of 174,286 ( 4.9%) repeat intervals chrY: 0 out of 174,286 ( 0.0%) repeat intervals chrM: 0 out of 174,286 ( 0.0%) repeat intervals alt contigs: 0 out of 174,286 ( 0.0%) repeat intervals Motif size distribution: 1bp: 365 out of 174,286 ( 0.2%) repeat intervals 2bp: 98,531 out of 174,286 ( 56.5%) repeat intervals 3bp: 14,442 out of 174,286 ( 8.3%) repeat intervals 4bp: 43,263 out of 174,286 ( 24.8%) repeat intervals 5bp: 12,551 out of 174,286 ( 7.2%) repeat intervals 6bp: 3,931 out of 174,286 ( 2.3%) repeat intervals 7-24bp: 1,203 out of 174,286 ( 0.7%) repeat intervals 25+bp: 0 out of 174,286 ( 0.0%) repeat intervals Num repeats in reference: 1x: 0 out of 174,286 ( 0.0%) repeat intervals 2x: 8,780 out of 174,286 ( 5.0%) repeat intervals 3x: 6,543 out of 174,286 ( 3.8%) repeat intervals 4x: 8,882 out of 174,286 ( 5.1%) repeat intervals 5x: 11,884 out of 174,286 ( 6.8%) repeat intervals 6x: 13,900 out of 174,286 ( 8.0%) repeat intervals 7x: 14,915 out of 174,286 ( 8.6%) repeat intervals 8x: 13,902 out of 174,286 ( 8.0%) repeat intervals 9x: 12,532 out of 174,286 ( 7.2%) repeat intervals 10-15x: 49,952 out of 174,286 ( 28.7%) repeat intervals 16-25x: 31,237 out of 174,286 ( 17.9%) repeat intervals 26-35x: 1,729 out of 174,286 ( 1.0%) repeat intervals 36-50x: 24 out of 174,286 ( 0.0%) repeat intervals 51+x: 6 out of 174,286 ( 0.0%) repeat intervals Reference repeat purity distribution: 0.0: 0 out of 174,286 ( 0.0%) repeat intervals 0.1: 0 out of 174,286 ( 0.0%) repeat intervals 0.2: 0 out of 174,286 ( 0.0%) repeat intervals 0.3: 0 out of 174,286 ( 0.0%) repeat intervals 0.4: 3 out of 174,286 ( 0.0%) repeat intervals 0.5: 14 out of 174,286 ( 0.0%) repeat intervals 0.6: 5 out of 174,286 ( 0.0%) repeat intervals 0.7: 2 out of 174,286 ( 0.0%) repeat intervals 0.8: 2 out of 174,286 ( 0.0%) repeat intervals 0.9: 7,874 out of 174,286 ( 4.5%) repeat intervals 1.0: 166,386 out of 174,286 ( 95.5%) repeat intervals Locus sizes at each motif size: 1bp motifs: locus size range: 4 bp to 36 bp (median: 22 bp) based on 365 loci. Mean base purity: 1.00. 2bp motifs: locus size range: 4 bp to 82 bp (median: 24 bp) based on 98,531 loci. Mean base purity: 1.00. 3bp motifs: locus size range: 6 bp to 240 bp (median: 27 bp) based on 14,442 loci. Mean base purity: 0.99. 4bp motifs: locus size range: 8 bp to 532 bp (median: 24 bp) based on 43,263 loci. Mean base purity: 1.00. 5bp motifs: locus size range: 10 bp to 220 bp (median: 25 bp) based on 12,551 loci. Mean base purity: 1.00. 6bp motifs: locus size range: 12 bp to 106 bp (median: 24 bp) based on 3,931 loci. Mean base purity: 1.00. 7bp motifs: locus size range: 14 bp to 70 bp (median: 14 bp) based on 630 loci. Mean base purity: 1.00. 8bp motifs: locus size range: 16 bp to 72 bp (median: 16 bp) based on 428 loci. Mean base purity: 1.00. 9bp motifs: locus size range: 18 bp to 90 bp (median: 27 bp) based on 34 loci. Mean base purity: 1.00. 10bp motifs: locus size range: 20 bp to 60 bp (median: 30 bp) based on 15 loci. Mean base purity: 1.00. Wrote 1 rows to illumina_variant_catalog.sorted.filtered.catalog_stats.tsv STEP #3: python3 -u -m str_analysis.annotate_and_filter_str_catalog --verbose --known-disease-associated-loci /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz --reference-fasta /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa --min-motif-size 1 --max-motif-size 1000 --min-interval-size-bp 1 --skip-gene-annotations --skip-mappability-annotations --skip-disease-loci-annotations --set-locus-id --discard-loci-with-non-ACGT-bases-in-reference --discard-loci-with-non-ACGTN-bases-in-motif --output-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.filtered.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.bed.gz Args: reference_fasta = /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa output_path = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.filtered.json.gz variant_catalog_json_or_bed = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.bed.gz known_disease_associated_loci = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz min_motif_size = 1 min_interval_size_bp = 1 max_motif_size = 1000 output_tsv = False output_bed = False output_stats = False show_progress_bar = False add_gene_region_to_locus_id = False add_canonical_motif_to_locus_id = False only_known_disease_associated_loci = False exclude_known_disease_associated_loci = False only_known_disease_associated_motifs = False discard_overlapping_intervals_with_similar_motifs = False discard_loci_with_non_ACGT_bases_in_motif = False discard_loci_with_non_ACGTN_bases_in_reference = False dont_simplify_motifs = False trim_loci = False verbose = True skip_gene_annotations = True skip_disease_loci_annotations = True skip_mappability_annotations = True set_locus_id = True discard_loci_with_non_ACGTN_bases_in_motif = True discard_loci_with_non_ACGT_bases_in_reference = True mappability_track_bigwig = gs://tgg-viewer/ref/GRCh38/mappability/GRCh38_no_alt_analysis_set_GCA_000001405.15-k36_m2.bw Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.bed.gz Wrote 4,558,281 records to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.filtered.json.gz Filter stats: 4,558,281 total input rows 4,558,281 out of 4,558,281 (100%) passed all filters Stats for /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.bed.gz STEP #3: python3 -m str_analysis.compute_catalog_stats --verbose /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.filtered.json.gz -------------------------------------------------- Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.filtered.json.gz Stats for hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.filtered.json.gz: 4,558,281 total loci 62,504,037 base pairs spanned by all loci (2.024% of the genome) 0 out of 4,558,281 ( 0.0%) loci define adjacent repeats 4,558,281 total repeat intervals 2,799,051 out of 4,558,281 ( 61.4%) repeat interval size is an integer multiple of the motif size (aka. trimmed) 1,337,567 out of 4,558,281 ( 29.3%) repeat intervals are homopolymers 0 out of 4,558,281 ( 0.0%) repeat intervals overlap each other by at least two motif lengths Ranges: Motif size range: 1-833bp Locus size range: 9-2523bp Num repeats range: 3-300x repeats Max locus size = 2,523bp @ chrX:71520430-71522953 (CCAGCACTTTGGGAGGCCGAGGCAGGCTGATCACTAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCCCGTCTCTACTAAAAATACAAAAATTACCTGGGTGTGGGGGTGGGCACCTGTAATCCCAGCTACTCGGGAGGCTGGGGAGGCAGGAGAATTGCCTGAACCTGAGAGGCAGAGGCTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGCGACAGAGTGAGACTCAGTCTCAAAACAAAAAAAAAAAAAGATTTTAGTAACTTTTATCCTGTTTTAATAATACTGACTCAGAAACTATAATGTGTACTTTATAATTTACTTCCTAGATGACACTTGATTTTCTTCAAGAGCAAGATAGCTGCCCTGTGCAGTTGGTCTCCTTGAAAACTATTTTAGTTCTATCATAATTTCCTGTGATAAATATTTTGACCTTCTAAAATTTCAGAATATTGCACCAAGTAGAAAGAAAATAGGTTTTTTCTCTTTTCTTCTTCTTCCTTTTTTTTTTCTGAGAAAGAGGGAATGAGAACTTTAGTGTTCTTTCAATAGCGTTCTTATTTGTAGAAATGCATAATAGTGTCCTAGTAAGGCTTGACAATAACTCTGGTCTTCATCATATTTTGTGATAAAACTTTTGATTTAAAAAAACCTCTGATCTATTTATCATGGCAAATGGATAGAGCTTTCCTGCCTGTTTTCTTTCTTTTCTTTTTTCTTTCTTTCCTTTTTTTTCCTTTGAGCTTAGATTTTTAGAAGCACATATTTAAAAATCAGGTATAAGACTGGATGCAGTGGCTCACGCCTGTAATC) Min reference repeat purity = 1.00 @ chrY:56887882-56887891 (TGA) Base-level purity median: 1.000, mean: 1.000 chrX: 232,977 out of 4,558,281 ( 5.1%) repeat intervals chrY: 38,867 out of 4,558,281 ( 0.9%) repeat intervals chrM: 14 out of 4,558,281 ( 0.0%) repeat intervals alt contigs: 0 out of 4,558,281 ( 0.0%) repeat intervals Motif size distribution: 1bp: 1,337,567 out of 4,558,281 ( 29.3%) repeat intervals 2bp: 967,389 out of 4,558,281 ( 21.2%) repeat intervals 3bp: 1,418,341 out of 4,558,281 ( 31.1%) repeat intervals 4bp: 569,907 out of 4,558,281 ( 12.5%) repeat intervals 5bp: 172,187 out of 4,558,281 ( 3.8%) repeat intervals 6bp: 52,023 out of 4,558,281 ( 1.1%) repeat intervals 7-24bp: 31,185 out of 4,558,281 ( 0.7%) repeat intervals 25+bp: 9,682 out of 4,558,281 ( 0.2%) repeat intervals Num repeats in reference: 1x: 0 out of 4,558,281 ( 0.0%) repeat intervals 2x: 0 out of 4,558,281 ( 0.0%) repeat intervals 3x: 1,787,359 out of 4,558,281 ( 39.2%) repeat intervals 4x: 641,281 out of 4,558,281 ( 14.1%) repeat intervals 5x: 353,723 out of 4,558,281 ( 7.8%) repeat intervals 6x: 148,753 out of 4,558,281 ( 3.3%) repeat intervals 7x: 76,841 out of 4,558,281 ( 1.7%) repeat intervals 8x: 43,343 out of 4,558,281 ( 1.0%) repeat intervals 9x: 352,151 out of 4,558,281 ( 7.7%) repeat intervals 10-15x: 756,034 out of 4,558,281 ( 16.6%) repeat intervals 16-25x: 347,635 out of 4,558,281 ( 7.6%) repeat intervals 26-35x: 45,324 out of 4,558,281 ( 1.0%) repeat intervals 36-50x: 5,550 out of 4,558,281 ( 0.1%) repeat intervals 51+x: 287 out of 4,558,281 ( 0.0%) repeat intervals Reference repeat purity distribution: 0.0: 0 out of 4,558,281 ( 0.0%) repeat intervals 0.1: 0 out of 4,558,281 ( 0.0%) repeat intervals 0.2: 0 out of 4,558,281 ( 0.0%) repeat intervals 0.3: 0 out of 4,558,281 ( 0.0%) repeat intervals 0.4: 0 out of 4,558,281 ( 0.0%) repeat intervals 0.5: 0 out of 4,558,281 ( 0.0%) repeat intervals 0.6: 0 out of 4,558,281 ( 0.0%) repeat intervals 0.7: 0 out of 4,558,281 ( 0.0%) repeat intervals 0.8: 0 out of 4,558,281 ( 0.0%) repeat intervals 0.9: 0 out of 4,558,281 ( 0.0%) repeat intervals 1.0: 4,558,281 out of 4,558,281 (100.0%) repeat intervals Locus sizes at each motif size: 1bp motifs: locus size range: 9 bp to 90 bp (median: 12 bp) based on 1,337,567 loci. Mean base purity: 1.00. 2bp motifs: locus size range: 9 bp to 600 bp (median: 10 bp) based on 967,389 loci. Mean base purity: 1.00. 3bp motifs: locus size range: 9 bp to 632 bp (median: 9 bp) based on 1,418,341 loci. Mean base purity: 1.00. 4bp motifs: locus size range: 12 bp to 533 bp (median: 15 bp) based on 569,907 loci. Mean base purity: 1.00. 5bp motifs: locus size range: 15 bp to 341 bp (median: 18 bp) based on 172,187 loci. Mean base purity: 1.00. 6bp motifs: locus size range: 18 bp to 1,103 bp (median: 21 bp) based on 52,023 loci. Mean base purity: 1.00. 7bp motifs: locus size range: 21 bp to 151 bp (median: 23 bp) based on 13,079 loci. Mean base purity: 1.00. 8bp motifs: locus size range: 24 bp to 149 bp (median: 26 bp) based on 5,531 loci. Mean base purity: 1.00. 9bp motifs: locus size range: 27 bp to 116 bp (median: 29 bp) based on 2,284 loci. Mean base purity: 1.00. 10bp motifs: locus size range: 30 bp to 111 bp (median: 34 bp) based on 1,693 loci. Mean base purity: 1.00. Wrote 1 rows to hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.filtered.catalog_stats.tsv STEP #3: python3 -u -m str_analysis.annotate_and_filter_str_catalog --verbose --known-disease-associated-loci /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz --reference-fasta /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa --min-motif-size 1 --max-motif-size 1000 --min-interval-size-bp 1 --skip-gene-annotations --skip-mappability-annotations --skip-disease-loci-annotations --set-locus-id --discard-loci-with-non-ACGT-bases-in-reference --discard-loci-with-non-ACGTN-bases-in-motif --output-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/merged_expansion_hunter_catalog.78_samples.filtered.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/merged_expansion_hunter_catalog.78_samples.json.gz Args: reference_fasta = /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa output_path = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/merged_expansion_hunter_catalog.78_samples.filtered.json.gz variant_catalog_json_or_bed = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/merged_expansion_hunter_catalog.78_samples.json.gz known_disease_associated_loci = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz min_motif_size = 1 min_interval_size_bp = 1 max_motif_size = 1000 output_tsv = False output_bed = False output_stats = False show_progress_bar = False add_gene_region_to_locus_id = False add_canonical_motif_to_locus_id = False only_known_disease_associated_loci = False exclude_known_disease_associated_loci = False only_known_disease_associated_motifs = False discard_overlapping_intervals_with_similar_motifs = False discard_loci_with_non_ACGT_bases_in_motif = False discard_loci_with_non_ACGTN_bases_in_reference = False dont_simplify_motifs = False trim_loci = False verbose = True skip_gene_annotations = True skip_disease_loci_annotations = True skip_mappability_annotations = True set_locus_id = True discard_loci_with_non_ACGTN_bases_in_motif = True discard_loci_with_non_ACGT_bases_in_reference = True mappability_track_bigwig = gs://tgg-viewer/ref/GRCh38/mappability/GRCh38_no_alt_analysis_set_GCA_000001405.15-k36_m2.bw Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/merged_expansion_hunter_catalog.78_samples.json.gz Wrote 1,937,160 records to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/merged_expansion_hunter_catalog.78_samples.filtered.json.gz Filter stats: 1,937,805 total input rows 1,937,160 out of 1,937,805 (100%) passed all filters 645 out of 1,937,805 ( 0%) row reference sequence has invalid bases Modification stats: 4,879 out of 1,937,805 ( 0%) replaced 4bp motif with a simplified 2bp motif 235 out of 1,937,805 ( 0%) replaced 6bp motif with a simplified 2bp motif 79 out of 1,937,805 ( 0%) replaced 6bp motif with a simplified 3bp motif 69 out of 1,937,805 ( 0%) replaced 8bp motif with a simplified 4bp motif 4 out of 1,937,805 ( 0%) replaced 9bp motif with a simplified 3bp motif 2 out of 1,937,805 ( 0%) replaced 10bp motif with a simplified 2bp motif 2 out of 1,937,805 ( 0%) replaced 12bp motif with a simplified 4bp motif 2 out of 1,937,805 ( 0%) replaced 10bp motif with a simplified 5bp motif 1 out of 1,937,805 ( 0%) replaced 15bp motif with a simplified 5bp motif Stats for /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/merged_expansion_hunter_catalog.78_samples.json.gz STEP #3: python3 -m str_analysis.compute_catalog_stats --verbose /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/merged_expansion_hunter_catalog.78_samples.filtered.json.gz -------------------------------------------------- Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/merged_expansion_hunter_catalog.78_samples.filtered.json.gz Stats for merged_expansion_hunter_catalog.78_samples.filtered.json.gz: 1,937,160 total loci 31,350,936 base pairs spanned by all loci (1.015% of the genome) 0 out of 1,937,160 ( 0.0%) loci define adjacent repeats 1,937,160 total repeat intervals 1,937,160 out of 1,937,160 (100.0%) repeat interval size is an integer multiple of the motif size (aka. trimmed) 1,400,310 out of 1,937,160 ( 72.3%) repeat intervals are homopolymers 20,352 out of 1,937,160 ( 1.1%) repeat intervals overlap each other by at least two motif lengths Examples of overlapping repeats: chr2:50551002-50551074, chr9:115136253-115136275, chr1:8243248-8243324, chr14:54228655-54228699, chrX:67452380-67452404, chr7:132561814-132561832, chr15:92296062-92296098, chr12:132463497-132463650, chr2:44429273-44429309, chr18:67642068-67642086 Ranges: Motif size range: 1-833bp Locus size range: 1-2499bp Num repeats range: 1-300x repeats Max locus size = 2,499bp @ chrX:71520430-71522929 (CCAGCACTTTGGGAGGCCGAGGCAGGCTGATCACTAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCCCGTCTCTACTAAAAATACAAAAATTACCTGGGTGTGGGGGTGGGCACCTGTAATCCCAGCTACTCGGGAGGCTGGGGAGGCAGGAGAATTGCCTGAACCTGAGAGGCAGAGGCTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGCGACAGAGTGAGACTCAGTCTCAAAACAAAAAAAAAAAAAGATTTTAGTAACTTTTATCCTGTTTTAATAATACTGACTCAGAAACTATAATGTGTACTTTATAATTTACTTCCTAGATGACACTTGATTTTCTTCAAGAGCAAGATAGCTGCCCTGTGCAGTTGGTCTCCTTGAAAACTATTTTAGTTCTATCATAATTTCCTGTGATAAATATTTTGACCTTCTAAAATTTCAGAATATTGCACCAAGTAGAAAGAAAATAGGTTTTTTCTCTTTTCTTCTTCTTCCTTTTTTTTTTCTGAGAAAGAGGGAATGAGAACTTTAGTGTTCTTTCAATAGCGTTCTTATTTGTAGAAATGCATAATAGTGTCCTAGTAAGGCTTGACAATAACTCTGGTCTTCATCATATTTTGTGATAAAACTTTTGATTTAAAAAAACCTCTGATCTATTTATCATGGCAAATGGATAGAGCTTTCCTGCCTGTTTTCTTTCTTTTCTTTTTTCTTTCTTTCCTTTTTTTTCCTTTGAGCTTAGATTTTTAGAAGCACATATTTAAAAATCAGGTATAAGACTGGATGCAGTGGCTCACGCCTGTAATC) Min reference repeat purity = 0.68 @ chr20:22538965-22539037 (CCT) Base-level purity median: 1.000, mean: 0.998 chrX: 83,664 out of 1,937,160 ( 4.3%) repeat intervals chrY: 5,755 out of 1,937,160 ( 0.3%) repeat intervals chrM: 0 out of 1,937,160 ( 0.0%) repeat intervals alt contigs: 0 out of 1,937,160 ( 0.0%) repeat intervals Motif size distribution: 1bp: 1,400,310 out of 1,937,160 ( 72.3%) repeat intervals 2bp: 249,968 out of 1,937,160 ( 12.9%) repeat intervals 3bp: 68,910 out of 1,937,160 ( 3.6%) repeat intervals 4bp: 131,029 out of 1,937,160 ( 6.8%) repeat intervals 5bp: 35,605 out of 1,937,160 ( 1.8%) repeat intervals 6bp: 14,601 out of 1,937,160 ( 0.8%) repeat intervals 7-24bp: 24,757 out of 1,937,160 ( 1.3%) repeat intervals 25+bp: 11,980 out of 1,937,160 ( 0.6%) repeat intervals Num repeats in reference: 1x: 10,584 out of 1,937,160 ( 0.5%) repeat intervals 2x: 33,359 out of 1,937,160 ( 1.7%) repeat intervals 3x: 72,960 out of 1,937,160 ( 3.8%) repeat intervals 4x: 58,053 out of 1,937,160 ( 3.0%) repeat intervals 5x: 66,750 out of 1,937,160 ( 3.4%) repeat intervals 6x: 57,218 out of 1,937,160 ( 3.0%) repeat intervals 7x: 51,632 out of 1,937,160 ( 2.7%) repeat intervals 8x: 246,878 out of 1,937,160 ( 12.7%) repeat intervals 9x: 240,065 out of 1,937,160 ( 12.4%) repeat intervals 10-15x: 700,636 out of 1,937,160 ( 36.2%) repeat intervals 16-25x: 346,599 out of 1,937,160 ( 17.9%) repeat intervals 26-35x: 46,087 out of 1,937,160 ( 2.4%) repeat intervals 36-50x: 5,925 out of 1,937,160 ( 0.3%) repeat intervals 51+x: 414 out of 1,937,160 ( 0.0%) repeat intervals Reference repeat purity distribution: 0.0: 0 out of 1,937,160 ( 0.0%) repeat intervals 0.1: 0 out of 1,937,160 ( 0.0%) repeat intervals 0.2: 0 out of 1,937,160 ( 0.0%) repeat intervals 0.3: 0 out of 1,937,160 ( 0.0%) repeat intervals 0.4: 0 out of 1,937,160 ( 0.0%) repeat intervals 0.5: 0 out of 1,937,160 ( 0.0%) repeat intervals 0.6: 28 out of 1,937,160 ( 0.0%) repeat intervals 0.7: 1,665 out of 1,937,160 ( 0.1%) repeat intervals 0.8: 13,884 out of 1,937,160 ( 0.7%) repeat intervals 0.9: 31,193 out of 1,937,160 ( 1.6%) repeat intervals 1.0: 1,890,390 out of 1,937,160 ( 97.6%) repeat intervals Locus sizes at each motif size: 1bp motifs: locus size range: 1 bp to 90 bp (median: 11 bp) based on 1,400,310 loci. Mean base purity: 1.00. 2bp motifs: locus size range: 2 bp to 600 bp (median: 18 bp) based on 249,968 loci. Mean base purity: 1.00. 3bp motifs: locus size range: 3 bp to 561 bp (median: 12 bp) based on 68,910 loci. Mean base purity: 0.98. 4bp motifs: locus size range: 4 bp to 472 bp (median: 20 bp) based on 131,029 loci. Mean base purity: 0.99. 5bp motifs: locus size range: 5 bp to 400 bp (median: 20 bp) based on 35,605 loci. Mean base purity: 1.00. 6bp motifs: locus size range: 6 bp to 498 bp (median: 18 bp) based on 14,601 loci. Mean base purity: 0.99. 7bp motifs: locus size range: 7 bp to 546 bp (median: 21 bp) based on 4,996 loci. Mean base purity: 1.00. 8bp motifs: locus size range: 8 bp to 312 bp (median: 24 bp) based on 3,579 loci. Mean base purity: 1.00. 9bp motifs: locus size range: 9 bp to 243 bp (median: 27 bp) based on 1,880 loci. Mean base purity: 0.99. 10bp motifs: locus size range: 10 bp to 260 bp (median: 30 bp) based on 1,886 loci. Mean base purity: 0.99. Wrote 1 rows to merged_expansion_hunter_catalog.78_samples.filtered.catalog_stats.tsv STEP #5: python3 -u -m str_analysis.merge_loci --verbose --add-found-in-fields --output-format JSON --discard-extra-fields-from-input-catalogs --overlapping-loci-action keep-first --write-merge-stats-tsv --write-outer-join-table --write-bed-files-with-unique-loci --outer-join-overlap-table-min-sources 1 --output-prefix /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.merged KnownDiseaseAssociatedLoci:/Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/variant_catalog_without_offtargets.GRCh38.split.filtered.json.gz Illumina174kPolymorphicTRs:/Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/illumina_variant_catalog.sorted.filtered.json.gz PerfectRepeatsInReference:/Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.filtered.json.gz PolymorphicTRsInT2TAssemblies:/Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/merged_expansion_hunter_catalog.78_samples.filtered.json.gz - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/variant_catalog_without_offtargets.GRCh38.split.filtered.json.gz Kept 83 out of 83 (100.0%) records from variant_catalog_without_offtargets.GRCh38.split.filtered.json.gz - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/illumina_variant_catalog.sorted.filtered.json.gz Kept 174,244 out of 174,286 (100.0%) records from illumina_variant_catalog.sorted.filtered.json.gz Discarded 42 out of 174,286 ( 0.0%) records since they overlapped an existing locus by at least 66.0% and had the same canonical motif Wrote 174,244 unique loci from illumina_variant_catalog.sorted.filtered.json.gz to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.merged.illumina_variant_catalog.sorted.filtered.unique_loci.bed.gz - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.filtered.json.gz Kept 4,391,197 out of 4,558,281 ( 96.3%) records from hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.filtered.json.gz Discarded 167,084 out of 4,558,281 ( 3.7%) records since they overlapped an existing locus by at least 66.0% and had the same canonical motif Wrote 4,391,197 unique loci from hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.filtered.json.gz to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.merged.hg38_repeats.motifs_1_to_1000bp.repeats_3x_and_spans_9bp.filtered.unique_loci.bed.gz - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/merged_expansion_hunter_catalog.78_samples.filtered.json.gz Kept 297,517 out of 1,937,160 ( 15.4%) records from merged_expansion_hunter_catalog.78_samples.filtered.json.gz Discarded 1,639,643 out of 1,937,160 ( 84.6%) records since they overlapped an existing locus by at least 66.0% and had the same canonical motif Wrote 297,517 unique loci from merged_expansion_hunter_catalog.78_samples.filtered.json.gz to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.merged.merged_expansion_hunter_catalog.78_samples.filtered.unique_loci.bed.gz Writing combined catalog to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.merged.json.gz Wrote 4,863,041 output records to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.merged.json.gz Source of loci in output catalog: 4,391,197 out of 4,863,041 (90.3%) PerfectRepeatsInReference 297,517 out of 4,863,041 ( 6.1%) PolymorphicTRsInT2TAssemblies 174,244 out of 4,863,041 ( 3.6%) Illumina174kPolymorphicTRs 83 out of 4,863,041 ( 0.0%) KnownDiseaseAssociatedLoci Motif sizes: 1,567,337 out of 4,863,041 (32.2%) 1bp 1,432,117 out of 4,863,041 (29.4%) 3bp 978,972 out of 4,863,041 (20.1%) 2bp 590,787 out of 4,863,041 (12.1%) 4bp 177,422 out of 4,863,041 ( 3.6%) 5bp 59,675 out of 4,863,041 ( 1.2%) 7+bp 56,731 out of 4,863,041 ( 1.2%) 6bp Chromsomes: 4,579,579 out of 4,863,041 (94.2%) are on autosomes 244,191 out of 4,863,041 ( 5.0%) are on X 39,257 out of 4,863,041 ( 0.8%) are on Y 14 out of 4,863,041 ( 0.0%) are on M Wrote 5,192,974 rows to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.merged.outer_join_overlap_table.tsv.gz Wrote 4 rows to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.merged.merge_stats.tsv STEP #6: python3 -u -m str_analysis.annotate_and_filter_str_catalog --verbose --show-progress-bar --reference-fasta /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa --gene-models-source gencode --gene-models-source refseq --gene-models-source mane --known-disease-associated-loci /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz --min-motif-size 1 --max-motif-size 1000 --min-interval-size-bp 1 --discard-overlapping-intervals-with-similar-motifs --output-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.merged.json.gz Args: reference_fasta = /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa output_path = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz variant_catalog_json_or_bed = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.merged.json.gz known_disease_associated_loci = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.primary_disease_associated_loci.json.gz min_motif_size = 1 min_interval_size_bp = 1 max_motif_size = 1000 output_tsv = False output_bed = False output_stats = False skip_gene_annotations = False skip_disease_loci_annotations = False skip_mappability_annotations = False set_locus_id = False add_gene_region_to_locus_id = False add_canonical_motif_to_locus_id = False only_known_disease_associated_loci = False exclude_known_disease_associated_loci = False only_known_disease_associated_motifs = False discard_loci_with_non_ACGT_bases_in_motif = False discard_loci_with_non_ACGTN_bases_in_motif = False discard_loci_with_non_ACGT_bases_in_reference = False discard_loci_with_non_ACGTN_bases_in_reference = False dont_simplify_motifs = False trim_loci = False verbose = True show_progress_bar = True discard_overlapping_intervals_with_similar_motifs = True gene_models_source = gencode refseq mane mappability_track_bigwig = gs://tgg-viewer/ref/GRCh38/mappability/GRCh38_no_alt_analysis_set_GCA_000001405.15-k36_m2.bw Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.merged.json.gz Parsing gs://str-truth-set/hg38/ref/other/gencode.v46.basic.annotation.gtf.gz... 1,949,574 total 865,643 exon 664,522 CDS 182,159 UTR 118,625 transcript 118,625 promoter Parsing gs://str-truth-set/hg38/ref/other/hg38.ncbiRefSeq.gtf.gz... Wrote 4,863,041 records to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Filter stats: 4,863,041 total input rows 4,863,041 out of 4,863,041 (100%) passed all filters STEP #7: python3 << EOF import gzip, ijson, json f = gzip.open("/Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz", "rt") out = gzip.open("/Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.json.gz", "wt") i = 0 out.write("[") for record in ijson.items(f, "item", use_float=True): # skip chrM loci because ExpansionHunter prints an error like: 'Unable to extract chrM:-793-207 from hg38.fa' is_chrM = record["LocusId"].startswith("M-") or record["LocusId"].startswith("chrM-") if is_chrM: print(f"Skipping chrM locus: {record['LocusId']}") if is_chrM: continue if i > 0: out.write(", ") i += 1 out.write(json.dumps({ k: v for k, v in record.items() if k in {"LocusId", "ReferenceRegion", "VariantType", "LocusStructure"} }, indent=4)) out.write("]") EOF Skipping chrM locus: M-207-220-TTAA Skipping chrM locus: M-231-241-ATA Skipping chrM locus: M-513-524-CA Skipping chrM locus: M-5150-5159-CTA Skipping chrM locus: M-5735-5744-CCG Skipping chrM locus: M-6572-6582-GGA Skipping chrM locus: M-9914-9923-GCC Skipping chrM locus: M-10340-10349-ATC Skipping chrM locus: M-10893-10903-CAA Skipping chrM locus: M-12980-12989-CCT Skipping chrM locus: M-12989-13000-AGC Skipping chrM locus: M-14106-14115-TTC Skipping chrM locus: M-14209-14219-ACT Skipping chrM locus: M-14332-14341-ACC STEP #8: python3 -m str_analysis.filter_out_loci_with_Ns_in_flanks -R /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa -o /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.without_loci_with_Ns_in_flanks.json.gz --output-list-of-filtered-loci /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.loci_with_Ns_in_flanks.txt /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.json.gz Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.json.gz Wrote 4,859,281 records to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.without_loci_with_Ns_in_flanks.json.gz Wrote 3,746 loci to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.loci_with_Ns_in_flanks.txt STEP #8: mv /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.without_loci_with_Ns_in_flanks.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.json.gz STEP #9: python3 /Users/weisburd/code/tandem-repeat-catalogs/scripts/add_variation_cluster_annotations_to_catalog.py --verbose --output-catalog-json-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.with_variation_clusters.json.gz --known-pathogenic-loci-json-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.json.gz /Users/weisburd/code/tandem-repeat-catalogs/vcs_v1.0.bed.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.json.gz Parsing /Users/weisburd/code/tandem-repeat-catalogs/vcs_v1.0.bed.gz Parsed 273,112 variation clusters that contained 593,325 simple TR ids Found 273,112 out of 273,112 (100.0%) variation clusters differed from simple TRs by at least 6bp Annotating /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz with variation cluster annotations STEP #9: mv /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.with_variation_clusters.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz STEP #9: cp /Users/weisburd/code/tandem-repeat-catalogs/vcs_v1.0.bed.gz variation_clusters_v1.hg38.TRGT.bed.gz STEP #9: python3 /Users/weisburd/code/tandem-repeat-catalogs/scripts/add_isolated_loci_to_variation_cluster_catalog.py --known-pathogenic-loci-json-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.json.gz -o variation_clusters_and_isolated_TRs_v1.hg38.TRGT.bed.gz /Users/weisburd/code/tandem-repeat-catalogs/vcs_v1.0.bed.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Parsed 273,112 variation clusters from /Users/weisburd/code/tandem-repeat-catalogs/vcs_v1.0.bed.gz Parsed 4,863,041 TRs from /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz 593,325 out of 4,863,041 (12.20%) TRs were in variation clusters Added 4,269,716 isolated TRs to variation_clusters_and_isolated_TRs_v1.hg38.TRGT.bed.gz Wrote 4,542,828 rows to variation_clusters_and_isolated_TRs_v1.hg38.TRGT.bed.gz STEP #10: python3 /Users/weisburd/code/tandem-repeat-catalogs/scripts/convert_trgt_catalog_to_longtr_format.py --known-pathogenic-loci-json-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.json.gz variation_clusters_v1.hg38.TRGT.bed.gz Wrote 273,112 rows to variation_clusters_v1.hg38.LongTR.bed.gz STEP #10: python3 /Users/weisburd/code/tandem-repeat-catalogs/scripts/convert_trgt_catalog_to_longtr_format.py --known-pathogenic-loci-json-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.json.gz variation_clusters_and_isolated_TRs_v1.hg38.TRGT.bed.gz Wrote 4,542,828 rows to variation_clusters_and_isolated_TRs_v1.hg38.LongTR.bed.gz STEP #11: python3 -u /Users/weisburd/code/tandem-repeat-catalogs/scripts/add_allele_frequency_annotations.py --add-t2t-assembly-frequencies-to-overlapping-loci -o /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz.with_allele_frequencies.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Loading allele frequencies for the Illumina 174k catalog from https://github.com/Illumina/RepeatCatalogs/raw/master/hg38/genotype/1000genomes/1kg.gt.hist.tsv.gz Parsed 174,262 rows Computing histograms for Illumina 174k Processed allele frequency histograms for 174,262 rows and computed 174,262 records Loading allele frequencies for the catalog of polymorphic loci in T2T assemblies from gs://str-truth-set-v2/filter_vcf/all_repeats_including_homopolymers_keeping_loci_that_have_overlapping_variants/combined/joined.78_samples.variants.tsv.gz Parsed 2,038,349 rows Computing histograms for T2T assemblies Processed allele frequency histograms from 2,038,349 rows and computed 2,033,191 records Parsing and annotating /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Annotated 173,885 out of 4,863,041 loci in the Illumina 174k allele frequency catalog Annotated 1,722,623 out of 4,863,041 loci in the T2T assemblies allele frequency catalog Annotated 221,053 out of 4,863,041 loci in the T2T assemblies allele frequency catalog based on overlap Wrote 4,863,041 records to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz.with_allele_frequencies.json.gz STEP #11: mv /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz.with_allele_frequencies.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz STEP #12: python3 /Users/weisburd/code/tandem-repeat-catalogs/scripts/add_LPS_stdev_annotations_to_catalog.py --output-catalog-json-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.with_LPS_annotations.json.gz --known-pathogenic-loci-json-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.json.gz /Users/weisburd/code/tandem-repeat-catalogs/HPRC_100_LongestPureSegmentQuantiles.txt.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.json.gz Parsed 83 known pathogenic loci Parsing /Users/weisburd/code/tandem-repeat-catalogs/HPRC_100_LongestPureSegmentQuantiles.txt.gz Filtered out 164,764 out of 4,907,968 (3.4%) records with missing values Adding LPS annotations to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Annotated 4,633,521 out of 4,863,041 loci Wrote annotated catalog to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.with_LPS_annotations.json.gz STEP #12: mv /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.with_LPS_annotations.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz STEP #13: python3 -m str_analysis.convert_expansion_hunter_catalog_to_bed --split-adjacent-repeats /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz --output-file /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Wrote 4,863,041 rows to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz Added /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz.tbi index STEP #14: python3 -m str_analysis.add_adjacent_loci_to_expansion_hunter_catalog --ref-fasta /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa --source-of-adjacent-loci /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz --add-extra-field TRsInRegion --only-add-extra-fields -o /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz.with_adjacent_loci_annotation.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Loading input catalog /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Loaded 4,863,041 records from /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Processing 1 input variant catalog: /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Parsing potential adjacent repeats from a 248,938,407bp region (chr1:9,000-248,947,406) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 133,779,423bp region (chr10:9,000-133,788,422) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 135,017,946bp region (chr11:59,677-135,077,622) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 133,257,310bp region (chr12:9,000-133,266,309) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 98,351,024bp region (chr13:16,004,305-114,355,328) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 90,862,168bp region (chr14:16,021,663-106,883,830) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 84,972,514bp region (chr15:17,009,676-101,982,189) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 90,218,717bp region (chr16:9,000-90,227,716) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 83,189,077bp region (chr17:59,364-83,248,440) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 80,255,286bp region (chr18:9,000-80,264,285) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 58,548,834bp region (chr19:59,783-58,608,616) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 242,175,497bp region (chr2:9,033-242,184,529) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 64,275,021bp region (chr20:59,291-64,334,311) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 41,691,415bp region (chr21:5,009,569-46,700,983) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 40,300,126bp region (chr22:10,509,343-50,809,468) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 198,227,522bp region (chr3:9,000-198,236,521) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 190,196,556bp region (chr4:9,000-190,205,555) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 181,469,869bp region (chr5:9,000-181,478,868) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 170,686,146bp region (chr6:59,344-170,745,489) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 159,327,974bp region (chr7:9,000-159,336,973) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 145,018,545bp region (chr8:59,386-145,077,930) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 138,326,499bp region (chr9:9,000-138,335,498) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 15,342bp region (chrM:0-15,341) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 156,022,873bp region (chrX:9,000-156,031,872) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Parsing potential adjacent repeats from a 54,108,131bp region (chrY:2,780,761-56,888,891) in /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz... Processing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz The input catalog(s) did not have any adjacent loci specified Added adjacent loci to 0 out of 4,863,041 records (0.0%): Wrote 4,863,041 total records to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz.with_adjacent_loci_annotation.json.gz STEP #14: mv /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz.with_adjacent_loci_annotation.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz STEP #15: python3 << EOF import gzip import json import pandas as pd from pprint import pformat with gzip.open("/Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz", "rt") as f: data = json.load(f) print(f"Loaded {len(data):,d} records from /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz") print("Writing it to TSV file...") df = pd.DataFrame(data) core_columns = [ 'LocusId', 'ReferenceRegion', 'LocusStructure', 'CanonicalMotif', 'TRsInRegion', 'Source', 'GencodeGeneRegion', 'GencodeGeneId', 'GencodeGeneName', 'GencodeTranscriptId', 'RefseqGeneRegion', 'RefseqGeneId', 'RefseqGeneName', 'RefseqTranscriptId', 'ManeGeneRegion', 'ManeGeneId', 'ManeGeneName', 'ManeTranscriptId', 'KnownDiseaseAssociatedMotif', 'KnownDiseaseAssociatedLocus', 'NsInFlanks', 'LeftFlankMappability', 'FlanksAndLocusMappability', 'RightFlankMappability', 'FoundInKnownDiseaseAssociatedLoci', 'FoundInIllumina174kPolymorphicTRs', 'FoundInPerfectRepeatsInReference', 'FoundInPolymorphicTRsInT2TAssemblies', 'NumRepeatsInReference', 'ReferenceRepeatPurity', 'AlleleFrequenciesFromIllumina174k', 'StdevFromIllumina174k', 'AlleleFrequenciesFromT2TAssemblies', 'StdevFromT2TAssemblies', 'VariationCluster', 'VariationClusterSizeDiff', 'LPSLengthStdevFromHPRC100', 'LPSMotifFractionFromHPRC100', ] drop_columns = ['VariantType', ] for c in set(core_columns) - set(df.columns): df[c] = None df = df[core_columns + [c for c in df.columns if c not in (core_columns + drop_columns)]] output_tsv_path = "/Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.tsv.gz" df.to_csv(output_tsv_path, sep="\t", index=False) print(f"Wrote {len(df):,d} rows to {output_tsv_path} with columns: {pformat(list(df.columns))}") EOF Loaded 4,863,041 records from /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Writing it to TSV file... Wrote 4,863,041 rows to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.tsv.gz with columns: ['LocusId', 'ReferenceRegion', 'LocusStructure', 'CanonicalMotif', 'TRsInRegion', 'Source', 'GencodeGeneRegion', 'GencodeGeneId', 'GencodeGeneName', 'GencodeTranscriptId', 'RefseqGeneRegion', 'RefseqGeneId', 'RefseqGeneName', 'RefseqTranscriptId', 'ManeGeneRegion', 'ManeGeneId', 'ManeGeneName', 'ManeTranscriptId', 'KnownDiseaseAssociatedMotif', 'KnownDiseaseAssociatedLocus', 'NsInFlanks', 'LeftFlankMappability', 'FlanksAndLocusMappability', 'RightFlankMappability', 'FoundInKnownDiseaseAssociatedLoci', 'FoundInIllumina174kPolymorphicTRs', 'FoundInPerfectRepeatsInReference', 'FoundInPolymorphicTRsInT2TAssemblies', 'NumRepeatsInReference', 'ReferenceRepeatPurity', 'AlleleFrequenciesFromIllumina174k', 'StdevFromIllumina174k', 'AlleleFrequenciesFromT2TAssemblies', 'StdevFromT2TAssemblies', 'VariationCluster', 'VariationClusterSizeDiff', 'LPSLengthStdevFromHPRC100', 'LPSMotifFractionFromHPRC100'] STEP #16: python3 -m str_analysis.convert_expansion_hunter_catalog_to_trgt_catalog --split-adjacent-repeats /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz --output-file /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.TRGT.bed Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Wrote 4,863,041 out of 4,863,041 rows to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.TRGT.bed STEP #17: python3 -m str_analysis.convert_expansion_hunter_catalog_to_longtr_format /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz --output-file /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.LongTR.bed Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Wrote out /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.LongTR.bed STEP #18: python3 -m str_analysis.convert_expansion_hunter_catalog_to_hipstr_format /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz --output-file /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.HipSTR.bed Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Wrote out /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.HipSTR.bed STEP #19: python3 -m str_analysis.convert_expansion_hunter_catalog_to_gangstr_spec /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz --output-file /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.GangSTR.bed Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Wrote out /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.GangSTR.bed STEP #20: trgt validate --genome /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa --repeats /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.TRGT.bed STEP #21: python3 /Users/weisburd/code/tandem-repeat-catalogs/scripts/validate_catalog.py --known-pathogenic-loci-json-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/variant_catalog_without_offtargets.GRCh38.split.json.gz --check-for-presence-of-annotations --check-for-presence-of-all-known-loci /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Done. Catalog passed validation. STEP #22: python3 /Users/weisburd/code/tandem-repeat-catalogs/scripts/validate_json.py -k LocusId -k LocusStructure -k ReferenceRegion -k VariantType /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz All 4,863,041 rows PASSED validation STEP #22: python3 /Users/weisburd/code/tandem-repeat-catalogs/scripts/validate_json.py -k LocusId -k LocusStructure -k ReferenceRegion -k VariantType /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.json.gz All 4,859,281 rows PASSED validation STEP #22: gzip -f /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.TRGT.bed STEP #22: bgzip -f /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.LongTR.bed STEP #22: bgzip -f /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.HipSTR.bed STEP #22: bgzip -f /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.GangSTR.bed STEP #22: cp /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/release_draft_2024-10-01 STEP #22: cp /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.bed.gz.tbi /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/release_draft_2024-10-01 STEP #22: cp /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/release_draft_2024-10-01 STEP #22: cp /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/release_draft_2024-10-01 STEP #22: cp /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.TRGT.bed.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/release_draft_2024-10-01 STEP #22: cp /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.LongTR.bed.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/release_draft_2024-10-01 STEP #22: cp /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.HipSTR.bed.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/release_draft_2024-10-01 STEP #22: cp /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.GangSTR.bed.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/release_draft_2024-10-01 STEP #22: cp variation_clusters_v1.hg38.TRGT.bed.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/release_draft_2024-10-01 STEP #22: cp variation_clusters_and_isolated_TRs_v1.hg38.TRGT.bed.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/release_draft_2024-10-01 STEP #22: cp variation_clusters_v1.hg38.LongTR.bed.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/release_draft_2024-10-01 STEP #22: cp variation_clusters_and_isolated_TRs_v1.hg38.LongTR.bed.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/release_draft_2024-10-01 STEP #23: python3 -m str_analysis.compute_catalog_stats --verbose /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz -------------------------------------------------- Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Stats for repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz: 4,863,041 total loci 65,678,112 base pairs spanned by all loci (2.127% of the genome) 0 out of 4,863,041 ( 0.0%) loci define adjacent repeats 4,863,041 total repeat intervals 3,210,115 out of 4,863,041 ( 66.0%) repeat interval size is an integer multiple of the motif size (aka. trimmed) 1,567,337 out of 4,863,041 ( 32.2%) repeat intervals are homopolymers 18,340 out of 4,863,041 ( 0.4%) repeat intervals overlap each other by at least two motif lengths 11 out of 4,863,041 ( 0.0%) repeat intervals have non-ACGT motifs Examples of overlapping repeats: chr1:82008141-82008152, chr3:78937990-78938032, chr4:1046750-1046794, chr5:52437153-52437201, chr6:34683425-34683453, chr4:107646310-107646327, chr18:52413438-52413450, chr6:150149299-150149323, chr7:40295582-40295597, chr9:35561915-35561931 Ranges: Motif size range: 1-833bp Locus size range: 1-2523bp Num repeats range: 1-300x repeats Max locus size = 2,523bp @ chrX:71520430-71522953 (CCAGCACTTTGGGAGGCCGAGGCAGGCTGATCACTAGGTCAGGAGTTCAAGACCAGCCTGGCCAACATGGTGAAACCCCCGTCTCTACTAAAAATACAAAAATTACCTGGGTGTGGGGGTGGGCACCTGTAATCCCAGCTACTCGGGAGGCTGGGGAGGCAGGAGAATTGCCTGAACCTGAGAGGCAGAGGCTGCAGTGAGCTGAGATTGTGCCACTGCACTCCAGCCTGGGCGACAGAGTGAGACTCAGTCTCAAAACAAAAAAAAAAAAAGATTTTAGTAACTTTTATCCTGTTTTAATAATACTGACTCAGAAACTATAATGTGTACTTTATAATTTACTTCCTAGATGACACTTGATTTTCTTCAAGAGCAAGATAGCTGCCCTGTGCAGTTGGTCTCCTTGAAAACTATTTTAGTTCTATCATAATTTCCTGTGATAAATATTTTGACCTTCTAAAATTTCAGAATATTGCACCAAGTAGAAAGAAAATAGGTTTTTTCTCTTTTCTTCTTCTTCCTTTTTTTTTTCTGAGAAAGAGGGAATGAGAACTTTAGTGTTCTTTCAATAGCGTTCTTATTTGTAGAAATGCATAATAGTGTCCTAGTAAGGCTTGACAATAACTCTGGTCTTCATCATATTTTGTGATAAAACTTTTGATTTAAAAAAACCTCTGATCTATTTATCATGGCAAATGGATAGAGCTTTCCTGCCTGTTTTCTTTCTTTTCTTTTTTCTTTCTTTCCTTTTTTTTCCTTTGAGCTTAGATTTTTAGAAGCACATATTTAAAAATCAGGTATAAGACTGGATGCAGTGGCTCACGCCTGTAATC) Min reference repeat purity = 0.43 @ chr3:112804380-112804514 (TCT) Min overall mappability = 0.00 @ chrY:56887882-56887891 (TGA) Base-level purity median: 1.000, mean: 0.999 chrX: 244,191 out of 4,863,041 ( 5.0%) repeat intervals chrY: 39,257 out of 4,863,041 ( 0.8%) repeat intervals chrM: 14 out of 4,863,041 ( 0.0%) repeat intervals alt contigs: 0 out of 4,863,041 ( 0.0%) repeat intervals Motif size distribution: 1bp: 1,567,337 out of 4,863,041 ( 32.2%) repeat intervals 2bp: 978,972 out of 4,863,041 ( 20.1%) repeat intervals 3bp: 1,432,117 out of 4,863,041 ( 29.4%) repeat intervals 4bp: 590,787 out of 4,863,041 ( 12.1%) repeat intervals 5bp: 177,422 out of 4,863,041 ( 3.6%) repeat intervals 6bp: 56,731 out of 4,863,041 ( 1.2%) repeat intervals 7-24bp: 43,996 out of 4,863,041 ( 0.9%) repeat intervals 25+bp: 15,679 out of 4,863,041 ( 0.3%) repeat intervals Num repeats in reference: 1x: 10,443 out of 4,863,041 ( 0.2%) repeat intervals 2x: 38,922 out of 4,863,041 ( 0.8%) repeat intervals 3x: 1,799,189 out of 4,863,041 ( 37.0%) repeat intervals 4x: 650,397 out of 4,863,041 ( 13.4%) repeat intervals 5x: 356,525 out of 4,863,041 ( 7.3%) repeat intervals 6x: 151,893 out of 4,863,041 ( 3.1%) repeat intervals 7x: 85,760 out of 4,863,041 ( 1.8%) repeat intervals 8x: 257,475 out of 4,863,041 ( 5.3%) repeat intervals 9x: 352,993 out of 4,863,041 ( 7.3%) repeat intervals 10-15x: 759,188 out of 4,863,041 ( 15.6%) repeat intervals 16-25x: 348,837 out of 4,863,041 ( 7.2%) repeat intervals 26-35x: 45,478 out of 4,863,041 ( 0.9%) repeat intervals 36-50x: 5,610 out of 4,863,041 ( 0.1%) repeat intervals 51+x: 331 out of 4,863,041 ( 0.0%) repeat intervals Reference repeat purity distribution: 0.0: 0 out of 4,863,041 ( 0.0%) repeat intervals 0.1: 0 out of 4,863,041 ( 0.0%) repeat intervals 0.2: 0 out of 4,863,041 ( 0.0%) repeat intervals 0.3: 0 out of 4,863,041 ( 0.0%) repeat intervals 0.4: 3 out of 4,863,041 ( 0.0%) repeat intervals 0.5: 14 out of 4,863,041 ( 0.0%) repeat intervals 0.6: 44 out of 4,863,041 ( 0.0%) repeat intervals 0.7: 1,570 out of 4,863,041 ( 0.0%) repeat intervals 0.8: 12,336 out of 4,863,041 ( 0.3%) repeat intervals 0.9: 21,126 out of 4,863,041 ( 0.4%) repeat intervals 1.0: 4,827,948 out of 4,863,041 ( 99.3%) repeat intervals Mappability distribution: 0.0: 154,279 out of 4,863,041 ( 3.2%) loci 0.1: 214,471 out of 4,863,041 ( 4.4%) loci 0.2: 246,877 out of 4,863,041 ( 5.1%) loci 0.3: 236,856 out of 4,863,041 ( 4.9%) loci 0.4: 391,388 out of 4,863,041 ( 8.0%) loci 0.5: 561,639 out of 4,863,041 ( 11.5%) loci 0.6: 352,273 out of 4,863,041 ( 7.2%) loci 0.7: 306,208 out of 4,863,041 ( 6.3%) loci 0.8: 337,715 out of 4,863,041 ( 6.9%) loci 0.9: 626,047 out of 4,863,041 ( 12.9%) loci 1.0: 1,435,288 out of 4,863,041 ( 29.5%) loci Locus sizes at each motif size: 1bp motifs: locus size range: 1 bp to 90 bp (median: 11 bp) based on 1,567,337 loci. Mean base purity: 1.00. Mean mappability: 0.66 2bp motifs: locus size range: 2 bp to 600 bp (median: 10 bp) based on 978,972 loci. Mean base purity: 1.00. Mean mappability: 0.76 3bp motifs: locus size range: 3 bp to 632 bp (median: 9 bp) based on 1,432,117 loci. Mean base purity: 1.00. Mean mappability: 0.75 4bp motifs: locus size range: 4 bp to 533 bp (median: 14 bp) based on 590,787 loci. Mean base purity: 1.00. Mean mappability: 0.68 5bp motifs: locus size range: 5 bp to 400 bp (median: 18 bp) based on 177,422 loci. Mean base purity: 1.00. Mean mappability: 0.61 6bp motifs: locus size range: 6 bp to 1,103 bp (median: 20 bp) based on 56,731 loci. Mean base purity: 1.00. Mean mappability: 0.62 7bp motifs: locus size range: 7 bp to 151 bp (median: 22 bp) based on 15,083 loci. Mean base purity: 1.00. Mean mappability: 0.58 8bp motifs: locus size range: 8 bp to 312 bp (median: 25 bp) based on 7,107 loci. Mean base purity: 1.00. Mean mappability: 0.57 9bp motifs: locus size range: 9 bp to 153 bp (median: 28 bp) based on 3,231 loci. Mean base purity: 1.00. Mean mappability: 0.51 10bp motifs: locus size range: 10 bp to 150 bp (median: 31 bp) based on 2,713 loci. Mean base purity: 1.00. Mean mappability: 0.50 Wrote 1 rows to repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.catalog_stats.tsv Done generating /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs catalog. Took 4h, 43m, 28s wget -O hg38_ver17.bed.gz.tmp -qnc https://s3.amazonaws.com/gangstr/hg38/genomewide/hg38_ver17.bed.gz && mv hg38_ver17.bed.gz.tmp hg38_ver17.bed.gz STEP #30: python3 -u -m str_analysis.convert_gangstr_spec_to_expansion_hunter_catalog --verbose /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.bed.gz -o /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.json.gz Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.bed.gz Writing records to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.json.gz Stats: {'total input loci': 1340266, 'trimmed locus': 1340266} Wrote 1,340,266 records to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.json.gz STEP #31: python3 -m str_analysis.annotate_and_filter_str_catalog --reference-fasta /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa --skip-gene-annotations --skip-mappability-annotations --skip-disease-loci-annotations --min-motif-size 1 --max-motif-size 1000 --output-path /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.filtered.json.gz --verbose /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.json.gz Args: reference_fasta = /Users/weisburd/code/tandem-repeat-catalogs/hg38.fa output_path = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.filtered.json.gz variant_catalog_json_or_bed = /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.json.gz min_motif_size = 1 max_motif_size = 1000 output_tsv = False output_bed = False output_stats = False show_progress_bar = False set_locus_id = False add_gene_region_to_locus_id = False add_canonical_motif_to_locus_id = False only_known_disease_associated_loci = False exclude_known_disease_associated_loci = False only_known_disease_associated_motifs = False discard_overlapping_intervals_with_similar_motifs = False discard_loci_with_non_ACGT_bases_in_motif = False discard_loci_with_non_ACGTN_bases_in_motif = False discard_loci_with_non_ACGT_bases_in_reference = False discard_loci_with_non_ACGTN_bases_in_reference = False dont_simplify_motifs = False trim_loci = False verbose = True skip_gene_annotations = True skip_disease_loci_annotations = True skip_mappability_annotations = True mappability_track_bigwig = gs://tgg-viewer/ref/GRCh38/mappability/GRCh38_no_alt_analysis_set_GCA_000001405.15-k36_m2.bw Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.json.gz Wrote 1,340,266 records to /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.filtered.json.gz Filter stats: 1,340,266 total input rows 1,340,266 out of 1,340,266 (100%) passed all filters STEP #32: python3 -m str_analysis.compute_catalog_stats --verbose /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.filtered.json.gz -------------------------------------------------- Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.filtered.json.gz Stats for hg38_ver17.filtered.json.gz: 1,340,266 total loci 22,952,364 base pairs spanned by all loci (0.743% of the genome) 0 out of 1,340,266 ( 0.0%) loci define adjacent repeats 1,340,266 total repeat intervals 1,340,266 out of 1,340,266 (100.0%) repeat interval size is an integer multiple of the motif size (aka. trimmed) 488,668 out of 1,340,266 ( 36.5%) repeat intervals are homopolymers 16 out of 1,340,266 ( 0.0%) repeat intervals overlap each other by at least two motif lengths Examples of overlapping repeats: chr11:2171085-2171113, chr15:96831014-96831039, chr11:2171087-2171115, chr20:2652733-2652757, chr2:1489650-1489682, chr9:27573523-27573541, chr20:2652732-2652756, chr15:96831011-96831036, chr5:123775555-123775599, chr5:146878727-146878757 Ranges: Motif size range: 1-20bp Locus size range: 10-612bp Num repeats range: 3-248x repeats Max locus size = 612bp @ chr4:10134483-10135095 (ATCCCAATTGATGGAGA) Min reference repeat purity = 0.95 @ chr6:170561907-170562021 (CAG) Base-level purity median: 1.000, mean: 1.000 chrX: 65,719 out of 1,340,266 ( 4.9%) repeat intervals chrY: 8,979 out of 1,340,266 ( 0.7%) repeat intervals chrM: 0 out of 1,340,266 ( 0.0%) repeat intervals alt contigs: 0 out of 1,340,266 ( 0.0%) repeat intervals Motif size distribution: 1bp: 488,668 out of 1,340,266 ( 36.5%) repeat intervals 2bp: 179,518 out of 1,340,266 ( 13.4%) repeat intervals 3bp: 150,463 out of 1,340,266 ( 11.2%) repeat intervals 4bp: 377,644 out of 1,340,266 ( 28.2%) repeat intervals 5bp: 106,135 out of 1,340,266 ( 7.9%) repeat intervals 6bp: 27,654 out of 1,340,266 ( 2.1%) repeat intervals 7-24bp: 10,184 out of 1,340,266 ( 0.8%) repeat intervals 25+bp: 0 out of 1,340,266 ( 0.0%) repeat intervals Num repeats in reference: 1x: 0 out of 1,340,266 ( 0.0%) repeat intervals 2x: 0 out of 1,340,266 ( 0.0%) repeat intervals 3x: 365,043 out of 1,340,266 ( 27.2%) repeat intervals 4x: 178,164 out of 1,340,266 ( 13.3%) repeat intervals 5x: 62,042 out of 1,340,266 ( 4.6%) repeat intervals 6x: 81,945 out of 1,340,266 ( 6.1%) repeat intervals 7x: 41,195 out of 1,340,266 ( 3.1%) repeat intervals 8x: 22,799 out of 1,340,266 ( 1.7%) repeat intervals 9x: 15,022 out of 1,340,266 ( 1.1%) repeat intervals 10-15x: 270,718 out of 1,340,266 ( 20.2%) repeat intervals 16-25x: 260,685 out of 1,340,266 ( 19.5%) repeat intervals 26-35x: 37,641 out of 1,340,266 ( 2.8%) repeat intervals 36-50x: 4,785 out of 1,340,266 ( 0.4%) repeat intervals 51+x: 227 out of 1,340,266 ( 0.0%) repeat intervals Reference repeat purity distribution: 0.0: 0 out of 1,340,266 ( 0.0%) repeat intervals 0.1: 0 out of 1,340,266 ( 0.0%) repeat intervals 0.2: 0 out of 1,340,266 ( 0.0%) repeat intervals 0.3: 0 out of 1,340,266 ( 0.0%) repeat intervals 0.4: 0 out of 1,340,266 ( 0.0%) repeat intervals 0.5: 0 out of 1,340,266 ( 0.0%) repeat intervals 0.6: 0 out of 1,340,266 ( 0.0%) repeat intervals 0.7: 0 out of 1,340,266 ( 0.0%) repeat intervals 0.8: 0 out of 1,340,266 ( 0.0%) repeat intervals 0.9: 4 out of 1,340,266 ( 0.0%) repeat intervals 1.0: 1,340,262 out of 1,340,266 (100.0%) repeat intervals Locus sizes at each motif size: 1bp motifs: locus size range: 12 bp to 90 bp (median: 16 bp) based on 488,668 loci. Mean base purity: 1.00. 2bp motifs: locus size range: 10 bp to 496 bp (median: 16 bp) based on 179,518 loci. Mean base purity: 1.00. 3bp motifs: locus size range: 12 bp to 180 bp (median: 12 bp) based on 150,463 loci. Mean base purity: 1.00. 4bp motifs: locus size range: 12 bp to 332 bp (median: 12 bp) based on 377,644 loci. Mean base purity: 1.00. 5bp motifs: locus size range: 15 bp to 305 bp (median: 15 bp) based on 106,135 loci. Mean base purity: 1.00. 6bp motifs: locus size range: 18 bp to 372 bp (median: 18 bp) based on 27,654 loci. Mean base purity: 1.00. 7bp motifs: locus size range: 21 bp to 119 bp (median: 21 bp) based on 5,708 loci. Mean base purity: 1.00. 8bp motifs: locus size range: 24 bp to 144 bp (median: 24 bp) based on 1,707 loci. Mean base purity: 1.00. 9bp motifs: locus size range: 27 bp to 108 bp (median: 27 bp) based on 760 loci. Mean base purity: 1.00. 10bp motifs: locus size range: 30 bp to 110 bp (median: 30 bp) based on 344 loci. Mean base purity: 1.00. Wrote 1 rows to hg38_ver17.filtered.catalog_stats.tsv STEP #33: python3 -u -m str_analysis.merge_loci --output-prefix GangSTR_v17 --output-format JSON --overlapping-loci-action keep-first --verbose --write-merge-stats-tsv /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.filtered.json.gz - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Kept 4,863,041 out of 4,863,041 (100.0%) records from repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Parsing /Users/weisburd/code/tandem-repeat-catalogs/results__2024-10-01/1_to_1000bp_motifs/hg38_ver17.filtered.json.gz Kept 0 out of 1,340,266 ( 0.0%) records from hg38_ver17.filtered.json.gz Discarded 1,340,266 out of 1,340,266 (100.0%) records since they overlapped an existing locus by at least 66.0% and had the same canonical motif Writing combined catalog to GangSTR_v17.json.gz Wrote 4,863,041 output records to GangSTR_v17.json.gz Source of loci in output catalog: 4,863,041 out of 4,863,041 (100.0%) repeat_catalog_v1.hg38.1_to_1000bp_motifs.EH.with_annotations.json.gz Motif sizes: 1,567,337 out of 4,863,041 (32.2%) 1bp 1,432,117 out of 4,863,041 (29.4%) 3bp 978,972 out of 4,863,041 (20.1%) 2bp 590,787 out of 4,863,041 (12.1%) 4bp 177,422 out of 4,863,041 ( 3.6%) 5bp 59,675 out of 4,863,041 ( 1.2%) 7+bp 56,731 out of 4,863,041 ( 1.2%) 6bp Chromsomes: 4,579,579 out of 4,863,041 (94.2%) are on autosomes 244,191 out of 4,863,041 ( 5.0%) are on X 39,257 out of 4,863,041 ( 0.8%) are on Y 14 out of 4,863,041 ( 0.0%) are on M Wrote 2 rows to GangSTR_v17.merge_stats.tsv Done with comparisons. Took 0h, 55m, 22s