{ "cells": [ { "cell_type": "markdown", "id": "2949abe4-dfa6-4c4a-a395-a3a1db92b5e7", "metadata": {}, "source": [ "## Create _C.virginia_ gene BED file which _includes_ pseudogenes.\n", "\n", "### Resulting gene BED file will be used for [_C.virginica_ RNAseq/DML sex/OA project](https://github.com/epigeneticstoocean/2018_L18-adult-methylation) (GitHub repo)\n", "\n", "This notebook relies on [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) to be installed and available in your `$PATH`." ] }, { "cell_type": "markdown", "id": "ee0ebae6-d54c-4d18-88bd-3d3456a8b1e6", "metadata": {}, "source": [ "### List computer specs" ] }, { "cell_type": "code", "execution_count": 1, "id": "9f60016c-d6b6-4b6d-86d3-5ee68b55464c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TODAY'S DATE:\n", "Mon Sep 26 01:12:01 PM PDT 2022\n", "------------\n", "\n", "Distributor ID:\tUbuntu\n", "Description:\tUbuntu 22.04.1 LTS\n", "Release:\t22.04\n", "Codename:\tjammy\n", "\n", "------------\n", "HOSTNAME: \n", "computer\n", "\n", "------------\n", "Computer Specs:\n", "\n", "Architecture: x86_64\n", "CPU op-mode(s): 32-bit, 64-bit\n", "Address sizes: 45 bits physical, 48 bits virtual\n", "Byte Order: Little Endian\n", "CPU(s): 8\n", "On-line CPU(s) list: 0-7\n", "Vendor ID: GenuineIntel\n", "Model name: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz\n", "CPU family: 6\n", "Model: 165\n", "Thread(s) per core: 1\n", "Core(s) per socket: 1\n", "Socket(s): 8\n", "Stepping: 2\n", "BogoMIPS: 4800.01\n", "Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat flush_l1d arch_capabilities\n", "Hypervisor vendor: VMware\n", "Virtualization type: full\n", "L1d cache: 256 KiB (8 instances)\n", "L1i cache: 256 KiB (8 instances)\n", "L2 cache: 2 MiB (8 instances)\n", "L3 cache: 128 MiB (8 instances)\n", "NUMA node(s): 1\n", "NUMA node0 CPU(s): 0-7\n", "Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported\n", "Vulnerability L1tf: Mitigation; PTE Inversion\n", "Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\n", "Vulnerability Meltdown: Mitigation; PTI\n", "Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\n", "Vulnerability Retbleed: Mitigation; IBRS\n", "Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\n", "Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\n", "Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, RSB filling\n", "Vulnerability Srbds: Unknown: Dependent on hypervisor status\n", "Vulnerability Tsx async abort: Not affected\n", "\n", "------------\n", "\n", "Memory Specs\n", "\n", " total used free shared buff/cache available\n", "Mem: 54Gi 6.4Gi 43Gi 201Mi 5.2Gi 47Gi\n", "Swap: 2.0Gi 0B 2.0Gi\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "No LSB modules are available.\n" ] } ], "source": [ "%%bash\n", "echo \"TODAY'S DATE:\"\n", "date\n", "echo \"------------\"\n", "echo \"\"\n", "#Display operating system info\n", "lsb_release -a\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"HOSTNAME: \"; hostname \n", "echo \"\"\n", "echo \"------------\"\n", "echo \"Computer Specs:\"\n", "echo \"\"\n", "lscpu\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"\"\n", "echo \"Memory Specs\"\n", "echo \"\"\n", "free -mh" ] }, { "cell_type": "markdown", "id": "19866f68-bfad-4a83-adb5-24e271e29d06", "metadata": {}, "source": [ "### Set variables\n", "- `%env` indicates a bash variable\n", "\n", "- without `%env` is Python variable" ] }, { "cell_type": "code", "execution_count": 2, "id": "7293bcb0-581c-4ad2-8f1e-09dd98352aaf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: data_dir=/home/sam/data/C_virginica/igv_tracks\n", "env: analysis_dir=/home/sam/analyses/20220926-cvir-gff-to-bed-genes_and_pseudogenes\n", "env: orig_gff=GCF_002022765.2_C_virginica-3.0_genomic.gff\n", "env: orig_gff_gz=GCF_002022765.2_C_virginica-3.0_genomic.gff.gz\n", "env: orig_gff_url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0\n", "env: gtf_extractor_output_genes=20220926-cvir-GCF_002022765.2-genes.bed\n", "env: gtf_extractor_output_pseudogenes=20220926-cvir-GCF_002022765.2-pseudogenes.bed\n", "env: genes_and_psuedogenes=20220926-cvir-GCF_002022765.2-genes-and-pseudogenes.bed\n" ] } ], "source": [ "# Set directories, input/output files\n", "%env data_dir=/home/sam/data/C_virginica/igv_tracks\n", "%env analysis_dir=/home/sam/analyses/20220926-cvir-gff-to-bed-genes_and_pseudogenes\n", "analysis_dir=\"20220926-cvir-gff-to-bed-genes_and_pseudogenes\"\n", "\n", "# Input GFF (from NCBI)\n", "%env orig_gff=GCF_002022765.2_C_virginica-3.0_genomic.gff\n", "%env orig_gff_gz=GCF_002022765.2_C_virginica-3.0_genomic.gff.gz\n", "%env orig_gff_url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0\n", "\n", "# GTF extractor output\n", "%env gtf_extractor_output_genes=20220926-cvir-GCF_002022765.2-genes.bed\n", "%env gtf_extractor_output_pseudogenes=20220926-cvir-GCF_002022765.2-pseudogenes.bed\n", "%env genes_and_psuedogenes=20220926-cvir-GCF_002022765.2-genes-and-pseudogenes.bed" ] }, { "cell_type": "markdown", "id": "56052f6d-441a-4048-8a6f-39d58552283d", "metadata": {}, "source": [ "### Download GFF" ] }, { "cell_type": "code", "execution_count": 3, "id": "951fc8e9-b821-4f54-848f-f9573daadc83", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 sam sam 412M Dec 10 2019 GCF_002022765.2_C_virginica-3.0_genomic.gff\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "gzip: GCF_002022765.2_C_virginica-3.0_genomic.gff already exists;\tnot overwritten\n" ] } ], "source": [ "%%bash\n", "cd \"${data_dir}\"\n", "\n", "# Download with wget.\n", "# Use --quiet option to prevent wget output from printing too many lines to notebook\n", "# Use --continue to prevent re-downloading fie if it's already been downloaded.\n", "wget --quiet \\\n", "--continue \\\n", "${orig_gff_url}/${orig_gff_gz}\n", "\n", "# Unzip downloaded GFF, if it exists\n", "if [[ -f \"${orig_gff_gz}\" ]]; then\n", " gunzip \"${orig_gff_gz}\"\n", "fi\n", "\n", "ls -ltrh \"${orig_gff}\"" ] }, { "cell_type": "markdown", "id": "7eb8b7c0-5927-44c5-ba79-cbee1d5a77fb", "metadata": {}, "source": [ "### Examine GFF" ] }, { "cell_type": "code", "execution_count": 4, "id": "18862291-d1ec-4b62-8b22-959404538a7f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "##gff-version 3\n", "#!gff-spec-version 1.21\n", "#!processor NCBI annotwriter\n", "#!genome-build C_virginica-3.0\n", "#!genome-build-accession NCBI_Assembly:GCF_002022765.2\n", "#!annotation-source NCBI Crassostrea virginica Annotation Release 100\n", "##sequence-region NC_035780.1 1 65668440\n", "##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6565\n", "NC_035780.1\tRefSeq\tregion\t1\t65668440\t.\t+\t.\tID=NC_035780.1:1..65668440;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample\n", "NC_035780.1\tGnomon\tgene\t13578\t14594\t.\t+\t.\tID=gene-LOC111116054;Dbxref=GeneID:111116054;Name=LOC111116054;gbkey=Gene;gene=LOC111116054;gene_biotype=lncRNA\n", "NC_035780.1\tGnomon\tlnc_RNA\t13578\t14594\t.\t+\t.\tID=rna-XR_002636969.1;Parent=gene-LOC111116054;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n", "NC_035780.1\tGnomon\texon\t13578\t13603\t.\t+\t.\tID=exon-XR_002636969.1-1;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n", "NC_035780.1\tGnomon\texon\t14237\t14290\t.\t+\t.\tID=exon-XR_002636969.1-2;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n", "NC_035780.1\tGnomon\texon\t14557\t14594\t.\t+\t.\tID=exon-XR_002636969.1-3;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n", "NC_035780.1\tGnomon\tgene\t28961\t33324\t.\t+\t.\tID=gene-LOC111126949;Dbxref=GeneID:111126949;Name=LOC111126949;gbkey=Gene;gene=LOC111126949;gene_biotype=protein_coding\n", "NC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna-XM_022471938.1;Parent=gene-LOC111126949;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\n", "NC_035780.1\tGnomon\texon\t28961\t29073\t.\t+\t.\tID=exon-XM_022471938.1-1;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n", "NC_035780.1\tGnomon\texon\t30524\t31557\t.\t+\t.\tID=exon-XM_022471938.1-2;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n", "NC_035780.1\tGnomon\texon\t31736\t31887\t.\t+\t.\tID=exon-XM_022471938.1-3;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n", "NC_035780.1\tGnomon\texon\t31977\t32565\t.\t+\t.\tID=exon-XM_022471938.1-4;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n" ] } ], "source": [ "%%bash\n", "head -n 20 \"${data_dir}\"/\"${orig_gff}\"" ] }, { "cell_type": "markdown", "id": "5f389b71-582f-41e5-b9a0-bd2491d968b0", "metadata": {}, "source": [ "### Use [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) to extract gene and pseudo gene features" ] }, { "cell_type": "code", "execution_count": 5, "id": "3dc274bf-228a-4d08-bd30-de6872a9ecc0", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "real\t0m59.836s\n", "user\t0m59.627s\n", "sys\t0m0.208s\n", "\n", "real\t0m59.366s\n", "user\t0m59.243s\n", "sys\t0m0.112s\n" ] } ], "source": [ "%%bash\n", "# Make analysis directory, if it doesn't exist\n", "mkdir --parents \"${analysis_dir}\"\n", "\n", "# Extract gene or pseudogene features\n", "# Extract chrom,start,end,gene=,and strand fields\n", "# \"gene=\" is the NCBI gene name, in this particular instance\n", "# Specify input as GFF\n", "# Use awk to to insert a \"score\" column before the strand column ($5)\n", "# and fill new \"score\" column with arbitrary value of \"0\"\n", "time \\\n", "gtf_extract \\\n", "--feature gene \\\n", "--fields=chrom,start,end,ID,strand \\\n", "--gff ${data_dir}/${orig_gff} \\\n", "| awk 'BEGIN{FS=OFS=\"\\t\"}{$5 = 0 OFS $5}1' \\\n", "> ${analysis_dir}/${gtf_extractor_output_genes}\n", "\n", "time \\\n", "gtf_extract \\\n", "--feature pseudogene \\\n", "--fields=chrom,start,end,ID,strand \\\n", "--gff ${data_dir}/${orig_gff} \\\n", "| awk 'BEGIN{FS=OFS=\"\\t\"}{$5 = 0 OFS $5}1' \\\n", "> ${analysis_dir}/${gtf_extractor_output_pseudogenes}" ] }, { "cell_type": "markdown", "id": "a718c81a-4fb3-4bab-8dc5-1466ba2452af", "metadata": {}, "source": [ "#### Check results" ] }, { "cell_type": "code", "execution_count": 6, "id": "5e368eb2-8d11-414f-be44-f369dc3c3c6d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 sam sam 2.0M Sep 26 13:13 20220926-cvir-GCF_002022765.2-genes.bed\n", "-rw-rw-r-- 1 sam sam 34K Sep 26 13:14 20220926-cvir-GCF_002022765.2-pseudogenes.bed\n", "\n", "==> 20220926-cvir-GCF_002022765.2-genes.bed <==\n", "NC_035780.1\t13578\t14594\tgene-LOC111116054\t0\t+\n", "NC_035780.1\t28961\t33324\tgene-LOC111126949\t0\t+\n", "NC_035780.1\t43111\t66897\tgene-LOC111110729\t0\t-\n", "NC_035780.1\t85606\t95254\tgene-LOC111112434\t0\t-\n", "NC_035780.1\t99840\t106460\tgene-LOC111120752\t0\t+\n", "NC_035780.1\t108305\t110077\tgene-LOC111128944\t0\t-\n", "NC_035780.1\t151859\t157536\tgene-LOC111128953\t0\t+\n", "NC_035780.1\t163809\t183798\tgene-LOC111105691\t0\t-\n", "NC_035780.1\t164820\t166793\tgene-LOC111105685\t0\t+\n", "NC_035780.1\t169468\t170178\tgene-LOC111105702\t0\t-\n", "\n", "==> 20220926-cvir-GCF_002022765.2-pseudogenes.bed <==\n", "NC_035780.1\t2215606\t2223571\tgene-LOC111129349\t0\t-\n", "NC_035780.1\t2997762\t3000443\tgene-LOC111101388\t0\t+\n", "NC_035780.1\t3070311\t3078520\tgene-LOC111129757\t0\t+\n", "NC_035780.1\t3376106\t3379058\tgene-LOC111129887\t0\t-\n", "NC_035780.1\t3379153\t3381662\tgene-LOC111129920\t0\t-\n", "NC_035780.1\t3850672\t3852462\tgene-LOC111125334\t0\t+\n", "NC_035780.1\t4397181\t4401310\tgene-LOC111130254\t0\t-\n", "NC_035780.1\t6721106\t6745518\tgene-LOC111130754\t0\t+\n", "NC_035780.1\t7313912\t7327414\tgene-LOC111128791\t0\t+\n", "NC_035780.1\t8003747\t8006376\tgene-LOC111120605\t0\t-\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "ls -ltrh *.bed\n", "\n", "echo \"\"\n", "\n", "head *.bed" ] }, { "cell_type": "markdown", "id": "d40dadc7-7dbd-40ca-9e2c-80f8d93568e4", "metadata": {}, "source": [ "#### Check that [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) output seem okay" ] }, { "cell_type": "code", "execution_count": 7, "id": "c24d1cd0-ac07-42c4-810f-008cf42f7fd5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GFFutils number of extracted genes:\n", "38838\n", "\n", "awk number of extracted genes:\n", "38838\n" ] } ], "source": [ "%%bash\n", "# Count gene features via GFFutils\n", "echo \"GFFutils number of extracted genes:\"\n", "gtf_extract --feature gene --fields=ID --gff ${data_dir}/${orig_gff} | wc -l\n", "\n", "echo \"\"\n", "\n", "# Count gene features via awk\n", "echo \"awk number of extracted genes:\"\n", "awk '$3 == \"gene\" { print $0 }' ${data_dir}/${orig_gff} | wc -l\n" ] }, { "cell_type": "markdown", "id": "d7431052-7168-416d-b773-20f2d0cf2c08", "metadata": {}, "source": [ "### Combine and sort BED files" ] }, { "cell_type": "code", "execution_count": 8, "id": "81259255-8a8e-43d1-bbff-ca233a994be8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "39505 /home/sam/analyses/20220926-cvir-gff-to-bed-genes_and_pseudogenes/20220926-cvir-GCF_002022765.2-genes-and-pseudogenes.bed\n", "NC_007175.2\t1\t1623\tgene-COX1\t0\t+\n", "NC_007175.2\t2558\t3429\tgene-COX3\t0\t+\n", "NC_007175.2\t3647\t4859\tgene-CYTB\t0\t+\n", "NC_007175.2\t4897\t5589\tgene-COX2\t0\t+\n", "NC_007175.2\t9518\t10192\tgene-ATP6\t0\t+\n", "NC_007175.2\t10295\t11290\tgene-ND2\t0\t+\n", "NC_007175.2\t11515\t12864\tgene-ND4\t0\t+\n", "NC_007175.2\t13126\t14793\tgene-ND5\t0\t+\n", "NC_007175.2\t14807\t15268\tgene-ND6\t0\t+\n", "NC_007175.2\t15352\t15705\tgene-ND3\t0\t+\n" ] } ], "source": [ "%%bash\n", "# Write to temp file \n", "cat ${analysis_dir}/${gtf_extractor_output_genes} \\\n", "${analysis_dir}/${gtf_extractor_output_pseudogenes} \\\n", "> ${analysis_dir}/tmp.txt\n", "\n", "# Sort the file by chromosome, then\n", "sort --version-sort -k1,1 -k2,2 ${analysis_dir}/tmp.txt \\\n", "> ${analysis_dir}/${genes_and_psuedogenes}\n", "\n", "# \n", "wc -l ${analysis_dir}/${genes_and_psuedogenes}\n", "\n", "# Check out combined file\n", "head ${analysis_dir}/*genes-and-pseudogenes.bed" ] }, { "cell_type": "markdown", "id": "24e1a558-e35d-4afc-ae7f-f1a8282b4c06", "metadata": {}, "source": [ "#### Clean up files" ] }, { "cell_type": "code", "execution_count": 9, "id": "71a6471d-a602-47ed-a616-53d5482fd95b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 2.0M\n", "-rw-rw-r-- 1 sam sam 2.0M Sep 26 13:15 20220926-cvir-GCF_002022765.2-genes-and-pseudogenes.bed\n" ] } ], "source": [ "%%bash\n", "rm ${analysis_dir}/tmp.txt ${analysis_dir}/${gtf_extractor_output_genes} ${analysis_dir}/${gtf_extractor_output_pseudogenes}\n", "\n", "ls -ltrh ${analysis_dir}" ] }, { "cell_type": "markdown", "id": "c0b01c2f-2c68-4821-ae0e-f19c43231119", "metadata": {}, "source": [ "### Generate checksums" ] }, { "cell_type": "code", "execution_count": 10, "id": "9a8e9dcb-eb74-4af4-8924-0553fe6f3596", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "101b49158ac0d7efe103670094118ecc 20220926-cvir-GCF_002022765.2-genes-and-pseudogenes.bed\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "\n", "for file in *\n", "do\n", " md5sum \"${file}\" | tee --append checksums.md5\n", "done" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }