{ "cells": [ { "cell_type": "markdown", "id": "2949abe4-dfa6-4c4a-a395-a3a1db92b5e7", "metadata": {}, "source": [ "## Create _C.virginia_ gene BED file\n", "\n", "### Resulting gene BED file will be used for [_C.virginica_ RNAseq/DML sex/OA project](https://github.com/epigeneticstoocean/2018_L18-adult-methylation) (GitHub repo)\n", "\n", "This notebook relies on [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) to be installed and available in your `$PATH`." ] }, { "cell_type": "markdown", "id": "ee0ebae6-d54c-4d18-88bd-3d3456a8b1e6", "metadata": {}, "source": [ "### List computer specs" ] }, { "cell_type": "code", "execution_count": 1, "id": "9f60016c-d6b6-4b6d-86d3-5ee68b55464c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TODAY'S DATE:\n", "Thu 09 Dec 2021 01:47:25 PM PST\n", "------------\n", "\n", "Distributor ID:\tUbuntu\n", "Description:\tUbuntu 20.04.3 LTS\n", "Release:\t20.04\n", "Codename:\tfocal\n", "\n", "------------\n", "HOSTNAME: \n", "computer\n", "\n", "------------\n", "Computer Specs:\n", "\n", "Architecture: x86_64\n", "CPU op-mode(s): 32-bit, 64-bit\n", "Byte Order: Little Endian\n", "Address sizes: 45 bits physical, 48 bits virtual\n", "CPU(s): 2\n", "On-line CPU(s) list: 0,1\n", "Thread(s) per core: 1\n", "Core(s) per socket: 1\n", "Socket(s): 2\n", "NUMA node(s): 1\n", "Vendor ID: GenuineIntel\n", "CPU family: 6\n", "Model: 165\n", "Model name: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz\n", "Stepping: 2\n", "CPU MHz: 2400.007\n", "BogoMIPS: 4800.01\n", "Hypervisor vendor: VMware\n", "Virtualization type: full\n", "L1d cache: 64 KiB\n", "L1i cache: 64 KiB\n", "L2 cache: 512 KiB\n", "L3 cache: 32 MiB\n", "NUMA node0 CPU(s): 0,1\n", "Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported\n", "Vulnerability L1tf: Not affected\n", "Vulnerability Mds: Not affected\n", "Vulnerability Meltdown: Not affected\n", "Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\n", "Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\n", "Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling\n", "Vulnerability Srbds: Not affected\n", "Vulnerability Tsx async abort: Not affected\n", "Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat flush_l1d arch_capabilities\n", "\n", "------------\n", "\n", "Memory Specs\n", "\n", " total used free shared buff/cache available\n", "Mem: 54Gi 27Gi 20Gi 109Mi 5.9Gi 26Gi\n", "Swap: 2.0Gi 0B 2.0Gi\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "No LSB modules are available.\n" ] } ], "source": [ "%%bash\n", "echo \"TODAY'S DATE:\"\n", "date\n", "echo \"------------\"\n", "echo \"\"\n", "#Display operating system info\n", "lsb_release -a\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"HOSTNAME: \"; hostname \n", "echo \"\"\n", "echo \"------------\"\n", "echo \"Computer Specs:\"\n", "echo \"\"\n", "lscpu\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"\"\n", "echo \"Memory Specs\"\n", "echo \"\"\n", "free -mh" ] }, { "cell_type": "markdown", "id": "19866f68-bfad-4a83-adb5-24e271e29d06", "metadata": {}, "source": [ "### Set variables\n", "- `%env` indicates a bash variable\n", "\n", "- without `%env` is Python variable" ] }, { "cell_type": "code", "execution_count": 1, "id": "7293bcb0-581c-4ad2-8f1e-09dd98352aaf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: data_dir=/home/sam/data/C_virginica/igv_tracks\n", "env: analysis_dir=/home/sam/analyses/20211209_cvir_gff-to-bed\n", "env: orig_gff=GCF_002022765.2_C_virginica-3.0_genomic.gff\n", "env: orig_gff_gz=GCF_002022765.2_C_virginica-3.0_genomic.gff.gz\n", "env: orig_gff_url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0\n", "env: gtf_extractor_output=20211209_cvir_GCF_002022765.2_genes.bed\n" ] } ], "source": [ "# Set directories, input/output files\n", "%env data_dir=/home/sam/data/C_virginica/igv_tracks\n", "%env analysis_dir=/home/sam/analyses/20211209_cvir_gff-to-bed\n", "analysis_dir=\"/home/sam/analyses/20211209_cvir_gff-to-bed\"\n", "\n", "# Input GFF (from NCBI)\n", "%env orig_gff=GCF_002022765.2_C_virginica-3.0_genomic.gff\n", "%env orig_gff_gz=GCF_002022765.2_C_virginica-3.0_genomic.gff.gz\n", "%env orig_gff_url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/022/765/GCF_002022765.2_C_virginica-3.0\n", "\n", "# GTF extractor output\n", "%env gtf_extractor_output=20211209_cvir_GCF_002022765.2_genes.bed" ] }, { "cell_type": "markdown", "id": "56052f6d-441a-4048-8a6f-39d58552283d", "metadata": {}, "source": [ "### Download GFF" ] }, { "cell_type": "code", "execution_count": 3, "id": "951fc8e9-b821-4f54-848f-f9573daadc83", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 sam sam 412M Dec 10 2019 GCF_002022765.2_C_virginica-3.0_genomic.gff\n" ] } ], "source": [ "%%bash\n", "cd \"${data_dir}\"\n", "\n", "# Download with wget.\n", "# Use --quiet option to prevent wget output from printing too many lines to notebook\n", "# Use --continue to prevent re-downloading fie if it's already been downloaded.\n", "wget --quiet \\\n", "--continue \\\n", "${orig_gff_url}/${orig_gff_gz}\n", "\n", "# Unzip download GFF\n", "gunzip \"${orig_gff_gz}\"\n", "\n", "ls -ltrh \"${orig_gff}\"" ] }, { "cell_type": "markdown", "id": "7eb8b7c0-5927-44c5-ba79-cbee1d5a77fb", "metadata": {}, "source": [ "### Examine GFF" ] }, { "cell_type": "code", "execution_count": 4, "id": "18862291-d1ec-4b62-8b22-959404538a7f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "##gff-version 3\n", "#!gff-spec-version 1.21\n", "#!processor NCBI annotwriter\n", "#!genome-build C_virginica-3.0\n", "#!genome-build-accession NCBI_Assembly:GCF_002022765.2\n", "#!annotation-source NCBI Crassostrea virginica Annotation Release 100\n", "##sequence-region NC_035780.1 1 65668440\n", "##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6565\n", "NC_035780.1\tRefSeq\tregion\t1\t65668440\t.\t+\t.\tID=NC_035780.1:1..65668440;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample\n", "NC_035780.1\tGnomon\tgene\t13578\t14594\t.\t+\t.\tID=gene-LOC111116054;Dbxref=GeneID:111116054;Name=LOC111116054;gbkey=Gene;gene=LOC111116054;gene_biotype=lncRNA\n", "NC_035780.1\tGnomon\tlnc_RNA\t13578\t14594\t.\t+\t.\tID=rna-XR_002636969.1;Parent=gene-LOC111116054;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n", "NC_035780.1\tGnomon\texon\t13578\t13603\t.\t+\t.\tID=exon-XR_002636969.1-1;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n", "NC_035780.1\tGnomon\texon\t14237\t14290\t.\t+\t.\tID=exon-XR_002636969.1-2;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n", "NC_035780.1\tGnomon\texon\t14557\t14594\t.\t+\t.\tID=exon-XR_002636969.1-3;Parent=rna-XR_002636969.1;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1\n", "NC_035780.1\tGnomon\tgene\t28961\t33324\t.\t+\t.\tID=gene-LOC111126949;Dbxref=GeneID:111126949;Name=LOC111126949;gbkey=Gene;gene=LOC111126949;gene_biotype=protein_coding\n", "NC_035780.1\tGnomon\tmRNA\t28961\t33324\t.\t+\t.\tID=rna-XM_022471938.1;Parent=gene-LOC111126949;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1\n", "NC_035780.1\tGnomon\texon\t28961\t29073\t.\t+\t.\tID=exon-XM_022471938.1-1;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n", "NC_035780.1\tGnomon\texon\t30524\t31557\t.\t+\t.\tID=exon-XM_022471938.1-2;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n", "NC_035780.1\tGnomon\texon\t31736\t31887\t.\t+\t.\tID=exon-XM_022471938.1-3;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n", "NC_035780.1\tGnomon\texon\t31977\t32565\t.\t+\t.\tID=exon-XM_022471938.1-4;Parent=rna-XM_022471938.1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1\n" ] } ], "source": [ "%%bash\n", "head -n 20 \"${data_dir}\"/\"${orig_gff}\"" ] }, { "cell_type": "markdown", "id": "5f389b71-582f-41e5-b9a0-bd2491d968b0", "metadata": {}, "source": [ "### Use [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) to extract gene features" ] }, { "cell_type": "code", "execution_count": 5, "id": "3dc274bf-228a-4d08-bd30-de6872a9ecc0", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "real\t1m3.584s\n", "user\t1m2.504s\n", "sys\t0m0.333s\n" ] } ], "source": [ "%%bash\n", "# Make analysis directory, if it doesn't exist\n", "mkdir --parents \"${analysis_dir}\"\n", "\n", "# Extract just gene features\n", "# Extract chrom,start,end,gene=,and strand fields\n", "# \"gene=\" is the NCBI gene name, in this particular instance\n", "# Specify input as GFF\n", "# Use awk to to insert a \"score\" column before the strand column ($5)\n", "# and fill new \"score\" column with arbitrary value of \"0\"\n", "time \\\n", "gtf_extract \\\n", "--feature gene \\\n", "--fields=chrom,start,end,ID,strand \\\n", "--gff ${data_dir}/${orig_gff} \\\n", "| awk 'BEGIN{FS=OFS=\"\\t\"}{$5 = 0 OFS $5}1' \\\n", "> ${analysis_dir}/${gtf_extractor_output}" ] }, { "cell_type": "markdown", "id": "a718c81a-4fb3-4bab-8dc5-1466ba2452af", "metadata": {}, "source": [ "#### Check results" ] }, { "cell_type": "code", "execution_count": 6, "id": "5e368eb2-8d11-414f-be44-f369dc3c3c6d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 sam sam 2.0M Dec 10 07:02 20211209_cvir_GCF_002022765.2_genes.bed\n", "\n", "NC_035780.1\t13578\t14594\tgene-LOC111116054\t0\t+\n", "NC_035780.1\t28961\t33324\tgene-LOC111126949\t0\t+\n", "NC_035780.1\t43111\t66897\tgene-LOC111110729\t0\t-\n", "NC_035780.1\t85606\t95254\tgene-LOC111112434\t0\t-\n", "NC_035780.1\t99840\t106460\tgene-LOC111120752\t0\t+\n", "NC_035780.1\t108305\t110077\tgene-LOC111128944\t0\t-\n", "NC_035780.1\t151859\t157536\tgene-LOC111128953\t0\t+\n", "NC_035780.1\t163809\t183798\tgene-LOC111105691\t0\t-\n", "NC_035780.1\t164820\t166793\tgene-LOC111105685\t0\t+\n", "NC_035780.1\t169468\t170178\tgene-LOC111105702\t0\t-\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "ls -ltrh ${gtf_extractor_output}\n", "\n", "echo \"\"\n", "\n", "head ${gtf_extractor_output}" ] }, { "cell_type": "markdown", "id": "d40dadc7-7dbd-40ca-9e2c-80f8d93568e4", "metadata": {}, "source": [ "#### Confirm that [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) output seem okay" ] }, { "cell_type": "code", "execution_count": 7, "id": "c24d1cd0-ac07-42c4-810f-008cf42f7fd5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GFFutils number of extracted genes:\n", "38838\n", "\n", "awk number of extracted genes:\n", "38838\n" ] } ], "source": [ "%%bash\n", "# Count gene features via GFFutils\n", "echo \"GFFutils number of extracted genes:\"\n", "gtf_extract --feature gene --fields=ID --gff ${data_dir}/${orig_gff} | wc -l\n", "\n", "echo \"\"\n", "\n", "# Count gene features via awk\n", "echo \"awk number of extracted genes:\"\n", "awk '$3 == \"gene\" { print $0 }' ${data_dir}/${orig_gff} | wc -l\n" ] }, { "cell_type": "markdown", "id": "c0b01c2f-2c68-4821-ae0e-f19c43231119", "metadata": {}, "source": [ "### Generate checksums" ] }, { "cell_type": "code", "execution_count": 8, "id": "9a8e9dcb-eb74-4af4-8924-0553fe6f3596", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "c8f203de591c0608b96f4299c0f847dc 20211209_cvir_GCF_002022765.2_genes.bed\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "\n", "for file in *\n", "do\n", " md5sum \"${file}\" | tee --append checksums.md5\n", "done" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }