{ "cells": [ { "cell_type": "markdown", "id": "2949abe4-dfa6-4c4a-a395-a3a1db92b5e7", "metadata": {}, "source": [ "## Create _S.namaycush_ gene BED file.\n", "\n", "### Resulting gene BED file will be used for [lake trout Ballgown isoform identifcation](https://github.com/RobertsLab/project-lake-trout) (GitHub repo)\n", "\n", "This notebook relies on [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) to be installed and available in your `$PATH`." ] }, { "cell_type": "markdown", "id": "ee0ebae6-d54c-4d18-88bd-3d3456a8b1e6", "metadata": {}, "source": [ "### List computer specs" ] }, { "cell_type": "code", "execution_count": 1, "id": "9f60016c-d6b6-4b6d-86d3-5ee68b55464c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TODAY'S DATE:\n", "Thu 18 Aug 2022 07:34:19 AM PDT\n", "------------\n", "\n", "Distributor ID:\tUbuntu\n", "Description:\tUbuntu 20.04.4 LTS\n", "Release:\t20.04\n", "Codename:\tfocal\n", "\n", "------------\n", "HOSTNAME: \n", "computer\n", "\n", "------------\n", "Computer Specs:\n", "\n", "Architecture: x86_64\n", "CPU op-mode(s): 32-bit, 64-bit\n", "Byte Order: Little Endian\n", "Address sizes: 45 bits physical, 48 bits virtual\n", "CPU(s): 2\n", "On-line CPU(s) list: 0,1\n", "Thread(s) per core: 1\n", "Core(s) per socket: 1\n", "Socket(s): 2\n", "NUMA node(s): 1\n", "Vendor ID: GenuineIntel\n", "CPU family: 6\n", "Model: 165\n", "Model name: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz\n", "Stepping: 2\n", "CPU MHz: 2400.007\n", "BogoMIPS: 4800.01\n", "Hypervisor vendor: VMware\n", "Virtualization type: full\n", "L1d cache: 64 KiB\n", "L1i cache: 64 KiB\n", "L2 cache: 512 KiB\n", "L3 cache: 32 MiB\n", "NUMA node0 CPU(s): 0,1\n", "Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported\n", "Vulnerability L1tf: Mitigation; PTE Inversion\n", "Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\n", "Vulnerability Meltdown: Mitigation; PTI\n", "Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\n", "Vulnerability Retbleed: Mitigation; IBRS\n", "Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\n", "Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\n", "Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, RSB filling\n", "Vulnerability Srbds: Unknown: Dependent on hypervisor status\n", "Vulnerability Tsx async abort: Not affected\n", "Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat flush_l1d arch_capabilities\n", "\n", "------------\n", "\n", "Memory Specs\n", "\n", " total used free shared buff/cache available\n", "Mem: 54Gi 31Gi 19Gi 100Mi 4.2Gi 22Gi\n", "Swap: 2.0Gi 25Mi 2.0Gi\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "No LSB modules are available.\n" ] } ], "source": [ "%%bash\n", "echo \"TODAY'S DATE:\"\n", "date\n", "echo \"------------\"\n", "echo \"\"\n", "#Display operating system info\n", "lsb_release -a\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"HOSTNAME: \"; hostname \n", "echo \"\"\n", "echo \"------------\"\n", "echo \"Computer Specs:\"\n", "echo \"\"\n", "lscpu\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"\"\n", "echo \"Memory Specs\"\n", "echo \"\"\n", "free -mh" ] }, { "cell_type": "markdown", "id": "19866f68-bfad-4a83-adb5-24e271e29d06", "metadata": {}, "source": [ "### Set variables\n", "- `%env` indicates a bash variable\n", "\n", "- without `%env` is Python variable" ] }, { "cell_type": "code", "execution_count": 2, "id": "7293bcb0-581c-4ad2-8f1e-09dd98352aaf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: data_dir=/home/sam/data/S_namaycush/genomes\n", "env: analysis_dir=/home/sam/analyses/20220818-snam-gff_to_bed-genes\n", "env: orig_gff=GCF_016432855.1_SaNama_1.0_genomic.gff\n", "env: orig_gff_gz=GCF_016432855.1_SaNama_1.0_genomic.gff.gz\n", "env: orig_gff_url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/432/855/GCF_016432855.1_SaNama_1.0\n", "env: gtf_extractor_output=20220818-snam-GCF_016432855.1_SaNama_1.0_genes.bed\n" ] } ], "source": [ "# Set directories, input/output files\n", "%env data_dir=/home/sam/data/S_namaycush/genomes\n", "%env analysis_dir=/home/sam/analyses/20220818-snam-gff_to_bed-genes\n", "analysis_dir=\"/home/sam/analyses/20220818-snam-gff_to_bed-genes\"\n", "\n", "# Input GFF (from NCBI)\n", "%env orig_gff=GCF_016432855.1_SaNama_1.0_genomic.gff\n", "%env orig_gff_gz=GCF_016432855.1_SaNama_1.0_genomic.gff.gz\n", "%env orig_gff_url=https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/432/855/GCF_016432855.1_SaNama_1.0\n", "\n", "# GTF extractor output\n", "%env gtf_extractor_output=20220818-snam-GCF_016432855.1_SaNama_1.0_genes.bed" ] }, { "cell_type": "markdown", "id": "56052f6d-441a-4048-8a6f-39d58552283d", "metadata": {}, "source": [ "### Download GFF" ] }, { "cell_type": "code", "execution_count": 3, "id": "951fc8e9-b821-4f54-848f-f9573daadc83", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 sam sam 373M Jan 13 2021 GCF_016432855.1_SaNama_1.0_genomic.gff\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "gzip: GCF_016432855.1_SaNama_1.0_genomic.gff already exists;\tnot overwritten\n" ] } ], "source": [ "%%bash\n", "cd \"${data_dir}\"\n", "\n", "# Download with wget.\n", "# Use --quiet option to prevent wget output from printing too many lines to notebook\n", "# Use --continue to prevent re-downloading fie if it's already been downloaded.\n", "wget --quiet \\\n", "--continue \\\n", "${orig_gff_url}/${orig_gff_gz}\n", "\n", "# Unzip download GFF\n", "gunzip \"${orig_gff_gz}\"\n", "\n", "ls -ltrh \"${orig_gff}\"" ] }, { "cell_type": "markdown", "id": "7eb8b7c0-5927-44c5-ba79-cbee1d5a77fb", "metadata": {}, "source": [ "### Examine GFF" ] }, { "cell_type": "code", "execution_count": 4, "id": "18862291-d1ec-4b62-8b22-959404538a7f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "##gff-version 3\n", "#!gff-spec-version 1.21\n", "#!processor NCBI annotwriter\n", "#!genome-build SaNama_1.0\n", "#!genome-build-accession NCBI_Assembly:GCF_016432855.1\n", "#!annotation-source NCBI Salvelinus namaycush Annotation Release 100\n", "##sequence-region NC_052307.1 1 84126519\n", "##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=8040\n", "NC_052307.1\tRefSeq\tregion\t1\t84126519\t.\t+\t.\tID=NC_052307.1:1..84126519;Dbxref=taxon:8040;Name=1;chromosome=1;gbkey=Src;genome=chromosome;isolate=Seneca;mol_type=genomic DNA;sex=female;tissue-type=white muscle\n", "NC_052307.1\tGnomon\tgene\t13938\t48855\t.\t+\t.\tID=gene-LOC120017344;Dbxref=GeneID:120017344;Name=LOC120017344;gbkey=Gene;gene=LOC120017344;gene_biotype=protein_coding\n", "NC_052307.1\tGnomon\tmRNA\t13938\t48855\t.\t+\t.\tID=rna-XM_038988747.1;Parent=gene-LOC120017344;Dbxref=GeneID:120017344,Genbank:XM_038988747.1;Name=XM_038988747.1;gbkey=mRNA;gene=LOC120017344;model_evidence=Supporting evidence includes similarity to: 6 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 53 samples with support for all annotated introns;product=threonine--tRNA ligase 1%2C cytoplasmic-like;transcript_id=XM_038988747.1\n", "NC_052307.1\tGnomon\texon\t13938\t14098\t.\t+\t.\tID=exon-XM_038988747.1-1;Parent=rna-XM_038988747.1;Dbxref=GeneID:120017344,Genbank:XM_038988747.1;gbkey=mRNA;gene=LOC120017344;product=threonine--tRNA ligase 1%2C cytoplasmic-like;transcript_id=XM_038988747.1\n", "NC_052307.1\tGnomon\texon\t15382\t15453\t.\t+\t.\tID=exon-XM_038988747.1-2;Parent=rna-XM_038988747.1;Dbxref=GeneID:120017344,Genbank:XM_038988747.1;gbkey=mRNA;gene=LOC120017344;product=threonine--tRNA ligase 1%2C cytoplasmic-like;transcript_id=XM_038988747.1\n", "NC_052307.1\tGnomon\texon\t15794\t15984\t.\t+\t.\tID=exon-XM_038988747.1-3;Parent=rna-XM_038988747.1;Dbxref=GeneID:120017344,Genbank:XM_038988747.1;gbkey=mRNA;gene=LOC120017344;product=threonine--tRNA ligase 1%2C cytoplasmic-like;transcript_id=XM_038988747.1\n", "NC_052307.1\tGnomon\texon\t20383\t20506\t.\t+\t.\tID=exon-XM_038988747.1-4;Parent=rna-XM_038988747.1;Dbxref=GeneID:120017344,Genbank:XM_038988747.1;gbkey=mRNA;gene=LOC120017344;product=threonine--tRNA ligase 1%2C cytoplasmic-like;transcript_id=XM_038988747.1\n", "NC_052307.1\tGnomon\texon\t22557\t22678\t.\t+\t.\tID=exon-XM_038988747.1-5;Parent=rna-XM_038988747.1;Dbxref=GeneID:120017344,Genbank:XM_038988747.1;gbkey=mRNA;gene=LOC120017344;product=threonine--tRNA ligase 1%2C cytoplasmic-like;transcript_id=XM_038988747.1\n", "NC_052307.1\tGnomon\texon\t24602\t24719\t.\t+\t.\tID=exon-XM_038988747.1-6;Parent=rna-XM_038988747.1;Dbxref=GeneID:120017344,Genbank:XM_038988747.1;gbkey=mRNA;gene=LOC120017344;product=threonine--tRNA ligase 1%2C cytoplasmic-like;transcript_id=XM_038988747.1\n", "NC_052307.1\tGnomon\texon\t27866\t27930\t.\t+\t.\tID=exon-XM_038988747.1-7;Parent=rna-XM_038988747.1;Dbxref=GeneID:120017344,Genbank:XM_038988747.1;gbkey=mRNA;gene=LOC120017344;product=threonine--tRNA ligase 1%2C cytoplasmic-like;transcript_id=XM_038988747.1\n", "NC_052307.1\tGnomon\texon\t28019\t28097\t.\t+\t.\tID=exon-XM_038988747.1-8;Parent=rna-XM_038988747.1;Dbxref=GeneID:120017344,Genbank:XM_038988747.1;gbkey=mRNA;gene=LOC120017344;product=threonine--tRNA ligase 1%2C cytoplasmic-like;transcript_id=XM_038988747.1\n", "NC_052307.1\tGnomon\texon\t30973\t31119\t.\t+\t.\tID=exon-XM_038988747.1-9;Parent=rna-XM_038988747.1;Dbxref=GeneID:120017344,Genbank:XM_038988747.1;gbkey=mRNA;gene=LOC120017344;product=threonine--tRNA ligase 1%2C cytoplasmic-like;transcript_id=XM_038988747.1\n" ] } ], "source": [ "%%bash\n", "head -n 20 \"${data_dir}\"/\"${orig_gff}\"" ] }, { "cell_type": "markdown", "id": "5f389b71-582f-41e5-b9a0-bd2491d968b0", "metadata": {}, "source": [ "### Use [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) to extract gene features" ] }, { "cell_type": "code", "execution_count": 5, "id": "3dc274bf-228a-4d08-bd30-de6872a9ecc0", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "real\t1m12.914s\n", "user\t1m2.161s\n", "sys\t0m5.052s\n" ] } ], "source": [ "%%bash\n", "# Make analysis directory, if it doesn't exist\n", "mkdir --parents \"${analysis_dir}\"\n", "\n", "# Extract just gene features\n", "# Extract chrom,start,end,gene=,and strand fields\n", "# \"gene=\" is the NCBI gene name, in this particular instance\n", "# Specify input as GFF\n", "# Use awk to to insert a \"score\" column before the strand column ($5)\n", "# and fill new \"score\" column with arbitrary value of \"0\"\n", "time \\\n", "gtf_extract \\\n", "--feature gene \\\n", "--fields=chrom,start,end,ID,strand \\\n", "--gff ${data_dir}/${orig_gff} \\\n", "| awk 'BEGIN{FS=OFS=\"\\t\"}{$5 = 0 OFS $5}1' \\\n", "> ${analysis_dir}/${gtf_extractor_output}" ] }, { "cell_type": "markdown", "id": "a718c81a-4fb3-4bab-8dc5-1466ba2452af", "metadata": {}, "source": [ "#### Check results" ] }, { "cell_type": "code", "execution_count": 6, "id": "5e368eb2-8d11-414f-be44-f369dc3c3c6d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 sam sam 2.3M Aug 18 07:36 20220818-snam-GCF_016432855.1_SaNama_1.0_genes.bed\n", "\n", "NC_052307.1\t13938\t48855\tgene-LOC120017344\t0\t+\n", "NC_052307.1\t243315\t251581\tgene-LOC120053455\t0\t+\n", "NC_052307.1\t265811\t273478\tgene-LOC120050024\t0\t+\n", "NC_052307.1\t289722\t315386\tgene-LOC120050032\t0\t-\n", "NC_052307.1\t428889\t450578\tgene-LOC120049777\t0\t-\n", "NC_052307.1\t606253\t621424\tgene-LOC120058477\t0\t+\n", "NC_052307.1\t629244\t678079\tgene-pdzd2\t0\t+\n", "NC_052307.1\t684892\t686121\tgene-pmaip1\t0\t+\n", "NC_052307.1\t741348\t859581\tgene-LOC120053472\t0\t+\n", "NC_052307.1\t1052957\t1054465\tgene-LOC120050046\t0\t+\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "ls -ltrh ${gtf_extractor_output}\n", "\n", "echo \"\"\n", "\n", "head ${gtf_extractor_output}" ] }, { "cell_type": "markdown", "id": "d40dadc7-7dbd-40ca-9e2c-80f8d93568e4", "metadata": {}, "source": [ "#### Confirm that [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) output seem okay\n", "\n", "Compare line counts from awk command and GFFutils match." ] }, { "cell_type": "code", "execution_count": 7, "id": "c24d1cd0-ac07-42c4-810f-008cf42f7fd5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GFFutils number of extracted genes:\n", "46359\n", "\n", "awk number of extracted genes:\n", "46359\n" ] } ], "source": [ "%%bash\n", "# Count gene features via GFFutils\n", "echo \"GFFutils number of extracted genes:\"\n", "gtf_extract --feature gene --fields=ID --gff ${data_dir}/${orig_gff} | wc -l\n", "\n", "echo \"\"\n", "\n", "# Count gene features via awk\n", "echo \"awk number of extracted genes:\"\n", "awk '$3 == \"gene\" { print $0 }' ${data_dir}/${orig_gff} | wc -l\n" ] }, { "cell_type": "markdown", "id": "c0b01c2f-2c68-4821-ae0e-f19c43231119", "metadata": {}, "source": [ "### Generate checksums" ] }, { "cell_type": "code", "execution_count": 8, "id": "9a8e9dcb-eb74-4af4-8924-0553fe6f3596", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "440d09ac4bd225a6585d69ef623fd812 20220818-snam-GCF_016432855.1_SaNama_1.0_genes.bed\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "\n", "for file in *\n", "do\n", " md5sum \"${file}\" | tee --append checksums.md5\n", "done" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 5 }