{ "cells": [ { "cell_type": "markdown", "id": "2949abe4-dfa6-4c4a-a395-a3a1db92b5e7", "metadata": {}, "source": [ "## Create _S.salar_ UniProt annotations file for Shelly\n", "\n", "See this [GitHub Issue](https://github.com/RobertsLab/resources/issues/1220)\n", "\n", "This notebook relies on [GFFutils](https://gffutils.readthedocs.io/en/v0.12.0/index.html) to be installed and available in your `$PATH`." ] }, { "cell_type": "markdown", "id": "ee0ebae6-d54c-4d18-88bd-3d3456a8b1e6", "metadata": {}, "source": [ "### List computer specs" ] }, { "cell_type": "code", "execution_count": 1, "id": "9f60016c-d6b6-4b6d-86d3-5ee68b55464c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TODAY'S DATE:\n", "Tue 01 Jun 2021 07:37:57 AM PDT\n", "------------\n", "\n", "Distributor ID:\tUbuntu\n", "Description:\tUbuntu 20.04.2 LTS\n", "Release:\t20.04\n", "Codename:\tfocal\n", "\n", "------------\n", "HOSTNAME: \n", "computer\n", "\n", "------------\n", "Computer Specs:\n", "\n", "Architecture: x86_64\n", "CPU op-mode(s): 32-bit, 64-bit\n", "Byte Order: Little Endian\n", "Address sizes: 45 bits physical, 48 bits virtual\n", "CPU(s): 8\n", "On-line CPU(s) list: 0-7\n", "Thread(s) per core: 1\n", "Core(s) per socket: 1\n", "Socket(s): 8\n", "NUMA node(s): 1\n", "Vendor ID: GenuineIntel\n", "CPU family: 6\n", "Model: 165\n", "Model name: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz\n", "Stepping: 2\n", "CPU MHz: 2400.000\n", "BogoMIPS: 4800.00\n", "Hypervisor vendor: VMware\n", "Virtualization type: full\n", "L1d cache: 256 KiB\n", "L1i cache: 256 KiB\n", "L2 cache: 2 MiB\n", "L3 cache: 128 MiB\n", "NUMA node0 CPU(s): 0-7\n", "Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported\n", "Vulnerability L1tf: Not affected\n", "Vulnerability Mds: Not affected\n", "Vulnerability Meltdown: Not affected\n", "Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\n", "Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\n", "Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling\n", "Vulnerability Srbds: Not affected\n", "Vulnerability Tsx async abort: Not affected\n", "Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat pku ospke md_clear flush_l1d arch_capabilities\n", "\n", "------------\n", "\n", "Memory Specs\n", "\n", " total used free shared buff/cache available\n", "Mem: 53Gi 3.6Gi 45Gi 428Mi 4.1Gi 48Gi\n", "Swap: 2.0Gi 0B 2.0Gi\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "No LSB modules are available.\n" ] } ], "source": [ "%%bash\n", "echo \"TODAY'S DATE:\"\n", "date\n", "echo \"------------\"\n", "echo \"\"\n", "#Display operating system info\n", "lsb_release -a\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"HOSTNAME: \"; hostname \n", "echo \"\"\n", "echo \"------------\"\n", "echo \"Computer Specs:\"\n", "echo \"\"\n", "lscpu\n", "echo \"\"\n", "echo \"------------\"\n", "echo \"\"\n", "echo \"Memory Specs\"\n", "echo \"\"\n", "free -mh" ] }, { "cell_type": "markdown", "id": "19866f68-bfad-4a83-adb5-24e271e29d06", "metadata": {}, "source": [ "### Set variables\n", "- `%env` indicates a bash variable; without `%env` is Python variable" ] }, { "cell_type": "code", "execution_count": 1, "id": "7293bcb0-581c-4ad2-8f1e-09dd98352aaf", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: data_dir=/home/samb/data/S_salar/genomes\n", "env: analysis_dir=/home/samb/analyses/20210601_ssal_gff-annotations\n", "env: orig_gff=GCF_000233375.1_ICSASG_v2_genomic.gff\n", "env: orig_gff_url=https://gannet.fish.washington.edu/metacarcinus/Salmo_Calig/GENOMES/v2/RefSeq\n", "env: perl_output=20210601_ssal_uniprot_batch_results.txt\n", "env: gtf_extractor_output=20210601_ssal_chrom-start-end-Dbxref.csv\n", "env: gene_list=20210601_ssal_gene-list.txt\n", "env: parsed_uniprot=20210601_ssal_accession-gene_id-gene-gene_description-go_ids.csv\n", "env: joined_output=20210601_ssal_chrom-gene_id_start-end-acc-gene-gene_description-go_ids.csv\n" ] } ], "source": [ "# Set directories, input/output files\n", "%env data_dir=/home/samb/data/S_salar/genomes\n", "%env analysis_dir=/home/samb/analyses/20210601_ssal_gff-annotations\n", "analysis_dir=\"/home/samb/analyses/20210601_ssal_gff-annotations\"\n", "# Input GFF\n", "%env orig_gff=GCF_000233375.1_ICSASG_v2_genomic.gff\n", "%env orig_gff_url=https://gannet.fish.washington.edu/metacarcinus/Salmo_Calig/GENOMES/v2/RefSeq\n", "\n", "# UniProt batch output\n", "%env perl_output=20210601_ssal_uniprot_batch_results.txt\n", "\n", "# GTF extractor output\n", "%env gtf_extractor_output=20210601_ssal_chrom-start-end-Dbxref.csv\n", "\n", "# Gene name list for UniProt batch submission\n", "%env gene_list=20210601_ssal_gene-list.txt\n", "\n", "# Parsed UniProt\n", "%env parsed_uniprot=20210601_ssal_accession-gene_id-gene-gene_description-go_ids.csv\n", "\n", "# Final output\n", "%env joined_output=20210601_ssal_chrom-gene_id_start-end-acc-gene-gene_description-go_ids.csv" ] }, { "cell_type": "markdown", "id": "56052f6d-441a-4048-8a6f-39d58552283d", "metadata": {}, "source": [ "### Download GFF" ] }, { "cell_type": "code", "execution_count": 3, "id": "951fc8e9-b821-4f54-848f-f9573daadc83", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 samb samb 828M Sep 30 2020 GCF_000233375.1_ICSASG_v2_genomic.gff\n" ] } ], "source": [ "%%bash\n", "cd \"${data_dir}\"\n", "\n", "# Download with wget. Use --no-check-certificate to avoid issues with Gannet certificate\n", "# Use --quiet option to prevent wget output from printing too many lines to notebook\n", "wget --no-check-certificate --quiet ${orig_gff_url}/${orig_gff}\n", "\n", "ls -ltrh \"${orig_gff}\"" ] }, { "cell_type": "markdown", "id": "7eb8b7c0-5927-44c5-ba79-cbee1d5a77fb", "metadata": {}, "source": [ "### Examine GFF" ] }, { "cell_type": "code", "execution_count": 4, "id": "18862291-d1ec-4b62-8b22-959404538a7f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "##gff-version 3\n", "#!gff-spec-version 1.21\n", "#!processor NCBI annotwriter\n", "#!genome-build ICSASG_v2\n", "#!genome-build-accession NCBI_Assembly:GCF_000233375.1\n", "#!annotation-source NCBI Salmo salar Annotation Release 100\n", "##sequence-region NC_027300.1 1 159038749\n", "##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=8030\n", "NC_027300.1\tRefSeq\tregion\t1\t159038749\t.\t+\t.\tID=NC_027300.1:1..159038749;Dbxref=taxon:8030;Name=ssa01;breed=double haploid;chromosome=ssa01;dev-stage=adult;gbkey=Src;genome=chromosome;isolate=Sally;mol_type=genomic DNA;sex=female;tissue-type=muscle\n", "NC_027300.1\tGnomon\tgene\t5501\t62139\t.\t-\t.\tID=gene-LOC106560212;Dbxref=GeneID:106560212;Name=LOC106560212;gbkey=Gene;gene=LOC106560212;gene_biotype=protein_coding\n", "NC_027300.1\tGnomon\tmRNA\t5501\t62139\t.\t-\t.\tID=rna-XM_014160784.1;Parent=gene-LOC106560212;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;Name=XM_014160784.1;gbkey=mRNA;gene=LOC106560212;model_evidence=Supporting evidence includes similarity to: 99%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 8 samples with support for all annotated introns;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1\n", "NC_027300.1\tGnomon\texon\t61647\t62139\t.\t-\t.\tID=exon-XM_014160784.1-1;Parent=rna-XM_014160784.1;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1\n", "NC_027300.1\tGnomon\texon\t43486\t43714\t.\t-\t.\tID=exon-XM_014160784.1-2;Parent=rna-XM_014160784.1;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1\n", "NC_027300.1\tGnomon\texon\t23978\t24241\t.\t-\t.\tID=exon-XM_014160784.1-3;Parent=rna-XM_014160784.1;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1\n", "NC_027300.1\tGnomon\texon\t16966\t17019\t.\t-\t.\tID=exon-XM_014160784.1-4;Parent=rna-XM_014160784.1;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1\n", "NC_027300.1\tGnomon\texon\t5501\t5691\t.\t-\t.\tID=exon-XM_014160784.1-5;Parent=rna-XM_014160784.1;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1\n", "NC_027300.1\tGnomon\tCDS\t43486\t43633\t.\t-\t0\tID=cds-XP_014016259.1;Parent=rna-XM_014160784.1;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1\n", "NC_027300.1\tGnomon\tCDS\t23978\t24241\t.\t-\t2\tID=cds-XP_014016259.1;Parent=rna-XM_014160784.1;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1\n", "NC_027300.1\tGnomon\tCDS\t16966\t17019\t.\t-\t2\tID=cds-XP_014016259.1;Parent=rna-XM_014160784.1;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1\n", "NC_027300.1\tGnomon\tCDS\t5501\t5691\t.\t-\t2\tID=cds-XP_014016259.1;Parent=rna-XM_014160784.1;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1\n" ] } ], "source": [ "%%bash\n", "head -n 20 \"${data_dir}\"/\"${orig_gff}\"" ] }, { "cell_type": "markdown", "id": "5f389b71-582f-41e5-b9a0-bd2491d968b0", "metadata": {}, "source": [ "### Use GFFutils to extract gene features" ] }, { "cell_type": "code", "execution_count": 7, "id": "3dc274bf-228a-4d08-bd30-de6872a9ecc0", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "real\t2m26.807s\n", "user\t2m26.279s\n", "sys\t0m0.468s\n" ] } ], "source": [ "%%bash\n", "# Extract just gene features\n", "# Extract chromosome name, start, end, and Dbxref fields\n", "# Dbxref is the NCBI gene name, in this particular instance\n", "# Specify input as GFF\n", "# Use awk to format as comma-delimited output to help with downstream parsing/joining\n", "time \\\n", "gtf_extract \\\n", "--feature gene \\\n", "--fields=chrom,start,end,Dbxref \\\n", "--gff ${data_dir}/${orig_gff} \\\n", "| awk 'BEGIN { OFS = \",\"; FS=\"[\\t:]\"} {print $1, $2, $3, $5}' \\\n", "> ${analysis_dir}/${gtf_extractor_output}" ] }, { "cell_type": "markdown", "id": "a718c81a-4fb3-4bab-8dc5-1466ba2452af", "metadata": {}, "source": [ "#### Check results" ] }, { "cell_type": "code", "execution_count": 8, "id": "5e368eb2-8d11-414f-be44-f369dc3c3c6d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-rw-rw-r-- 1 samb samb 2.9M Jun 1 09:49 20210601_ssal_chrom-start-end-Dbxref.csv\n", "\n", "NC_027300.1,5501,62139,106560212\n", "NC_027300.1,160437,198815,106607996\n", "NC_027300.1,228330,231471,106601976\n", "NC_027300.1,296031,297111,106560213\n", "NC_027300.1,306942,310878,106566220\n", "NC_027300.1,331369,346454,106571988\n", "NC_027300.1,355675,362950,106578259\n", "NC_027300.1,401623,416794,106583877\n", "NC_027300.1,431662,432555,106589664\n", "NC_027300.1,449112,490663,106596642\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "ls -ltrh ${gtf_extractor_output}\n", "\n", "echo \"\"\n", "\n", "head ${gtf_extractor_output}" ] }, { "cell_type": "markdown", "id": "d40dadc7-7dbd-40ca-9e2c-80f8d93568e4", "metadata": {}, "source": [ "#### Confirm that GFFutils output seem okay" ] }, { "cell_type": "code", "execution_count": 10, "id": "c24d1cd0-ac07-42c4-810f-008cf42f7fd5", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GFFutils number of extracted genes:\n", "79030\n", "\n", "awk number of extracted genes:\n", "79030\n" ] } ], "source": [ "%%bash\n", "# Count gene features via GFFutils\n", "echo \"GFFutils number of extracted genes:\"\n", "gtf_extract -f gene --fields=Dbxref --gff ${data_dir}/${orig_gff} | wc -l\n", "\n", "echo \"\"\n", "\n", "# Count gene features via awk\n", "echo \"awk number of extracted genes:\"\n", "awk '$3 == \"gene\" { print $0 }' ${data_dir}/${orig_gff} | wc -l\n" ] }, { "cell_type": "markdown", "id": "cb7b736d-0cbb-42e1-8f25-2dd185d20df7", "metadata": {}, "source": [ "### Extract gene ids for batch submission to UniProt" ] }, { "cell_type": "code", "execution_count": 11, "id": "65d1da85-bce0-45aa-be5b-248e025aff28", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "awk -F\",\" '{print $4}' \"${gtf_extractor_output}\" > \"${gene_list}\"" ] }, { "cell_type": "markdown", "id": "2cda29e9-937c-49d4-9c14-1af9bc9eaf3d", "metadata": {}, "source": [ "#### Check gene list" ] }, { "cell_type": "code", "execution_count": 12, "id": "e92ee93a-5699-4310-b7a3-efd2ccd3b4c7", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "106560212\n", "106607996\n", "106601976\n", "106560213\n", "106566220\n", "106571988\n", "106578259\n", "106583877\n", "106589664\n", "106596642\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "head \"${gene_list}\"" ] }, { "cell_type": "markdown", "id": "87c0a52f-8e16-4cad-acaf-cf656f5ffe78", "metadata": {}, "source": [ "### Batch submission to UniProt\n", "\n", "Perl script obtained from UniProt: https://www.uniprot.org/help/api_batch_retrieval\n", "\n", "Modified to map NCIB gene ID to UniProt accession." ] }, { "cell_type": "code", "execution_count": 13, "id": "930d79e1-d513-4e78-9783-2757f0f22c31", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "use strict;\n", "use warnings;\n", "use LWP::UserAgent;\n", "\n", "my $list = $ARGV[0]; # File containg list of UniProt identifiers.\n", "\n", "my $base = 'https://www.uniprot.org';\n", "my $tool = 'uploadlists';\n", "\n", "my $contact = 'samwhite@uw.edu'; # Please set a contact email address here to help us debug in case of problems (see https://www.uniprot.org/help/privacy).\n", "my $agent = LWP::UserAgent->new(agent => \"libwww-perl $contact\");\n", "push @{$agent->requests_redirectable}, 'POST';\n", "\n", "my $response = $agent->post(\"$base/$tool/\",\n", " [ 'file' => [$list],\n", " 'format' => 'txt',\n", " 'from' => 'P_ENTREZGENEID',\n", " 'to' => 'ACC',\n", " ],\n", " 'Content_Type' => 'form-data');\n", "\n", "while (my $wait = $response->header('Retry-After')) {\n", " print STDERR \"Waiting ($wait)...\\n\";\n", " sleep $wait;\n", " $response = $agent->get($response->base);\n", "}\n", "\n", "$response->is_success ?\n", " print $response->content :\n", " die 'Failed, got ' . $response->status_line .\n", " ' for ' . $response->request->uri . \"\\n\";\n" ] } ], "source": [ "%%bash\n", "# Print the script for viewing\n", "cat /home/samb/programs/uniprot_mapping.pl" ] }, { "cell_type": "code", "execution_count": 16, "id": "c15fe95c-9d2d-40f6-92d8-61f1b866966f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 354M\n", "-rw-rw-r-- 1 samb samb 2.9M Jun 1 09:49 20210601_ssal_chrom-start-end-Dbxref.csv\n", "-rw-rw-r-- 1 samb samb 772K Jun 1 09:58 20210601_ssal_gene-list.txt\n", "-rw-rw-r-- 1 samb samb 350M Jun 1 10:04 20210601_ssal_uniprot_batch_results.txt\n", "\n", "\n", "--------------------------------------------------\n", "\n", "Line count:\n", "7273462 20210601_ssal_uniprot_batch_results.txt\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n", "real\t3m32.207s\n", "user\t0m3.099s\n", "sys\t0m9.446s\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "time \\\n", "perl /home/samb/programs/uniprot_mapping.pl \"${gene_list}\" > \"${perl_output}\"\n", "ls -ltrh\n", "\n", "echo \"\"\n", "echo \"\"\n", "echo \"--------------------------------------------------\"\n", "echo \"\"\n", "echo \"Line count:\"\n", "wc -l \"${perl_output}\"" ] }, { "cell_type": "markdown", "id": "a3d6548e-44b1-4f84-8211-7466d35314e4", "metadata": {}, "source": [ "#### Check mapping output" ] }, { "cell_type": "code", "execution_count": 17, "id": "adf8f956-eb73-479c-8180-ef88b3a33dda", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ID A0A1S3NLH1_SALSA Unreviewed; 218 AA.\n", "AC A0A1S3NLH1;\n", "DT 12-APR-2017, integrated into UniProtKB/TrEMBL.\n", "DT 12-APR-2017, sequence version 1.\n", "DT 07-APR-2021, entry version 14.\n", "DE SubName: Full=fibroblast growth factor receptor 3-like {ECO:0000313|RefSeq:XP_014016259.1};\n", "GN Name=LOC106560212 {ECO:0000313|RefSeq:XP_014016259.1};\n", "OS Salmo salar (Atlantic salmon).\n", "OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;\n", "OC Actinopterygii; Neopterygii; Teleostei; Protacanthopterygii; Salmoniformes;\n", "OC Salmonidae; Salmoninae; Salmo.\n", "OX NCBI_TaxID=8030 {ECO:0000313|Proteomes:UP000087266, ECO:0000313|RefSeq:XP_014016259.1};\n", "RN [1] {ECO:0000313|RefSeq:XP_014016259.1}\n", "RP IDENTIFICATION.\n", "RC TISSUE=Muscle {ECO:0000313|RefSeq:XP_014016259.1};\n", "RG RefSeq;\n", "RL Submitted (OCT-2020) to UniProtKB.\n", "CC ---------------------------------------------------------------------------\n", "CC Copyrighted by the UniProt Consortium, see https://www.uniprot.org/terms\n", "CC Distributed under the Creative Commons Attribution (CC BY 4.0) License\n", "CC ---------------------------------------------------------------------------\n", "DR RefSeq; XP_014016259.1; XM_014160784.1.\n", "DR GeneID; 106560212; -.\n", "DR KEGG; sasa:106560212; -.\n", "DR OrthoDB; 1497110at2759; -.\n", "DR Proteomes; UP000087266; Genome assembly.\n", "DR GO; GO:0016021; C:integral component of membrane; IEA:UniProtKB-KW.\n", "DR Gene3D; 2.60.40.10; -; 2.\n", "DR InterPro; IPR007110; Ig-like_dom.\n", "DR InterPro; IPR036179; Ig-like_dom_sf.\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "head -n 30 \"${perl_output}\"" ] }, { "cell_type": "markdown", "id": "e7ed5651-cdb6-490e-b0f3-ee7daa4fe5e8", "metadata": {}, "source": [ "### Parse out the stuff we want:\n", "\n", "- UniProt accession\n", "\n", "- Gene ID (NCBI gene ID)\n", "\n", "- Gene name/abbraviation\n", "\n", "- Gene description\n", "\n", "- GO terms" ] }, { "cell_type": "code", "execution_count": 18, "id": "bd3fa826-98fb-49c9-ae94-c5bd91f0551b", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "IOPub data rate exceeded.\n", "The Jupyter server will temporarily stop sending output\n", "to the client in order to avoid crashing it.\n", "To change this limit, set the config variable\n", "`--ServerApp.iopub_data_rate_limit`.\n", "\n", "Current values:\n", "ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)\n", "ServerApp.rate_limit_window=3.0 (secs)\n", "\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "time \\\n", "while read -r line\n", "do\n", " # Get record line descriptor\n", " descriptor=$(echo \"${line}\" | awk '{print $1}' )\n", "\n", " # Capture second field for evaluation\n", " go_line=$(echo \"${line}\" | awk '{print $2}')\n", "\n", " # Append GO IDs to array\n", " if [[ \"${go_line}\" == \"GO;\" ]]; then\n", " go_id=$(echo \"${line}\" | awk '{print $3}')\n", " go_ids_array+=(\"${go_id}\")\n", " elif [[ \"${go_line}\" == \"GeneID;\" ]]; then\n", " # Uses sed to strip trailing semi-colon\n", " gene_id=$(echo \"${line}\" | awk '{print $3}' | sed 's/;$//')\n", " fi\n", "\n", " # Get gene description\n", " if [[ \"${descriptor}\" == \"DE\" ]] && [[ \"${go_line}\" == \"SubName:\" ]]; then\n", " # Uses sed to strip trailing spaces at end of line and remove commas\n", " gene_description=$(echo \"${line}\" | awk -F\"[={]\" '{print $2}' | sed 's/[[:blank:]]*$//' | sed 's/,//g')\n", "\n", " # Get gene name\n", " elif [[ \"${descriptor}\" == \"GN\" ]] && [[ $(echo \"${line}\" | awk -F \"=\" '{print $1}') == \"GN Name\" ]]; then\n", " # Uses sed to strip trailing spaces at end of line\n", " gene=$(echo \"${line}\" | awk -F'Name=|{' '{print $2}' | sed 's/[[:blank:]]*$//')\n", "\n", " # Get UniProt accession\n", " elif [[ \"${descriptor}\" == \"AC\" ]]; then\n", " # Uses sed to strip trailing semi-colon\n", " accession=$(echo \"${line}\" | awk '{print $2}' | sed 's/;$//')\n", "\n", " # Identify beginning on new record\n", " elif [[ \"${descriptor}\" == \"//\" ]]; then\n", "\n", " # Prints other comma-separated variables, then GOID1;GOID2;GOIDn\n", " # IFS prevents spaces from being added between GO IDs\n", " # sed removes \";\" after final GO ID\n", " (IFS=; printf \"%s,%s,%s,%s,%s\\n\" \"${accession}\" \"${gene_id}\" \"${gene}\" \"${gene_description}\" \"${go_ids_array[*]}\" | sed 's/;$//')\n", "\n", " # Re-initialize variables\n", " accession=\"\" \n", " descriptor=\"\"\n", " gene=\"\"\n", " gene_description\n", " gene_id=\"\"\n", " go_id=\"\"\n", " go_ids_array=()\n", " fi\n", "\n", "\n", "done < \"${perl_output}\" >> \"${parsed_uniprot}\"" ] }, { "cell_type": "markdown", "id": "fa47aee6-8a09-41a8-887a-ce740b589e61", "metadata": {}, "source": [ "Despite notebook error message, if you check the time stamps on the files below, it looks like this took nearly 6.5hrs!!" ] }, { "cell_type": "markdown", "id": "ffb0df40-e39a-4060-8e9d-16b98c2504bb", "metadata": {}, "source": [ "#### Check parsed file" ] }, { "cell_type": "code", "execution_count": 20, "id": "83203a01-b1e4-4f14-899f-630e5917cb10", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 362M\n", "-rw-rw-r-- 1 samb samb 2.9M Jun 1 09:49 20210601_ssal_chrom-start-end-Dbxref.csv\n", "-rw-rw-r-- 1 samb samb 772K Jun 1 09:58 20210601_ssal_gene-list.txt\n", "-rw-rw-r-- 1 samb samb 350M Jun 1 10:04 20210601_ssal_uniprot_batch_results.txt\n", "-rw-rw-r-- 1 samb samb 8.0M Jun 1 16:31 20210601_ssal_accession-gene_id-gene-gene_description-go_ids.csv\n", "\n", "A0A1S3NLH1,106560212,LOC106560212,fibroblast growth factor receptor 3-like,GO:0016021\n", "A0A1S3SK04,106607996,LOC106607996,histone-lysine N-methyltransferase NSD2-like,GO:0005694;GO:0005634;GO:0018024;GO:0046872\n", "A0A1S3RMY2,106601976,LOC106601976,fibroblast growth factor receptor 3-like,GO:0005524;GO:0004672\n", "A0A1S3KV69,106560213,LOC106560213,phospholipase B1 membrane-associated-like,GO:0004620\n", "A0A1S3LSJ1,106566220,LOC106566220,forkhead box protein I1c-like,GO:0005634;GO:0003700;GO:0043565\n", "A0A1S3MPI8,106571988,LOC106571988,GDNF family receptor alpha-4-like,GO:0005886;GO:0038023\n", "A0A1S3NLZ6,106578259,LOC106578259,attractin-like,\n", "A0A1S3PJR5,106583877,LOC106583877,sodium bicarbonate transporter-like protein 11,GO:0016020;GO:0005452\n", "A0A1S3QEI4,106589664,LOC106589664,G-protein coupled receptor 4-like,GO:0016021;GO:0004930\n", "A0A1S3QYM9,106596642,LOC106596642,sodium bicarbonate transporter-like protein 11,GO:0016021;GO:0005452\n", "\n", "\n", "--------------------------------------------------\n", "\n", "Line count:\n", "82393 20210601_ssal_accession-gene_id-gene-gene_description-go_ids.csv\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "ls -ltrh\n", "echo \"\"\n", "head \"${parsed_uniprot}\"\n", "\n", "echo \"\"\n", "echo \"\"\n", "echo \"--------------------------------------------------\"\n", "echo \"\"\n", "echo \"Line count:\"\n", "wc -l \"${parsed_uniprot}\"" ] }, { "cell_type": "markdown", "id": "3fc10e91-4053-4844-ae64-3b26a01b9fd6", "metadata": {}, "source": [ "Line count looks reasonable, as I know that some NCBI gene IDs are associated with multiple UniProt accessions, so we should end up with more results than were submitted." ] }, { "cell_type": "markdown", "id": "d09638c7-fbc7-431b-9243-fef7a307df60", "metadata": {}, "source": [ "### Join parsed UniProt info with chromosome names" ] }, { "cell_type": "markdown", "id": "87de8d8f-03d0-4073-b119-dafe9e6da355", "metadata": {}, "source": [ "This will sort the both files on the columns with the NCBI gene ID for joining.\n", "\n", "Then, it will replace the commas with tabs and re-order the columns so that the NCBI chromosome is in the first column." ] }, { "cell_type": "code", "execution_count": 22, "id": "acfe96dd-553c-47f9-bd79-ab7f840a571f", "metadata": {}, "outputs": [], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "join \\\n", "-t \",\" \\\n", "-1 4 \\\n", "-2 2 \\\n", "<(sort -t \",\" -k 4,4 \"${gtf_extractor_output}\") \\\n", "<(sort -t \",\" -k2,2 \"${parsed_uniprot}\") \\\n", "| awk 'BEGIN {FS=\",\"; OFS=\"\\t\"} {print $2, $1, $3, $4, $5, $6, $7, $8}' \\\n", "> \"${joined_output}\"" ] }, { "cell_type": "code", "execution_count": 23, "id": "0d175269-a71e-4b57-ac92-48bfd0bce38e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 372M\n", "-rw-rw-r-- 1 samb samb 2.9M Jun 1 09:49 20210601_ssal_chrom-start-end-Dbxref.csv\n", "-rw-rw-r-- 1 samb samb 772K Jun 1 09:58 20210601_ssal_gene-list.txt\n", "-rw-rw-r-- 1 samb samb 350M Jun 1 10:04 20210601_ssal_uniprot_batch_results.txt\n", "-rw-rw-r-- 1 samb samb 8.0M Jun 1 16:31 20210601_ssal_accession-gene_id-gene-gene_description-go_ids.csv\n", "-rw-rw-r-- 1 samb samb 11M Jun 1 20:21 20210601_ssal_chrom-gene_id_start-end-acc-gene-gene_description-go_ids.csv\n", "\n", "Line count:\n", "82570 20210601_ssal_chrom-gene_id_start-end-acc-gene-gene_description-go_ids.csv\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "ls -ltrh\n", "\n", "echo \"\"\n", "\n", "echo \"Line count:\"\n", "wc -l \"${joined_output}\"" ] }, { "cell_type": "markdown", "id": "a3737529-e4b1-450f-ada2-329f2cd8ace8", "metadata": {}, "source": [ "Hmmm, does this line count make sense?\n", "\n", "Submitted to UniProt: 79030\n", "\n", "Returned from UniProt: 82393\n", "\n", "Joined: 82570" ] }, { "cell_type": "markdown", "id": "dddbd0e3-58fc-41a4-9205-add3ef3d20c6", "metadata": {}, "source": [ "Look at the file. Columns will be tab-separated in this order:\n", "\n", "| chromosome | NCBI gene ID | start | end | UniProt accession | gene abbreviation/name | gene description | GO IDs |\n", "|---|----|----|----|---|----|----|----|" ] }, { "cell_type": "code", "execution_count": 24, "id": "d262ac1e-fdf3-47d3-90a8-338b9eb3b173", "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NC_027327.1\t100135779\t7914675\t7922535\tQ7T2G8\tfsh-R\tneurexin-1a-like isoform X17\tGO:0016323;GO:0016021;GO:0016020;GO:0004963;GO:0004996;GO:0032354;GO:0034699\n", "NC_027321.1\t100136349\t24148576\t24153099\tQ8QHK5\tdj-1\t28S ribosomal protein S16 mitochondrial-like\tGO:0045121;GO:0005886;GO:0006914;GO:0007338\n", "NC_027321.1\t100136351\t63087382\t63155499\tQ2V2G8\tRAF1c\tSerine/threonine protein kinase RAF1c\tGO:0005524;GO:0030553;GO:0106310;GO:0106311;GO:0035556\n", "NC_027302.1\t100136352\t92389729\t92396132\tB5X240\tACTB\tactin cytoplasmic 1 isoform X2\tGO:0005856;GO:0097433;GO:0005925;GO:0005886\n", "NC_027302.1\t100136352\t92389729\t92396132\tO42161\tactb;\tcaskin-1-like\tGO:0015629;GO:0005737;GO:0005634;GO:0005886;GO:0005524\n", "NC_027318.1\t100136353\t42266727\t42274733\tO57560\tLOC100136353\tMBT domain-containing protein 1-like\tGO:0005524;GO:0004550;GO:0006241;GO:0006183;GO:0006228\n", "NC_027306.1\t100136354\t58238908\t58240915\tA0A1S2WYK4\tLOC100136354\tLIM domain only protein 3 isoform X1\tGO:0005840;GO:0003735;GO:0006412\n", "NC_027306.1\t100136354\t58238908\t58240915\tO57561\trpl18a;\tLIM domain only protein 3 isoform X1\tGO:0005840;GO:0003735;GO:0006412\n", "NC_027321.1\t100136355\t8772597\t8774381\tA8YTA4\ttshb\tthyrotropin subunit beta precursor\tGO:0005576;GO:0005179\n", "NC_027321.1\t100136355\t8772597\t8774381\tO73824\ttshb;\tmitochondrial glutamate carrier 1-like isoform X1\tGO:0005576;GO:0005179\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "\n", "head \"${joined_output}\"" ] }, { "cell_type": "markdown", "id": "c0b01c2f-2c68-4821-ae0e-f19c43231119", "metadata": {}, "source": [ "### Generate cheksums\n", "\n", "Forgot to do this before closing notebook." ] }, { "cell_type": "code", "execution_count": 2, "id": "9a8e9dcb-eb74-4af4-8924-0553fe6f3596", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "e7d970782d7f531967dbfce01e5df549 20210601_ssal_accession-gene_id-gene-gene_description-go_ids.csv\n", "5288ed387a6b1155cca11f25b9a9e3ca 20210601_ssal_chrom-gene_id_start-end-acc-gene-gene_description-go_ids.csv\n", "f4182e5129978328b0e9ae2b07d0bbf7 20210601_ssal_chrom-start-end-Dbxref.csv\n", "0d330da91260189090ba2fac1ca0340f 20210601_ssal_gene-list.txt\n", "81f63345d2f2cfbabdc8d60c3326ba66 20210601_ssal_uniprot_batch_results.txt\n" ] } ], "source": [ "%%bash\n", "cd \"${analysis_dir}\"\n", "\n", "for file in *\n", "do\n", " md5sum \"${file}\" | tee --append checksums.md5\n", "done" ] }, { "cell_type": "code", "execution_count": null, "id": "b62f2075-672e-427d-8def-7b1f82743f2c", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.5" } }, "nbformat": 4, "nbformat_minor": 5 }