{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create _P.generosa_ primary gene annotations mapping file\n",
    "\n",
    "See this [GitHub Issue](https://github.com/RobertsLab/resources/issues/1443).\n",
    "\n",
    "This notebook will utilize NCBI BLAST and DIAMOND BLAST annotations generated by our [GenSas _P.generosa_ genome annotation](https://robertslab.github.io/sams-notebook/2019/09/28/Genome-Annotation-Pgenerosa_v074-a4-Using-GenSAS.html).\n",
    "\n",
    "It will compare the two sets of SwissProt ID annotations (SPIDs) to determine lowest E-value and use that entry as the representative entry for a gene. It will then use that canonical list of SPIDs to pull gene names and gene ontology (GO) IDs from UniProt, and create a tab-deltimited annotation mapping file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List computer specs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TODAY'S DATE\n",
      "Wed 20 Apr 2022 06:24:50 AM PDT\n",
      "------------\n",
      "\n",
      "Distributor ID:\tUbuntu\n",
      "Description:\tUbuntu 20.04.4 LTS\n",
      "Release:\t20.04\n",
      "Codename:\tfocal\n",
      "\n",
      "------------\n",
      "HOSTNAME: \n",
      "computer\n",
      "\n",
      "------------\n",
      "Computer Specs:\n",
      "\n",
      "Architecture:                    x86_64\n",
      "CPU op-mode(s):                  32-bit, 64-bit\n",
      "Byte Order:                      Little Endian\n",
      "Address sizes:                   45 bits physical, 48 bits virtual\n",
      "CPU(s):                          2\n",
      "On-line CPU(s) list:             0,1\n",
      "Thread(s) per core:              1\n",
      "Core(s) per socket:              1\n",
      "Socket(s):                       2\n",
      "NUMA node(s):                    1\n",
      "Vendor ID:                       GenuineIntel\n",
      "CPU family:                      6\n",
      "Model:                           165\n",
      "Model name:                      Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz\n",
      "Stepping:                        2\n",
      "CPU MHz:                         2400.006\n",
      "BogoMIPS:                        4800.01\n",
      "Hypervisor vendor:               VMware\n",
      "Virtualization type:             full\n",
      "L1d cache:                       64 KiB\n",
      "L1i cache:                       64 KiB\n",
      "L2 cache:                        512 KiB\n",
      "L3 cache:                        32 MiB\n",
      "NUMA node0 CPU(s):               0,1\n",
      "Vulnerability Itlb multihit:     KVM: Mitigation: VMX unsupported\n",
      "Vulnerability L1tf:              Mitigation; PTE Inversion\n",
      "Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown\n",
      "Vulnerability Meltdown:          Mitigation; PTI\n",
      "Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\n",
      "Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization\n",
      "Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling\n",
      "Vulnerability Srbds:             Not affected\n",
      "Vulnerability Tsx async abort:   Not affected\n",
      "Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves arat flush_l1d arch_capabilities\n",
      "\n",
      "------------\n",
      "\n",
      "Memory Specs\n",
      "\n",
      "              total        used        free      shared  buff/cache   available\n",
      "Mem:           54Gi       2.0Gi        50Gi        38Mi       2.6Gi        52Gi\n",
      "Swap:         2.0Gi          0B       2.0Gi\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "No LSB modules are available.\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "echo \"TODAY'S DATE\"\n",
    "date\n",
    "echo \"------------\"\n",
    "echo \"\"\n",
    "lsb_release -a\n",
    "echo \"\"\n",
    "echo \"------------\"\n",
    "echo \"HOSTNAME: \"\n",
    "hostname\n",
    "echo \"\"\n",
    "echo \"------------\"\n",
    "echo \"Computer Specs:\"\n",
    "echo \"\"\n",
    "lscpu\n",
    "echo \"\"\n",
    "echo \"------------\"\n",
    "echo \"\"\n",
    "echo \"Memory Specs\"\n",
    "echo \"\"\n",
    "free -mh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set variables\n",
    "- `%env` indicates a bash variable\n",
    "- without `%env` is Python variablec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "env: data_dir=/home/sam/data/P_generosa/genomes\n",
      "env: analysis_dir=/home/sam/analyses/20220419-pgen-gene_annotation_mapping\n",
      "env: base_url=https://gannet.fish.washington.edu/Atumefaciens/20190928_Pgenerosa_v074.a4_gensas_annotation\n",
      "env: blast_annotations=Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab\n",
      "env: diamond_annotations=Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab\n",
      "env: uniprot_output=20220419-pgen-uniprot_batch-results.txt\n",
      "env: spid_list=Panopea-generosa-v1.0.a4-blast-diamond-functional-SPIDs.txt\n",
      "env: genome_IDs_SPIDs=Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt\n",
      "env: blast_diamond_cat=Panopea-generosa-v1.0.a4-blast-diamond-functional.tab\n",
      "env: blast_diamond_cat_best=Panopea-generosa-v1.0.a4-blast-diamond-functional_best.tab\n",
      "env: parsed_uniprot=20220419-pgen-accession-gene_name-gene_description-go_ids.tab\n",
      "env: joined_output=20220419-pgen-gene-accessions-gene_id-gene_name-gene_description-alt_gene_description-go_ids.tab\n",
      "env: uniprot_mapping_script=/home/sam/programs/uniprot_mapping.pl\n"
     ]
    }
   ],
   "source": [
    "######################################################################\n",
    "### Set directories\n",
    "%env data_dir=/home/sam/data/P_generosa/genomes\n",
    "%env analysis_dir=/home/sam/analyses/20220419-pgen-gene_annotation_mapping\n",
    "analysis_dir=\"/home/sam/analyses/20220419-pgen-gene_annotation_mapping\"\n",
    "\n",
    "#####################################################################\n",
    "### Input files\n",
    "%env base_url=https://gannet.fish.washington.edu/Atumefaciens/20190928_Pgenerosa_v074.a4_gensas_annotation\n",
    "%env blast_annotations=Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab\n",
    "%env diamond_annotations=Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab\n",
    "\n",
    "######################################################################\n",
    "### Output files\n",
    "# UniProt batch output\n",
    "%env uniprot_output=20220419-pgen-uniprot_batch-results.txt\n",
    "\n",
    "# Gene name list for UniProt batch submission\n",
    "%env spid_list=Panopea-generosa-v1.0.a4-blast-diamond-functional-SPIDs.txt\n",
    "\n",
    "# Genome IDs and SPIDs\n",
    "%env genome_IDs_SPIDs=Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt\n",
    "\n",
    "%env blast_diamond_cat=Panopea-generosa-v1.0.a4-blast-diamond-functional.tab\n",
    "\n",
    "%env blast_diamond_cat_best=Panopea-generosa-v1.0.a4-blast-diamond-functional_best.tab\n",
    "\n",
    "# Parsed UniProt\n",
    "%env parsed_uniprot=20220419-pgen-accession-gene_name-gene_description-go_ids.tab\n",
    "\n",
    "# Final output\n",
    "%env joined_output=20220419-pgen-gene-accessions-gene_id-gene_name-gene_description-alt_gene_description-go_ids.tab\n",
    "\n",
    "######################################################################\n",
    "\n",
    "### Programs\n",
    "\n",
    "# UniProt batch submission/retrieval script\n",
    "%env uniprot_mapping_script=/home/sam/programs/uniprot_mapping.pl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Make input/output directories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "# If directories don't exist, make them\n",
    "mkdir --parents \"${data_dir}\" \"${analysis_dir}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Download and inspect annotation files\n",
    "\n",
    "`--quiet`: Prevents `wget` output from overwhelming Jupyter Notebook\n",
    "\n",
    "`--continue`: If download was previously initiated, will continue where leftoff and will not create a second file if one already exists."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 2.6G\n",
      "-rw-rw-r-- 1 sam sam 1.5M Oct  3  2019 Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab\n",
      "-rw-rw-r-- 1 sam sam 1.3M Oct  3  2019 Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab\n",
      "-rwxr-xr-x 1 sam sam 914M Nov  5  2019 Panopea-generosa-v1.0.fasta\n",
      "-rw-rw-r-- 1 sam sam 454M Mar 19 07:58 Panopea-generosa-v1.0.a4.gff3\n",
      "-rw-r--r-- 1 sam sam 503M Mar 22 06:48 Panopea-generosa-v1.0.a4_biotype.gff\n",
      "-rw-r--r-- 1 sam sam 4.8M Mar 24 07:30 Panopea-generosa-v1.0.a4_biotype-trna_strand_converted-no_RNAmmer.bed\n",
      "-rw-rw-r-- 1 sam sam  658 Mar 25 06:11 Panopea-generosa-v1.0.fa.fai\n",
      "-rw-rw-r-- 1 sam sam 9.7M Mar 30 11:03 Panopea-generosa-v1.0.a4_biotype.gtf\n",
      "-rw-rw-r-- 1 sam sam 507M Mar 30 11:43 Panopea-generosa-v1.0.a4_biotype.bed\n",
      "-rw-rw-r-- 1 sam sam  378 Mar 30 13:20 Panopea-generosa-v1.0.fa.lengths\n",
      "-rw-rw-r-- 1 sam sam 9.7M Mar 30 13:34 Panopea-generosa-v1.0.a4_biotype.sorted.gtf\n",
      "-rw-rw-r-- 1 sam sam 996K Mar 30 13:34 Panopea-generosa-v1.0.a4_biotype_non-coding.bed\n",
      "drwxrwxr-x 2 sam sam 4.0K Mar 30 13:46 feelnc_codpot_out\n",
      "-rw-rw-r-- 1 sam sam  70M Mar 31 06:00 Panopea-generosa-v1.0.a4.gtf\n",
      "-rw-rw-r-- 1 sam sam 101K Apr  6 12:22 spids.txt\n",
      "-rw-rw-r-- 1 sam sam 138M Apr  6 12:23 uniprot_mapping-all.txt\n",
      "-rw-rw-r-- 1 sam sam 282K Apr  7 07:13 uniprot_mapping-all-AC_only.txt\n",
      "-rw-rw-r-- 1 sam sam   78 Apr  7 13:57 File1.txt\n",
      "-rw-rw-r-- 1 sam sam  357 Apr  7 13:57 File2.txt\n",
      "\n",
      "---------------------------------------------------------\n",
      "\n",
      "==> Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab <==\n",
      "#\n",
      "# Output is generated by GenSAS 7.x-5.0\n",
      "#\n",
      "#name     : mRNA\n",
      "#start    : Start of alignment in subject\n",
      "#end      : End of alignment in subject\n",
      "#m_start  : Start of alignment in query\n",
      "#m_end    : End of alignment in query\n",
      "#al       : Alignment length\n",
      "#score    : Row score of the match\n",
      "#evalue   : E value of the match\n",
      "#identity : Percentage of identical matches\n",
      "mame\tstart\tend\tscore\tAccession\tMatch ID\tm_start\tm_end\tE-value\tidentity\tal\n",
      "21910-PGEN_.00g000010.m01\t121\t229\t165\tQ86IC9\tsp|Q86IC9|CAMT1_DICDI\t11\t122\t8.93e-14\t35.652\t115\n",
      "21910-PGEN_.00g000020.m01\t147\t467\t968\tP04177\tsp|P04177|TY3H_RAT\t20\t339\t3.47e-127\t55.140\t321\n",
      "21910-PGEN_.00g000050.m01\t566\t722\t182\tQ8L840\tsp|Q8L840|RQL4A_ARATH\t2\t167\t2.67e-14\t35.119\t168\n",
      "21910-PGEN_.00g000080.m01\t268\t322\t152\tA1E2V0\tsp|A1E2V0|BIRC3_CANLF\t163\t220\t3.91e-10\t53.448\t58\n",
      "21910-PGEN_.00g000090.m01\t199\t327\t161\tP34456\tsp|P34456|YMD2_CAEEL\t7\t134\t7.52e-12\t26.357\t129\n",
      "21910-PGEN_.00g000210.m01\t18\t200\t263\tO00463\tsp|O00463|TRAF5_HUMAN\t5\t191\t2.24e-25\t34.921\t189\n",
      "21910-PGEN_.00g000230.m01\t48\t155\t287\tQ00945\tsp|Q00945|CONO_LYMST\t31\t134\t1.59e-32\t50.000\t108\n",
      "21910-PGEN_.00g000240.m01\t4\t605\t1091\tQ5SWK7\tsp|Q5SWK7|RN145_MOUSE\t13\t601\t2.65e-139\t39.607\t611\n",
      "21910-PGEN_.00g000280.m01\t4\t153\t210\tQ8ZXT3\tsp|Q8ZXT3|Y1111_PYRAE\t853\t1012\t1.10e-17\t38.750\t160\n",
      "21910-PGEN_.00g000300.m01\t159\t347\t480\tQ5REG4\tsp|Q5REG4|DTX3_PONAB\t1135\t1320\t1.20e-51\t50.794\t189\n",
      "21910-PGEN_.00g000300.m02\t159\t347\t480\tQ5REG4\tsp|Q5REG4|DTX3_PONAB\t1138\t1323\t1.18e-51\t50.794\t189\n",
      "21910-PGEN_.00g000380.m01\t381\t508\t205\tQ8QG60\tsp|Q8QG60|CRY2_CHICK\t2\t145\t4.92e-18\t36.111\t144\n",
      "\n",
      "==> Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab <==\n",
      "#\n",
      "# Output is generated by GenSAS 7.x-5.0\n",
      "#\n",
      "#name     : mRNA\n",
      "#start    : Start of alignment in subject\n",
      "#end      : End of alignment in subject\n",
      "#m_start  : Start of alignment in query\n",
      "#m_end    : End of alignment in query\n",
      "#al       : Alignment length\n",
      "#score    : Row score of the match\n",
      "#evalue   : E value of the match\n",
      "#identity : Percentage of identical matches\n",
      "mame\tstart\tend\tscore\tAccession\tMatch ID\tm_start\tm_end\tE-value\tidentity\tal\n",
      "21910-PGEN_.00g000020.m01\t147\t467\t945\tP04177\tsp|P04177|TY3H_RAT\t20\t339\t7.9e-101\t55.1\t321\n",
      "21910-PGEN_.00g000050.m01\t566\t722\t180\tQ8L840\tsp|Q8L840|RQL4A_ARATH\t2\t167\t2.4e-12\t35.1\t168\n",
      "21910-PGEN_.00g000060.m01\t1957\t2106\t129\tQ61043\tsp|Q61043|NIN_MOUSE\t31\t184\t1.7e-06\t26.0\t154\n",
      "21910-PGEN_.00g000080.m01\t233\t304\t134\tQ24307\tsp|Q24307|DIAP2_DROME\t174\t255\t6.2e-07\t34.1\t82\n",
      "21910-PGEN_.00g000120.m01\t6\t49\t118\tP34457\tsp|P34457|YMD3_CAEEL\t90\t133\t3.2e-05\t47.7\t44\n",
      "21910-PGEN_.00g000230.m01\t49\t155\t216\tQ00945\tsp|Q00945|CONO_LYMST\t32\t134\t9.9e-17\t49.5\t107\n",
      "21910-PGEN_.00g000240.m01\t4\t585\t1144\tQ5SWK7\tsp|Q5SWK7|RN145_MOUSE\t13\t592\t1.2e-123\t40.1\t591\n",
      "21910-PGEN_.00g000280.m01\t433\t592\t230\tQ9WYX8\tsp|Q9WYX8|Y508_THEMA\t863\t1022\t2.2e-17\t40.2\t164\n",
      "21910-PGEN_.00g000300.m01\t161\t347\t474\tQ80V91\tsp|Q80V91|DTX3_MOUSE\t1137\t1320\t1.2e-45\t51.3\t187\n",
      "21910-PGEN_.00g000300.m02\t161\t347\t474\tQ80V91\tsp|Q80V91|DTX3_MOUSE\t1140\t1323\t1.2e-45\t51.3\t187\n",
      "21910-PGEN_.00g000380.m01\t381\t508\t201\tQ8QG60\tsp|Q8QG60|CRY2_CHICK\t2\t145\t7.0e-15\t35.4\t144\n",
      "21910-PGEN_.00g000440.m01\t234\t1796\t1606\tQ9H583\tsp|Q9H583|HEAT1_HUMAN\t1\t1575\t8.0e-177\t30.0\t1624\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${data_dir}\"\n",
    "\n",
    "wget --quiet --continue ${base_url}/${blast_annotations}\n",
    "wget --quiet --continue ${base_url}/${diamond_annotations}\n",
    "\n",
    "ls -ltrh\n",
    "\n",
    "echo \"\"\n",
    "echo \"---------------------------------------------------------\"\n",
    "echo \"\"\n",
    "head -n 25 *.tab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Count number of header lines (i.e. beginning with a `#`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Panopea-generosa-vv0.74.a4.5d951a9b74287-blast_functional.tab:12\n",
      "Panopea-generosa-vv0.74.a4.5d951bcf45b4b-diamond_functional.tab:12\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${data_dir}\"\n",
    "\n",
    "grep -c \"^#\" \"${blast_annotations}\" \"${diamond_annotations}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Concatenate annotation files and keep only one with best `e-value`\n",
    "\n",
    "Also modifies mRNA names to generate gene names instead.\n",
    "\n",
    "- `awk 'NR > 13'`: Skips first 13 header lines\n",
    "- `sort -k1,1 -k9,9`: Sorts on first field (mRNA name), then on 9th field (e-value)\n",
    "- `sed 's/^21910-//'`: Removes leading info from each mRNA name, at the beginning of each line (`^`)\n",
    "- `sed 's/.m0[0-9]//'`: Removes `.m0N` from each mRNA name.\n",
    "- `awk '!array[$1]++'`: `awk` array that only prints line if it's the first occurrence of gene name (first field; `$1` (i.e. no duplicates)\n",
    "\n",
    "---\n",
    "\n",
    "Also replaces two obsolete SPIDs, as [identified in previous notebook run](http://localhost:8888/lab/workspaces/auto-W/tree/20220419-pgen-gene_annotation_mapping.ipynb#Manually-look-up-SPIDs-in-UniProt)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Line count:\n",
      "31216 /home/sam/analyses/20220419-pgen-gene_annotation_mapping/Panopea-generosa-v1.0.a4-blast-diamond-functional.tab\n",
      "--------------------------------------------------\n",
      "\n",
      "Line count:\n",
      "14676 /home/sam/analyses/20220419-pgen-gene_annotation_mapping/Panopea-generosa-v1.0.a4-blast-diamond-functional_best.tab\n",
      "--------------------------------------------------\n",
      "\n",
      "PGEN_.00g000010\t121\t229\t165\tQ86IC9\tsp|Q86IC9|CAMT1_DICDI\t11\t122\t8.93e-14\t35.652\t115\n",
      "PGEN_.00g000020\t147\t467\t968\tP04177\tsp|P04177|TY3H_RAT\t20\t339\t3.47e-127\t55.140\t321\n",
      "PGEN_.00g000050\t566\t722\t180\tQ8L840\tsp|Q8L840|RQL4A_ARATH\t2\t167\t2.4e-12\t35.1\t168\n",
      "PGEN_.00g000060\t1957\t2106\t129\tQ61043\tsp|Q61043|NIN_MOUSE\t31\t184\t1.7e-06\t26.0\t154\n",
      "PGEN_.00g000080\t268\t322\t152\tA1E2V0\tsp|A1E2V0|BIRC3_CANLF\t163\t220\t3.91e-10\t53.448\t58\n",
      "PGEN_.00g000090\t199\t327\t161\tP34456\tsp|P34456|YMD2_CAEEL\t7\t134\t7.52e-12\t26.357\t129\n",
      "PGEN_.00g000120\t6\t49\t118\tP34457\tsp|P34457|YMD3_CAEEL\t90\t133\t3.2e-05\t47.7\t44\n",
      "PGEN_.00g000210\t18\t200\t263\tO00463\tsp|O00463|TRAF5_HUMAN\t5\t191\t2.24e-25\t34.921\t189\n",
      "PGEN_.00g000230\t48\t155\t287\tQ00945\tsp|Q00945|CONO_LYMST\t31\t134\t1.59e-32\t50.000\t108\n",
      "PGEN_.00g000240\t4\t585\t1144\tQ5SWK7\tsp|Q5SWK7|RN145_MOUSE\t13\t592\t1.2e-123\t40.1\t591\n",
      "PGEN_.00g000280\t4\t153\t210\tQ8ZXT3\tsp|Q8ZXT3|Y1111_PYRAE\t853\t1012\t1.10e-17\t38.750\t160\n",
      "PGEN_.00g000300\t159\t347\t480\tQ5REG4\tsp|Q5REG4|DTX3_PONAB\t1135\t1320\t1.20e-51\t50.794\t189\n",
      "PGEN_.00g000380\t381\t508\t205\tQ8QG60\tsp|Q8QG60|CRY2_CHICK\t2\t145\t4.92e-18\t36.111\t144\n",
      "PGEN_.00g000440\t792\t1796\t1362\tQ9H583\tsp|Q9H583|HEAT1_HUMAN\t539\t1575\t7.24e-156\t34.692\t1055\n",
      "PGEN_.00g000450\t347\t753\t1038\tA0JMR6\tsp|A0JMR6|MYSM1_XENLA\t138\t535\t3.16e-130\t48.426\t413\n",
      "PGEN_.00g000460\t478\t1045\t859\tO88917\tsp|O88917|AGRL1_RAT\t150\t740\t2.17e-95\t33.443\t610\n",
      "PGEN_.00g000490\t189\t393\t152\tQ7D513\tsp|Q7D513|EGTB_MYCTO\t307\t565\t1.7e-08\t24.8\t266\n",
      "PGEN_.00g000520\t118\t339\t495\tQ92968\tsp|Q92968|PEX13_HUMAN\t134\t356\t1.34e-56\t49.565\t230\n",
      "PGEN_.00g000530\t1\t191\t593\tA6H769\tsp|A6H769|RS7_BOVIN\t1\t160\t2.3e-60\t68.1\t191\n",
      "PGEN_.00g000540\t1876\t2081\t565\tQ14676\tsp|Q14676|MDC1_HUMAN\t1857\t2060\t1.19e-56\t46.117\t206\n",
      "PGEN_.00g000560\t17\t114\t135\tQ54W11\tsp|Q54W11|MCFL_DICDI\t21\t113\t3.1e-07\t32.7\t98\n",
      "PGEN_.00g000600\t49\t474\t1301\tQ9FKK7\tsp|Q9FKK7|XYLA_ARATH\t215\t641\t1.35e-173\t55.245\t429\n",
      "PGEN_.00g000660\t5\t308\t899\tQ3ZCD7\tsp|Q3ZCD7|TECR_BOVIN\t2\t303\t1.4e-95\t55.7\t305\n",
      "PGEN_.00g000670\t60\t1237\t3493\tQ9D180\tsp|Q9D180|CFA57_MOUSE\t1\t1143\t0.0\t59.102\t1181\n",
      "PGEN_.00g000680\t17\t206\t520\tQ9D6Z0\tsp|Q9D6Z0|ALKB7_MOUSE\t89\t278\t1.11e-64\t48.947\t190\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${data_dir}\"\n",
    "\n",
    "# Concatenate both annotation files\n",
    "for file in ${blast_annotations} ${diamond_annotations}\n",
    "do\n",
    "    awk 'NR > 13' ${file}\n",
    "done \\\n",
    ">> \"${analysis_dir}\"/\"${blast_diamond_cat}\"\n",
    "\n",
    "# Sort for best e-value and perform formatting of genome IDs\n",
    "sort -k1,1 -k9,9 \"${analysis_dir}\"/\"${blast_diamond_cat}\" \\\n",
    "| sed 's/^21910-//' \\\n",
    "| sed 's/.m0[0-9]//' \\\n",
    "| awk '!array[$1]++' \\\n",
    ">> \"${analysis_dir}\"/\"${blast_diamond_cat_best}\"\n",
    "\n",
    "# Find/replace two obsolete SPIDs\n",
    "sed -i 's/Q6ZRR9/M0R2J8/g' \"${analysis_dir}\"/\"${blast_diamond_cat_best}\"\n",
    "sed -i 's/Q9NPA5/Q9NTW7/g' \"${analysis_dir}\"/\"${blast_diamond_cat_best}\"\n",
    "\n",
    "echo \"\"\n",
    "echo \"Line count:\"\n",
    "\n",
    "wc -l \"${analysis_dir}\"/\"${blast_diamond_cat}\"\n",
    "\n",
    "echo \"--------------------------------------------------\"\n",
    "\n",
    "echo \"\"\n",
    "echo \"Line count:\"\n",
    "\n",
    "wc -l \"${analysis_dir}\"/\"${blast_diamond_cat_best}\"\n",
    "\n",
    "echo \"--------------------------------------------------\"\n",
    "echo \"\"\n",
    "\n",
    "head -n 25 \"${analysis_dir}\"/\"${blast_diamond_cat_best}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create list of genome IDs and SwissProt IDs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Line count:\n",
      "14676 Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt\n",
      "--------------------------------------------------\n",
      "Q86IC9 \t PGEN_.00g000010\n",
      "P04177 \t PGEN_.00g000020\n",
      "Q8L840 \t PGEN_.00g000050\n",
      "Q61043 \t PGEN_.00g000060\n",
      "A1E2V0 \t PGEN_.00g000080\n",
      "P34456 \t PGEN_.00g000090\n",
      "P34457 \t PGEN_.00g000120\n",
      "O00463 \t PGEN_.00g000210\n",
      "Q00945 \t PGEN_.00g000230\n",
      "Q5SWK7 \t PGEN_.00g000240\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "awk '{print $5,\"\\t\",$1}' \"${blast_diamond_cat_best}\" > \"$genome_IDs_SPIDs\"\n",
    "\n",
    "echo \"\"\n",
    "echo \"Line count:\"\n",
    "\n",
    "wc -l \"$genome_IDs_SPIDs\"\n",
    "\n",
    "echo \"--------------------------------------------------\"\n",
    "\n",
    "head \"$genome_IDs_SPIDs\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create list os SwissProt IDs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Line count:\n",
      "14676 Panopea-generosa-v1.0.a4-blast-diamond-functional-SPIDs.txt\n",
      "--------------------------------------------------\n",
      "\n",
      "Q86IC9\n",
      "P04177\n",
      "Q8L840\n",
      "Q61043\n",
      "A1E2V0\n",
      "P34456\n",
      "P34457\n",
      "O00463\n",
      "Q00945\n",
      "Q5SWK7\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "awk '{print $5}' \"${blast_diamond_cat_best}\" > \"${spid_list}\"\n",
    "\n",
    "echo \"\"\n",
    "echo \"Line count:\"\n",
    "\n",
    "wc -l \"${spid_list}\"\n",
    "\n",
    "echo \"--------------------------------------------------\"\n",
    "\n",
    "echo \"\"\n",
    "\n",
    "head \"${spid_list}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Batch submission/retrieval to/from UniProt\n",
    "\n",
    "Perl script obtained from UniProt: https://www.uniprot.org/help/api_batch_retrieval\n",
    "\n",
    "Modified to accept file with list of IDs and to map SPID to UniProt Accession"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "use strict;\n",
      "use warnings;\n",
      "use LWP::UserAgent;\n",
      "\n",
      "my $list = $ARGV[0]; # File containg list of UniProt identifiers.\n",
      "\n",
      "my $base = 'https://www.uniprot.org';\n",
      "my $tool = 'uploadlists';\n",
      "\n",
      "my $contact = 'samwhite@uw.edu'; # Please set a contact email address here to help us debug in case of problems (see https://www.uniprot.org/help/privacy).\n",
      "my $agent = LWP::UserAgent->new(agent => \"libwww-perl $contact\");\n",
      "push @{$agent->requests_redirectable}, 'POST';\n",
      "\n",
      "my $response = $agent->post(\"$base/$tool/\",\n",
      "                            [ 'file' => [$list],\n",
      "                              'format' => 'txt',\n",
      "                              'from' => 'SWISSPROT',\n",
      "                              'to' => 'ACC',\n",
      "                            ],\n",
      "                            'Content_Type' => 'form-data');\n",
      "\n",
      "while (my $wait = $response->header('Retry-After')) {\n",
      "  print STDERR \"Waiting ($wait)...\\n\";\n",
      "  sleep $wait;\n",
      "  $response = $agent->get($response->base);\n",
      "}\n",
      "\n",
      "$response->is_success ?\n",
      "  print $response->content :\n",
      "  die 'Failed, got ' . $response->status_line .\n",
      "    ' for ' . $response->request->uri . \"\\n\";\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "# Print script for viewing\n",
    "cat \"${uniprot_mapping_script}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 142M\n",
      "-rw-rw-r-- 1 sam sam 2.8M Apr 20 21:20 Panopea-generosa-v1.0.a4-blast-diamond-functional.tab\n",
      "-rw-rw-r-- 1 sam sam 1.2M Apr 20 21:20 Panopea-generosa-v1.0.a4-blast-diamond-functional_best.tab\n",
      "-rw-rw-r-- 1 sam sam 359K Apr 20 21:20 Panopea-generosa-v1.0.a4-blast-diamond-functional-genome_IDs-SPIDs.txt\n",
      "-rw-rw-r-- 1 sam sam 101K Apr 20 21:20 Panopea-generosa-v1.0.a4-blast-diamond-functional-SPIDs.txt\n",
      "-rw-rw-r-- 1 sam sam 138M Apr 20 21:21 20220419-pgen-uniprot_batch-results.txt\n",
      "\n",
      "\n",
      "--------------------------------------------------\n",
      "\n",
      "Line count:\n",
      "2850419 20220419-pgen-uniprot_batch-results.txt\n",
      "--------------------------------------------------\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "real\t0m43.226s\n",
      "user\t0m3.070s\n",
      "sys\t0m3.310s\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "# Run UniProt Prel mapping script and time how long it takes\n",
    "time \\\n",
    "perl \"${uniprot_mapping_script}\" \"${spid_list}\" > \"${uniprot_output}\"\n",
    "\n",
    "ls -ltrh\n",
    "\n",
    "echo \"\"\n",
    "echo \"\"\n",
    "echo \"--------------------------------------------------\"\n",
    "echo \"\"\n",
    "echo \"Line count:\"\n",
    "\n",
    "wc -l \"${uniprot_output}\"\n",
    "\n",
    "echo \"--------------------------------------------------\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check mapping output\n",
    "\n",
    "Counting Accession lines (beginning with `AC`) should show a _lower_ count than the number of SwissProt IDs submitted, as UniProt automatically removes duplicates upon submission."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ID   CAMT1_DICDI             Reviewed;         230 AA.\n",
      "AC   Q86IC9; Q552T5;\n",
      "DT   05-MAY-2009, integrated into UniProtKB/Swiss-Prot.\n",
      "DT   01-JUN-2003, sequence version 1.\n",
      "DT   23-FEB-2022, entry version 92.\n",
      "DE   RecName: Full=Probable caffeoyl-CoA O-methyltransferase 1;\n",
      "DE            EC=2.1.1.104;\n",
      "DE   AltName: Full=O-methyltransferase 5;\n",
      "GN   Name=omt5; ORFNames=DDB_G0275499;\n",
      "OS   Dictyostelium discoideum (Slime mold).\n",
      "OC   Eukaryota; Amoebozoa; Evosea; Eumycetozoa; Dictyostelia; Dictyosteliales;\n",
      "OC   Dictyosteliaceae; Dictyostelium.\n",
      "OX   NCBI_TaxID=44689;\n",
      "RN   [1]\n",
      "RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].\n",
      "RC   STRAIN=AX4;\n",
      "RX   PubMed=12097910; DOI=10.1038/nature00847;\n",
      "RA   Gloeckner G., Eichinger L., Szafranski K., Pachebat J.A., Bankier A.T.,\n",
      "RA   Dear P.H., Lehmann R., Baumgart C., Parra G., Abril J.F., Guigo R.,\n",
      "RA   Kumpf K., Tunggal B., Cox E.C., Quail M.A., Platzer M., Rosenthal A.,\n",
      "RA   Noegel A.A.;\n",
      "RT   \"Sequence and analysis of chromosome 2 of Dictyostelium discoideum.\";\n",
      "RL   Nature 418:79-85(2002).\n",
      "RN   [2]\n",
      "RP   NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].\n",
      "RC   STRAIN=AX4;\n",
      "RX   PubMed=15875012; DOI=10.1038/nature03481;\n",
      "RA   Eichinger L., Pachebat J.A., Gloeckner G., Rajandream M.A., Sucgang R.,\n",
      "RA   Berriman M., Song J., Olsen R., Szafranski K., Xu Q., Tunggal B.,\n",
      "RA   Kummerfeld S., Madera M., Konfortov B.A., Rivero F., Bankier A.T.,\n",
      "\n",
      "----------------------------------------------------\n",
      "\n",
      "Number of accessions:\n",
      "\n",
      "10634\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "head -n 30 \"${uniprot_output}\"\n",
    "\n",
    "echo \"\"\n",
    "\n",
    "echo \"----------------------------------------------------\"\n",
    "\n",
    "echo \"\"\n",
    "\n",
    "echo \"Number of accessions:\"\n",
    "\n",
    "echo \"\"\n",
    "\n",
    "grep -c \"^AC\" \"${uniprot_output}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Parse the stuff we want\n",
    "\n",
    "- UniProt accession\n",
    "\n",
    "- Gene name/abbreviation\n",
    "\n",
    "- Gene description\n",
    "\n",
    "- GO IDs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Check DE descriptor lines to decide pattern matching\n",
    "\n",
    "Checks lines beginning with `DE` to identify values in the 2nd field with `Name` in them.\n",
    "\n",
    "Identifies unique values. This will determine how to parse properly after this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AltName:\n",
      "RecName:\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "grep \"^DE\" \"${uniprot_output}\" | awk '$2 ~ /Name/ { print $2 }' | sort -u"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n",
      "real\t529m51.428s\n",
      "user\t427m33.240s\n",
      "sys\t79m42.704s\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "# Loop through UniProt records\n",
    "time \\\n",
    "while read -r line\n",
    "do\n",
    "  # Get record line descriptor\n",
    "  descriptor=$(echo \"${line}\" | awk '{print $1}')\n",
    "\n",
    "  # Capture second field for evaluation\n",
    "  go_line=$(echo \"${line}\" | awk '{print $2}')\n",
    "\n",
    "  # Append GO IDs to array\n",
    "  if [[ \"${go_line}\" == \"GO;\" ]]; then\n",
    "    go_id=$(echo \"${line}\" | awk '{print $3}')\n",
    "    go_ids_array+=(\"${go_id}\")\n",
    "  elif [[ \"${go_line}\" == \"GeneID;\" ]]; then\n",
    "    # Uses sed to strip trailing semi-colon\n",
    "    gene_id=$(echo \"${line}\" | awk '{print $3}' | sed 's/;$//')\n",
    "  fi\n",
    "\n",
    "  # Get gene description\n",
    "  if [[ \"${descriptor}\" == \"DE\" ]] && [[ \"${go_line}\" == \"RecName:\" ]]; then\n",
    "    # Uses sed to strip trailing spaces at end of line and remove commas\n",
    "    gene_description=$(echo \"${line}\" | awk -F \"[={]\" '{print $2}' | sed 's/[[:blank:]]*$//' | sed 's/,//g' | sed 's/;$//')\n",
    "\n",
    "  # Get alternate name\n",
    "  elif [[ \"${descriptor}\" == \"DE\" ]] && [[ \"${go_line}\" == \"AltName:\" ]]; then\n",
    "    # Uses sed to strip trailing spaces at end of line and remove commas\n",
    "    alt_gene_description=$(echo \"${line}\" | awk -F \"[={]\" '{print $2}' | sed 's/[[:blank:]]*$//' | sed 's/,//g' | sed 's/;$//')\n",
    "\n",
    "  # Get gene name\n",
    "  elif [[ \"${descriptor}\" == \"GN\"  ]] && [[ $(echo \"${line}\" | awk -F \"=\" '{print $1}') == \"GN   Name\" ]]; then\n",
    "    # Uses sed to strip trailing spaces at end of line\n",
    "    gene=$(echo \"${line}\" | awk -F 'Name=|{|;' '{print $2}' | sed 's/[[:blank:]]*$//')\n",
    "\n",
    "  # Get UniProt accession\n",
    "  elif [[ \"${descriptor}\" == \"AC\" ]]; then\n",
    "    # awk removes \"AC\" denotation\n",
    "    # sed removes all spaces\n",
    "    # sed removes trailing semi-colon\n",
    "    # Uses array to handle accessions being on multiple lines of UniProt records file\n",
    "    accession=$(echo \"${line}\" | awk '{$1=\"\";print $0}' | sed 's/[[:space:]]*//g' | sed 's/;$//')\n",
    "    accessions_array+=(\"${accession}\")\n",
    "\n",
    "  # Identify beginning on new record\n",
    "  elif [[ \"${descriptor}\" == \"//\" ]]; then\n",
    "\n",
    "    # Prints other comma-separated variables, then GOID1;GOID2;GOIDn\n",
    "    # IFS prevents spaces from being added between GO IDs\n",
    "    # sed removes \";\" after final GO ID\n",
    "    (IFS=; printf \"%s\\t%s\\t%s\\t%s\\t%s\\t%s\\n\" \"${accessions_array[*]}\" \"${gene_id}\" \"${gene}\" \"${gene_description}\" \"${alt_gene_description}\" \"${go_ids_array[*]}\" | sed 's/;$//')\n",
    "\n",
    "    # Re-initialize variables\n",
    "    accession=\"\"  \n",
    "    accessions_array=()\n",
    "    descriptor=\"\"\n",
    "    gene=\"\"\n",
    "    gene_description=\"\"\n",
    "    gene_id=\"\"\n",
    "    go_id=\"\"\n",
    "    go_ids_array=()\n",
    "  fi\n",
    "\n",
    "\n",
    "done < \"${uniprot_output}\" >> \"${parsed_uniprot}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Inspect parsed UniProt file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10304 20220419-pgen-accession-gene_name-gene_description-go_ids.tab\n",
      "\n",
      "------------------------------------------------------------------\n",
      "\n",
      "Q86IC9;Q552T5\t8620183\tomt5\tProbable caffeoyl-CoA O-methyltransferase 1\tO-methyltransferase 5\tGO:0042409;GO:0046872;GO:0008757;GO:0032259\n",
      "P04177\t25085\tTh\tTyrosine 3-monooxygenase\tTyrosine 3-hydroxylase\tGO:0030424;GO:0005737;GO:0009898;GO:0031410;GO:0030659;GO:0005829;GO:0030425;GO:0033162;GO:0005739;GO:0043005;GO:0043025;GO:0005634;GO:0043204;GO:0048471;GO:0005790;GO:0008021;GO:0043195;GO:0016597;GO:0035240;GO:0019899;GO:0008199;GO:0008198;GO:0042802;GO:0004497;GO:0019825;GO:0019904;GO:0034617;GO:0004511;GO:0015842;GO:0009887;GO:0042423;GO:0071312;GO:0071333;GO:0071363;GO:0071287;GO:0071316;GO:0071466;GO:0021987;GO:0042745;GO:0050890;GO:0042416;GO:0006585;GO:0042755;GO:0048596;GO:0042418;GO:0042462;GO:0006631;GO:0016137;GO:0007507;GO:1990384;GO:0033076;GO:0007612;GO:0007626;GO:0007617;GO:0007613;GO:0010259;GO:0042136;GO:0042421;GO:0018963;GO:0052314;GO:0008016;GO:0014823;GO:0001975;GO:0051412;GO:0051602;GO:0032355;GO:0045471;GO:0045472;GO:0070848;GO:0009635;GO:0001666;GO:0035902;GO:0017085;GO:0035900;GO:0009416;GO:0032496;GO:0010038;GO:0035094;GO:0031667;GO:0014070;GO:0043434;GO:0046684;GO:0009651;GO:0048545;GO:0009414;GO:0009410;GO:0010043;GO:0007605;GO:0035176;GO:0006665;GO:0001963;GO:0042214;GO:0007601\n",
      "Q8L840;O04092;Q9FT71\t837636\tRECQL4A\tATP-dependent DNA helicase Q-like 4A\tSGS1-like protein\tGO:0005694;GO:0005737;GO:0005634;GO:0009506;GO:0043138;GO:0005524;GO:0016887;GO:0009378;GO:0046872;GO:0003676;GO:0071215;GO:0070417;GO:0006974;GO:0051276;GO:0032508;GO:0006310;GO:0006281;GO:0006268;GO:0000724\n",
      "Q61043;A0A1Y7VJL5;B2RQ73;B7ZMZ9;E9Q488;E9Q4S3;Q674R4;Q6ZPM7\t18080\tNin\tNinein\tSGS1-like protein\tGO:0045177;GO:0030424;GO:0044295;GO:0120103;GO:0005814;GO:0005813;GO:0097539;GO:0005881;GO:0030425;GO:0072686;GO:0097431;GO:0005730;GO:0005654;GO:0000242;GO:0005886;GO:0000922;GO:0005509;GO:0005525;GO:0019900;GO:0051011;GO:0010457;GO:0051642;GO:0090222;GO:0048668;GO:0021540;GO:0021957;GO:0034454;GO:0050772;GO:0031116;GO:0008104\n",
      "A1E2V0\t489433\tBIRC3\tBaculoviral IAP repeat-containing protein 3\tRING-type E3 ubiquitin transferase BIRC3\tGO:0005737;GO:0005829;GO:0005654;GO:0005634;GO:0043027;GO:0046872;GO:0061630;GO:0043066;GO:0060546;GO:0031398;GO:0051726\n",
      "P34456\t186266\t\tUncharacterized protein F54H12.2\tRING-type E3 ubiquitin transferase BIRC3\tGO:0005829;GO:0004748;GO:0009263\n",
      "P34457\t\t\tPutative uncharacterized transposon-derived protein F54H12.3\tRING-type E3 ubiquitin transferase BIRC3\tGO:0003676;GO:0015074\n",
      "O00463;B4DIS9;B4E0A2;Q6FHY1\t7188\tTRAF5\tTNF receptor-associated factor 5\tRING finger protein 84\tGO:0035631;GO:0005813;GO:0009898;GO:0005829;GO:0042802;GO:0031996;GO:0005164;GO:0031625;GO:0008270;GO:0006915;GO:0097400;GO:0048255;GO:0008284;GO:0051091;GO:0043123;GO:0046330;GO:0051092;GO:0070534;GO:0042981;GO:0043122;GO:0007165;GO:0023019;GO:0033209\n",
      "Q00945\t\t\tNeurophysin\tRING finger protein 84\tGO:0005576;GO:0005185\n",
      "Q5SWK7;Q8BXX5;Q9CXG1\t74315\tRnf145\tRING finger protein 145\tRING finger protein 84\tGO:0012505;GO:0005783;GO:0005789;GO:0016021;GO:0061630;GO:0008270\n",
      "Q8ZXT3\t\t\tUncharacterized protein PAE1111\tRING finger protein 84\t\n",
      "Q5REG4\t100171717\tDTX3\tProbable E3 ubiquitin-protein ligase DTX3\tRING-type E3 ubiquitin transferase DTX3\tGO:0005737;GO:0046872;GO:0016740;GO:0007219;GO:0016567\n",
      "Q8QG60;Q8QG52;Q8QGQ5\t374092\tCRY2\tCryptochrome-2\tRING-type E3 ubiquitin transferase DTX3\tGO:0005737;GO:0005634;GO:0003677;GO:0071949;GO:0009881;GO:0032922;GO:0007623;GO:0043153;GO:0042754;GO:0045892;GO:0018298;GO:0042752;GO:0009416\n",
      "Q9H583;Q5T3Q8;Q6P197;Q9NW23\t55127\tHEATR1\tHEAT repeat-containing protein 1 N-terminally processed\tU3 small nucleolar RNA-associated protein 10 homolog\tGO:0030686;GO:0001650;GO:0016020;GO:0005739;GO:0005730;GO:0005654;GO:0032040;GO:0034455;GO:0003723;GO:0030515;GO:0000462;GO:2000234;GO:0045943\n",
      "A0JMR6\t779416\tmysm1\tHistone H2A deubiquitinase MYSM1\tMyb-like SWIRM and MPN domain-containing protein 1\tGO:0005634;GO:0003677;GO:0042393;GO:0070122;GO:0046872;GO:0140492;GO:0004843;GO:0003713;GO:0006338;GO:0035522;GO:0045944\n",
      "O88917;O09026;O35818;O88916\t65096\tAdgrl1\tAdhesion G protein-coupled receptor L1\tLatrophilin-1\tGO:0030424;GO:0098978;GO:0030426;GO:0005887;GO:0099056;GO:0043005;GO:0005886;GO:0014069;GO:0042734;GO:0045202;GO:0030246;GO:0050839;GO:0004930;GO:0016524;GO:0015643;GO:0007189;GO:0007420;GO:0035584;GO:0007166;GO:0007157;GO:0051965;GO:0090129\n",
      "Q7D513\t\tegtB\tHercynine oxygenase\tGamma-glutamyl hercynylcysteine S-oxide synthase\tGO:0044875;GO:0005506;GO:0004497\n",
      "Q92968;B2RCS1\t5194\tPEX13\tPeroxisomal membrane protein PEX13\tPeroxin-13\tGO:0005829;GO:0005779;GO:0016020;GO:1990429;GO:0005778;GO:0005777;GO:0021795;GO:0001561;GO:0007626;GO:0060152;GO:0001764;GO:0016560;GO:0001967\n",
      "A6H769\t505507\tRPS7\t40S ribosomal protein S7\tPeroxin-13\tGO:0022627;GO:0005815;GO:0032040;GO:0003735;GO:0042274;GO:0006364;GO:0006412\n",
      "Q14676;A2AB04;A2BF04;A2RRA8;A7YY86;B0S8A2;Q0EFC2;Q2L6H7;Q2TAZ4Q5JP55;Q5JP56;Q5ST83;Q68CQ3;Q86Z06;Q96QC2\t9656\tMDC1\tMediator of DNA damage checkpoint protein 1\tNuclear factor with BRCT domains 1\tGO:0005694;GO:0005925;GO:0016604;GO:0005654;GO:0005634;GO:0070975;GO:0008022;GO:0006281;GO:0031573\n",
      "Q54W11\t8622324\tmcfL\tMitochondrial substrate carrier family protein L\tNuclear factor with BRCT domains 1\tGO:0016021;GO:0005743;GO:0055085\n",
      "Q9FKK7;Q8L759\t835871\tXYLA\tXylose isomerase\tNuclear factor with BRCT domains 1\tGO:0005783;GO:0005794;GO:0000325;GO:0009536;GO:0099503;GO:0046872;GO:0009045;GO:0042843\n",
      "Q3ZCD7\t614105\tTECR\tVery-long-chain enoyl-CoA reductase\tTrans-23-enoyl-CoA reductase\tGO:0005783;GO:0030176;GO:0016491;GO:0102758;GO:0030497;GO:0006665;GO:0006694;GO:0042761\n",
      "Q9D180;A2ACY9\t68625\tCfap57\tCilia- and flagella-associated protein 57\tWD repeat-containing protein 65\t\n",
      "Q9D6Z0;Q8K1H3;Q9CY41;Q9D942\t66400\tAlkbh7\tAlpha-ketoglutarate-dependent dioxygenase alkB homolog 7 mitochondrial\tAlkylated DNA repair protein alkB homolog 7\tGO:0005759;GO:0005739;GO:0051213;GO:0046872;GO:0006974;GO:0006631;GO:0010883;GO:1902445\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "wc -l \"${parsed_uniprot}\"\n",
    "\n",
    "echo \"\"\n",
    "echo \"------------------------------------------------------------------\"\n",
    "echo \"\"\n",
    "\n",
    "head -n 25 \"${parsed_uniprot}\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "# Sets markdown table align left in subsequent cell\n",
       "<style>\n",
       "  table {margin-left: 0 !important;}\n",
       "</style>\n"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%html\n",
    "# Sets markdown table align left in subsequent cell\n",
    "<style>\n",
    "  table {margin-left: 0 !important;}\n",
    "</style>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Combine with original list of genes and SPIDs\n",
    "\n",
    "Output format (tab-delimited):\n",
    "\n",
    "| gene_ID | SPIDs | UniProt_gene_ID | gene | gene_description | alternate_gene_description | GO_IDs |\n",
    "|---------|-------|-----------------|------|------------------|----------------------------|--------|\n",
    "\n",
    "\n",
    "Explanation:\n",
    "\n",
    "- `awk -v FS='[;[:space:]]+'`: Sets the Field Separator variable to handle `; ` in UniProt accessions. Allows for proper searching.\n",
    "\n",
    "- `FNR == NR`: Restricts next block (designated by `{}`) to work only on first input file.\n",
    "\n",
    "- `{array[$1]=$0; next}`: Adds the entire line (`$0`) of the first file to the array names `array` and then moves on to the next set of commands for the second input file.\n",
    "\n",
    "- `($1 in array)`: Looks for the value of the first column (`$1`, which is SPID) from the second file to see if there's a match from the array (which contains the line from the first file).\n",
    "\n",
    "- `{print $2,array[$1]}'`: If there's a match, print the second column (`$2`, which is gene ID) from the second file, followed by the line from the first file.\n",
    "\n",
    "- `\"${parsed_uniprot}\" \"${spid_list}\"`: The first and second input files.\n",
    "\n",
    "- `\"${joined_output}\"`: Result of the join."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "awk \\\n",
    "-v FS='[;[:space:]]+' \\\n",
    "'NR==FNR \\\n",
    "{array[$1]=$0; next} \\\n",
    "($1 in array) \\\n",
    "{print $2\"\\t\"array[$1]}' \\\n",
    "\"${parsed_uniprot}\" \"${genome_IDs_SPIDs}\" \\\n",
    "> \"${joined_output}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inspect final annotation file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "14672 20220419-pgen-gene-accessions-gene_id-gene_name-gene_description-alt_gene_description-go_ids.tab\n",
      "\n",
      "------------------------------------------------------------------\n",
      "\n",
      "PGEN_.00g000010\tQ86IC9;Q552T5\t8620183\tomt5\tProbable caffeoyl-CoA O-methyltransferase 1\tO-methyltransferase 5\tGO:0042409;GO:0046872;GO:0008757;GO:0032259\n",
      "PGEN_.00g000020\tP04177\t25085\tTh\tTyrosine 3-monooxygenase\tTyrosine 3-hydroxylase\tGO:0030424;GO:0005737;GO:0009898;GO:0031410;GO:0030659;GO:0005829;GO:0030425;GO:0033162;GO:0005739;GO:0043005;GO:0043025;GO:0005634;GO:0043204;GO:0048471;GO:0005790;GO:0008021;GO:0043195;GO:0016597;GO:0035240;GO:0019899;GO:0008199;GO:0008198;GO:0042802;GO:0004497;GO:0019825;GO:0019904;GO:0034617;GO:0004511;GO:0015842;GO:0009887;GO:0042423;GO:0071312;GO:0071333;GO:0071363;GO:0071287;GO:0071316;GO:0071466;GO:0021987;GO:0042745;GO:0050890;GO:0042416;GO:0006585;GO:0042755;GO:0048596;GO:0042418;GO:0042462;GO:0006631;GO:0016137;GO:0007507;GO:1990384;GO:0033076;GO:0007612;GO:0007626;GO:0007617;GO:0007613;GO:0010259;GO:0042136;GO:0042421;GO:0018963;GO:0052314;GO:0008016;GO:0014823;GO:0001975;GO:0051412;GO:0051602;GO:0032355;GO:0045471;GO:0045472;GO:0070848;GO:0009635;GO:0001666;GO:0035902;GO:0017085;GO:0035900;GO:0009416;GO:0032496;GO:0010038;GO:0035094;GO:0031667;GO:0014070;GO:0043434;GO:0046684;GO:0009651;GO:0048545;GO:0009414;GO:0009410;GO:0010043;GO:0007605;GO:0035176;GO:0006665;GO:0001963;GO:0042214;GO:0007601\n",
      "PGEN_.00g000050\tQ8L840;O04092;Q9FT71\t837636\tRECQL4A\tATP-dependent DNA helicase Q-like 4A\tSGS1-like protein\tGO:0005694;GO:0005737;GO:0005634;GO:0009506;GO:0043138;GO:0005524;GO:0016887;GO:0009378;GO:0046872;GO:0003676;GO:0071215;GO:0070417;GO:0006974;GO:0051276;GO:0032508;GO:0006310;GO:0006281;GO:0006268;GO:0000724\n",
      "PGEN_.00g000060\tQ61043;A0A1Y7VJL5;B2RQ73;B7ZMZ9;E9Q488;E9Q4S3;Q674R4;Q6ZPM7\t18080\tNin\tNinein\tSGS1-like protein\tGO:0045177;GO:0030424;GO:0044295;GO:0120103;GO:0005814;GO:0005813;GO:0097539;GO:0005881;GO:0030425;GO:0072686;GO:0097431;GO:0005730;GO:0005654;GO:0000242;GO:0005886;GO:0000922;GO:0005509;GO:0005525;GO:0019900;GO:0051011;GO:0010457;GO:0051642;GO:0090222;GO:0048668;GO:0021540;GO:0021957;GO:0034454;GO:0050772;GO:0031116;GO:0008104\n",
      "PGEN_.00g000080\tA1E2V0\t489433\tBIRC3\tBaculoviral IAP repeat-containing protein 3\tRING-type E3 ubiquitin transferase BIRC3\tGO:0005737;GO:0005829;GO:0005654;GO:0005634;GO:0043027;GO:0046872;GO:0061630;GO:0043066;GO:0060546;GO:0031398;GO:0051726\n",
      "PGEN_.00g000090\tP34456\t186266\t\tUncharacterized protein F54H12.2\tRING-type E3 ubiquitin transferase BIRC3\tGO:0005829;GO:0004748;GO:0009263\n",
      "PGEN_.00g000120\tP34457\t\t\tPutative uncharacterized transposon-derived protein F54H12.3\tRING-type E3 ubiquitin transferase BIRC3\tGO:0003676;GO:0015074\n",
      "PGEN_.00g000210\tO00463;B4DIS9;B4E0A2;Q6FHY1\t7188\tTRAF5\tTNF receptor-associated factor 5\tRING finger protein 84\tGO:0035631;GO:0005813;GO:0009898;GO:0005829;GO:0042802;GO:0031996;GO:0005164;GO:0031625;GO:0008270;GO:0006915;GO:0097400;GO:0048255;GO:0008284;GO:0051091;GO:0043123;GO:0046330;GO:0051092;GO:0070534;GO:0042981;GO:0043122;GO:0007165;GO:0023019;GO:0033209\n",
      "PGEN_.00g000230\tQ00945\t\t\tNeurophysin\tRING finger protein 84\tGO:0005576;GO:0005185\n",
      "PGEN_.00g000240\tQ5SWK7;Q8BXX5;Q9CXG1\t74315\tRnf145\tRING finger protein 145\tRING finger protein 84\tGO:0012505;GO:0005783;GO:0005789;GO:0016021;GO:0061630;GO:0008270\n",
      "PGEN_.00g000280\tQ8ZXT3\t\t\tUncharacterized protein PAE1111\tRING finger protein 84\t\n",
      "PGEN_.00g000300\tQ5REG4\t100171717\tDTX3\tProbable E3 ubiquitin-protein ligase DTX3\tRING-type E3 ubiquitin transferase DTX3\tGO:0005737;GO:0046872;GO:0016740;GO:0007219;GO:0016567\n",
      "PGEN_.00g000380\tQ8QG60;Q8QG52;Q8QGQ5\t374092\tCRY2\tCryptochrome-2\tRING-type E3 ubiquitin transferase DTX3\tGO:0005737;GO:0005634;GO:0003677;GO:0071949;GO:0009881;GO:0032922;GO:0007623;GO:0043153;GO:0042754;GO:0045892;GO:0018298;GO:0042752;GO:0009416\n",
      "PGEN_.00g000440\tQ9H583;Q5T3Q8;Q6P197;Q9NW23\t55127\tHEATR1\tHEAT repeat-containing protein 1 N-terminally processed\tU3 small nucleolar RNA-associated protein 10 homolog\tGO:0030686;GO:0001650;GO:0016020;GO:0005739;GO:0005730;GO:0005654;GO:0032040;GO:0034455;GO:0003723;GO:0030515;GO:0000462;GO:2000234;GO:0045943\n",
      "PGEN_.00g000450\tA0JMR6\t779416\tmysm1\tHistone H2A deubiquitinase MYSM1\tMyb-like SWIRM and MPN domain-containing protein 1\tGO:0005634;GO:0003677;GO:0042393;GO:0070122;GO:0046872;GO:0140492;GO:0004843;GO:0003713;GO:0006338;GO:0035522;GO:0045944\n",
      "PGEN_.00g000460\tO88917;O09026;O35818;O88916\t65096\tAdgrl1\tAdhesion G protein-coupled receptor L1\tLatrophilin-1\tGO:0030424;GO:0098978;GO:0030426;GO:0005887;GO:0099056;GO:0043005;GO:0005886;GO:0014069;GO:0042734;GO:0045202;GO:0030246;GO:0050839;GO:0004930;GO:0016524;GO:0015643;GO:0007189;GO:0007420;GO:0035584;GO:0007166;GO:0007157;GO:0051965;GO:0090129\n",
      "PGEN_.00g000490\tQ7D513\t\tegtB\tHercynine oxygenase\tGamma-glutamyl hercynylcysteine S-oxide synthase\tGO:0044875;GO:0005506;GO:0004497\n",
      "PGEN_.00g000520\tQ92968;B2RCS1\t5194\tPEX13\tPeroxisomal membrane protein PEX13\tPeroxin-13\tGO:0005829;GO:0005779;GO:0016020;GO:1990429;GO:0005778;GO:0005777;GO:0021795;GO:0001561;GO:0007626;GO:0060152;GO:0001764;GO:0016560;GO:0001967\n",
      "PGEN_.00g000530\tA6H769\t505507\tRPS7\t40S ribosomal protein S7\tPeroxin-13\tGO:0022627;GO:0005815;GO:0032040;GO:0003735;GO:0042274;GO:0006364;GO:0006412\n",
      "PGEN_.00g000540\tQ14676;A2AB04;A2BF04;A2RRA8;A7YY86;B0S8A2;Q0EFC2;Q2L6H7;Q2TAZ4Q5JP55;Q5JP56;Q5ST83;Q68CQ3;Q86Z06;Q96QC2\t9656\tMDC1\tMediator of DNA damage checkpoint protein 1\tNuclear factor with BRCT domains 1\tGO:0005694;GO:0005925;GO:0016604;GO:0005654;GO:0005634;GO:0070975;GO:0008022;GO:0006281;GO:0031573\n",
      "PGEN_.00g000560\tQ54W11\t8622324\tmcfL\tMitochondrial substrate carrier family protein L\tNuclear factor with BRCT domains 1\tGO:0016021;GO:0005743;GO:0055085\n",
      "PGEN_.00g000600\tQ9FKK7;Q8L759\t835871\tXYLA\tXylose isomerase\tNuclear factor with BRCT domains 1\tGO:0005783;GO:0005794;GO:0000325;GO:0009536;GO:0099503;GO:0046872;GO:0009045;GO:0042843\n",
      "PGEN_.00g000660\tQ3ZCD7\t614105\tTECR\tVery-long-chain enoyl-CoA reductase\tTrans-23-enoyl-CoA reductase\tGO:0005783;GO:0030176;GO:0016491;GO:0102758;GO:0030497;GO:0006665;GO:0006694;GO:0042761\n",
      "PGEN_.00g000670\tQ9D180;A2ACY9\t68625\tCfap57\tCilia- and flagella-associated protein 57\tWD repeat-containing protein 65\t\n",
      "PGEN_.00g000680\tQ9D6Z0;Q8K1H3;Q9CY41;Q9D942\t66400\tAlkbh7\tAlpha-ketoglutarate-dependent dioxygenase alkB homolog 7 mitochondrial\tAlkylated DNA repair protein alkB homolog 7\tGO:0005759;GO:0005739;GO:0051213;GO:0046872;GO:0006974;GO:0006631;GO:0010883;GO:1902445\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "wc -l \"${joined_output}\"\n",
    "\n",
    "echo \"\"\n",
    "echo \"------------------------------------------------------------------\"\n",
    "echo \"\"\n",
    "\n",
    "head -n 25 \"${joined_output}\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DO NOT EDIT! CELLS BELOW THIS!!\n",
    "\n",
    "The cells below are for reference, as they were used to identify a set of obsolete/missing SPIDs. These were then used in a subsequent re-run of the noteook to refine things. The remaining cells document this process.\n",
    "\n",
    "The original number of SPIDs was 14676, but we ended up with only 14668 matches.\n",
    "\n",
    "Let's see if we can figure out why..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Identify differences in lists of genome IDs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2344a2345\n",
      "> PGEN_.00g050810\n",
      "5161a5163\n",
      "> PGEN_.00g119250\n",
      "6930a6933\n",
      "> PGEN_.00g162250\n",
      "7914a7918\n",
      "> PGEN_.00g185820\n",
      "8976a8981\n",
      "> PGEN_.00g209090\n",
      "9601a9607\n",
      "> PGEN_.00g222140\n",
      "11142a11149\n",
      "> PGEN_.00g258380\n",
      "12942a12950\n",
      "> PGEN_.00g304800\n"
     ]
    },
    {
     "ename": "CalledProcessError",
     "evalue": "Command 'b'\\ncd \"${analysis_dir}\"\\n\\ndiff <(awk \\'{print $1}\\' \"${joined_output}\") <(awk \\'{print $2}\\' \"${genome_IDs_SPIDs}\")\\n'' returned non-zero exit status 1.",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mCalledProcessError\u001b[0m                        Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-36-0ccf4752df95>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mget_ipython\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun_cell_magic\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'bash'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m''\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'\\ncd \"${analysis_dir}\"\\n\\ndiff <(awk \\'{print $1}\\' \"${joined_output}\") <(awk \\'{print $2}\\' \"${genome_IDs_SPIDs}\")\\n'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/programs/miniconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py\u001b[0m in \u001b[0;36mrun_cell_magic\u001b[0;34m(self, magic_name, line, cell)\u001b[0m\n\u001b[1;32m   2401\u001b[0m             \u001b[0;32mwith\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuiltin_trap\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2402\u001b[0m                 \u001b[0margs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0mmagic_arg_s\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2403\u001b[0;31m                 \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2404\u001b[0m             \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2405\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/programs/miniconda3/lib/python3.9/site-packages/IPython/core/magics/script.py\u001b[0m in \u001b[0;36mnamed_script_magic\u001b[0;34m(line, cell)\u001b[0m\n\u001b[1;32m    140\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    141\u001b[0m                 \u001b[0mline\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mscript\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 142\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshebang\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mline\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    143\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    144\u001b[0m         \u001b[0;31m# write a basic docstring:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/programs/miniconda3/lib/python3.9/site-packages/decorator.py\u001b[0m in \u001b[0;36mfun\u001b[0;34m(*args, **kw)\u001b[0m\n\u001b[1;32m    230\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mkwsyntax\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    231\u001b[0m                 \u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkw\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfix\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mkw\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msig\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 232\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mcaller\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mextras\u001b[0m \u001b[0;34m+\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkw\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    233\u001b[0m     \u001b[0mfun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__name__\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    234\u001b[0m     \u001b[0mfun\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__doc__\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__doc__\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/programs/miniconda3/lib/python3.9/site-packages/IPython/core/magic.py\u001b[0m in \u001b[0;36m<lambda>\u001b[0;34m(f, *a, **k)\u001b[0m\n\u001b[1;32m    185\u001b[0m     \u001b[0;31m# but it's overkill for just that one bit of state.\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    186\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mmagic_deco\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 187\u001b[0;31m         \u001b[0mcall\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mlambda\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mk\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    188\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    189\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mcallable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/programs/miniconda3/lib/python3.9/site-packages/IPython/core/magics/script.py\u001b[0m in \u001b[0;36mshebang\u001b[0;34m(self, line, cell)\u001b[0m\n\u001b[1;32m    243\u001b[0m             \u001b[0msys\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstderr\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mflush\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    244\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0margs\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mraise_error\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreturncode\u001b[0m\u001b[0;34m!=\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 245\u001b[0;31m             \u001b[0;32mraise\u001b[0m \u001b[0mCalledProcessError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreturncode\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moutput\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstderr\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0merr\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    246\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    247\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m_run_script\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mp\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcell\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mto_close\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mCalledProcessError\u001b[0m: Command 'b'\\ncd \"${analysis_dir}\"\\n\\ndiff <(awk \\'{print $1}\\' \"${joined_output}\") <(awk \\'{print $2}\\' \"${genome_IDs_SPIDs}\")\\n'' returned non-zero exit status 1."
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "diff <(awk '{print $1}' \"${joined_output}\") <(awk '{print $2}' \"${genome_IDs_SPIDs}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get SPIDs of \"missing\" genome IDs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Q3UN21 \t PGEN_.00g050810\n",
      "P0DN79 \t PGEN_.00g119250\n",
      "Q6ZRR9 \t PGEN_.00g162250\n",
      "Q9NPA5 \t PGEN_.00g185820\n",
      "Q6ZR98 \t PGEN_.00g209090\n",
      "Q5T699 \t PGEN_.00g222140\n",
      "Q9NPA5 \t PGEN_.00g258380\n",
      "Q9NPA5 \t PGEN_.00g304800\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "\n",
    "cd \"${analysis_dir}\"\n",
    "\n",
    "for id in PGEN_.00g050810 PGEN_.00g119250 PGEN_.00g162250 PGEN_.00g185820 PGEN_.00g209090 PGEN_.00g222140 PGEN_.00g258380 PGEN_.00g304800\n",
    "do\n",
    "  grep \"${id}\" \"${genome_IDs_SPIDs}\"\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Manually look up SPIDs in UniProt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- [Q3UN21](https://www.uniprot.org/uniprot/Q3UN21): Obsolete/deleted from UniProt.\n",
    "\n",
    "- [P0DN79](https://www.uniprot.org/uniprot/P0DN79): Obsolete/deleted from UniProt.\n",
    "\n",
    "- [Q6ZRR9](https://www.uniprot.org/uniprot/M0R2J8): Redirects to SPID M0R2J8.\n",
    "\n",
    "- [Q9NPA5](https://www.uniprot.org/uniprot/Q9NTW7): Redirects to SPID Q9NTW7.\n",
    "\n",
    "- [Q6ZR98](https://www.uniprot.org/uniprot/Q6ZR98): Obsolete/deleted from UniProt.\n",
    "\n",
    "- [Q5T699](https://www.uniprot.org/uniprot/Q5T699): Obsolete/deleted from UniProt.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using information above, will add code to find/replace the redirected SPIDs and then re-run rest of notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}